Impala is an SQL query engine that processes data stored in the Hadoop cluster. I will explain Impala over the next few blogs. This first blog will cover the introduction, features, pros/cons and the architecture of Impala.
What is Impala?
Impala is an MPP (Massive Parallel Processing) SQL query engine which processes big data stored in the Hadoop cluster. It is also an open source software created from C++ and JAVA. It does not have postpone time. It has a higher performance engine compared to other Hadoop SQL engines.
In other words, Impala is a high-quality engine that quickly accesses data stored in HDFS.
Features of Impala
The following are the features of Impala.
- Impala can be used freely as an open source granted by the Apache license.
- Impala supports in-memory data processing. It is possible to analyze or access data stored on nodes in Hadoop without data transaction.
- Impala can access data using a query like SQL.
- Impala provides faster processing to access data on HDFS compared to other SQL engines.
- Users can store data in storage systems such as HDFS, HBase, Amazon S3.
- Impala can be integrated with Business Intelligence (BI) tools such as Tableau, Pentaho, Micro strategy, and Zoom data.
- Impala supports various file formats such as LZO, Sequence File, Avro, RCFile, and Parquet.
- Impala uses metadata from Hive, ODBC driver, SQL grammar.
Advantages of Impala
The following are advantages of Impala.
- Impala can process data stored in HDFS fast using existing SQL knowledge.
- Since the processing is working in a place where data is stored, the stored data in Hadoop does not need to be updated or transferred.
- Users can access data stored in HDFS, HBase, Amazon S3 without a knowledge of MapReduce because Impala supports existing SQL query.
- When the user processes query, usually there are a few complex steps such as ETL(Extract-Transform-Load); However, Impala reduces some of the steps.
Drawbacks of Impala
The following are disadvantages of Impala.
- Impala does not support both serialization and deserialization.
- Impala can only read a text file. It does not support to read binary file user defined.
- Impala should update table whenever new record/file are added in the data directory of HDFS.
Impala consists of two large processes, Impalad and Impala state store.
Impalad, a process taking charge of distributed query engine, works on processing queries and planning for query on data node in Hadoop cluster. Impala state store takes charge of holding meta data of Impalad processed on each data nodes. When the Impalad process is added or removed in cluster, meta data will be updated through the Impala state store process.
BITNINE GLOBAL INC., THE COMPANY SPECIALIZING IN GRAPH DATABASE
비트나인, 그래프 데이터베이스 전문 기업