In the previous blog, we covered the features, pros, cons, and the architecture of Impala. In this blog, we will examine existing systems and compare those against Impala.
Before Hadoop was available, only a few companies and organizations were able to analyze massive data. The main point of Hadoop is it’s Hadoop Distributed Filing System (HDFS) and MapReduce Framework. The data in Hadoop is stored in disparate HDFS. Users can obtain data they want to get using MapReduce.
Subsequently, Hive was created to improve the shortcomings of MapReduce such as ‘inconvenience of use’ and ‘slow processing’. Although Hive was more convenient to use, it still had a problem. Analyzing data was slow because Hive uses MapReduce internally.
That’s why HBase, which is NoSQL based on column, was developed. HBase makes it possible to input/output key-value data fast. It creates a database environment that processes the data in real time in the Hadoop based system.
Impala is a real-time SQL system working in HDFS. SQL is very familiar to most developers. The big advantage of SQL is that it manipulates data easily. Since Impala supports SQL and real time processing system, it could be utilized as a BI system.
|GFS & MapReduce||HDFS & MapReduce||– Batch
|Sawzall||Pig & Hive||– Batch
Impala vs RDB
Impala uses similar query language such as SQL and HiveQL. The following table shows the main differences between SQL and Impala Query Language.
|Query Language is similar to SQL and HiveQL||Uses SQL|
|Impossible to update/delete for Individual recode||Possible to update/delete for Individual recode|
|Does not support transaction||Supports transaction|
|Does not support indexing||Supports indexing|
|Able to store/manage massive data (petabytes)||
Unable to store/manage massive data
Impala vs Hive vs HBase
There are similarities and differences between Impala, Hive and HBase. The following table shows the comparison.
|A tool to manage or analyze data stored on Hadoop.||Data warehouse software. Accesses or manages distributed dataset based on Hadoop.||Based on Hadoop, wide-column store database. Uses BigTable concept.|
|Relational model||Relational model||Wide column store|
|Developed by C++||Developed by Java||Developed by Java|
|Offers JDBC and ODBC API||Offers JDBC, ODBC and Thrift API||Offers Java, Restful and Thrift API|
|Supports any language related to JDBC, ODBC||Supports C++, Java, PHP and Python||Supports C, C#, C++, Groovy, Java, PHP, Python and Scala|
|Does not support trigger||Does not support trigger||Supports trigger|
– Available to use as open-source
– Supports server-side scripting
– Follows ACID like Durability, Concurrency
– Uses shading for partitioning
Impala vs Hive
The difference between Impala and Hive is real-time. Hive uses MapReduce to access data while Impala uses its own distributed query engine. Cloudera mentions the three reasons why Impala has high-performance.
- Impala decreases an overload of CPU more than Hive and utilizes decreased workload for I/O bandwidth. The results show 3 ~ 4 times higher performance for pure I/O bound queries in Impala.
- In the case of dealing with complex query, Hive should take several steps of MapReduce or Reduce-side join. However, since Impala processes complex query (When a query has one or more Join operation) through MapReduce Framework, it has 7 ~ 45 times better performance than Hive.
- If a data block needed for analysis is on the condition of a file cache, then Impala has 20 ~ 90 times faster performance than Hive.
BITNINE GLOBAL INC., THE COMPANY SPECIALIZING IN GRAPH DATABASE
비트나인, 그래프 데이터베이스 전문 기업