Hive 2: http://bitnine.net/blog-computing/about-hive-2-hiveql/
The latest blog, we looked into a feature, structure, operator and example of actual query for Hive version 1.x. This blog will cover features of Hive 2.0 version.
HPLSQL: Added Procedural SQL
Procedural SQL is similar to PL/SQL of Oracle and sored procedure of Teradata.
- Added Cursor, Loop (FOR, WHILE, LOOP), Branch(IF), HPLSQL procedure and exceptions(SIGNAL).
HPLSQL aims to be compatible with main functions of procedural SQL to maximize the re-use of existing script.
Currently, communicated from outside of Hive through JDBC.
- A user runs command using hplsql binary.
- There are two targets. The first one is that HPLSQL is run by parser of Hive and the other one is the parser of HPLSQL is able to save HPLSQL procedure.
LLAP (Live Long And Process) : Query that is faster than a second.
- Saving time when processing begins (Allocation of container and getting rid of starting time of JVM)
Data caching using asynchronous I/O elevator
- Hot data is cached on a memory (Since it is recognized as column, hot column is cached)
When it is a suitable case, it is possible to run an operator within LLAP.
- Big dataset or ETL style query is generally not suitable case.
- For the security, LLAP does not run a user code.
An interface is operated to make other engines reading data in parallel safety.
- It is read only and writing is not set up, yet.
- It is not operated by ACID.
- A user should determine how to run query out of LLAP only, Mixed-mode and Tez only.
- It, currently, can read only ORC files
- It is integrated of Tez and engine
HBase Metastore: Fast Query Planning.
Added option that saves meta data of Hive using HBase.
- It takes more than 5 minutes to make a plan for query that is able to read thousands of partitions on Hive 1.2 and the most of time is used for collecting meta data.
- ORM cluster generates cumbersome skema.
- Because it cannot save too much data in single-nod RDBMS, the chance of caching is limited.
- It is available to work all meta data only under the condition of limitation of simultaneous connections.
- HBase manages above things all.
Goal: the goal decreases in the meta data access time of query that has thousands of partitions by 200 milliseconds.
Hive on Spark Improvement.
- Dynamic partition pruning
- Self-join, self-union, Using spark durability about CTE
- Vectorized map-join, another map-join improvement
- Parallel order by
CBO (Cost Base Optimizer) Improvement
CBO of Hive uses Calcite
CBO is a default set on 2.0. (Unavailable on 1.x version)
The main target of CBO working is BI query (it uses TPC-DS as guild)
Hive 2.0 incompatibility
Supporting Java 7 & 8 (6 is not supported)
Requiring Hadoop 2.x, Hadoop 1.x is not supported anymore.
MapReduce is not recommended, but Tez or Spark is recommended
- It is scheduled to remove MapReduce in the future.
Some default configurations are changed
- Bucketing is applied as default setting.
- In the case without meta data schema, it doesn’t generate it anymore.
- SQL standard authority is used as default setup.
It has plan to replace Hive CLI with beeline CLI
- It is easy for users to distribute security cluster managing all accesses through [OJ] DBC.
- It is very clear to keep one path to go through
– It does not need HiveServer2 and HS2 can be embedded on Beeline
Hive-on_MR, in other words, the function implemented by MapReduce will be disappeared. We recommend Tex or Spark as distribution engine instead of MapReduce. If you still want to use MepReduce, you may use it on Hive 1.x version. However, you should plan for Hive after considering the future load map.
BITNINE GLOBAL INC., THE COMPANY SPECIALIZING IN GRAPH DATABASE
비트나인, 그래프 데이터베이스 전문 기업