With the growing size of data everyday, the need for business Intelligence is also thriving like never before. Business Intelligence is about analyzing data to gather some business insights to take better business action and this analysis can be performed with the help of some ETL tools.
Apache Hive is an ETL tool that can query large data sets stored on a Hadoop Distributed File System. Hive framework was designed by Facebook for their Business Analysts who had limited knowledge about any programming language but were excellent at running SQL queries to find business insights.
Characteristics of Apache Hive
Schema on Read:
Hive is not a database because it does not keep data and schema together. Data is always stored on a Hadoop distributed file system, Hive just imposes schema on top of that data too while reading and that is why it is called schema on Read.
SQL like Syntax:
Hive uses SQL-Like language called Hive Query Language (HQL) to perform analytics on top of data. Apache Hive is extremely famous among Business Analysts & testers who are well acquainted with SQL queries.
Apache Hive is built on top of Hadoop Framework development and it runs MapReduce jobs in the background. HQL queries are converted into Java programs, which run as Mappers & Reducers on top of Hadoop to perform analysis.
Hive provides an interface between User and HDFS to run HQL queries against data stored in Hadoop File System.Various supported interfaces by Hive are Hive Command Line, Hive Web UI and Hive HDInsight.
Hive stores the schema and table metadata such as column name, column type, HDFS location (where data is actually stored) on a separate database. By default, a Derby database is used internally to store the schema but it can be replaced by any other database such as postgres, mysql for production environments.
HiveQL Process Engine:
This process engine communicates with the metastore to compile and parse HQL queries.
It processes the HQL queries to create a corresponding MapReduce job to run on top of the Hadoop distributed file system to access data and to produce the final output.
Create Database in Hive
To start working with Hive, we need a database in which we can create tables.Below commands can be used to create, describe and drop a database in Hive.
There are two types of tables in Hive – External & Managed tables. Major difference between the two is that if you drop a Managed table, both data & schema attached to the table will be deleted whereas if you drop an external table then only schema associated with the table will be deleted although the data will remain intact on HDFS.
Create Managed Table
Create External Table
Hive supports various ways to perform data analysis such as
You can aggregate data using various aggregation API provided by Hive such as count, sum, avg etc.
You can order the data based on one or more columns. Ordering/Sorting in distribution is little different from traditional systems. Either you can sort the data across all machines that will result in data shuffling or you can sort the data within a single machine. Sorting the data within a single machine will not guarantee order across all machines.
You can also join the data based on some keys to perform aggregation on top of that joined data.
With the concept of windowing in Hive, you can perform analysis on data falling in some window. For example, you want to find the top searched websites in the last 10 minutes or the highest sold product in the last 7 days.
You can also partition your data in HDFS. Concept of partitioning data can make queries run faster. For example, in population based data, you can partition data based on country and then on city. Suppose I want to perform some aggregation on population data based on country/city, I can directly jump to partition based on country/city instead of searching entire data that will ultimately decrease query latency.
In this Blog, we will look into Aggregation, Partitioning and Joining.
Suppose I want to find out all the cities in which the number of employees is greater than 100. I can run below HQL query.
City with more than 100 Employees
There are numerous built-in aggregation operations provided by hive such as sum, Min, Max, Avg. In addition, Hive provides String, Date and Math operations to perform advanced Analytics to perform some formatting or advanced analytics.
As already discussed, partitioning makes HQL run faster. In below example, we partitioned the employees data based on employee city. Running the query below will create physical directories on HDFS based on the number of cities and all data belonging to the same city will go to the same partition. For example, Partition with Newyork City will have all employees from Newyork city.
Partitioned based on city
You can also join data using Hive. For example, we have department and employee data on HDFS and we want to join this data to fetch our employee with department information in the final result. There are various types of joins supported by Hive – Inner Join, Left Outer Join and Right Outer Join. You can join multiple tables together using Join queries in Hive.
Let’s take an example, Suppose we have employee and department information at two separate locations and I want to join this data to get the employee name & his department name in the final result. Below figures represent the dataset and final outcome after applying inner join.
Inner Join based on EmpId
Apache Hive is liked by Business Analyst, Testers, Data Analyst & even Developers because of its flexibility and easiness to run SQL-Like query on top of Big Data solutions companies.
However, there are other ETL tools coming into the marketplace everyday such as Presto, Impala but Hive has already secured its place and is well used across the industry.