With growing size of data everyday, the need for business Intelligence is also thriving like never before.Business Intelligence is about analyzing data to gather some business insights to take better business action and these analysis can be performed with the help of some ETL tool.
Apache Hive is an ETL tool that can query large datasets stored on Hadoop Distributed filesystem.Hive framework was designed by Facebook for their Business Analysts who had limited knowldege about any programming language but were excellent at running SQL queries to find business insights.
Characteristics of Apache Hive
Schema on Read:
Hive is not a database because it does not keep data and schema together. Data is always stored on Hadoop distributed file system, Hive just imposes schema on top of that data that too while reading and that is why it is called schema on Read.
SQL like Syntax:
Hive uses SQL-Like language called Hive Query Language (HQL) to perform analytics on top of data. Apache Hive is extremely famous among Business Analyst & testers who are well acquainted with SQL queries.
Apache Hive is built on top of Hadoop Framework development and it runs MapReduce jobs in the background. HQL queries are converted into Java program, which run as Mappers & Reducers on top of Hadoop to perform analysis.
Hive provide an interface between User and HDFS to run HQL queries against data stored in Hadoop File System.Various supported interfaces by Hive are Hive Command Line, Hive Web UI and Hive HD Insight.
Hive stores the schema and table metadata such as columns name, columns type, HDFS location (where data is actually stored) on a separate database. By default, Derby database is used internally to store the schema but it can be replaced by any other database such as postgress, mysql for production environments.
HiveQL Process Engine:
This process engine communicates with metastore to compile and parse HQL queries.
It process the HQL queries to create corresponding MapReduce job to run on top of Hadoop distributed file system to access data and to produce the final output.
Create Database in Hive
To start working with Hive, we need a database in which we can create tables.Below commands can be used to create, describe and drop a database in Hive.
There are two types of tables in Hive – External & Managed table. Major difference between the two is that if you drop a Managed table, both data & schema attached to the table will be deleted whereas if you drop an external table then only schema associated with the table will be deleted although the data will remain intact on HDFS.
Create Managed Table
Create External Table
Hive supports various ways to perform data analysis such as
You can aggregate data using various aggregation API provided by Hive such as count, sum, avg etc.
You can order the data based on one or more columns. Ordering/Sorting in distributed is little different from traditional system. Either you can sort the data across all machine that will result in data shuflling or you can sort the data within single machine. Sorting the data within single machine will not guarantee order across all machines.
You can also join the data based on some keys to perform aggregation on top of that joined data.
With the concept of windowing in Hive, you can perform analysis on data falling in some window. For example, you want to find top searched websites in last 10 minutes or highest sold product in last 7 days.
You can also partition your data in HDFS. Concept of partitioning data can make queries run faster. For example, in population based data, you can partition data based on country and then on city. Suppose I want to perform some aggregation on population data based on country/city, I can directly jump to partition based on country/city instead of searching entire data that will ultimately decrease query latency.
In this Blog, we will look into Aggregation, Partitioning and Joining.
Suppose I want to find out the all cities in which number of employees are greater than 100. I can run below HQL query.
City with more than 100 Employees
There are numerous built-in aggregation operations provided by hive such as sum, Min, Max, Avg. In addition, Hive provides String, Date and Math operations to perform advance Analytics to perform some formatting or advance analytics.
As already discussed, partitioning makes HQL run faster. In below example, we partitioned the employees data based on employee city. Running below query will create physical directories on HDFS based on number of cities and all data belonging to same city will go to same partition. For example, Partition with Newyork City will have all employees from Newyork city.
Partitioned based on city
You can also join data using Hive. For example, we have department and employee data on HDFS and we want to join this data to fetch out employee with department information in the final result. There are various types of joins supported by Hive – Inner Join, Left Outer Join and Right Outer Join. You can join multiple tables together using Join queries in Hive.
Let’s take an example, Suppose we have employee and department information at two separate locations and I want to join these data to get employee name & his department name in the final result. Below figures represent the dataset and final outcome after applying inner join.
Inner Join based on EmpId
Apache Hive is liked by Business Analyst, Testers, Data Analyst & even Developers because of its flexibility and easiness to run SQL-Like querie on top of Big Data solutions companies. However, there are other ETL tool coming into marketplace everyday such as Presto, Impala but Hive has already secured its place and is well used across the industry.