Why Apache Hive is used to Perform Advanced Analytics?

banner

With the growing size of data everyday, the need for business Intelligence is also thriving like never before. Business Intelligence is about analyzing data to gather some business insights to take better business action and this analysis can be performed with the help of some ETL tools.

Apache Hive is an ETL tool that can query large data sets stored on a Hadoop Distributed File System. Hive framework was designed by Facebook for their Business Analysts who had limited knowledge about any programming language but were excellent at running SQL queries to find business insights.

Characteristics of Apache Hive

Advance Analytics Using Hive

Schema on Read:

Hive is not a database because it does not keep data and schema together. Data is always stored on a Hadoop distributed file system, Hive just imposes schema on top of that data too while reading and that is why it is called schema on Read.

SQL like Syntax:

Hive uses SQL-Like language called Hive Query Language (HQL) to perform analytics on top of data. Apache Hive is extremely famous among Business Analysts & testers who are well acquainted with SQL queries.

Higher Abstraction:

Apache Hive is built on top of Hadoop Framework development and it runs MapReduce jobs in the background. HQL queries are converted into Java programs, which run as Mappers & Reducers on top of Hadoop to perform analysis.

Hive Architecture

Advance Analytics Using Hive

User Interface:

Hive provides an interface between User and HDFS to run HQL queries against data stored in Hadoop File System.Various supported interfaces by Hive are Hive Command Line, Hive Web UI and Hive HDInsight.

Meta Store:

Hive stores the schema and table metadata such as column name, column type, HDFS location (where data is actually stored) on a separate database. By default, a Derby database is used internally to store the schema but it can be replaced by any other database such as postgres, mysql for production environments.

HiveQL Process Engine:

This process engine communicates with the metastore to compile and parse HQL queries.

Execution Engine:

It processes the HQL queries to create a corresponding MapReduce job to run on top of the Hadoop distributed file system to access data and to produce the final output.

Create Database in Hive

To start working with Hive, we need a database in which we can create tables.Below commands can be used to create, describe and drop a database in Hive.

Hive Database

Create Database if not exists company; Describe Database company; Drop Database company;

External/Managed Tables

There are two types of tables in Hive – External & Managed tables. Major difference between the two is that if you drop a Managed table, both data & schema attached to the table will be deleted whereas if you drop an external table then only schema associated with the table will be deleted although the data will remain intact on HDFS.

Create Managed Table

CREATE TABLE employees ( custId INT, fName STRING, lName STRING, city STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;

Create External Table

CREATE EXTERNAL TABLE employees (empId INT, fName STRING, lName STRING, city STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/user/cloudera/employees;

Hive Analysis

Advance Analytics Using Hive

Hive supports various ways to perform data analysis such as

Aggregation:

You can aggregate data using various aggregation API provided by Hive such as count, sum, avg etc.

Ordering:

You can order the data based on one or more columns. Ordering/Sorting in distribution is little different from traditional systems. Either you can sort the data across all machines that will result in data shuffling or you can sort the data within a single machine. Sorting the data within a single machine will not guarantee order across all machines.

Joining:

You can also join the data based on some keys to perform aggregation on top of that joined data.

Windowing:

With the concept of windowing in Hive, you can perform analysis on data falling in some window. For example, you want to find the top searched websites in the last 10 minutes or the highest sold product in the last 7 days.

Partitioning:

You can also partition your data in HDFS. Concept of partitioning data can make queries run faster. For example, in population based data, you can partition data based on country and then on city. Suppose I want to perform some aggregation on population data based on country/city, I can directly jump to partition based on country/city instead of searching entire data that will ultimately decrease query latency.

In this Blog, we will look into Aggregation, Partitioning and Joining.

Aggregation

Suppose I want to find out all the cities in which the number of employees is greater than 100. I can run below HQL query.

City with more than 100 Employees

SELECT city, count(*) FROM employees GROUP BY city HAVING count(*) > 100

There are numerous built-in aggregation operations provided by hive such as sum, Min, Max, Avg. In addition, Hive provides String, Date and Math operations to perform advanced Analytics to perform some formatting or advanced analytics.

Partitioning

As already discussed, partitioning makes HQL run faster. In below example, we partitioned the employees data based on employee city. Running the query below will create physical directories on HDFS based on the number of cities and all data belonging to the same city will go to the same partition. For example, Partition with Newyork City will have all employees from Newyork city.

Partitioned based on city

CREATE EXTERNAL TABLE employees_partitioned (empId INT, fname STRING, lname STRING) PARTITIONED BY (city STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

Joining

You can also join data using Hive. For example, we have department and employee data on HDFS and we want to join this data to fetch our employee with department information in the final result. There are various types of joins supported by Hive – Inner Join, Left Outer Join and Right Outer Join. You can join multiple tables together using Join queries in Hive.

Let’s take an example, Suppose we have employee and department information at two separate locations and I want to join this data to get the employee name & his department name in the final result. Below figures represent the dataset and final outcome after applying inner join.

Advance Analytics Using Hive

Inner Join based on EmpId

SELECT e.emp_name,d.dep_name FROM employee e JOIN department d ON (e.dep_id==d.dep_id)

Conclusion

Apache Hive is liked by Business Analyst, Testers, Data Analyst & even Developers because of its flexibility and easiness to run SQL-Like query on top of Big Data solution companies.

However, there are other ETL tools coming into the marketplace everyday such as Presto, Impala but Hive has already secured its place and is well used across the industry.

Related article

In this blog, let us go through some of the very important tuning techniques in Apache Spark. Apache Spark is a distributed data processing engine and

If the requirement is to host very large tables and we want to access it randomly and in real-time HBase is a great choice which can fulfill our requirement.

Big data world comes with big challenges and even more challenging is tuning the big data development application for optimal performance.

DMCA Logo do not copy