Looking for an Expert Development Team? Take two weeks Trial! Try Now

Advance Analytics Using Hive

banner

With growing size of data everyday, the need for business Intelligence is also thriving like never before.Business Intelligence is about analyzing data to gather some business insights to take better business action and these analysis can be performed with the help of some ETL tool.

Apache Hive is an ETL tool that can query large datasets stored on Hadoop Distributed filesystem.Hive framework was designed by Facebook for their Business Analysts who had limited knowldege about any programming language but were excellent at running SQL queries to find business insights.

Characteristics of Apache Hive

Advance Analytics Using Hive

Schema on Read:

Hive is not a database because it does not keep data and schema together. Data is always stored on Hadoop distributed file system, Hive just imposes schema on top of that data that too while reading and that is why it is called schema on Read.

SQL like Syntax:

Hive uses SQL-Like language called Hive Query Language (HQL) to perform analytics on top of data. Apache Hive is extremely famous among Business Analyst & testers who are well acquainted with SQL queries.

Higher Abstraction:

Apache Hive is built on top of Hadoop Framework development and it runs MapReduce jobs in the background. HQL queries are converted into Java program, which run as Mappers & Reducers on top of Hadoop to perform analysis.

Hive Architecture

Advance Analytics Using Hive

User Interface:

Hive provide an interface between User and HDFS to run HQL queries against data stored in Hadoop File System.Various supported interfaces by Hive are Hive Command Line, Hive Web UI and Hive HD Insight.

Meta Store:

Hive stores the schema and table metadata such as columns name, columns type, HDFS location (where data is actually stored) on a separate database. By default, Derby database is used internally to store the schema but it can be replaced by any other database such as postgress, mysql for production environments.

HiveQL Process Engine:

This process engine communicates with metastore to compile and parse HQL queries.

Execution Engine:

It process the HQL queries to create corresponding MapReduce job to run on top of Hadoop distributed file system to access data and to produce the final output.

Create Database in Hive

To start working with Hive, we need a database in which we can create tables.Below commands can be used to create, describe and drop a database in Hive.

Hive Database

Create Database if not exists company; Describe Database company; Drop Database company;

External/Managed Tables

There are two types of tables in Hive – External & Managed table. Major difference between the two is that if you drop a Managed table, both data & schema attached to the table will be deleted whereas if you drop an external table then only schema associated with the table will be deleted although the data will remain intact on HDFS.

Create Managed Table

CREATE TABLE employees ( custId INT, fName STRING, lName STRING, city STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;

Create External Table

CREATE EXTERNAL TABLE employees (empId INT, fName STRING, lName STRING, city STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/user/cloudera/employees;

Hive Analysis

Advance Analytics Using Hive

Hive supports various ways to perform data analysis such as

Aggregation:

You can aggregate data using various aggregation API provided by Hive such as count, sum, avg etc.

Ordering:

You can order the data based on one or more columns. Ordering/Sorting in distributed is little different from traditional system. Either you can sort the data across all machine that will result in data shuflling or you can sort the data within single machine. Sorting the data within single machine will not guarantee order across all machines.

Joining:

You can also join the data based on some keys to perform aggregation on top of that joined data.

Windowing:

With the concept of windowing in Hive, you can perform analysis on data falling in some window. For example, you want to find top searched websites in last 10 minutes or highest sold product in last 7 days.

Partitioning:

You can also partition your data in HDFS. Concept of partitioning data can make queries run faster. For example, in population based data, you can partition data based on country and then on city. Suppose I want to perform some aggregation on population data based on country/city, I can directly jump to partition based on country/city instead of searching entire data that will ultimately decrease query latency.

In this Blog, we will look into Aggregation, Partitioning and Joining.

Aggregation

Suppose I want to find out the all cities in which number of employees are greater than 100. I can run below HQL query.

City with more than 100 Employees

SELECT city, count(*) FROM employees GROUP BY city HAVING count(*) > 100

There are numerous built-in aggregation operations provided by hive such as sum, Min, Max, Avg. In addition, Hive provides String, Date and Math operations to perform advance Analytics to perform some formatting or advance analytics.

Partitioning

As already discussed, partitioning makes HQL run faster. In below example, we partitioned the employees data based on employee city. Running below query will create physical directories on HDFS based on number of cities and all data belonging to same city will go to same partition. For example, Partition with Newyork City will have all employees from Newyork city.

Partitioned based on city

CREATE EXTERNAL TABLE employees_partitioned (empId INT, fname STRING, lname STRING) PARTITIONED BY (city STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

Joining

You can also join data using Hive. For example, we have department and employee data on HDFS and we want to join this data to fetch out employee with department information in the final result. There are various types of joins supported by Hive – Inner Join, Left Outer Join and Right Outer Join. You can join multiple tables together using Join queries in Hive.

Let’s take an example, Suppose we have employee and department information at two separate locations and I want to join these data to get employee name & his department name in the final result. Below figures represent the dataset and final outcome after applying inner join.

Advance Analytics Using Hive

Inner Join based on EmpId

SELECT e.emp_name,d.dep_name FROM employee e JOIN department d ON (e.dep_id==d.dep_id)

Conclusion

Apache Hive is liked by Business Analyst, Testers, Data Analyst & even Developers because of its flexibility and easiness to run SQL-Like querie on top of Big Data solutions companies. However, there are other ETL tool coming into marketplace everyday such as Presto, Impala but Hive has already secured its place and is well used across the industry.

Read More:

DMCA Logo do not copy