Sudhish Koloth
4 min readDec 5, 2020

Critical Role of Hive Big Data for Analytics

Author : Sudhish Koloth

Abstract

This article helps all the new big data developers to get more insight into the Hive big data table. The power of Hive tables is its extreme use cases, which fits various data analytics needs.

Keywords: Hive, Big Data, part files, Hdfs

1. Hive Big Data Table

Hive Big Data table is one of the stacks which comes along with Hadoop infrastructure. Hive tables are mapped to Hadoop file systems, and it is data warehousing built on Hadoop. Like all other RDMS and NoSQL tables, the Hive tables also have columns and rows.

The simplest way to create the Hive tables is to create an HQL file and run on the Hadoop edge nodes. There are two types of Hive table that developers can create, i.e., Managed table or External Table. Managed tables are managed and controlled by Hive and support ACID, whereas External tables data does not support ACID, and it is not tightly coupled. In most analytics-based applications, we prefer an External table. Below is one same example on how to create a customer events external table

Table Creation

For the External table, you have to use the keyword External. In the above example, database/schema is called CUSTOMER, and the table under it is customer_click_events. Followed by all the column names and partition columns and location where data resides. Now we can go over the downside and benefits of this design.

1.1 Importance of Partition Columns

The most important step while creating Hive tables is to decide partition columns.

The partition column decision is purely driven on how the data analytics team uses the data and generates reports or intelligence based on it. So in the above Hive table, I am taking an example of a typical use case of the customer events that any sector/business will have. This use case is to find how many customers purchased company products/or signed up and demographics related to the customer base. Let’s say we want to find number of customers who purchased product in particular zip code during the day.

So, here in the table, we can partition the table by day at least. To empower great performance results and powerful analytics, it is suggested to identify more data patterns and decide to partition those columns or some cases you can use clustering command for sorting the data. Here in this example I used partition by day, hour and zip code. This will allow me to pull data based on in a day, for each hour and for each zip code I would like to know total order purchased.

It is as simple as commanding, hey Siri — “ can you tell me for today and for previous hour, how many purchasing happened in zip code 10001?”.

There is no hard rule on choosing particular column/columns as a partition column. Data analytics patterns and their needs purely drive it.

The Hive table for the above example will have data organized into the Hdfs folder in ORC format, specified while creating a table. Also, as you observed, there are few table properties settings that you can use for performance benefits.

Partition Layout

1.1 Partition Data Benefits and considerations

This use case fits small and medium scale companies where the customer is geographically focused on specific regions or cities.

For big companies where the customer base is vast, you have to reconsider your partition columns. For the above example, companies with massive customer bases where you have customers in all the zip codes. And in the United States alone, we have around 40K zip codes, so each hour creates 40K directories, and in a day, it creates 24*40K. So sometimes there are issues in NameNodes when there are lots of directories. In this case, the better approach would be partition by day and hour, and this will have only 24 folders per day. Also note, as a practice, write bigger files in hdfs folder rather than writing smaller files.

Deciding on these factors depends on the data’s size, business analytics needs, and available infrastructure.

2. Analytics & Reporting

One of the main advantages of Hive tables are reporting and smart data analytics. For the above example, it is easy for companies to find out how many customers purchased the product. Also, it will give opportunities to identify which section of people are interested in their product

It also helps them analyze what pitfalls in the applications where customers are unable to go further are. Some customers added items to the cart, but for some reason it became difficult to understand the user interface and exit the page/app. So these things help to determine areas of improvement. In conjunction, reports can also be made on top of this data and publish to the dashboard.

· Hive table helps in-depth drive analysis on customer actions.

· Hive table is robust to extract any historical data

· Data Analytics results can be used for overall product improvement

· Reports helps determine budgeting

· In some cases, fraudulent analytics can be performed

· The historical data comes in handy during any audit occurs.

· These data are further used for machine learning and build AI on top of this