In my previous post, we discussed the map, array and struct data types and their implementation in Hive. Continuing on the Hive theme, this post will introduce partitioning and bucketing as method for segmenting large data sets to improve query performance.
If you have previous experience working in the relational database world then inevitably the concept of partitions and partitioning is not new. Partitions are fundamentally horizontal slices of data which allow larges sets of data to be segmented into more manageable chunks.
Partitioning in this manner takes many different forms with boundaries defined on either a single or range of values for the one or more columns that act as a splitter. This is commonly seen in a data warehouse environments using dates (such as transaction or order date) and occasionally geography to partition large fact tables.
In SQL Server, this partitioning support (for single columns) is built in through the use of partitioning schemas and functions. For Hive, partitioning is also built into for both managed and external tables through the table definition as seen below.
CREATE TABLE customer (
PARTITION BY (
A couple of interesting things become apparent as you digest the sample above:
- Multiple 'slicer' or partitioning columns are supported (i.e. Region/Country)
- The partitioning columns do not need to be included in the table definition as the function both as the slicer and column. You will still be able to include them in the query projection.
Behind the scenes, the partitioning statement alters the way managed tables are stored. If you browse HDFS or whatever underlying file system you are using the normal file structure for a given database and table looks like .db/
or in the case of our example ERP.db/Customer. Data files are then written directly to the Customer directory.
When you define the table with partitions, the underlying structure is changed such that sub-directories are created for each slicer or partitioning column. If multiple slicers are specified, the sub-directories will be nested based on the order of of the columns in the definition statement. To provide a better visual picture, lets circle back to our Customer table with Region and Country partitions and see how the directories now enumerate:
Now when a Hive query is issued that contains either one or both of our partition slicers in the criteria or predicate (the WHERE clause) what effectively happens is partition elimination where only those data directories that are needed are scanned. In the event that the slicers are not used all directories will be scanned (i.e. a table scan) and the partitioning has no effect.
A few final important notes on partitioning:
Although you the term Bucketing may not be familiar to you, you are already familiar with the concept behind it. Much like partitioning, bucketing is a technique that allows you to cluster or segment large sets of data to optimize query performance. The syntax to create a table with bucketing is listed below:
CREATE TABLE order (
) PARTITIONED BY (company STRING)
CLUSTERED BY (username) INTO 25 BUCKETS;
Now that you've see it let's talk about how it works. Unlike partitioning where each value for the slicer or partition key gets its own space, in clustering a hash is taken for the field value and then distributed across buckets. In the example above, we created 25 buckets and are clustering on username. Each bucket then would contain multiple users, while restricting each username to a single bucket.
There's a danger with this technique though. Bucketing is entirely dependent on data correctly being loaded to the table. To properly load data to a table utilizing this technique you need to either:
In this post, we introduced the horizontal partitioning capabilities within Hive. In addition to the partitioning capabilities you are familiar with, Hive also supports the ability to bucket data to more evenly partition data when a more natural slicer or key is not available. In the next post we will introduce Hive Indexes and Views.
Till next time!