Within the Hadoop ecosystem, you can use HDFS to load and store data and MapReduce to do both simple and hardcore processing. One of the missing pieces to the puzzle that is familiar to data warehousing professionals is the ability to interact with the data. Enter HIVE. Hive got is start at Facebook as they struggled to deal with the massive quantity of data accumulating daily within their Hadoop cluster.
While it was easy for developers to write MapReduce jobs in a variety of languages, there was no easy was to expose the data in a familiar paradigm for analyst. In response, Facebook developed Hive as a data warehousing layer over Hadoop. This new layer allowed for data to be exposed as tables that supported both simple and sophisticated ad-hoc queries using a query language called HiveQL to summarize and analyze data.
With a little bit of background behind us, let's look at using Hive on HDInsight.
HDInsight Hive Demostration
1. Start by downloading the sample IIS Log file, which we will use throughout the demo from HERE. Once downloaded, unpack the zip to the c:\temp\ directory on your HDInsight instance.
2. Let's upload the file to HDFS using the Hadoop command-line. For this demo the HDFS path I am using is /user/Administrator/input but yours may vary:
hadoop fs -copyFromLocal c:\temp\w3c.txt /user/Administrator/input
3. Browse the HDFS file system to ensure was loaded successful:
hadoop fs -ls /user/Administrator/input
4. Launch the Hive command-line by entering "Hive" in the Hadoop command-line.
5. We will be working with an internal table which is denoted by the absences of the EXTERNAL keyword in the create statement. Internal tables actually move the data that is added to the table to the \home\hive\warehouse directory in HDFS. External tables on the other had are metadata only and should be used if data is going to be shared among multiple Hadoop applications. When data is stored in Hive its stored using Derby by default although other options are available for storage. It's also worth noting that Hive tables can be partitioned which can significantly improve performance. I will discuss partitioning and bucketing in a future blog post.
Create a table to hold the IIS Log data using the following script:
drop table iislog;
create external table iislog (
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
Alternatively, you download the above script in an HQL file HERE. The HQL script can be run from the Hadoop command-line using the following command:
hive -f c:\temp\create_iislog_table.hql
6. After the script completes, we can browse the Hive tables and individual table structures using:
Take note that each hive query statement is terminated with a semicolon ';'.
7. Now, lets load data into our new Hive table from our IIS log file:
load data inpath '/user/Administrator/input/w3c.txt'
overwrite into table iislog;
8. Getting data into a Hive table is only the tip of the proverbial iceberg. The most power part of Hive is the Hive Query Language (HiveQL or HQL). HiveQL is a language loosely based on the SQL-92 standard that allow for data to be retrieved much as it is in a modern RDBMS such as SQL Server. While the syntax is familiar, behind the scenes HiveQL translates the sql-like syntax into MapReduce jobs, which are submitted to Hadoop for execution.
You will find many of the expected features such as joins and aggregations and many others you may not be as familiar with such as: Multi-Table Inserts, Sampling, Arrays and Map Support. In addition, to further extend the possibilities HiveQL supports the use of custom map/reducers within the data stream. With a basic understanding of HiveQL, let's look at a few examples to query data from the iislog table.
In the query below we start by using a correlated subquery to filter out the header rows from the IIS log file. Next, we simply do a count, grouping by the URL to get a count of hits per page within the logfile. We you execute the command take not that a MapReduce job is actually created and executed for you as detailed above.
SELECT sdate,csuristem from iislog WHERE sdate NOT RLIKE '.*#.*'
GROUP BY (filtered.csuristem);
The next example, we will perform a similar operation, except this time we will use the average aggregate to calculate the average time taken for each page.
SELECT sdate,csuristem,timetaken from iislog WHERE sdate NOT RLIKE '.*#.*'
GROUP BY (filtered.csuristem);
As a final point and to further illustrate what's going on behind the scenes to return your results to you, open the Map/Reduce Administration link found on the desktop. When you scroll down to the Completed Jobs section you will notice that all the work you have done to this point is found in the for of Map/Reduce jobs.
Before wrapping up, I want to point out a couple of other interesting things as well. First, if you want to extract the results of the query into a file you prepend the following code on either of the select statements:
insert overwrite directory '/user/Administrator/output/results.txt'
Also, you can use the EXPLAIN command in conjunction with the select statement to get a glimpse of some of what is going on behind the scenes.
You can clean-up the example by dropping the hive table using:
DROP TABLE iislog;
The focus of this post has been primarily on Hive itself. I would be remiss if I didn't at least mention the availability of the HiveODBC component which allows you access and incorporation data from Hive tables into tools such as Excel and other environments such as an existing SQL Server data warehouse or Analysis Services deployment. We will dig more into HiveODBC and integration in future blog posts.
I hope this post has opened your eyes to what's possible with Hive and Hadoop.
Till next time,