In previous posts, we have looked at what it takes to get started with with Hadoop on Windows using HDInsight. We also looked at Hive, which is the data warehousing framework built on top of Hadoop. In this post, we will dig a little deeper into the Hadoop Ecosystem focusing in on the parallel language and runtime known as Pig.
Pig, More than just bacon
Pig got its start at Yahoo in 2006, originally created as a research tool intended to allow for ad-hoc queries and exploration of large semi-structured data sets. Pig consists of two parts:
- Pig Latin - a high-level procedural language for querying and processing data
- Pig Engine - parses, optimizes, and then compiles and distributes queries using Map/Reduce
The Pig Engine is built-in on top of Map/Reduce and uses Pig Latin to serve as a layer of abstraction that reduces the time and complexity needed to approach large scale data. This abstraction makes it more efficient to do data exploration and rapid iterations since working in Pig empirically takes 5% of the development time and results in 5% of the code required for a Map/Reduce program.
Functionally, Pig is a hybrid between SQL and Map/Reduce. Common SQL operations such as joins, filters and grouping are supported while offering the flexibility of an optional runtime data schemas found in Map/Reduce. Other features include optimization and automatic rewrite of queries which are not found in neither SQL or Map/Reduce. With that let's dive in and take a look.
Relations, Bags and Tuples, Oh my!
A big part of making Pig work for you is understanding the various data structures and types that you will encounter. The basis or outermost container of every Pig query is the relation. The relation is simply a bag (sometimes referred to as the outer bag). A bag is a collection of tuples, a tuples is simply a collection of fields and fields are represented by one of the six simple data types: int, long, float, double, chararray (string), bytearray. Pig also supports the map data structure, which is simply a collection of key/value pairs.
A Simple First Example
To get our feet wet, let's look a simple example. This example will load a data file from the HDFS file system, project a schema onto the data, apply a filter and then perform a simple aggregation. For this example, we will use the w3c log data available in the Getting Started materials on the HDInsight install to get a count of page views by IP address.
1. The first step is to load the log data and project a schema onto the data. The log columns are delimited by a blank space character so we will need to specify that. During the load process, we also project a data schema onto the data. To keep this demo simple, we will work will only the first three columns of the data and will work purely with chararrays (strings).
log = load '/user/Administrator/input/w3c.txt'
using PigStorage(' ') as (
2. Since the log file we are working with has headers, we need to filter them out before any data is processed. We use the matches function and a regular expressions pattern to apply the filter.
noheaders = filter log by NOT (date matches '.*#.*');
3. Now we are ready to group the data by IP address:
grouped = group noheaders by (ip);
4. Next, apply our aggregate to each group:
byip = foreach grouped generate group,COUNT(noheaders);
5. Notice that up until this point, no real processing outside of syntax checking has occurred. We are going to use the store command which will cause the compilation and execution of the scripts using Map/Reduce and then store the results in tab-delimited format on HDFS. Use the following command and then watch as the Map/Reduce job is created, submitted and then executed:
store byip into '/user/Administrator/Output/PageViewsByIP' using PigStorage('\t');
6. At the Hadoop command prompt you can use the -ls and -cat commands to view the output and then preview the file.
hadoop fs -ls /user/Administrator/Output/PageViewsByIP
hadoop fs -cat /user/Administrator/Output/PageViewsByIP/part-r-00000
In this first post, we introduced Pig as a tool for querying and processing data within the Hadoop ecosystem. We walked through a simple example that illustrated projecting a schema onto the data, filtering, group and then processing the data to get a count by IP. In the next post, we are going to look at a more advanced task that loads NOAA weather station data to calculate the average temperature for each station. If you have not prepped and loaded the NOAA data please check out my previous blog.
Till next time!