In the previous two posts of this series we built a foundation for designing and building a recommendation engine. In the first post we built an understanding for what a recommendation engine looks like and how it works. In the second post we introduced Mahout as a platform for building a recommendation engine. This posts builds on that as we start designing a recommendation from end-to-end, beginning with the ELT process for moving data to a Hadoop Cluster.
Sources of Data
If you recall from the first blog post, there are two core types of data that we use to generate recommendations: explicit and implicit. Common data such as user activities, product ratings, feedback, and purchase/shopping cart history will typically be found in a relational database and is mostly tabular in nature.
Other data such as that used for clicks and/or page views will require web server logs which are more unstructured in nature and will require additional processing in the form of something like a MapReduce job before it can be used. Regardless of the source, the data should be moved into HDFS storage and for that you have a bunch of different options.
Data Movement Tools
There are a number of tools that are available for moving data. This section briefly summarizes some of the most common options.
Hive ODBC Driver
If you have any level of familiarity with HDInsight then you've undoubtedly seen the Hive ODBC Driver demos that pull data out of a Hive table and into Excel. But did you know that you can use this same driver to push data back into Hive? For tabular sources of data this can be powerful solution particularly when you combine it with an enterprise ETL tool such as SQL Server Integration Services (SSIS) on SQL Server 2012 as it has a built-in ODBC source and destination courtesy of Attunity.
One consideration to make for this data loading option is throughput or performance. If you are dealing with a very large volume of data, this option may not be for you. Be sure to perform some test and tune your potential solution before you get too far down the road. You can download the current beta Microsoft Hive ODBC driver from http://www.microsoft.com/en-us/download/details.aspx?id=37134 and keep an eye out for the Simba Hive ODBC driver when it becomes available (http://www.bloomberg.com/article/2013-06-26/a93EumhT1cNI.html).
I have an in-depth blog post (HERE) on the Sqoop tool so I won't spend to much time on it beyond a basic description. Sqoop is a data movement tool that is designed to efficiently move data between Hadoop (HDFS) and relational databases and other structured data stores. The tool functions by generating and subsequently running MapReduce jobs which are responsible for doing the heavy lifting. Sqoop is capable of connecting to SQL Server databases (and many others) through the JDBC driver and can be scheduled through Oozie, which is the workflow scheduler of choice in Hadoop or orchestrated into an SSIS workflow.
Apache Flume is a distributed data movement service that is capable of collecting, aggregating and moving large amounts of log data in a near real-time fashion through a streaming architecture. It is a popular, open source enterpris option found in some of the largest Hadoop eco-scapes. It abstracts away a lot of the complexity involved in handling manageability, scalability, configuration and fault-tolerance associated with data movement.
Flume functions as a series of sources and sinks that come in the form of an agent (the originator of the data), a processor (an optional step that can extract/transform the data) and a collector (persists the data to storage in HDFS). Built-in are tools for handling deployment and scalability scenarios such as easily aggregating logs from 100 different web servers. This is a vast simplification of the process but is sufficient for this post. For more information including check out the project website at http://flume.apache.org/.
AzCopy is a tool specific to Windows Azure and Azure Blob Storage (ASV) that is similar to robocopy or xcopy. Deployed as a command-line utility, this tool allows you to easily copy files into and out of your Azure storage account and could easily by scripted to move data like web server logs out for HDInsight to process. This is clearly a fairly rudimentary approach and for production systems you will need to build some scaffolding around this tool for monitoring and reliability purposes. You can find the AzCopy tool on Github at: https://github.com/downloads/WindowsAzure/azure-sdk-downloads/AzCopy.zip
I would be remiss if I didn't mention that there are many others ways to get data into your HDInsight/Hadoop cluster. One of the more common of the roll-your-own variety is to transfer or FTP files into the cluster and then use the Hadoop shell's CopyfromLocal command. to load the file into HDFS. There are numerous others ways which can be investigated if none of the options above work for you.
The objective of this post was pretty simple, introduce the most common tools available for moving data into HDInsight. Multiple options are available and each has strengths and weaknesses in terms of scenarios on when and how it is used. The data, its volume and its velocity will (and should) largely drive your decision. In the next post of this series, we will discuss the data transformations and normalizations that need to be considered to prepare the data for the recommendation algorithms.
Till next time!