I am passionate when it comes to analytics, data mining and machine learning and I think most organizations do too little when it comes to this arena. That's why one of my favorite parts of the Hadoop ecosystem is Mahout.
Mahout is a scalable machine learning library that includes multiple out of the box machine learning and data mining algorithms including clustering, classification, collaborative filtering and frequent pattern mining.
If you are using HDInsight in the cloud Mahout comes pre-installed for your use. Unfortunately, if you are running a local HDInsight instance on Windows Server you must deploy Mahout on your own.
While this may sound like a daunting task the fortunate thing is that underneath the covers of HDInsight is a standard instance of Hadoop. Let's take a look at what it takes to get Mahout up and running.
1. Download the zipped Mahout 0.7 distribution from the Apache website: http://www.apache.org/dyn/closer.cgi/mahout/
2. Extract the contents of the zip file to c:\Hadoop and rename the folder mahout-0.7 for simplicity
3. Now we are going to test the installation using the Simple Recommendation Engine demo: http://www.windowsazure.com/en-us/manage/services/hdinsight/recommendation-engine-using-mahout/
4. Follow the lab to generate the required files for lab or for expediency you can download them here:
5. Once you have download the files, place them in the c:\temp\ directory on your HDInsight instance.
6. Open the Hadoop Command Line console by clicking the link either found on the desktop or the on the start menu.
7. The first step as directed by the lab is to copy the test files from the local file system into HDFS. Use the following commands to deploy both text files to HDFS:
hadoop dfs -copyFromLocal c:\temp\mInput.txt input\mInput.txt
hadoop dfs -copyFromLocal c:\temp\users.txt input\users.txt
8. Browse and verify that the files now exists within HDFS:
hadoop fs -ls input/
9. I won?t explain what the sample job is doing since the lab referenced above does a good job of explaining that. We will simply use the sample job to verify the Mahout distribution is configured and ready for use:
hadoop jar C:\Hadoop\mahout-0.7\mahout-core-0.7-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE
--input=input/mInput.txt --output=output --usersFile=input/users.txt
10. The job will take several minutes to run to completion. When the job completes lets dump the results to a text file in the temp directory:
hadoop fs -copyToLocal output/part-r-00000 c:\temp\output.txt
11. Optionally, to clean-up the files used for the test use the following commands to remove the output and temp directories:
hadoop fs -rmr -skipTrash temp
hadoop fs -rmr -skipTrash output
That's it. You Hadoop instance now has Mahout support!
Till next time!