In my next couple of blog entries, I will be focusing on PIG and then MapReduce. Before that however, I need to prepare a dataset and get it loaded in HDFS. The data that I will be working with is weather data, specifically the NOAA Global Summary of the Day (GSOD) data available for over 9,000 weather stations. GSOD data can be downloaded from the NOAA ftp site using the following address: ftp://ftp.ncdc.noaa.gov/pub/data/gsod.
For this demo, I am only going to focus on a single full year's worth of data (2012). The structure of the data on the ftp site is broken down by year. Within each year directory are logs for each station compressed using gzip. To download the entire collection of files can be downloaded in a tar format: gsod_####.tar (where #### is the year). In addition to the GSOD data, you will need to download the ish-history.txt file which contains data about each weather station. After you have downloaded the file, extract the tar's contents to your working directory: c:\temp\Weatherdata\2012\.
The remainder of this demo will assume the following:
- You have downloaded and installed the Unix-terminal emulator, Cygwin (http://www.cygwin.com/)
- You have the required library to decompress gzip
Preparing Data for HDFS
1. Open Cygwin and cd to your working directory. If you are new to Cygwin a tip I found useful is that Ctrl-L will clear the screen.
2. Before we start, lets get comfortable using the bash commands. The first command we are going to use will simply list out all the files in our working directory: ls
Let's build on that command now, using the powerful find command to find all files and will then pipe the output to wordcount (wc) command to find how many files we have to process: ls -ld $(find . -type f) |wc -l
3. We are not ready to begin processing the 12,408 individual files. Since we have so many small compress files, we need to decompress the files, combine them into one large files and them split them into more manageable (approx. 250Mb) chunks. Decompressing the files with gnu zip tools is trivial although it make take several minutes for the process to complete. Enter the following command at the prompt: gunzip *
After the process completes, you are left with 12,408 files that are relatively small (1-12Kb). The files have an *.op extension but you can browse the contents of the file by typing: cat 007034-99999-2012.op. We are not concerned with the contents of the file at this point.
4. The cat command will also allow us to combine all the data in the files by piping it into a new file called 2012_WeatherData.op: cat *.op >> 2012_WeatherData.op
The resulting file is approximately 554Mb. You can view the file by typiing: ls -l 2012_WeatherData.op
5. Before we upload this file, we are going to split it into 250Mb chunks to make it more efficient to work with within HDFS. Before splitting the file create a new directory (mkdir splits) and the change to the new directory (cd splits).
6. The split command will split the single large file into chunks. We want 250Mb chunks so we pass that to the split command using -b. The full command is: split ?b 250m ../2012_WeatherData.op
The results are 3 files in the splits directory that are arbitrarily named by the splits process.
7. The final step is to upload the files to HDFS. Open the Hadoop Command Prompt. The copyFromLocal will push the files from the local file system into HDFS (the path I as using is /user/Administrator/WeatherData/2012). To complete the demo enter:
hadoop fs -copyFromLocal c:\temp\Weatherdata\2012\splits\ /user/Administrator/WeatherData/2012
After the command completes, verify the files are in HDFS using the Hadoop Name Node Status webpage.
In this demo, we walked-through data preparation which is a common required activity for loading data to HDFS. If this was a repetitive process, you would want to take the time to script out these operations so that it could be done will little work. In the next post, we will start using the data as we get going with a simple MapReduce program.
Till next time,