If my previous post we have looked at different means and methods for loading and subsequently working with data in a Hadoop environment. Largely missing from the discussion to date however is how SQL Server and other relational database play in this sandbox. While there are multiple points of integration the focus of this post will be on SQL-to-Hadoop tool better known as Sqoop.
Have a Double Sqoop
Sqoop is a relatively new command-line tool whose primary purpose is efficiently moving data between Hadoop (HDFS) and structured data stores such as our beloved SQL Server. The tool accomplishes this through the use of the following commands which are executed as MapReduce jobs:
- import - Used to load data from a target table or defined query from SQL Server to HDFS. Supports advanced functionality such as incremental load, multiple file formats and loading data in parallel (multiple mappers)
- import-all-tables - Similar to import except that it supports a list of tables to import
- export - Used to load data from HDFS to SQL Server. The target table must exist prior to the export command running and this command supports both Inserts and Updates. Multiple mappers can be used to load data in parallel but transaction boundaries are at the mapper level. For non-SQL Server destinations, built-in staging table support is available. I hope that sometime in the near future this will be extended to include SQL Server.
- job - Allows you to save import/export jobs including parameters so that they can be re-executed at a later date. Incremental import jobs will have state persisted by updating the job parameters.
- metastore - Configures sqoop to allow multiple users to share jobs.
- merge - Used to flatten dataset in HDFS. Primarily used to combine the output from incremental job runs where multiple files exists that contain duplicate rows. The merge job will produce a single flattened file removing duplicates so that only the newest record is kept.
- codegen - Generates Java classes that encapsulate and interpreted imported data
- create-hive-table - Creates a Hive metastore based on a SQL Server table. If the table has not already been loaded to HDFS the command will run an import as well.
Other commands that are supported include eval, list-databases and list-tables which use a specified connection to execute queries and explore the structure of the relational data store.
Working with Sqoop
Importing and exporting data with Sqoop is straight-forward. In the following sections we will look at three examples: importing, exporting and creating a hive table. Before we get started note that you all the commands below will need to be executed from a command-prompt at c:\Hadoop\sqoop-1.4.2\bin (note this location may change).
The minimum requirements to execute the import command are a connection, a table and a target HDFS directory. The connection to SQL Server takes places using the JDBC driver and requires a connection string passed in the following format:
Using this connection string we could easily import the DimProductCategory table from the AdventureWorks database to HDFS (/user/Administrator/AdventureWorks).
(Line Breaks added for clarity)
Note that after the connection, we only need to specify the SQL Server table and the target directory. The default output to HDFS is a CSV file output.
The -m parameter allows us to control parallelism by setting the number of MapReduce jobs to use. The default behavior for this setting is to split the workload based on the primary key for the table. If this is not the desired behavior you can configure the split column by passing the column name with the --split-by parameter.
We can extend the above example to handle a more complex scenario. Instead of working with an entire table, we can craft a query to return a list of Product Categories and Subcategories rather than just a single table. The only difference is the use of the --query parameter instead of --table as seen below:
sqoop import --connect "jdbc:sqlserver://localhost;database=AdventureWorksDW2012;username=Hadoop;password=********" --query "SELECT EnglishProductCategoryName,EnglishProductSubcategoryName FROM DimProductCategory JOIN DimProductSubcategory ON (DimProductCategory.ProductCategoryKey = DimProductSubcategory.ProductCategoryKey) WHERE $CONDITIONS" --target-dir /user/Administrator/AdventureWorks/ProductCategory -m 1
Getting data out of HDFS and back into SQL Server is straight-forward as the only required parameters are the --table and --export-dir parameters. There are a couple of caveats to be aware of however:
- The SQL Server table with an exact matching schema for the file on HDFS must exist.
- The Export command by default uses 4 tasks to load data in parallel. Each tasks has its own transaction container which could lead to transactional inconsistencies. Sqoop supports the use of a staging table to overcome this limitation for other RDBMS systems. Unfortunately, this option isn't currently available for SQL Server.
- Each task gets its own connection to SQL Server and transactions by default are committed every 10,000 records.
The command to load data from HDFS to SQL Server is below:
sqoop export --connect "jdbc:sqlserver://localhost;database=HadoopIntegration;username=Hadoop;password=********" --table DimProductCategory --export-dir /user/Administrator/AdventureWorks/ProductCategory/part-m-00000
If you need to perform updates, set the --update-key to the key column. If you need to perform both inserts and updates use the --update-mode parameter with a value of allowinsert.
Creating a Hive Table
Out-of-the-box Sqoop supports creating a Hive table based on a SQL Server table. The create-hive-table command will read the table schema from SQL Server, create the Hive metastore schema and if necessary perform the import of data from SQL Server to HDFS.
sqoop create-hive-table --connect "jdbc:sqlserver://localhost;database=HadoopIntegration;username=Hadoop;password=password" --table DimProductCategory --hive-table ProductCategory
In this post we looked at how you can use Sqoop to easily integrate Hadoop with your existing SQL Server environment. Although support for SQL Server in this tool is relatively young it greatly simplifies both importing, and exporting data as well as allowing you to easily create Hive tables over RDBMS tables.
Till next time!