In my last post "Getting Started with Data Quality Services", we walked through installing the DQS client and server and briefly discussed important DQS concepts. In this blog post, we will take the next steps by building out a knowledge base. More specifically this post will focus on creating domains, establishing business rules to validate those domains and then creating composite domains to allow for cross-domain validation. We will follow-up in subsequent posts by training our knowledge base with domain data and then putting the knowledge base to use in a data cleansing project.
Before we move forward let's set a little bit of the groundwork. Will be build our DQS demo around a data set that contains every NFL and MLB team, their stadium and the stadium's capacity. We will define domains for each data element and then put together a series of business rules to validate our data. With that let's get started.
DQS Blog Series Index
Part 1: Getting Started with Data Quality Services (DQS) 2012
Part 2: Building Out a Knowledge Base
Part 3: Knowledge Discovery in DQS
Part 4: Data Cleansing in DQS
Part 5 : Building a Matching Policy in DQS
Part 6: Matching Projects in DQS
Part 7: Activity Monitoring, Configuration & Security in DQS
Building the Knowledge Base (KB)
Fundamental to the ability to cleanse and correct data within DQS is the knowledge base. Knowledge bases are collections of knowledge about related data domains. These knowledge bases that are built and managed both interactively with a data steward and through a machine-learning knowledge discovery process. To get started, we need to launch the Data Quality Client and connect to the DQS server.
Once the client launches, you will notice three distinct regions within the application. We will start within the "Knowledge Base Management" section. Start by clicking "New Knowledge Base" under the "Knowledge Base Management" section. When the "New Knowledge Base" screen appears you will see prompts for both a name and description.
Data Quality Client Home Screen
For our demo you we will name our KB, "SportsTeamsKB". Before we move on, note that you have several other options available on this screen. You have the option to duplicate an existing knowledge base using the "Create Knowledge Base from" prompt as well as selecting an activity. Because we are walking through all the functionality of DQS we will start with the default of "Domain Management." Similar results can be achieve by starting with Knowledge Discovery as you will see in subsequent posts.
New Knowledge Base Screen
When you click "Next", you will be presented with the Domain Management screen. This is where the meat of the knowledge base is defined. Before we start defining our domains, let's look at the icons that are available the list of domains within the KB. From left to right, you will see options to Create a New Domain, Create a Composite Domain, Copy a Domain, Import a Domain, Export KB Data and finally Delete a Domain. We won't discuss importing and exporting in this series but we will cover the other functionalities as we progress through the demo.
Domain Management Actions
With that out of the way, we can start creating the four domains we need for our demo. Start by clicking the "Create a Domain" icon. A dialog will pop-up that will prompted you for a domain name, description, data type as well as several other configurable options. For brevity and clarity I've included a table below that outlines what each of these options do.
Create Domain Dialog
The domains we need to create are as follows:
Defining Business Rules
Once you create a domain, you will notice that depending on the data type type you configured you have either four or five tabs available. The four tabs that are available for every data type are Domain Properties, Reference Data, Domain Rules and Domain Values. The fifth tab is only available for the string data type is Term-Based Relations.
To quickly go through the tabs, you should notice that the Domain Properties tab contains the same information that you used to set-up the domain initially. The Reference Data tab is where you set-up a data market service provider (covered in subsequent posts) and the next tab is the Domain Rules tab which is the tab that is relevant to our current discussion. The other tabs, Domain Values and Term-Based Relations we will discuss in the Knowledge Discovery post.
Let's start defining our business rules with our League domain. Ensure the League domain is selected by clicking on it in the domain list and then clicking on the "Domain Rules" tab. In the upper-right hand corner above the grid you will notice an "Add a new domain rule" icon. Click that icon to start the process.
For our first rule we are going to go simple. We are going to define a rule that requires that a valid League value to have a length equal to three. As such, let's name our rule in the grid, "Valid League Acronym". Notice in the "Build a Rule" section that a condition has been supplied by default. Select "Length is equal to" from the drop-down list and then enter "3" in the prompt.
Domain Rule Definition
There's more to note before we move on. The business rule support multiple conditions that are joined using either and/or logic. We also have the ability to test our business rules as well as analyze the impact the rule will have on any current domain data that exists. When you are done making changes, click "Apply All Rules".
As an exercise for you, go ahead and create a business rule on the Capacity domain to ensure that the value is greater than zero.
Occasionally when putting together a data quality solution you will encounter a scenario where it is impossible to accurately validate a value without the context of one or more other values. For example when validating a postal code, we could not fully validate the value "33810" without the city and/or state. "33810" is certainly a valid postal code if you are in Lakeland, Florida but it is not valid if your in Atlanta, Georgia.
Luckily for us, DQS has the concept of Composite Domains (CD) which helps solve this problem. Composite Domains allow us to bring together multiple individual domains so that we can define cross-domain rules. To illustrate this let's point out a couple of problems we might face in our sports data set.
The first problem we might face has to do with our Capacity domain. MLB and NFL stadium capacities are in distinct ranges that do not overlap. MLB teams typically play in smaller stadiums (34,000 to 56,000), while NFL teams play in larger stadiums (61,000 to 92,000).
The second problem we could potentially face is that there is a Giants team in both NFL and in MLB. To accurate validate and cleanse our Team domain we must be able to check the League domain in order to properly rename a Team value of "Giants" to the proper full team name either New York Giants or San Francisco Giants.
To illustrate this, let's build a composite domain that consist of our Capacity, League and Team domains. Start by clicking the "Create a Composite Domain" icon above the list of the KB domains. The Create a Composite Domain dialog appears and prompts you for a name, description and the selection of domains to include in the CD. For our demo we will name our CD, "SportsTeams" and then add our Capacity, League and Team domains. Click "OK" when you are finished.
Create a Composite Domain Dialog
Back on the Domain Management screen you will notice that your CD shows up in your list of domains and is differentiate with an icon with multiple circles. You will notice your tab options look similar as well. For the purpose of our demo, we are going to be working in the CD Rules tab to define the cross-domain rules we need to accurately validate our data set.
To build our first business rule for validation of stadium capacity, click the "Add a New Domain Rule" icon. Name the business rule "MLB Stadium Capacity". In the Build a Rule section, select the League domain and then choose "Value is equal to". Enter "MLB" in the prompt. Under the Then section, choose the Capacity domain. First we choose "Value is greater than" and enter "30000" in the prompt. To add our closing condition, click the "Adds a new condition to the selected clause" icon. You will notice that the logical operator defaults to "AND", which is what we want. Now select "Value is less than or equal to" and enter "60000" in the prompt. This completes the MLB half of the rule. I will leave the NFL half of the rule as an exercise for you.
CD Rule ' MLB Stadium Capacity
The second rule we need to define is one that will rename a team name "Giants" to the correct full name based on the league. Add a new domain rule and name it "New York Giants Rename". In the Build a Rule section select League as the domain and then "Value is equal to" and enter "NFL" in the prompt. Now, before we add a second condition we need to click the domain level, you will notice that the blue bar to the left of your condition turns gray. Then click the add condition icon. For the second condition, select the Team domain. Next choose "Value is equal to" and enter "Giants" in the prompt.
Now we need to set the team name to appropriate value. Under the Then section, choose the Team domain. Again select "Value is equal to" and then enter the correct full name in the prompt, "New York Giants". This completes half the problem resolution and as before I will leave the second half as an exercise for you to complete.
CD Rule ' New York Giants Rename
Once you have finished the composite domain you can click "Finish" and then "Publish" your KB. Once your KB has been published it is available for other activities such as knowledge discovery, cleansing and matching.
Publish KB Dialog
Other Domain Features
Before we completely wrap up the shell of our knowledge base I want to point out a couple of other features that are available.
Linked Domains - It is possible to link domains together so that you inherit all the values, rules and properties of a domain. Because a domain can only be bound to a single field, this is a useful feature when you have two fields in a data set that need the same domain (think Billing State and Shipping State). You would not want to manage two domains separately so linked domains make it possible to manage a single domain and have linked domains inherit the changes.
Term-Based Relations - Allows for multiple values that are the same except for spelling to be consider synonyms. Some common example of this is linking the values "Inc." to "Incorporated" or "St." to "Street".
To recap we covered creating a new knowledge base. We also set-up several domains that are relevant to our dataset. We then learned how to add rules to our domain to validate data at the domain level. We followed that by digging into composite domains and cross-domain validation using composite domain rules.
Next time we will dive into the knowledge discovery process. Once we have populated our demo knowledge basis with some baseline data we will look into running a cleansing project both through a Data Quality Project as well as through an SSIS project.
Till next time!