Writing for BIDN

Looking to give back to the community or learn through teaching others? Anyone can post blogs by clicking Add Blog Post after contacting us for permission.

«August 2015»
MonTueWedThuFriSatSun
272829

Power BI and Big Data

If you’re worked in the wide and diverse field of information technology for almost any amount of time, it probably hasn’t taken you long to discover that the one thing constant about IT is that the technologies and strategies involved change faster than you can learn them. And if you work in business intelligence like I do, you don’t have to look very far at all to see change. The Microsoft Power BI team rolls out a software update every month! If I want to stay learned up on the technology, I have to really be on top of things.

About ten years ago when Hadoop was first being developed at Yahoo, I don’t think anyone could have anticipated the size of the ripples (more likes cannonball sized splashes) being able to access Big Data could and would have on the IT industry. Hadoop (and other advances in hardware and software technologies) gave us something we never had before: The ability to access and report on data in real time on a scale never previously imagined gives an organization to identify and understand trends and patterns in the data and gain previously unknown insights. The organizations that are able to leverage big data will be the organizations that leave their competition in the dust.

Set Up and Configure the Hortonworks Sandbox in Azure

Not only does Power BI Desktop give us the ability to connect to Hadoop Distributed File System (HDFS) for reporting we can also mash it up with other more traditional and structured data sources with minimal effort required. But that’s not what this blog post is all about. This post is about setting up a virtual machine in Azure running Hadoop and connecting to our Hortonworks Sandbox with Power BI Desktop :).

The first thing you do if you don’t have access to a Hadoop cluster is to set up the Hortonworks Sandbox on Azure. The good news is its free (for the duration of the trial) and its super easy. Just follow the instructions at this link to set up the Hortonworks Sandbox.

Hadoop in Azure

Once that’s set up, you’ll need to add mapping for the IP address and host name to your hosts file. Devin Knight has a blog on this that you’ll find helpful.

Connecting to Hadoop with Power BI Desktop

Once your Hortonworks Sandbox is set up, you’re ready to set up your connection to Hadoop with Power BI Query. Start up the Power BI Desktop and click Get Data. Scroll down and select Hadoop File (HDFS) and click Connect.

Get Data with Power BI

From there you can follow the rest of the wizard to load the data into the semantic model.

Load Data with Power BI

Once the data is loaded, you’ll need to modify the query to navigate to the data you wish to use in your model.

In Power BI Desktop, go to the Home ribbon and click Edit Queries.

Read more

30

Three Best Practices for Power BI

Since the release of Power BI Desktop this past week, I’ve been really spending my extra time digging into the application focusing on learning and experimenting as much as I can. When my wife has been watching Law and Order: SVU reruns at night after the rug rats are in bed, I’ve been right there next to her designing Power BI dashboards like the total data nerd that I am. When my kids have been taking their naps during the weekend, I’ve been writing calculations in the model for my test dashboards. Or when I’ve been riding in the car back and forth to work I’ve been thinking of new things to do with Power BI Desktop.

Since I’ve been spending a decent amount of time with Power BI Desktop, I thought I’d take a moment to share three things to know and remember when designing your Power BI models and dashboards that I think will help you make the most of this tool and be effective at providing the data your business needs to succeed.

1. Optimize your Power BI Semantic Model

It probably hasn’t taken you long to figure this one out if you’ve built Power Pivot/Tabular models or at least it won’t when you do start developing Power BI dashboards. The visualizations in Power BI and Power View are heavily meta-data driven which means that column names, table or query names, formatting and more are surfaced to the user in the dashboard. So if you using a really whacky naming convention in your data warehouse for your tables like “dim_Product_scd2_v2” and the column names aren’t much better, these naming conventions are going to be shown to the users in the report visualizations and field list.

For example, take a look at the following report.

Power BI Dashboard without formatting

Notice anything wonky about it? Check the field names, report titles and number formatting. Not very pretty, is it? Now take a look at this report.

Power BI Dashboard with formatting

See the difference a little cleaned up metadata makes? All I did was spend a few minutes giving the fields user-friendly name and formatting the data types. This obviously makes a huge difference in the way the dashboard appears to the users. By the way, I should get into the movie production business. ;)

My point is that the names of columns, formatting, data types, data categories and relationships are all super important to creating clean, meaningful and user friendly dashboards. The importance of a well-defined semantic model cannot be understated in my opinion. A good rule of thumb is to spend 80% to 90% of your time on the data model (besides, designing the reports is the easy part).

I’d also like the mention the importance of the relationships between the objects in the semantic model. Chance are you will have a small group of power users that will want to design their own dashboards to meet their job’s requirements and that’s one of the beauties of Power BI. But when users began developing reports, they may query your model in unexpected ways that will generate unexpected behaviors and results. I only want to mention this because the relationships between the objects in the model will impact the results your users will see in their reports. Double check your relationships and ensure that they are correct, especially after you add new objects to the model since the

Read more
31

Power BI Fantasy Football Player Stats Dashboards for Download

Every year at Pragmatic Works some coworkers, including consultants, marketing staff, support team members, software development staff and project management, partake in a company fantasy football league. And with the recent release of the new Power BI Desktop, I thought what better way is there to prepare to completely annihilate my coworkers and friends in an imaginary nonsensical game than by creating some nifty Power BI dashboards based on last years player stats as recorded by Yahoo! Sports. So I thought I’d walk you through some of the steps I followed to leverage the Yahoo! Sports NFL player stats page as a data source and some of the query transformations I applied to prepare the data for reporting.

Power BI dashboard with Power BI Desktop

Click here to download my Fantasy Football Dashboards Power BI .pbix file.

If you’re completed new to Power BI Desktop I highly suggest you watch my video walkthrough of Power BI Desktop or read my blog post which walks you through each step of creating your first Power BI dashboards with Power BI Desktop. Last Friday, I also blogged about my three best practices for designing a killer Power BI solution, so take a look at that.

To create these dashboards, I simply navigated to the Yahoo! Sports NFL stats page and found the page for each position I’m interested in for this fantasy football season. I copied the URL to my clipboard. In Power BI Desktop, click Get Data and then use the Web data source option. Then all you have to do is copy and paste the URL into the text box and click OK.

Get data from web with Power BI Desktop

Then select the HTML table that contains your data and click Edit. We need to edit our query because there are some issues with the data. By clicking Edit, we can apply transformations to our query which will allow us to do things like rename columns, remove unwanted columns, modify data types, create custom columns and much more.

Get data from web with Power BI Desktop

One thing you’ll notice in the above screen grab is that the column names are in the first row, so we need to fix that.

On the Home ribbon of the Query Editor, just click the Use First Row As Headers button. Pre

Read more
12
3456789
10111213141516
17181920212223
24252627282930
31123456

Building Out a DQS Knowledge Base

  • 29 March 2012
  • Author: cprice1979
  • Number of views: 8547
  • 0 Comments

In my last post "Getting Started with Data Quality Services", we walked through installing the DQS client and server and briefly discussed important DQS concepts. In this blog post, we will take the next steps by building out a knowledge base. More specifically this post will focus on creating domains, establishing business rules to validate those domains and then creating composite domains to allow for cross-domain validation. We will follow-up in subsequent posts by training our knowledge base with domain data and then putting the knowledge base to use in a data cleansing project. 

Before we move forward let's set a little bit of the groundwork. Will be build our DQS demo around a data set that contains every NFL and MLB team, their stadium and the stadium's capacity. We will define domains for each data element and then put together a series of business rules to validate our data. With that let's get started.

 

DQS Blog Series Index

Part 1: Getting Started with Data Quality Services (DQS) 2012

Part 2: Building Out a Knowledge Base

Part 3: Knowledge Discovery in DQS

Part 4: Data Cleansing in DQS

Part 5 : Building a Matching Policy in DQS

Part 6: Matching Projects in DQS

Part 7: Activity Monitoring, Configuration & Security in DQS

 

Building the Knowledge Base (KB)

Fundamental to the ability to cleanse and correct data within DQS is the knowledge base. Knowledge bases are collections of knowledge about related data domains. These knowledge bases that are built and managed both interactively with a data steward and through a machine-learning knowledge discovery process. To get started, we need to launch the Data Quality Client and connect to the DQS server. 

Once the client launches, you will notice three distinct regions within the application. We will start within the "Knowledge Base Management" section. Start by clicking "New Knowledge Base" under the "Knowledge Base Management" section. When the "New Knowledge Base" screen appears you will see prompts for both a name and description.

 

1-25-2012-10-21-55-AM_thumb

Data Quality Client Home Screen

 

For our demo you we will name our KB, "SportsTeamsKB". Before we move on, note that you have several other options available on this screen. You have the option to duplicate an existing knowledge base using the "Create Knowledge Base from" prompt as well as selecting an activity. Because we are walking through all the functionality of DQS we will start with the default of "Domain Management." Similar results can be achieve by starting with Knowledge Discovery as you will see in subsequent posts.

 

1-23-2012-4-05-38-PM_thumb

New Knowledge Base Screen

 

When you click "Next", you will be presented with the Domain Management screen. This is where the meat of the knowledge base is defined. Before we start defining our domains, let's look at the icons that are available the list of domains within the KB. From left to right, you will see options to Create a New Domain, Create a Composite Domain, Copy a Domain, Import a Domain, Export KB Data and finally Delete a Domain. We won't discuss importing and exporting in this series but we will cover the other functionalities as we progress through the demo.

 

1-25-2012-10-26-22-AM_thumb

Domain Management Actions

 

With that out of the way, we can start creating the four domains we need for our demo. Start by clicking the "Create a Domain" icon. A dialog will pop-up that will prompted you for a domain name, description, data type as well as several other configurable options. For brevity and clarity I've included a table below that outlines what each of these options do.

 

1-23-2012-4-06-50-PM_thumb

Create Domain Dialog

 

The domains we need to create are as follows:

 

Domain Name Data Type Format Output
Team String Capitalize
League String Upper Case
Stadium String Capitalize
Capacity Decimal #,##0

 

Definition of Domain Properties
Use Leading Values - Relevant when you define synonyms. If this value is checked, the synonym is replaced with the defined leading value.
Normalize String - Replaces special characters with either spaces or NULL to improve matching accuracy.
Format Output To - Allows you to specify the output format for data values.
Enable Speller and Language - Allows the DQS speller to propose syntax and validation corrections for the Language specified.
Disable Syntax Error Algorithms -  Allows you so skip syntax checks on domains where is may not be relevant such as on a SKU or VIN number.

 

Defining Business Rules

Once you create a domain, you will notice that depending on the data type type you configured you have either four or five tabs available. The four tabs that are available for every data type are Domain Properties, Reference Data, Domain Rules and Domain Values. The fifth tab is only available for the string data type is Term-Based Relations.

 

1-23-2012-4-08-17-PM_thumb
Domain Properties Tab
1-23-2012-4-08-40-PM_thumb
Reference Data Tab
1-23-2012-4-09-50-PM_thumb
Domain Rules Tab
1-25-2012-11-18-04-AM_thumb
Domain Values Tab
1-25-2012-11-18-35-AM_thumb
Term-Based Relations Tab
 

 

To quickly go through the tabs, you should notice that the Domain Properties tab contains the same information that you used to set-up the domain initially. The Reference Data tab is where you set-up a data market service provider (covered in subsequent posts) and the next tab is the Domain Rules tab which is the tab that is relevant to our current discussion. The other tabs, Domain Values and Term-Based Relations we will discuss in the Knowledge Discovery post. 

Let's start defining our business rules with our League domain. Ensure the League domain is selected by clicking on it in the domain list and then clicking on the "Domain Rules" tab. In the upper-right hand corner above the grid you will notice an "Add a new domain rule" icon. Click that icon to start the process. 

For our first rule we are going to go simple. We are going to define a rule that requires that a valid League value to have a length equal to three. As such, let's name our rule in the grid, "Valid League Acronym". Notice in the "Build a Rule" section that a condition has been supplied by default. Select "Length is equal to" from the drop-down list and then enter "3" in the prompt.

 

1-23-2012-4-11-58-PM_thumb

Domain Rule Definition

 

There's more to note before we move on. The business rule support multiple conditions that are joined using either and/or logic. We also have the ability to test our business rules as well as analyze the impact the rule will have on any current domain data that exists. When you are done making changes, click "Apply All Rules".

 

1-23-2012-4-10-38-PM_thumb
Business Rule Validation Tool
1-25-2012-11-36-06-AM_thumb
Domain Impact Analysis Results

 

As an exercise for you, go ahead and create a business rule on the Capacity domain to ensure that the value is greater than zero. 

Composite Domains

Occasionally when putting together a data quality solution you will encounter a scenario where it is impossible to accurately validate a value without the context of one or more other values. For example when validating a postal code, we could not fully validate the value "33810" without the city and/or state. "33810" is certainly a valid postal code if you are in Lakeland, Florida but it is not valid if your in Atlanta, Georgia. 

Luckily for us, DQS has the concept of Composite Domains (CD) which helps solve this problem. Composite Domains allow us to bring together multiple individual domains so that we can define cross-domain rules. To illustrate this let's point out a couple of problems we might face in our sports data set. 

The first problem we might face has to do with our Capacity domain. MLB and NFL stadium capacities are in distinct ranges that do not overlap. MLB teams typically play in smaller stadiums (34,000 to 56,000), while NFL teams play in larger stadiums (61,000 to 92,000). 

The second problem we could potentially face is that there is a Giants team in both NFL and in MLB. To accurate validate and cleanse our Team domain we must be able to check the League domain in order to properly rename a Team value of "Giants" to the proper full team name either New York Giants or San Francisco Giants. 

To illustrate this, let's build a composite domain that consist of our Capacity, League and Team domains. Start by clicking the "Create a Composite Domain" icon above the list of the KB domains. The Create a Composite Domain dialog appears and prompts you for a name, description and the selection of domains to include in the CD. For our demo we will name our CD, "SportsTeams" and then add our Capacity, League and Team domains. Click "OK" when you are finished.

 

1-23-2012-4-14-30-PM_thumb

Create a Composite Domain Dialog

 

Back on the Domain Management screen you will notice that your CD shows up in your list of domains and is differentiate with an icon with multiple circles. You will notice your tab options look similar as well.  For the purpose of our demo, we are going to be working in the CD Rules tab to define the cross-domain rules we need to accurately validate our data set. 

To build our first business rule for validation of stadium capacity, click the "Add a New Domain Rule" icon. Name the business rule "MLB Stadium Capacity". In the Build a Rule section, select the League domain and then choose "Value is equal to". Enter "MLB" in the prompt. Under the Then section, choose the Capacity domain. First we choose "Value is greater than" and enter "30000" in the prompt. To add our closing condition, click the "Adds a new condition to the selected clause" icon. You will notice that the logical operator defaults to "AND", which is what we want. Now select "Value is less than or equal to" and enter "60000" in the prompt. This completes the MLB half of the rule. I will leave the NFL half of the rule as an exercise for you.

 

1-23-2012-4-16-13-PM_thumb

CD Rule ' MLB Stadium Capacity

 

The second rule we need to define is one that will rename a team name "Giants" to the correct full name based on the league. Add a new domain rule and name it "New York Giants Rename". In the Build a Rule section select League as the domain and then "Value is equal to" and enter "NFL" in the prompt. Now, before we add a second condition we need to click the domain level, you will notice that the blue bar to the left of your condition turns gray. Then click the add condition icon. For the second condition, select the Team domain. Next choose "Value is equal to" and enter "Giants" in the prompt. 

Now we need to set the team name to appropriate value. Under the Then section, choose the Team domain. Again select "Value is equal to" and then enter the correct full name in the prompt, "New York Giants". This completes half the problem resolution and as before I will leave the second half as an exercise for you to complete.

 

1-25-2012-2-14-15-PM_thumb

CD Rule ' New York Giants Rename

 

Once you have finished the composite domain you can click "Finish" and then "Publish" your KB. Once your KB has been published it is available for other activities such as knowledge discovery, cleansing and matching.

 

1-23-2012-4-21-59-PM_thumb

Publish KB Dialog 

Other Domain Features

Before we completely wrap up the shell of our knowledge base I want to point out a couple of other features that are available. 

Linked Domains - It is possible to link domains together so that you inherit all the values, rules and properties of a domain. Because a domain can only be bound to a single field, this is a useful feature when you have two fields in a data set that need the same domain (think Billing State and Shipping State). You would not want to manage two domains separately so linked domains make it possible to manage a single domain and have linked domains inherit the changes. 

Term-Based Relations - Allows for multiple values that are the same except for spelling to be consider synonyms. Some common example of this is linking the values "Inc." to "Incorporated" or "St." to "Street". 

Recap

To recap we covered creating a new knowledge base. We also set-up several domains that are relevant to our dataset. We then learned how to add rules to our domain to validate data at the domain level. We followed that by digging into composite domains and cross-domain validation using composite domain rules. 

Next time we will dive into the knowledge discovery process. Once we have populated our demo knowledge basis with some baseline data we will look into running a cleansing project both through a Data Quality Project as well as through an SSIS project.

 

Till next time!

 

Chris

Print
Categories: Analysis Services
Tags:
Rate this article:
No rating

cprice1979cprice1979

Other posts by cprice1979

Please login or register to post comments.