Posted: 5/31/2011
I've been trying to consolidate a number of different sources and representations of the same data, specifically physician listings. I'd like to have a master physician table, and that led me to read about Master Data Services (MDS). MDS, though, seems to be just about storing the master data. I need to match up physicians across the different sources first. Does anyone have any any recommendations on tools which will help in the matching process?
I'm writing a lot of T-SQL code, and experimenting with SSIS Fuzzy Lookup tasks, to match different representations of the same physician before I store it. And eventually, some of this matching will have to be done manually. So, I'm looking for perhaps one tool which will assist in this in some coordinated fashion. At this time I feel like I am reinventing the wheel.My research has indicated that "Customer Data Integration" is the common term describing the techniques by which data is consolidated from different sources. I'm finding some big, expensive names which help with this, but anyone have recommendations on tools for the common database developer in a small shop? Thanks.
MDS is designed to be the central "point of truth" in situations like yours where the same entity is referred to by multiple names in different systems. Once you seutp MDS and define which version is "the truth", you refer to the MDS entry and it should trickle down to other ancillary systems if you're making changes and whatnot (NOTE: that's my current understanding, I have a book on MDS that I need to dig in to and grok everything). If you're rolling a homegrown solution it sounds like you'll be doing a series of lookups and fuzzy lookups (if you have Enterprise Edition) within SSIS in order to load your own version of "the truth" table. Are you currently doing everything natively or do you have any 3rd party tools now? If you're looking for 3rd party solution, is there a budget? I ask only because you've already seen there are solutions out there to help with this sort of situation but they're pricey.
That's my understanding as well on MDS, and it is the getting to "the truth" part that I am currently writing a lot of code to achieve. No, there is no budget for this, and this project will be over by the time anyone might approve the purchase of any 3rd party tools.
However, I did find something intruguing, thanks to Mega's blog post here about the Tech Ed sessions being available on Channel 9, and this might be just what I'm looking for:
Using Knowledge to Cleanse Data with Data Quality Services
http://channel9.msdn.com/Events/TechEd/NorthAmerica/2011/DBI207
I haven't had time to fully absorb the presentation, but I think this is a Denali Enterprise Edition feature. Looks interesting.
-Steve
Yeah the Data Quality Services in Denali definitely looks exciting and hopefully alleviates a lot of these types of pains.
In the meantime, I spoke with Joe Salvatore this morning about this issue and he suggested you'll probably have to a "waterfall" method in SSIS. This means you'd do an initial lookup against the other data source for direct matches. Those matches would go on down the matched output. The non-matched output would need to go to a fuzzy lookup task where you continue your comparisons until you filter down to what you need to move down the line to your final destination.