DQS and the structure of unstructured data

In these times of Big Data and getting the best out of unstructured data I thought that I will try to get Data Quality Services to get a hold of unstructured data.

I read 2 distinct articles where DQS and unstructured data where mentioned together. The first of them is about data matching policies and written by Gadi Peleg : http://blogs.msdn.com/b/dqs/archive/2011/11/02/matching-policy-a-closer-look-into-data-quality-services-data-matching.aspx

This articles handles about the matching policies where you already have a strong DQS Knowledge Base and where DQS is capable of identifying relationships between indentical records within both structured and unstructured data. This requires of course a well trained Knowledge Base and to begin with a lot of manual work to correct, approve and reject records.

Once your base is somehow strong enough you can take advantage of Matching policies. To do that you need of course to kow your data very well and the unstructured attributes you want to retrieve, match and eventually correct.

This is not the purpose of this blog post !

The second article was a bit more in the direction of what I want to achieve but still the example it uses for unstructured data were frankly way to simple to qualify the data as unstructured. Like having a full name : “John B. Doe” and knowing that John is a first name and Doe a surname the KB would be able to conclude that B. was a middlename. Not impressing !

What I want is to be able to have a moderate trained base and do Knowledge discovery of some unstructured data so it becomes stronger and ultimately can be used to build a matching policy.

Since I am french and love wine I decided to structure my otherwise unstructured wine cellar. I have a lot of bottles down there but I don’t have a clear overview of what I have and eventually the wine will become too old (frankly quite seldom) and I will feel sorry about it.

Ideally I want to be able to feed DQS with some wine informations and I want my KB to grow without me having me to do too much work. Well, I can tell you right away that it was easier said than done !

Unstructured Test 1 : 1 unknown element amongst 3 known elements

These are some of the ways I want to be able to describe my wine list :

Moulin a Vent 1998 75 cl Domaine de Montparnasse0,5 l Chateau clef des Champs Graves 1989Morgon 75 cl Domaine Pascal Berthier 2010demi-bouteille Mercurey La Grange du Roy 1978Montagne Saint-Émilion 2011 Prairie d’Argent 75 clMagnum Pomerol La grande chasse 1988

As you can see an item in the list is made of :

AOC name + Year + Size + Producer name

Producer name being the unknown element.

Size can furthermore be expressed in different ways. In liters, centiliters and even by a “technical” name like ”Magnum”.

First of all I started by creating a KB called Wines and some domains – all strings:

image

I populated some initial domain values in my KB. For example all the years since 1958 until 2012 which has easily done in Excel and imported into my KB.

I also created a list of the possible contenance and their synomyms :

image

I did the same for AOC since they are known values. I found a list of 339 French AOC on wikipedia and imported it in my AOC domain :

image

So as you can see it is quite structured so far.

What I impossible can know beforehand is the name of the producers. Of course they are some very famous ones (which I can’t afford so I won’t need them anyway) I could have inserted as domain values.

I then built a Composite Domain with these domains and very important marked the CD to user delimiters (spaces in my case) and use Knowledge Base Parsing. Basically telling my KB that it should parse the data into the single Domains of my Composite Domain based on the knowledge attached to the Domains of my CD.

image

This is where the unstructured data comes in. I want to be able to describe my bottle in different ways and still having DQS putting informations the right places so my KB becomes more trained at the same pace as my nose for wine!

So if I was to feed my KB with :

Moulin a Vent 1998 75 cl Domaine de Montparnasse
0,5 l Chateau clef des Champs Graves 1989
Morgon 75 cl Domaine Pascal Berthier 2010
demi-bouteille Mercurey La Grange du Roy 1978

it will know that the overlined parts are producers without me having to tell it so.

So I gave it a shot and tried to run some Knowledge discovery on those examples. Here it goes :

image

The discovery process found 6 new producers they weren’t any in the base before. All the other values were known beforehand.

Unstructured Test 2 : 2 unknown element amongst 3 known elements

So optimistic as I always am I thought that I would be able to write the prize of the wine in my data to keep it in my KB. Unfortunately it proved to be a greater challenge than I thought.

When I add price information on my data :

€5.00 Moulin a Vent 1998 75 cl Domaine de Montparnasse
0,5 l €4.50 Chateau clef des Champs Graves 1989
Morgon 75 cl €12.00 Domaine Pascal Berthier 2010
demi-bouteille Mercurey La Grange du Roy 1978 €18.00

and I added Price domain to my CD

and even keeping it somewhat structured : “€” + amount in the format “12.34” the Knowledge Discovery process really failed. I even build a rule with a regular expression to validate the price format :

image

When doing Knowledge Discovery with the above data here is what I got :

image

Basically it’s all mixed up :

image

and the Price is not valid!!

image

So much for 2 unknown pieces of unstructured data.

I also tried splitting the Price into 2 parts. The currency part and the amount.

I implemented domain rule stating that a currency can only be £, € or $ and of a length of max 1. And even with those clear rules the Knowledge Discovery suggested “€ 18.00” as a new invalid currency. Not being able to isolate the currency from the amount even though it is hard coded as a domain value. I’m not impressed !!

Conclusion

When it comes to data matching, DQS doeas a great job with structured data, but if you have unstructured data you will need a lot of structure around it to make DQS accept it. As far as I’ve been trying I only succeeded with one piece of unknown data amongst pieces of known data when using Knowledge Base Parsing. May be I should get some more wine so my KB gets smarter !!

Happy DQS’ing !!

Send me a message if you’re into french wines and  want a copy of my KB and Excel spreadsheet !!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: