Text Data Mining on Mosha’s blog
Posted by thomasivarssonmalmo on February 8, 2009
With the permission of the blog owner I have collected various blog posts since late 2004 and have excluded posts about subjects outside of core MDX discussions. This have been a copy and paste work into a single large text file of the text and not the HtML source tags. I have marked the main text in each blog entry of interest and have simply done a copy and paste into a text file.
Text data mining in SQL Server 2005 or 2008 starts with using SSIS and the Term Extraction and Term Lookup transforms in the data flow. You start by creating a dictionary of terms that you think are relevant. Create a data flow task in the control flow and in the data flow you build something like in the picture below.
I have a flat file connection to my text file with all the blog entries. Import all text in a single column. One error that can occur with this task is truncation errors so in the next picture you can open the the advanced editor of the flat file source and set it to ignore truncation errors. I recommend you to set up a error flow but that is outside of the scope of this blog post.
The next task is the data conversion task since the text data mining algorithm only accepts unicode.
The term extraction task is used to create the dictionary of interesting MDX key words. The first time you run it you will get a long list of frequent words were 85 percent of the are rubbish. You set up the term extraction as in the following picture. I have selected Noun Phrase only because it returned the most interesting and clear results. Have a look at Books On Line for more information about this.
Now follows the long work of removing less interesting key words. If not you will see that [Internet Sales Amount] is the most frequent key word i´n Mosha’s blog. You build a table of rejected words that will filter out them from the list of key words each time you run the term extraction task. You refer to this rejected key words list in the same task on the first tab. I have done selects on the results of the term extraction task and inserted the rejected words by a simple Insert Into TSQL statement.
When you have a dictionary of interesting key words you can create a new data flow task and set it up like the first with a connection to the same flat file. The only difference is that you use term lookup instead of term extraction task. All other task remain the same except for that you will need a second OLE DB destination. Here you keep the key words and their frequency in significant parts of the text file. The reference table is the dictionary that we have created in the first part(term extraction output).
The relation between the flat file output and the reference table is set up in the second tab of the term lookup task.
Finally, what were the results of this text data mining of Moshas blog?
First we have the most significant key words that requires a lot of work with the rejected words table.
This dictionary is the reference table that I have used in the terms lookup task.
The result from the terms lookup is the following.
It can be a little harder to se but to the left you have the keywords from the dictionary and to the right the text fragment where they have appeared. The frequency in the middle is the number of times the key word appears in the text fragment.
My model can be improved by adding dates for the blog entries and to run a data mining algorithm on top of these two tables that I have created, but that is the subject of another blog entry.