Thomas.I Microsoft BI & Analytics

A Site Dedicated to General BI Issues and MS BI Issues: Architecture and Technology

Archive for the ‘MS Data Mining and Business Value’ Category

Calculate Rule Importance in MS Association Rules

Posted by thomasivarssonmalmo on July 9, 2010

To follow this blog post you will need to read my previous post about MS Association rules and more important understand why and when you can use this algorithm.
 
When you look in the data mining viewer in Excel 2007/2010 or in BIDS the result on running the MS Association Rules data mining algorithm can look like this on the rules tab. The picture below is from BIDS and the market basket mining model that is part of the Adventure Works SSAS demo project. There is one itemset {Touring Tire, Touring Tire Tube}  that are visualised as two rules: Touring Tire -> Touring Tire Tube and Touring Tire Tube -> Touring Tire. Itemsets in MS Association Rules do not say anything about the relations between the products(Touring Tire and Touring Tire Tube). Rules on the other hand do exactly that and have the arrow between the products.
 
Note in the picture below that the strongest rule, Touring Tire Tube -> Touring Tire has a lower probability(0,543) than the third strongest rule, Touring Tire -> Touring Tire Tube(0,860). I hope that this blog post might help you to understand this and how to check the calculation for rule importance.
 

Thanks to Jamie MacLennan and Bogdan Crivat at Predixion that have helped me with answers to my questions.

Regarding the relation between rule probability and rule importance Jamie explained this in the following way. If people always buy milk at the grocery store together with other products you will have a high probablity of milk to occur with other products. All products that are bought together with milk will have a 100 precent probability. This rule is not that interesting and will have a lower importance than probability.

 The starting point for calculating the rule importance is a correlation matrix with Touring Tire Tube and Touring Tire transactions. The data source for the market basket model is the dbo.vAssocSeqLineItems in the AdventureWorksDW relational database. I order to follow how the MS Association rules algorithm works I have added 1 to the result for all cells except the total row and the total column.

 

Touring Tire

Not Touring Tire

Total

Touring Tire Tube

759

640

1 399

Not Touring Tire Tube

124

19 736

19 860

Total

883

20 736

21 259

 

Some of the queries are these:

select * from dbo.vAssocSeqLineItems
where Model = ‘Touring Tire’  —881 records

select * from dbo.vAssocSeqLineItems
where Model = ‘Touring Tire Tube’  –1397 records

Select * from dbo.vAssocSeqLineItems v
Join dbo.vAssocSeqLineItems v1
On v.OrderNumber = v1.OrderNumber
Where v.Model = ‘Touring Tire’ and v1.Model = ‘Touring Tire Tube’  –758 records

–Credit to Bogdan for this one.
SELECT DISTINCT OrderNumber FROM dbo.vAssocSeqLineItems
EXCEPT
(
 SELECT DISTINCT OrderNumber FROM
 (
  SELECT OrderNumber FROM dbo.vAssocSeqLineItems WHERE Model=’Touring Tire Tube’
  UNION
  SELECT OrderNumber FROM dbo.vAssocSeqLineItems WHERE Model=’Touring Tire’
 ) AS T
)  –19 735 Records

 With the correlation matrix cells completed you can go to next step and calculate the importance of the rules. The figures in the BIDS viewer are:

1,43722003 Touring Tire -> Touring Tire Tube
1,93898162 Touring Tire Tube -> Touring Tire

 Let us start with the Importance of the rule Touring Tire Tube -> Touring Tire and check the numbers in the table above. Note that +1 has been added to each cell except for the totals that includes this addition.

The calculation is done like this in several steps:

1. (Touring Tire, Touring Tire Tube) / (Touring Tire Tube Total)  = 759 / 1 399

2. (Touring Tire, Not Touring Tire Tube) / (Not Touring Tire Tube Total) = 124 / 19 860

3. Divide the ratios in 1 and 2 = 86, 89

4. Add Log10 (in Excel) to point 3 above = Log10(86,89) = 1,9390. I am not an expert in mathematics but log10 means that you decrease the scale. 

The same calculation for Touring Tire -> Touring Tire Tube is as follows:

1. (Touring Tire, Touring Tire Tube) / (Touring Tire Total) = 759 / 883

2. (Touring Tire Tube, Not Touring Tire) / (Not Touring Tire Total) = 640 / 20 376

3. Divide the ratios in 1 and 2 = 27,37

4.  Add Log10(in Excel) to the result of point 3 = Log10(27,37) = 1,437

I hope that you with these examples can follow the calculations of MS Association Rule importance directly from a data source.

Happy Association!

Posted in MS Data Mining and Business Value | Leave a Comment »

Association rules viewers

Posted by thomasivarssonmalmo on April 5, 2010

MS Association Rules is the data mining algorithm in Analysis Services that is recommended for market basket analysis. This means that you would like to see what products customers buy together in order to give recommendations to other customers.  A lot of material exist on the web of how to set up this algorithm in BIDS and there is also a data mining tutorial in Books On Line for Association Rules. In the Adventure Works cube project you have this algorithm ready to look at and my examples are from the SSAS 2008 R2 version but also applies to SSAS 2005 and SSAS 2008.
 
My post is about how to analyze or use the visualizing tools that is part of BIDS 2005 and later.
 
I would recommend beginners to start with the dependency network tool that is the last tab of the mining model viewer in BIDS. In order to see this you will have to click on the Market Basket mining structure in the solution explorer window in BIDS. What you will see is how the most important rules for products that were bought together by customers. The dependency network graph below is related to the rules tab in the mining model viewer. The Itemsets tab does not have the connection with the dependency network tool.
 
 
Let us use the links strengthness tool to the left in the dependency network tool to only show the three strongest links in that tool. A link and a rule is the same here and each rule or link show products that are bought together.
 
 
The strongest rule is between the Touring Tire Tube and the Touring Tire products above.
 
 
 
The second most important rule above is between the water bottle and the road bottle cage products.
 
The third most imporant rule below is between the water bottle and the mountain bottle cage products.
 
 
If you move from the dependency network tab to the rules tab you will see this three rules at the top if you sort by the importance indicator/bar on that tab.
 
 
 
The higher the importance of the rule it is more likely that the products will be bought together. The probability index value to the left is less important when you want to detect these relationships.
 
On the itemsets tab you will see how frequent products occurs together but there is no rule stating that if you will buy product A you will also buy product B. You will only see that product A and B will occur together It is more a less a tab for seeing that you minimum support requirement is working. It is the same as stating that I require that product A and B most occur together with a minimum frequency, like 10 times or more in the data set. You can actually see below that the itemsets with the largest support are not the most important rules.
 
 
 
 The statistical theory behind this will take many blog posts to explain so I will refer to Books On Line for more information about this.

Posted in MS Data Mining and Business Value | Leave a Comment »

Understanding MS Clustering Visualizations in BIDS

Posted by thomasivarssonmalmo on November 11, 2009

This is my first blog post about understanding data mining from an information consumers point of view, that understands almost nothing about the statiscal foundation of this area.
 
For you that are new to data mining it is a statistical approach to finding patterns in data that can save you from writing a lot of SQL or MDX.
 
I will use the well known bike buyer scenario that is part of the Adventure Works 2008 DW database. I am using the Adventure Works 2008 SSAS cube project where you can find the Targeted Mailing data mining structure. Have a look at the picture below.
 
Remember that MS-clustering do not require you to explain bike buyer(yes or no) . Clustering will get you a result that can be used to explain the numbers of cars or any the other attribute values that are part of the mining model and the mining structure of your choice. I will focus on the bike buyer attribute since it is well known and can be used for tying the different viewers in BIDS for SSAS 2008 together.
 
What I am doing here is simply to use the mining models part of the Adventure Works SSAS demo project for SSAS 2008 but this blog post also applies to data mining in SSAS 2005.  You will need to download that SSAS project and the source database from here.
 
Start up the Adventure Works cube demo project in BIDS(BI Developer Studio) 2008 and check the minng structures folder in that project. Click on "Targeted Mailing" and then go to the mining models tab.  You will see several mining models but in this blog post it is all about the"TM Clustering" mining model.
 
 
The next tab of interest is the mining model viewer. Select the TM clustering mining model in the mining model listbox up to the left. The first tab that will show up is the Clustering Diagram visualization.
 
 
 
Clustering do no require you to set a target attribute like bike buyer when you build the mining model. This means that this algorithm is a good starter when you know nothing about what is the target attribute and the input attributes. The result is your rows or cases arranged in to groups where each group have similar values for each attribute. It is up to you to decide if these groups are useful or not.
 
After the mining structure is processed you can use the visualization tools in BIDS to see what your data source patterns that was found.
 
 
 
The cluster diagram above is set to show darker colours for each cluster where a bike buyer is most frequent.  You can see that by looking at the shading variable(Bike buyer) and the state(=1 or true). Another interesting concept is the relation between the groups. It is not necessarly a good thing to have groups with strong relations but I can be wrong about that. I will add more about this later.  If you put the cursor on cluster 4 and 8 you will see that they have 71 % and 59 % probability of being bike byers.
 
Edit: Bogdan Crivat, of the MS DM team, added this description of the cluster diagram: "By default the shading is based on the size(population) of each cluster. The layout is based on the similarity of distribution of each cluster. In short, this means that clusters close to each other in this diagram are more similar than clusters that are far from each other." 
 
 
 
In the cluster profile above you get a confirmation that cluster/group 4 and cluster/group 8 have more bike buyers and a confirmation of the percentage probability of these two cluster groups. Do not forget to have a look at the other attributes for these two groups because they will tell you more about if they differ between other groups.  This can be the hard part but you can get a quick opinion but seeing if the proportion of the colours differs between each attribute in different clusters or groups.
 
 
 
In the cluster characteristics viewer(Cluster 4) you can see the probabilities for the most important attribute states or values within a group/cluster. This is why an attribute can occur several times here. Both the probability of the attribute and the state or value of the attribute wil be shown. Put the cursor on the bike buyer = 1 attribute and you will see the same percentage probability that we have seen since the cluster diagram.
 
 
 Finally we can see the cluster discrimination visualizer above where I have selected cluster 4 and 8. This tool can cause some confusion. Here you will see the largest differences for these two clusters regarding their each attribute and attribute state. Bike buyer is at the end since that has the least difference between these two cluster. If you have two clusters with little differences you can have a problem but that depends on what you are looking for. In this context you will probably have to review your mining model.
 
Bogan Crivat also sent me his helpful comment about the difference between Cluster Characteristics and Cluster Discrimation(vs. Complement):
"The discrimination vs. complement view is probably the most useful visualization for any single cluster, as it emphasizes cluster’s specifics." Cluster 7 and 8 have a strong presence of North american customers, so this will appear in characteristics. However, if you compare cluster 8 against its complement, North America will likely not show up in the list ( or, at least , it will not be on top)." 
 
The complement is everything outside of the selected cluster.
 
Edit: I have change the cluster number to reflect the strong North American customers in the sample I use.
 
It is up to you to decide the names of each cluster depending on the significant attibutes and attribute values for each group. It is also possible to analyze the number of cars attribute and see how that is explained by creating different groups. Change the shading variable in the cluster diagram to the attribute and the state to what you are looking for. 

Posted in MS Data Mining and Business Value | Leave a Comment »

Sampling data with the Excel 2007 data mining add in

Posted by thomasivarssonmalmo on June 19, 2009

I have used the sample tool in the Excel 2007 before but I never thought that I could do this directly on a data source, only on an Excel 2007 table, but you can. I am using SSAS 2008, as the data source, and the Excel 2007 data mining tools for SSAS 2008.  This can be useful even if you do not want to use the data mining add ins and only collect data for a pivot table. Think about building a pivot table on top of 100 thousand records or do it with a random sample of 10 thousand records.
 
You cannot join tables in this tool, only select single tables or views.
 
Edit: I had the wrong order of the pictures but that is fixed now.
 
Open Excel 2007 with an empty sheet and go to the data mining menu.
 
 
 
Click on the button next to the data source name box. You will see the select data source page.
 
 
You will se the empty data source query editor below.
 
  
 Click on the button next to server data source listbox. You will see the new analysis services data source form. Enter the data source name and the taget server name and browse for the database at the bottom in the catalog name listbox. I will select the AdventureWorksDW database in the cataloge name listbox below.
 
 
After we have selected the database we will return to the data source query editor. I will select the well known target mail view in the AdventureWorksDW database on the next page below.
 
 
 
 You will come back to the select data source page and see the query in the bottom test box. Click next. You will arrive to the select sample type page where you do not have any choice but accepting random sampling.
 

 
 
 In the random sampling option page you will only have the option to select the number of records for the random sample.
 
 
In the final page you enter the name of the spreadsheet tab. I will not show that. Click finish to get the random sample records into the spreadsheet.
 

 
Another approach is to write a TSQL select and use the random sampling features that has been available since SQL Server 2005. The built in sampling in the data mining tools can be interesting for end users without TSQL coding skills.

Posted in MS Data Mining and Business Value | Leave a Comment »

Text Data Mining on Mosha’s blog

Posted by thomasivarssonmalmo on February 8, 2009

With the permission of the blog owner I have collected various blog posts since late 2004 and have excluded posts about subjects outside of core MDX discussions. This have been a copy and paste work into a single large text file of the text and not the HtML source tags. I have marked the main text in each blog entry of interest and have simply done a copy and paste into a text file.
 
Text data mining in SQL Server 2005 or 2008 starts with using SSIS and the Term Extraction and Term Lookup transforms in the data flow. You start by creating a dictionary of terms that you think are relevant.  Create a data flow task in the control flow and in the data flow you build something like in the picture below.
 
 
 
 I have a flat file connection to my text file with all the blog entries. Import all text in a single column. One error that can occur with this task is truncation errors so in the next picture you can open the the advanced editor of the flat file source and set it to ignore truncation errors. I recommend you to set up a error flow but that is outside of the scope of this blog post.
 
 
The next task is the data conversion task since the text data mining algorithm only accepts unicode.
 
 
The term extraction task is used to create the dictionary of interesting MDX key words.  The first time you run it you will get a long list of frequent words were 85 percent of the are rubbish. You set up the term extraction as in the following picture. I have selected Noun Phrase only because it returned the most interesting and clear results. Have a look at Books On Line for more information about this.
 
 
Now follows the long work of removing less interesting key words. If not you will see that [Internet Sales Amount] is the most frequent key word i´n Mosha’s blog. You build a table of rejected words that will filter out them from the list of key words each time you run the term extraction task. You refer to this rejected key words list in the same task on the first tab. I have done selects on the results of the term extraction task and inserted the rejected words by a simple Insert Into TSQL statement.
 
 
 
When you have a dictionary of interesting key words you can create a new data flow task and set it up like the first with a connection to the same flat file. The only difference is that you use term lookup instead of term extraction task. All other task remain the same except for that you will need a second OLE DB destination. Here you keep the key words and their frequency in significant parts of the text file. The reference table is the dictionary that we have created in the first part(term extraction output).
 
 
 
 
 The relation between the flat file output and the reference table is set up in the second tab of the term lookup task.
 
 
Finally, what were the results of this text data mining of Moshas blog?
 
First we have the most significant key words that requires a lot of work with the rejected words table.
 
 
This dictionary is the reference table that I have used in the terms lookup task.
 
The result from the terms lookup is the following.
 
 
It can be a little harder to se but to the left you have the keywords from the dictionary and to the right the text fragment where they have appeared. The frequency in the middle is the number of times the key word appears in the text fragment.
 
My model can be improved by adding dates for the blog entries and to run a data mining algorithm on top of these two tables that I have created, but that is the subject of another blog entry.

Posted in MS Data Mining and Business Value | Leave a Comment »

Data mining – Key influencers

Posted by thomasivarssonmalmo on December 3, 2008

This is my first blog post that uses SSAS 2008. It is also valid for SSAS 2005 users.
 
It is about a data mining components for Excel 2007 and a simple algorithm, "Table Analysis Tools and "Analyze key Influencers", that I like because it is simple. You will only have to decide what column you would like to predict and the columns that you think might decide the value of that predicted column. It will only work with discrete or categorized columns and not continous data like sales or income.
 
Instead of using the known example of bike buyer patterns in the customer dimensions I would like to use this algorithm to predict what attributes that are most imprtant for explaining the number of cars each customer has bought.
 
This is the start table in the data mining samples that is part of the installation of the Excel 2007 data mining add ins for SSAS 2005 and SSAS 2008.
 
 
I select the "Table Analysis Tools and "Analyze Key Influencers"
 
The first step in that wizard is to decide on what column to explain.
 
 
 
 And optionally you can unselect columns that are of no interest like a source table key (like ID in the next picture). You should also deselect columns with duplicate information like a avoid using both the key and the description columns(CustomerId and CustomerName). Select one of these columns.
 
 
And this is the result were I have filtered on the most important attributes below. The Naive Bayes algorithm that is used here uses the influencing attributes one by one. This means that you can only look at each number of cars and their influencers and not across different number of cars.
 
 
If the customers have zero cars this can be explained by education and the commute distance. One car is explained by another education degree and commute distance. More than one car is explained by education, the number of children, commute distance and income.

Posted in MS Data Mining and Business Value | Leave a Comment »

Decision Trees in SSAS 2005 and Excel 2007

Posted by thomasivarssonmalmo on August 24, 2008

I have followed the data mining part of SQL Server since the release of SQL Server 2000 and bought MS official book. My attempts to learn more did not have the highest priority and I played with the decision tree algoritm, that is close to the analysis you do with ProClarity Decomposition tree. I have never seen the business use before the release of SQL Server 2005 and especially the data mining add in for Excel 2007.
 
During this year I have used the time series algorithm to do forecasts in Excel 2007 to make forcasts in my daily work. It is called forecast in Excel 2007 and is quite easy to use.
 
Have a look at Rafal Lukawiecki’s great presentations. You can find the links on Vidas site here
 
Here is my first attempt to show how I understand that data mining works in SQL Server 2005 and later. My strategy is to find the ones with best user interface and best business use. Since clustering and decision trees were the first algorithms I will start with them. I will not focus on a detailed explanations of how you build DM-models but more on the end result.
 
Data mining will help you with finding patterns and relations in your data that is not apparent from the start. It can also confirm the relations that you have already  found. Finally data mining can help you with doing predictions, based on the patterns in historic data when new data arrives. 
 
Decision trees is a very general data mining algorithm that can be used in several areas. The example that is most frequently used is how to understand what customers that buys bicycles by running this algorithm on customer attributes. The business value is to direct new market campains when you know more about your customers, so that you do not send offers to the wrong customers.
 
I can also think about a scenario of production planning or quality management with analysis of what variables that causes a high or low scrap rate.
 
The source is not the Adventure Works cube but a view that is part of the data source view of that project. In this case you will see that the number of cars owned is the most important variable for explaining if you buy a bike or not. The thin line with two colors in each box explains if you are a bike buyer or not in each group. This picture below is from the data mining viewer in BIDS.
 
Decision tree
 
In each box you can expand the next level of most important variable for explaining bike buying.
 

Decision tree next level 

Number of cars owned = 2 seems to be the variable that best explains if you buy a bicycle.  In the next level for that group or variable it is income that is the second best explanation.

There is also a dependency network mining model viewer in BIDS that will confirm these two attributes(number of cars owned and, second, income)as the most important explanations for buying bikes.

Decision tree dependency

 BIDS is not easy to use for end users. In order to promote data mining for end users MS released the data mining add ins for Excel 2007.

 In Excel 2007 you select the data mining tab in the ribbon and click the classification button. A wizard will start.

 Classification in excel 2007

You will only have to select what attribute that should be explained or predicted and what attributes that will will be used to explain it.  You can also select if you would like to build a temporary model in SSAS 2005 or a permanent one. Be aware of that the Excel 2007 end users will need access to SSAS 2005 or later to build this model.

I am using the spreadsheet with customer data that is a part of the installation of the Excel 2007 data mining add ins and not the same data as used in BIDS.

The result in Excel 2007 is this with the tree viewer. Here it is age that is the primary explanation for if you buy bikes or not.

The result in Excel 2007

Excel 2007 also has the dependency network viewer.

Dependency network Excel 2007

The Excel 2007 classification part do not have all the options that BIDS provides but it is a quick way to see patterns in your data without having to be a data mining expert or developer.  At the same time you must be careful and check your results carefully. Your should also be aware of that you must work with these models continously so that their prediction value will not degrade over time.

Update: When is this algorithm interesiting from a data warehouse perspective? First, you can check the relation between attributes in the same dimension in order to get undiscovered information. Also, I have done som work with this algorithm to see how people register error codes in different countries in the same system.

 

Posted in MS Data Mining and Business Value | Leave a Comment »