Understanding MS Clustering Visualizations in BIDS
Posted by thomasivarssonmalmo on November 11, 2009
This is my first blog post about understanding data mining from an information consumers point of view, that understands almost nothing about the statiscal foundation of this area.
For you that are new to data mining it is a statistical approach to finding patterns in data that can save you from writing a lot of SQL or MDX.
I will use the well known bike buyer scenario that is part of the Adventure Works 2008 DW database. I am using the Adventure Works 2008 SSAS cube project where you can find the Targeted Mailing data mining structure. Have a look at the picture below.
Remember that MS-clustering do not require you to explain bike buyer(yes or no) . Clustering will get you a result that can be used to explain the numbers of cars or any the other attribute values that are part of the mining model and the mining structure of your choice. I will focus on the bike buyer attribute since it is well known and can be used for tying the different viewers in BIDS for SSAS 2008 together.
What I am doing here is simply to use the mining models part of the Adventure Works SSAS demo project for SSAS 2008 but this blog post also applies to data mining in SSAS 2005. You will need to download that SSAS project and the source database from here.
Start up the Adventure Works cube demo project in BIDS(BI Developer Studio) 2008 and check the minng structures folder in that project. Click on "Targeted Mailing" and then go to the mining models tab. You will see several mining models but in this blog post it is all about the"TM Clustering" mining model.
The next tab of interest is the mining model viewer. Select the TM clustering mining model in the mining model listbox up to the left. The first tab that will show up is the Clustering Diagram visualization.
Clustering do no require you to set a target attribute like bike buyer when you build the mining model. This means that this algorithm is a good starter when you know nothing about what is the target attribute and the input attributes. The result is your rows or cases arranged in to groups where each group have similar values for each attribute. It is up to you to decide if these groups are useful or not.
After the mining structure is processed you can use the visualization tools in BIDS to see what your data source patterns that was found.
The cluster diagram above is set to show darker colours for each cluster where a bike buyer is most frequent. You can see that by looking at the shading variable(Bike buyer) and the state(=1 or true). Another interesting concept is the relation between the groups. It is not necessarly a good thing to have groups with strong relations but I can be wrong about that. I will add more about this later. If you put the cursor on cluster 4 and 8 you will see that they have 71 % and 59 % probability of being bike byers.
Edit: Bogdan Crivat, of the MS DM team, added this description of the cluster diagram: "By default the shading is based on the size(population) of each cluster. The layout is based on the similarity of distribution of each cluster. In short, this means that clusters close to each other in this diagram are more similar than clusters that are far from each other."
In the cluster profile above you get a confirmation that cluster/group 4 and cluster/group 8 have more bike buyers and a confirmation of the percentage probability of these two cluster groups. Do not forget to have a look at the other attributes for these two groups because they will tell you more about if they differ between other groups. This can be the hard part but you can get a quick opinion but seeing if the proportion of the colours differs between each attribute in different clusters or groups.
In the cluster characteristics viewer(Cluster 4) you can see the probabilities for the most important attribute states or values within a group/cluster. This is why an attribute can occur several times here. Both the probability of the attribute and the state or value of the attribute wil be shown. Put the cursor on the bike buyer = 1 attribute and you will see the same percentage probability that we have seen since the cluster diagram.
Finally we can see the cluster discrimination visualizer above where I have selected cluster 4 and 8. This tool can cause some confusion. Here you will see the largest differences for these two clusters regarding their each attribute and attribute state. Bike buyer is at the end since that has the least difference between these two cluster. If you have two clusters with little differences you can have a problem but that depends on what you are looking for. In this context you will probably have to review your mining model.
Bogan Crivat also sent me his helpful comment about the difference between Cluster Characteristics and Cluster Discrimation(vs. Complement):
"The discrimination vs. complement view is probably the most useful visualization for any single cluster, as it emphasizes cluster’s specifics." Cluster 7 and 8 have a strong presence of North american customers, so this will appear in characteristics. However, if you compare cluster 8 against its complement, North America will likely not show up in the list ( or, at least , it will not be on top)."
The complement is everything outside of the selected cluster.
Edit: I have change the cluster number to reflect the strong North American customers in the sample I use.
It is up to you to decide the names of each cluster depending on the significant attibutes and attribute values for each group. It is also possible to analyze the number of cars attribute and see how that is explained by creating different groups. Change the shading variable in the cluster diagram to the attribute and the state to what you are looking for.