Decision Trees in SSAS 2005 and Excel 2007
I have followed the data mining part of SQL Server since the release of SQL Server 2000 and bought MS official book. My attempts to learn more did not have the highest priority and I played with the decision tree algoritm, that is close to the analysis you do with ProClarity Decomposition tree. I have never seen the business use before the release of SQL Server 2005 and especially the data mining add in for Excel 2007.
During this year I have used the time series algorithm to do forecasts in Excel 2007 to make forcasts in my daily work. It is called forecast in Excel 2007 and is quite easy to use.
Have a look at Rafal Lukawiecki’s great presentations. You can find the links on Vidas site here
Here is my first attempt to show how I understand that data mining works in SQL Server 2005 and later. My strategy is to find the ones with best user interface and best business use. Since clustering and decision trees were the first algorithms I will start with them. I will not focus on a detailed explanations of how you build DM-models but more on the end result.
Data mining will help you with finding patterns and relations in your data that is not apparent from the start. It can also confirm the relations that you have already found. Finally data mining can help you with doing predictions, based on the patterns in historic data when new data arrives.
Decision trees is a very general data mining algorithm that can be used in several areas. The example that is most frequently used is how to understand what customers that buys bicycles by running this algorithm on customer attributes. The business value is to direct new market campains when you know more about your customers, so that you do not send offers to the wrong customers.
I can also think about a scenario of production planning or quality management with analysis of what variables that causes a high or low scrap rate.
The source is not the Adventure Works cube but a view that is part of the data source view of that project. In this case you will see that the number of cars owned is the most important variable for explaining if you buy a bike or not. The thin line with two colors in each box explains if you are a bike buyer or not in each group. This picture below is from the data mining viewer in BIDS.
In each box you can expand the next level of most important variable for explaining bike buying.
Number of cars owned = 2 seems to be the variable that best explains if you buy a bicycle. In the next level for that group or variable it is income that is the second best explanation.
There is also a dependency network mining model viewer in BIDS that will confirm these two attributes(number of cars owned and, second, income)as the most important explanations for buying bikes.
BIDS is not easy to use for end users. In order to promote data mining for end users MS released the data mining add ins for Excel 2007.
In Excel 2007 you select the data mining tab in the ribbon and click the classification button. A wizard will start.
You will only have to select what attribute that should be explained or predicted and what attributes that will will be used to explain it. You can also select if you would like to build a temporary model in SSAS 2005 or a permanent one. Be aware of that the Excel 2007 end users will need access to SSAS 2005 or later to build this model.
I am using the spreadsheet with customer data that is a part of the installation of the Excel 2007 data mining add ins and not the same data as used in BIDS.
The result in Excel 2007 is this with the tree viewer. Here it is age that is the primary explanation for if you buy bikes or not.
Excel 2007 also has the dependency network viewer.
The Excel 2007 classification part do not have all the options that BIDS provides but it is a quick way to see patterns in your data without having to be a data mining expert or developer. At the same time you must be careful and check your results carefully. Your should also be aware of that you must work with these models continously so that their prediction value will not degrade over time.
Update: When is this algorithm interesiting from a data warehouse perspective? First, you can check the relation between attributes in the same dimension in order to get undiscovered information. Also, I have done som work with this algorithm to see how people register error codes in different countries in the same system.