Visualize the ETL flow
Posted by thomasivarssonmalmo on November 8, 2009
There are several approaches of how to document the flow of data from the source systems into the data warehouse and data marts.
This blog post is about finding a good enough starting point for documenting the data flow from the data warehouse to the data marts. How much detail should you be showing? Should you start by tracking each attribute on the way from the data warehouse to the SSAS cube? What tool should you be using? Should you document the code and translate that into business descriptions that end users can understand?
Well it is easy to drown in an approach that is to big and will take to much time before delivering any useful. What I describe is an approch that is not the final approach but is god enough as a start but will able to improve over time into more detail.
Let us start with the tool. I will use MS Visio 2010 that will shortly be provided as a public CTP but the templates are the same as in Visio 2007. You will be able to do the same in Visio 2007.
First, this is the template that I choosed not to use.
If you look at the options of objects to use in the left pane you have a great toolbox. The problem is that it will require a lot of time to learn about these options and teach the viewers of the documentation what is presented. The number of objects are to many.
The other approach is to use the Gane-Sarson template in Visio.
What I am trying to achive here is to document what data stores and what code(SSIS packages and stored procedures) that are involved in the data flow from the data warehouse to the data marts. I am actually looking after redundant data transformations and a more efficient way of bulding the marts since there is a stove pipe problem behind this. To many data marts have been built to solve isolated report requirements.
Procedures and SSIS packages are classified as processes since they do something with data. Data stores are tables, views and cubes. I am aware of that you can discuss if a view or a SSAS cube is an interface. I think that a view that change data is a process and not an interface. For the moment I have classified cubes as data stores. I mainly use data store and process as the objects and try to fit in cubes, views, stored procedures and SSIS packages into that framework.
I have tried other approaches like documenting each measure and dimension attribute from the cube down to the data ware house. The problem is that it will not bring quick enough results to understanding the system. You will end up with large Excel lists with hundreds of columns and a collections of code that are to hard to understand even for BI-professionals.
The next step is to add business terms descriptions from the code to each process. I will discuss that in another blog post. When you have this general classificatin available you can start working with the details in each process.