I’ve been reading through the Data Warehouse Toolkit by Ralph Kimball. The case studies have been practical and applicable. Especially because they look at industries I haven't had the chance to work in. These concepts be used to plan and implement reporting out of just one operational system even if you’re not building a full data warehouse with multiple operational data sources.
Each case study adds new concepts. So far I’ve read chapter 3 Retail Sales, chapter 4 Inventory, and now I’m on chapter 5 Procurement.
The procurement case study lists some common analytic requirements, explores the option of using a single fact table or multiple fact tables to handle different transaction types, gives an example of a procurement accumulating snapshot table, and goes in detail on slowly changing dimensions.
I've included my brief notes on the section covering single vs multiple fact tables.
Single vs Multiple Fact Tables
Consider these factors when choosing between a single fact table or multiple fact tables to handle different transaction types:
- Are the analytic requirements more compatible with one approach versus the other?
- Are the different transaction types really different business processes? Control numbers for each type can be a sign that they are different.
- Are the metrics for the transaction types at differing granularities? If so you might need separate fact tables.
- How similar are the dimensions used between the different processes? If the facts have several dimensions that aren’t shared then you may need separate fact tables.
Use a bus matrix with additional columns for the answers to these questions for each process. This will highlight the differences and similarities between the processes.
The benefits of using separate fact tables include more descriptive dimensions and attributes with less generalized labels. Reuse of dimensions is still possible where the fact tables have them in common.
The drawbacks include more work loading, indexing and aggregating data. ETL could be more complex but it could also be simpler since there’s less work transforming the data to fit into one fact table.
The book itself includes very specific examples to flesh out these points and also has great diagrams.
Slowly Changing Dimensions (SCD)
Sometimes attributes in your dimensional tables will change. This change can affect fact table rows. For example when a product's category is changed. This kind of change will affect reporting - aggregates and OLAP cubes may need to be recalculated. Depending on the significance of the change the business may want to track the change historically - noting not just the current value but also the previous value and dates.
8 Ways to Handle SCDs
There are at least 8 methods for handling slowly changing dimension data in a data warehouse. Kimball group was one of the originators of these methods.
Retain original value - essentially do nothing. This dimensional attribute never changes. This can be useful for persistent durable keys and durable supernatural keys.
Overwrite the old value in the dimensional row with the current value. This method will always show the most recent value. Historical data is eliminated. aggregations and OLAP cubes will require recalculation. This can be used for small corrections.
Add a new row to the dimensional table to show the new attribute data. This is the standard way to represent historical information in a data warehouse. A dimensional table using Type 2 change tracking will also include administrative columns for effective date, expiration date, current row flag, natural key (like a product SKU), and surrogate key.
Historical reports, aggregates and OLAP cubes will NOT need recalculation. This is because the fact table rows will refer to the original surrogate key for the dimensional rows. For product history or distinct count of products be sure to use the natural key (SKU).
It's common to mix SCD techniques in the same dimensional table. For example when a type 1 change is made it will require multiple row updates if there are historical rows that product.
Add a new attribute column to hold the prior value and overwrite the existing data in the original column with the new value. This approach allows reporting on all fact rows with either the new or old attribute. It can be useful during transitional periods when users might want to see reporting both ways. This approach is best for significant changes that happen rarely and are scheduled. If an attribute changes every year then add yearly columns each year to track the attribute value over time. This approach does require recalculation of aggregations and OLAP cubes.
I'll review these next time.
Here's a link to the Kimball Group website where you can find out more: kimballgroup.com