Star Schema versus Snowflake Schema
This is a question that comes up in job interviews where data warehousing is a topic. These are some general summary level notes that I’ve taken during the course of my own research. To get really in depth with data warehouse concepts and case studies I highly recommend Ralph Kimball’s The Data Warehouse Toolkit. It really comes alive once you have some experience working with databases and doing reporting.
What is snowflake schema?
Snowflake schema is generally what we think of when talking about a traditional normalized database. The dimensions can connect to each other based on hierarchical relationships. So dimensions can branch off of the fact table and off of each other like snowflake branches.
What is star schema?
Star schema is more commonly used in data warehouse scenarios. It’s a denormalized schema where each dimension connects directly to the fact table - the fact table will have potentially many dimension foreign keys. The schema will look like a star in that the dimensions branch off of the fact table only.
What are the pros and cons of star schema and snowflake schema?
Because it is normalized any querying will involve more joins. Joins can affect performance as well as readability for the end user.
ETL can be more complex because there is dependency between dimensions - this also inhibits using parallelism in the ETL process.
It’s easier to do data maintenance because the data is less duplicated. You can update a lookup table instead of updating every row of a fact table.
It conserves more space because denormalization decreases data duplication.
When is snowflake schema used? A traditional operational database or a data warehouse.
What types of analysis are easier with snowflake schema? Metrics analysis such as calculating revenue per customer. Complex analysis dealing with many-to-many relationships. Essentially any kind of analysis can be done with snowflake schema as long as you're following the business logic of the database.
Snowflake schema is implemented with a transaction or fact table with foriegn keys to dimensions. Those dimensions could then have foreign keys joining them to further related dimensions.
Star Schema is denormalized. Querying will involve fewer joins. This makes it easier to reason about for users, and more performant for querying.
ETL involves simpler logic and is more parallel because there’s no dependency between dimensions.
Space usage can be higher because there is more data duplication.
When do you use Star schema? In a data warehouse and for OLAP cubes.
The star schema can be used for reporting over time, dimensional analysis and fast aggregation. However it is less flexible for analysis than a regular database or snowflake model.
Star schema is implemented using a fact table that has foreign keys to dimensions with (usually) many attributes. These dimensions generally do not connect to further dimensions.
If performance isn’t a major roadblock then a star schema style view could be built on top of a traditional snowflake schema database. This view could then be used by BI tools and end users for simpler querying. On the other hand if performance is an issue then the scale of data is there such that investing in development of a star schema enterprise data warehouse will probably be worthwhile.