When JOIN isn't right
How do I show data from two or more tables in a way that makes sense for my desired output - especially when there's a many to many relationship? It's a question that has come up several times for me recently in my own work and when helping others.
When working with table data it's easy to get stuck thinking the only way to show or summarize data is to use a join. What if the tables are related to each other in that they are similar data but from different departments or categories? There may not be a good way to join data like that. Separate columns for each department's measures aren't necessary. It could work better to put all the measures in one column and label the rows by department.
What's really needed is a union.
To help navigate these choices try starting with the end in mind. Using a spreadsheet make a very simple mock-up of what the end result should like like. Knowing exactly where everything should end up makes it easier to plan the steps to get there.
Let's say the final report is going to look like a pivot table with categories and sub categories but the data is currently in different tables. Then use a union to get all the data into the same table or result set in a way that a pivot table or Tableau can display effectively.
Tableau has great unioning tools in it's desktop application. It will match up same name columns for and label the source table. It also has a manual matching tool for when columns don't have similar names. Tableau will handle any columns that don't have a match at all as well.
Looker allows union when used with derived tables. The syntax is similar to SQL and implementation is much more manual than with Tableau.
In Excel, Power Query can also union data. Power Query is now called "Get & Transform". It refers to union as append.
Pandas can union DataFrames with the concat or append functions. The appended result will have a hierarchical index - similar to the way Tableau labels the source of each table.
Finally using SQL to union can be a bit more manual bit it's just as effective. Here are some documentation options:
- MySQL Union Clause
- Microsoft Union Clause
- PostgreSQL Union Clause
- Scroll down to the UNION Clause section
Make sure the tables have the same number of columns, in the same order, with the same data types. They don't have to have the same names. So if one table doesn't have enough columns then make a dummy column like so:
select null as dummy_col from dual
Give each table its own "label" column with the name of the source table or category to differentiate each table's data in the final result. Once the tables are unioned use the table layout in Tableau, the Pivot Table in Excel, write a SQL query to display the results or use any other reporting or data viz tool.
There are countless data manipulation tools that can combine datasets in many different ways. Learn how they work at a deeper level by getting comfortable with SQL. Thinking in terms of the relationships between tables will help when solving tricky data shaping puzzles in higher level tools like Tableau.