Comparing Two Versions of the Same Data File
You exported your customer list in January. Now it's February, and you have a new export. Someone asks: "What changed?" New customers, lost customers, updated emails, changed account tiers - you need a diff, but this isn't source code. It's a CSV with 15,000 rows.
Opening both files side by side and scrolling isn't going to cut it. What you actually need is a join - specifically a FULL OUTER JOIN on the unique identifier. That gives you every row from both files, lined up, with NULLs where something exists in one file but not the other.
Here's the approach, step by step.
Load both files
Open ExploreMyData and load
your first file - say, customers_january.csv.
Then load the second file, customers_february.csv.
Both files appear as separate tables in the app.
Before joining, check that both files have the same key column. If January uses "customer_id" and February uses "id", you'll need to rename one first (use Rename Column). The columns don't need to be in the same order, but the join key needs to match.
Join with FULL OUTER
On either table, open the Join operation. Select the
other file as the table to join with. Set the join type to Full Outer
and the join key to customer_id.
A FULL OUTER JOIN keeps everything. Rows that exist in both files get merged. Rows that exist only in January appear with NULLs for the February columns. Rows that exist only in February appear with NULLs for the January columns.
| jan_customer_id | jan_email | jan_tier | feb_customer_id | feb_email | feb_tier |
|---|---|---|---|---|---|
| 1001 | alice@co.com | starter | 1001 | alice@co.com | pro |
| 1002 | bob@co.com | pro | 1002 | bob.new@co.com | pro |
| 1003 | carol@co.com | starter | NULL | NULL | NULL |
| NULL | NULL | NULL | 1248 | dana@co.com | starter |
Full Outer Join on customer_id. Row 3 (carol) is NULL on the Feb side - she left. Row 4 (dana) is NULL on the Jan side - she's new. Row 2 has a changed email.
The result has columns from both files. DuckDB automatically prefixes them to avoid conflicts -
you'll see jan_email and
feb_email (or similar prefixes
based on your table names).
Find new rows (added in February)
These are rows where the January side is NULL - the customer exists in February but not January.
Apply a Filter:
jan_customer_id IS NULL.
Every row in the result is a new customer who appeared in the February export.
Count the rows. If you went from 15,000 to 15,400 customers, you'd expect roughly 400 new rows here (plus any that were both added and removed).
Find removed rows (gone from February)
The opposite filter: feb_customer_id IS NULL.
These are customers who were in January's export but disappeared from February's. Churned customers,
deleted accounts, or data that fell out of scope.
Filter: feb_customer_id IS NULL - customers present in January but gone from February:
| jan_customer_id | jan_name | jan_email | jan_tier | feb_customer_id |
|---|---|---|---|---|
| 1003 | Carol Vega | carol@co.com | starter | NULL |
| 1019 | Dan Marsh | dan@marsh.io | pro | NULL |
| 1041 | Elena Park | epark@example.com | starter | NULL |
3 customers removed - churned accounts, deleted records, or data that fell out of scope since January.
Find changed values
This is the interesting part. For rows that exist in both files, which fields actually changed? Use Add Column to create a change flag:
CASE WHEN jan_email != feb_email THEN 'email_changed' WHEN jan_tier != feb_tier THEN 'tier_changed' ELSE 'no_change' END
Name it change_type. Then
filter to exclude "no_change" to see only the rows where something is different.
For a broader check across more columns, you can create multiple flags or use a simpler approach:
CASE WHEN jan_email != feb_email OR jan_tier != feb_tier OR jan_phone != feb_phone THEN 'changed' ELSE 'same' END
This catches any row where at least one field differs between the two exports.
Putting it all together
For a complete diff summary, create a single status column with Add Column:
CASE WHEN jan_customer_id IS NULL THEN 'added' WHEN feb_customer_id IS NULL THEN 'removed' WHEN jan_email != feb_email OR jan_tier != feb_tier THEN 'modified' ELSE 'unchanged' END
Name it diff_status.
Now every row is labeled. Filter out "unchanged" to see just the differences, or use the Column
Explorer to see the distribution: how many added, removed, modified, and unchanged.
| jan_customer_id | jan_tier | feb_tier | diff_status |
|---|---|---|---|
| 1001 | starter | pro | modified |
| 1002 | pro | pro | unchanged |
| 1003 | starter | NULL | removed |
| NULL | NULL | starter | added |
| 1005 | enterprise | enterprise | unchanged |
The diff_status column labels every row. Filter out "unchanged" to focus on what actually changed between the two exports.
When to use this
This works for any scenario where you have two snapshots of the same data:
- Monthly customer exports - track churn and growth
- Inventory snapshots - what was added to or removed from stock
- Price lists - which products changed price and by how much
- Employee rosters - new hires, departures, role changes
- Configuration exports - what settings changed between deployments
The key requirement is a stable unique identifier that exists in both files. Without that, you can't match rows. If your data doesn't have a natural key, you might need to create a composite one (like combining name + date + amount) before joining.