← All posts
5 min read

Comparing Two Versions of the Same Data File

You exported your customer list in January. Now it's February, and you have a new export. Someone asks: "What changed?" New customers, lost customers, updated emails, changed account tiers - you need a diff, but this isn't source code. It's a CSV with 15,000 rows.

Opening both files side by side and scrolling isn't going to cut it. What you actually need is a join - specifically a FULL OUTER JOIN on the unique identifier. That gives you every row from both files, lined up, with NULLs where something exists in one file but not the other.

Here's the approach, step by step.

Load both files

Open ExploreMyData and load your first file - say, customers_january.csv. Then load the second file, customers_february.csv. Both files appear as separate tables in the app.

Before joining, check that both files have the same key column. If January uses "customer_id" and February uses "id", you'll need to rename one first (use Rename Column). The columns don't need to be in the same order, but the join key needs to match.

Join with FULL OUTER

On either table, open the Join operation. Select the other file as the table to join with. Set the join type to Full Outer and the join key to customer_id.

A FULL OUTER JOIN keeps everything. Rows that exist in both files get merged. Rows that exist only in January appear with NULLs for the February columns. Rows that exist only in February appear with NULLs for the January columns.

jan_customer_idjan_emailjan_tierfeb_customer_idfeb_emailfeb_tier
1001alice@co.comstarter1001alice@co.compro
1002bob@co.compro1002bob.new@co.compro
1003carol@co.comstarterNULLNULLNULL
NULLNULLNULL1248dana@co.comstarter

Full Outer Join on customer_id. Row 3 (carol) is NULL on the Feb side - she left. Row 4 (dana) is NULL on the Jan side - she's new. Row 2 has a changed email.

The result has columns from both files. DuckDB automatically prefixes them to avoid conflicts - you'll see jan_email and feb_email (or similar prefixes based on your table names).

Find new rows (added in February)

These are rows where the January side is NULL - the customer exists in February but not January.

Apply a Filter: jan_customer_id IS NULL. Every row in the result is a new customer who appeared in the February export.

Count the rows. If you went from 15,000 to 15,400 customers, you'd expect roughly 400 new rows here (plus any that were both added and removed).

Find removed rows (gone from February)

The opposite filter: feb_customer_id IS NULL. These are customers who were in January's export but disappeared from February's. Churned customers, deleted accounts, or data that fell out of scope.

Filter: feb_customer_id IS NULL - customers present in January but gone from February:

jan_customer_idjan_namejan_emailjan_tierfeb_customer_id
1003Carol Vegacarol@co.comstarterNULL
1019Dan Marshdan@marsh.ioproNULL
1041Elena Parkepark@example.comstarterNULL

3 customers removed - churned accounts, deleted records, or data that fell out of scope since January.

Find changed values

This is the interesting part. For rows that exist in both files, which fields actually changed? Use Add Column to create a change flag:

CASE WHEN jan_email != feb_email THEN 'email_changed' WHEN jan_tier != feb_tier THEN 'tier_changed' ELSE 'no_change' END

Name it change_type. Then filter to exclude "no_change" to see only the rows where something is different.

For a broader check across more columns, you can create multiple flags or use a simpler approach:

CASE WHEN jan_email != feb_email OR jan_tier != feb_tier OR jan_phone != feb_phone THEN 'changed' ELSE 'same' END

This catches any row where at least one field differs between the two exports.

Putting it all together

For a complete diff summary, create a single status column with Add Column:

CASE WHEN jan_customer_id IS NULL THEN 'added' WHEN feb_customer_id IS NULL THEN 'removed' WHEN jan_email != feb_email OR jan_tier != feb_tier THEN 'modified' ELSE 'unchanged' END

Name it diff_status. Now every row is labeled. Filter out "unchanged" to see just the differences, or use the Column Explorer to see the distribution: how many added, removed, modified, and unchanged.

jan_customer_idjan_tierfeb_tierdiff_status
1001starterpromodified
1002proprounchanged
1003starterNULLremoved
NULLNULLstarteradded
1005enterpriseenterpriseunchanged

The diff_status column labels every row. Filter out "unchanged" to focus on what actually changed between the two exports.

When to use this

This works for any scenario where you have two snapshots of the same data:

  • Monthly customer exports - track churn and growth
  • Inventory snapshots - what was added to or removed from stock
  • Price lists - which products changed price and by how much
  • Employee rosters - new hires, departures, role changes
  • Configuration exports - what settings changed between deployments

The key requirement is a stable unique identifier that exists in both files. Without that, you can't match rows. If your data doesn't have a natural key, you might need to create a composite one (like combining name + date + amount) before joining.

Compare your files now →

Try it yourself

No sign-up, no upload, no tracking.

Open ExploreMyData