November 7, 2025 by Arif Aslam 5 min read

Comparing Two Versions of the Same Data File

You exported your customer list in January. Now it's February, and you have a new export. Someone asks: "What changed?" New customers, lost customers, updated emails, changed account tiers - you need a diff, but this isn't source code. It's a CSV with 15,000 rows.

Opening both files side by side and scrolling isn't going to cut it. What you actually need is a join - specifically a FULL OUTER JOIN on the unique identifier. That gives you every row from both files, lined up, with NULLs where something exists in one file but not the other.

Here's the approach, step by step.

Load both files

Open ExploreMyData and load your first file - say, customers_january.csv. Then load the second file, customers_february.csv. Both files appear as separate tables in the app.

Before joining, check that both files have the same key column. If January uses "customer_id" and February uses "id", you'll need to rename one first (use Copy Columns to copy it with the new name, then Delete Columns to remove the original). The columns don't need to be in the same order, but the join key needs to match.

Join with FULL OUTER

On either table, open the Join operation. Select the other file as the table to join with. Set the join type to Full Outer and the join key to customer_id.

A FULL OUTER JOIN keeps everything. Rows that exist in both files get merged. Rows that exist only in January appear with NULLs for the February columns. Rows that exist only in February appear with NULLs for the January columns.

jan_customer_id	jan_email	jan_tier	feb_customer_id	feb_email	feb_tier
1001	alice@co.com	starter	1001	alice@co.com	pro
1002	bob@co.com	pro	1002	bob.new@co.com	pro
1003	carol@co.com	starter	NULL	NULL	NULL
NULL	NULL	NULL	1248	dana@co.com	starter

Full Outer Join on customer_id. Row 3 (carol) is NULL on the Feb side - she left. Row 4 (dana) is NULL on the Jan side - she's new. Row 2 has a changed email.

The result has columns from both files. DuckDB automatically prefixes them to avoid conflicts - you'll see jan_email and feb_email (or similar prefixes based on your table names).

Find new rows (added in February)

These are rows where the January side is NULL - the customer exists in February but not January.

Apply a Filter: jan_customer_id IS NULL. Every row in the result is a new customer who appeared in the February export.

Count the rows. If you went from 15,000 to 15,400 customers, you'd expect roughly 400 new rows here (plus any that were both added and removed).

Find removed rows (gone from February)

The opposite filter: feb_customer_id IS NULL. These are customers who were in January's export but disappeared from February's. Churned customers, deleted accounts, or data that fell out of scope.

Filter: feb_customer_id IS NULL - customers present in January but gone from February:

jan_customer_id	jan_name	jan_email	jan_tier	feb_customer_id
1003	Carol Vega	carol@co.com	starter	NULL
1019	Dan Marsh	dan@marsh.io	pro	NULL
1041	Elena Park	epark@example.com	starter	NULL

3 customers removed - churned accounts, deleted records, or data that fell out of scope since January.

Find changed values

This is the interesting part. For rows that exist in both files, which fields actually changed? Use Add Column to create a change flag:

CASE WHEN jan_email != feb_email THEN 'email_changed' WHEN jan_tier != feb_tier THEN 'tier_changed' ELSE 'no_change' END

Name it change_type. Then filter to exclude "no_change" to see only the rows where something is different.

For a broader check across more columns, you can create multiple flags or use a simpler approach:

CASE WHEN jan_email != feb_email OR jan_tier != feb_tier OR jan_phone != feb_phone THEN 'changed' ELSE 'same' END

This catches any row where at least one field differs between the two exports.

Putting it all together

For a complete diff summary, create a single status column with Add Column:

CASE WHEN jan_customer_id IS NULL THEN 'added' WHEN feb_customer_id IS NULL THEN 'removed' WHEN jan_email != feb_email OR jan_tier != feb_tier THEN 'modified' ELSE 'unchanged' END

Name it diff_status. Now every row is labeled. Filter out "unchanged" to see just the differences, or use the Column Explorer to see the distribution: how many added, removed, modified, and unchanged.

jan_customer_id	jan_tier	feb_tier	diff_status
1001	starter	pro	modified
1002	pro	pro	unchanged
1003	starter	NULL	removed
NULL	NULL	starter	added
1005	enterprise	enterprise	unchanged

The diff_status column labels every row. Filter out "unchanged" to focus on what actually changed between the two exports.

When to use this

This works for any scenario where you have two snapshots of the same data:

Monthly customer exports - track churn and growth
Inventory snapshots - what was added to or removed from stock
Price lists - which products changed price and by how much
Employee rosters - new hires, departures, role changes
Configuration exports - what settings changed between deployments

The key requirement is a stable unique identifier that exists in both files. Without that, you can't match rows. If your data doesn't have a natural key, you might need to create a composite one (like combining name + date + amount) before joining.

Compare your files now →

Comparing Two Versions of the Same Data File

Load both files

Join with FULL OUTER

Find new rows (added in February)

Find removed rows (gone from February)

Find changed values

Putting it all together

When to use this

Try it yourself

Related tools

More from the blog

How to Join Multiple CSV Files Without Coding

Finding Patterns in Missing Data

How to Analyze a CSV File Without Uploading It Anywhere