December 5, 2025 by Arif Aslam 5 min read

Finding Near-Duplicate Rows in Your Data

You pull a customer report and notice something off. The total customer count is 4,200, but your billing system says 3,800. You start scrolling and there they are: "Sarah Chen" and "sarah chen" and " Sarah Chen". Three rows for one person. Her revenue is split across all three.

Exact duplicates are easy - run Remove Duplicates and you're done. Near-duplicates are trickier. The rows aren't identical. They just look the same to a human but not to a computer. A basic DISTINCT won't catch them.

The usual suspects

Near-duplicates show up in predictable ways:

Case differences: "TechWave Solutions" vs "Techwave Solutions" vs "TECHWAVE SOLUTIONS"
Extra whitespace: " John Smith " vs "John Smith" vs "John Smith"
Trailing characters: "Acme Corp." vs "Acme Corp" vs "Acme Corp, Inc."
Abbreviations: "St." vs "Street", "NY" vs "New York"

The first two categories - case and whitespace - account for probably 80% of near-duplicate issues. And they're the easiest to fix.

Near-duplicate rows for the same company - each variation is treated as a different entity by the database:

company_name (raw)	revenue	contact_email
TechWave Solutions	$12,400	sarah@techwave.io
techwave solutions	$8,200	mike@techwave.io
TECHWAVE SOLUTIONS	$3,100	ops@techwave.io
TechWave Solutions	$5,500	cfo@techwave.io

A GROUP BY on company_name produces four rows summing to $29,200 each - instead of one row for $29,200 total.

The strategy: normalize, then deduplicate

You can't deduplicate on the original column because the values aren't equal. So create a normalized copy of the column, clean it up, and deduplicate on that instead.

Here's the approach in ExploreMyData:

Step 1: Copy the column

Use Copy Columns from the Columns group to duplicate the column you want to match on. If your column is company_name, you'll get a new column called company_name_copy. This copy is what we'll normalize. The original stays untouched so you keep the real name.

Step 2: Trim and lowercase

Apply Text Transform to the copy column. First, select "trim" to strip leading and trailing whitespace. Then apply it again with "lowercase" to normalize case.

After these two steps, all of these become the same value:

"TechWave Solutions" → "techwave solutions"
"TECHWAVE SOLUTIONS" → "techwave solutions"
" Techwave Solutions " → "techwave solutions"

The SQL ExploreMyData generates: TRIM(LOWER(company_name_copy))

Step 3: Remove extra internal spaces

Trim only handles leading and trailing whitespace. "John Smith" (two spaces) is still different from "John Smith" (one space) after trimming. Use Find & Replace on the normalized column. Find (two spaces) and replace with (one space).

Under the hood: REPLACE(company_name_copy, ' ', ' ')

If your data might have three or more consecutive spaces, apply this step twice. The first pass turns triple spaces into doubles, the second pass collapses those to singles.

Both columns side by side - original preserved, normalized copy used for matching:

company_name (original)	company_name_copy (normalized)
TechWave Solutions	techwave solutions
techwave solutions	techwave solutions
TECHWAVE SOLUTIONS	techwave solutions
TechWave Solutions	techwave solutions
Acme Corp.	acme corp.
ACME CORP	acme corp

Four TechWave rows now share the same normalized key. Remove Duplicates on the copy column keeps only the first.

Step 4: Deduplicate on the normalized column

Now use Remove Duplicates from the Filter & Sort group. Select the normalized column (company_name_copy) as the column to check for duplicates.

ExploreMyData generates: DISTINCT ON (company_name_copy)

This keeps the first row for each unique normalized value and drops the rest. "TechWave Solutions", "TECHWAVE SOLUTIONS", and " Techwave Solutions " all share the same normalized key, so only the first one survives.

Step 5: Clean up

Once you're satisfied with the deduplication, drop the normalized column. You don't need it anymore. Use Delete Columns to remove company_name_copy. Your data now has one row per unique customer with the original formatting intact.

Going further with abbreviations

Case and whitespace normalization handles the majority of near-duplicates. But what about "Acme Corp." vs "Acme Corporation"? These need domain-specific replacements.

Before deduplicating, add Find & Replace steps on the normalized column:

Replace "corp." with "corporation"
Replace "inc." with "incorporated"
Replace "ltd." with "limited"
Replace "st." with "street" (for address data)

Each replacement adds a pipeline step. The SQL stacks up: REPLACE(REPLACE(col, 'corp.', 'corporation'), 'inc.', 'incorporated')

It's not fancy fuzzy matching, but it catches the patterns that actually appear in your data. Look at the duplicates you have, identify the specific variation patterns, and add targeted replacements. Practical beats clever.

Full pipeline for near-duplicate removal - 6 steps, each targeting a specific cleanup concern:

Step	Operation	Purpose
1	Copy Columns	Preserve original; create company_name_copy
2	Text Transform: trim	Remove leading/trailing whitespace
3	Text Transform: lowercase	Collapse case differences
4	Find & Replace: "corp." → "corporation"	Normalize abbreviations
5	Remove Duplicates (on copy column)	Keep first row per normalized name
6	Delete Columns: company_name_copy	Remove the helper column

When to check before deleting

If you're working with revenue or transaction data, don't just blindly deduplicate. Two orders from the same customer might genuinely be two orders. Sort by your normalized column first and visually scan the groups. Are these truly duplicates, or just repeat customers?

The normalized column is useful even if you don't deduplicate. You can group by it to see which "duplicates" exist, count how many variations each entity has, and make an informed decision about what to merge.

Try deduplicating your data →

Finding Near-Duplicate Rows in Your Data

The usual suspects

The strategy: normalize, then deduplicate

Step 1: Copy the column

Step 2: Trim and lowercase

Step 3: Remove extra internal spaces

Step 4: Deduplicate on the normalized column

Step 5: Clean up

Going further with abbreviations

When to check before deleting

Try it yourself

Related tools

More from the blog

5 Ways to Clean Messy Data Without Writing Code

Splitting a Full Address into Street, City, State, Zip

Converting "$1,234.56" Strings to Actual Numbers