← All posts
5 min read

Finding Near-Duplicate Rows in Your Data

You pull a customer report and notice something off. The total customer count is 4,200, but your billing system says 3,800. You start scrolling and there they are: "Sarah Chen" and "sarah chen" and " Sarah Chen". Three rows for one person. Her revenue is split across all three.

Exact duplicates are easy - run Remove Duplicates and you're done. Near-duplicates are trickier. The rows aren't identical. They just look the same to a human but not to a computer. A basic DISTINCT won't catch them.

The usual suspects

Near-duplicates show up in predictable ways:

  • Case differences: "TechWave Solutions" vs "Techwave Solutions" vs "TECHWAVE SOLUTIONS"
  • Extra whitespace: " John Smith " vs "John Smith" vs "John Smith"
  • Trailing characters: "Acme Corp." vs "Acme Corp" vs "Acme Corp, Inc."
  • Abbreviations: "St." vs "Street", "NY" vs "New York"

The first two categories - case and whitespace - account for probably 80% of near-duplicate issues. And they're the easiest to fix.

Near-duplicate rows for the same company - each variation is treated as a different entity by the database:

company_name (raw) revenue contact_email
TechWave Solutions$12,400sarah@techwave.io
techwave solutions$8,200mike@techwave.io
TECHWAVE SOLUTIONS$3,100ops@techwave.io
  TechWave Solutions  $5,500cfo@techwave.io

A GROUP BY on company_name produces four rows summing to $29,200 each - instead of one row for $29,200 total.

The strategy: normalize, then deduplicate

You can't deduplicate on the original column because the values aren't equal. So create a normalized copy of the column, clean it up, and deduplicate on that instead.

Here's the approach in ExploreMyData:

Step 1: Copy the column

Use Copy Column from the Columns group to duplicate the column you want to match on. If your column is company_name, you'll get a new column called company_name_copy. This copy is what we'll normalize. The original stays untouched so you keep the real name.

Step 2: Trim and lowercase

Apply Text Transform to the copy column. First, select "trim" to strip leading and trailing whitespace. Then apply it again with "lowercase" to normalize case.

After these two steps, all of these become the same value:

  • "TechWave Solutions" → "techwave solutions"
  • "TECHWAVE SOLUTIONS" → "techwave solutions"
  • " Techwave Solutions " → "techwave solutions"

The SQL ExploreMyData generates: TRIM(LOWER(company_name_copy))

Step 3: Remove extra internal spaces

Trim only handles leading and trailing whitespace. "John Smith" (two spaces) is still different from "John Smith" (one space) after trimming. Use Find & Replace on the normalized column. Find    (two spaces) and replace with   (one space).

Under the hood: REPLACE(company_name_copy, ' ', ' ')

If your data might have three or more consecutive spaces, apply this step twice. The first pass turns triple spaces into doubles, the second pass collapses those to singles.

Both columns side by side - original preserved, normalized copy used for matching:

company_name (original) company_name_copy (normalized)
TechWave Solutionstechwave solutions
techwave solutionstechwave solutions
TECHWAVE SOLUTIONStechwave solutions
  TechWave Solutions  techwave solutions
Acme Corp.acme corp.
ACME CORPacme corp

Four TechWave rows now share the same normalized key. Remove Duplicates on the copy column keeps only the first.

Step 4: Deduplicate on the normalized column

Now use Remove Duplicates from the Filter & Sort group. Select the normalized column (company_name_copy) as the column to check for duplicates.

ExploreMyData generates: DISTINCT ON (company_name_copy)

This keeps the first row for each unique normalized value and drops the rest. "TechWave Solutions", "TECHWAVE SOLUTIONS", and " Techwave Solutions " all share the same normalized key, so only the first one survives.

Step 5: Clean up

Once you're satisfied with the deduplication, drop the normalized column. You don't need it anymore. Use Drop Columns to remove company_name_copy. Your data now has one row per unique customer with the original formatting intact.

Going further with abbreviations

Case and whitespace normalization handles the majority of near-duplicates. But what about "Acme Corp." vs "Acme Corporation"? These need domain-specific replacements.

Before deduplicating, add Find & Replace steps on the normalized column:

  • Replace "corp." with "corporation"
  • Replace "inc." with "incorporated"
  • Replace "ltd." with "limited"
  • Replace "st." with "street" (for address data)

Each replacement adds a pipeline step. The SQL stacks up: REPLACE(REPLACE(col, 'corp.', 'corporation'), 'inc.', 'incorporated')

It's not fancy fuzzy matching, but it catches the patterns that actually appear in your data. Look at the duplicates you have, identify the specific variation patterns, and add targeted replacements. Practical beats clever.

Full pipeline for near-duplicate removal - 6 steps, each targeting a specific cleanup concern:

Step Operation Purpose
1Copy ColumnPreserve original; create company_name_copy
2Text Transform: trimRemove leading/trailing whitespace
3Text Transform: lowercaseCollapse case differences
4Find & Replace: "corp." → "corporation"Normalize abbreviations
5Remove Duplicates (on copy column)Keep first row per normalized name
6Drop Columns: company_name_copyRemove the helper column

When to check before deleting

If you're working with revenue or transaction data, don't just blindly deduplicate. Two orders from the same customer might genuinely be two orders. Sort by your normalized column first and visually scan the groups. Are these truly duplicates, or just repeat customers?

The normalized column is useful even if you don't deduplicate. You can group by it to see which "duplicates" exist, count how many variations each entity has, and make an informed decision about what to merge.

Try deduplicating your data →

Try it yourself

No sign-up, no upload, no tracking.

Open ExploreMyData