Finding Near-Duplicate Rows in Your Data
You pull a customer report and notice something off. The total customer count is 4,200, but your billing system says 3,800. You start scrolling and there they are: "Sarah Chen" and "sarah chen" and " Sarah Chen". Three rows for one person. Her revenue is split across all three.
Exact duplicates are easy - run Remove Duplicates and you're done. Near-duplicates are
trickier. The rows aren't identical. They just look the same to a human but not to a computer.
A basic DISTINCT won't catch them.
The usual suspects
Near-duplicates show up in predictable ways:
- Case differences: "TechWave Solutions" vs "Techwave Solutions" vs "TECHWAVE SOLUTIONS"
- Extra whitespace: " John Smith " vs "John Smith" vs "John Smith"
- Trailing characters: "Acme Corp." vs "Acme Corp" vs "Acme Corp, Inc."
- Abbreviations: "St." vs "Street", "NY" vs "New York"
The first two categories - case and whitespace - account for probably 80% of near-duplicate issues. And they're the easiest to fix.
Near-duplicate rows for the same company - each variation is treated as a different entity by the database:
| company_name (raw) | revenue | contact_email |
|---|---|---|
| TechWave Solutions | $12,400 | sarah@techwave.io |
| techwave solutions | $8,200 | mike@techwave.io |
| TECHWAVE SOLUTIONS | $3,100 | ops@techwave.io |
| TechWave Solutions | $5,500 | cfo@techwave.io |
A GROUP BY on company_name produces four rows summing to $29,200 each - instead of one row for $29,200 total.
The strategy: normalize, then deduplicate
You can't deduplicate on the original column because the values aren't equal. So create a normalized copy of the column, clean it up, and deduplicate on that instead.
Here's the approach in ExploreMyData:
Step 1: Copy the column
Use Copy Column from the Columns group to duplicate the column you want
to match on. If your column is company_name,
you'll get a new column called company_name_copy.
This copy is what we'll normalize. The original stays untouched so you keep the real name.
Step 2: Trim and lowercase
Apply Text Transform to the copy column. First, select "trim" to strip leading and trailing whitespace. Then apply it again with "lowercase" to normalize case.
After these two steps, all of these become the same value:
- "TechWave Solutions" → "techwave solutions"
- "TECHWAVE SOLUTIONS" → "techwave solutions"
- " Techwave Solutions " → "techwave solutions"
The SQL ExploreMyData generates:
TRIM(LOWER(company_name_copy))
Step 3: Remove extra internal spaces
Trim only handles leading and trailing whitespace. "John Smith" (two spaces) is still different
from "John Smith" (one space) after trimming. Use Find & Replace on the
normalized column. Find (two spaces) and replace with
(one space).
Under the hood: REPLACE(company_name_copy, ' ', ' ')
If your data might have three or more consecutive spaces, apply this step twice. The first pass turns triple spaces into doubles, the second pass collapses those to singles.
Both columns side by side - original preserved, normalized copy used for matching:
| company_name (original) | company_name_copy (normalized) |
|---|---|
| TechWave Solutions | techwave solutions |
| techwave solutions | techwave solutions |
| TECHWAVE SOLUTIONS | techwave solutions |
| TechWave Solutions | techwave solutions |
| Acme Corp. | acme corp. |
| ACME CORP | acme corp |
Four TechWave rows now share the same normalized key. Remove Duplicates on the copy column keeps only the first.
Step 4: Deduplicate on the normalized column
Now use Remove Duplicates from the Filter & Sort group. Select the
normalized column (company_name_copy) as the
column to check for duplicates.
ExploreMyData generates:
DISTINCT ON (company_name_copy)
This keeps the first row for each unique normalized value and drops the rest. "TechWave Solutions", "TECHWAVE SOLUTIONS", and " Techwave Solutions " all share the same normalized key, so only the first one survives.
Step 5: Clean up
Once you're satisfied with the deduplication, drop the normalized column. You don't need it anymore.
Use Drop Columns to remove company_name_copy.
Your data now has one row per unique customer with the original formatting intact.
Going further with abbreviations
Case and whitespace normalization handles the majority of near-duplicates. But what about "Acme Corp." vs "Acme Corporation"? These need domain-specific replacements.
Before deduplicating, add Find & Replace steps on the normalized column:
- Replace "corp." with "corporation"
- Replace "inc." with "incorporated"
- Replace "ltd." with "limited"
- Replace "st." with "street" (for address data)
Each replacement adds a pipeline step. The SQL stacks up:
REPLACE(REPLACE(col, 'corp.', 'corporation'), 'inc.', 'incorporated')
It's not fancy fuzzy matching, but it catches the patterns that actually appear in your data. Look at the duplicates you have, identify the specific variation patterns, and add targeted replacements. Practical beats clever.
Full pipeline for near-duplicate removal - 6 steps, each targeting a specific cleanup concern:
| Step | Operation | Purpose |
|---|---|---|
| 1 | Copy Column | Preserve original; create company_name_copy |
| 2 | Text Transform: trim | Remove leading/trailing whitespace |
| 3 | Text Transform: lowercase | Collapse case differences |
| 4 | Find & Replace: "corp." → "corporation" | Normalize abbreviations |
| 5 | Remove Duplicates (on copy column) | Keep first row per normalized name |
| 6 | Drop Columns: company_name_copy | Remove the helper column |
When to check before deleting
If you're working with revenue or transaction data, don't just blindly deduplicate. Two orders from the same customer might genuinely be two orders. Sort by your normalized column first and visually scan the groups. Are these truly duplicates, or just repeat customers?
The normalized column is useful even if you don't deduplicate. You can group by it to see which "duplicates" exist, count how many variations each entity has, and make an informed decision about what to merge.