Cleaning and Normalizing a CRM Export
You export your contacts from Salesforce, HubSpot, or whatever CRM your company uses. The CSV lands in your Downloads folder. You open it and immediately see the problems.
Sixty columns, most of which you don't need. "TechWave Solutions" appears as "TechWave Solutions", "TECHWAVE SOLUTIONS", "techwave solutions", and "TechWave Solutions " (with a trailing space). Contact names are in mixed case. Half the phone number fields are blank. And there are duplicate contacts - the same person entered twice because someone created a new record instead of updating the existing one.
This is the kind of mess that makes people open a spreadsheet and start manually fixing cells. Don't do that. Build a pipeline in ExploreMyData that fixes everything in the right order and can be reapplied next time you export.
| first_name | last_name | company_name | phone | |
|---|---|---|---|---|
| john | smith | John.Smith@TechWave.com | TECHWAVE SOLUTIONS | 415-555-0182 |
| LISA | NGUYEN | LNGUYEN@TECHWAVE.COM | TechWave Solutions | |
| Marcus | Webb | marcus.webb@acmecorp.com | Acme Corp Inc. | 312-555-0094 |
| john | smith | john.smith@techwave.com | techwave solutions | 415-555-0182 |
Inconsistent casing, trailing spaces, duplicate contact (rows 1 and 4), and a missing phone on row 2 - typical CRM export problems.
Step 1: Strip down to the columns you need
Start with the biggest win: getting rid of noise. Select Select Columns from the Columns group. In a typical CRM export, you might keep 10-15 columns out of 60:
contact_idfirst_name,last_nameemailphonecompany_namejob_titlecity,statecreated_date,last_activity_date
Uncheck everything else. The first pipeline card appears, and your table goes from impossibly wide to something you can actually read.
Do this first. Every subsequent operation runs faster on 10 columns than on 60, and you won't accidentally clean data in a column you're about to throw away.
Step 2: Trim whitespace from all text columns
Trailing spaces are invisible enemies. "TechWave Solutions" and "TechWave Solutions " look identical on screen but are treated as different values by every grouping and matching operation.
Select Text Transform from the Transform group. Choose each text column and apply "trim". This strips leading and trailing whitespace. You'll want to do this for first_name, last_name, email, company_name, city, and job_title.
Each column gets its own pipeline card. Under the hood, ExploreMyData uses TRIM(column_name). Six columns, six cards - but each one runs instantly.
Step 3: Capitalize names consistently
Your contacts have "john smith", "JOHN SMITH", and "John Smith". They're all the same person, but your CRM doesn't enforce formatting.
Use Text Transform again. Select first_name and apply "capitalize" (title case). Do the same for last_name. This converts everything to "John Smith" format.
For email, apply "lowercase" instead - emails should always be lowercase for matching purposes.
Pipeline sidebar - steps so far
- Select Columns - kept 10 of 62 columns
- Text Transform - trim
first_name - Text Transform - trim
last_name - Text Transform - trim
email - Text Transform - trim
company_name - Text Transform - capitalize
first_name - Text Transform - capitalize
last_name - Text Transform - lowercase
email
Each operation is a separate card - visible, editable, and applied in order.
Step 4: Normalize company name variations
This is the hardest problem in CRM data. "TechWave Solutions", "TECHWAVE SOLUTIONS", "Techwave", "TechWave Solutions Inc.", "TechWave Solutions, Inc" - all the same company.
Start by applying Text Transform with "capitalize" on company_name. That collapses the case variations into one form.
For the remaining variants (with and without "Inc.", abbreviations, etc.), use Find & Replace. Select company_name, find "Techwave Solutions Inc." and replace with "Techwave Solutions". Add another Find & Replace for "Techwave Solutions, Inc" to "Techwave Solutions".
Yes, this is manual. But it's documented. Each replacement is a pipeline card with the exact find/replace pair visible. When you export again in three months and there's a new variation, you add one more card. The old replacements still work.
For the most common issue - trailing suffixes like "Inc.", "LLC", "Ltd" - a single Find & Replace per suffix handles hundreds of companies at once.
Step 5: Fill missing phone numbers
Some contacts have phone numbers; some don't. If multiple contacts exist at the same company, there might be a shared office number on some records but not others.
Select Fill Missing from the Data group. Choose phone as the column. For many cases, "forward fill" works well - it uses the previous row's value when the current row is NULL.
For this to work well, sort by company_name first so contacts at the same company are grouped together. Then forward fill propagates a known phone number to the blank rows below it within the same company block.
If you'd rather mark unknowns explicitly, use "literal" mode and fill with "No phone on file". Either way, you've eliminated the NULLs.
Step 6: Remove duplicate contacts
Finally, deduplicate. Select Remove Duplicates and choose email as the column to check. Since email addresses are unique per person (and you already lowercased them), this catches duplicates even when names are slightly different.
This generates SELECT DISTINCT ON (email) *. The first row for each email address is kept; the rest are dropped. If you had 8,000 contacts and 500 were duplicates, you're down to 7,500 clean rows.
The pipeline you built
Scroll through the pipeline sidebar. You'll see something like a dozen cards:
- Select Columns (trim to essentials)
- Trim whitespace (multiple columns)
- Capitalize names
- Lowercase emails
- Capitalize company names
- Find & Replace company variations
- Sort by company
- Forward fill phone numbers
- Remove duplicates by email
Every step is visible, editable, and deletable. If you realize you need to also normalize job titles ("VP of Sales" vs "Vice President of Sales"), you insert a new card wherever it makes sense. The rest of the pipeline adjusts.
The real value is next quarter. Export from the CRM again, load the file, and the same pipeline cleans it the same way. Five minutes instead of an afternoon.
| first_name | last_name | company_name | phone | |
|---|---|---|---|---|
| John | Smith | john.smith@techwave.com | Techwave Solutions | 415-555-0182 |
| Lisa | Nguyen | lnguyen@techwave.com | Techwave Solutions | 415-555-0182 |
| Marcus | Webb | marcus.webb@acmecorp.com | Acme Corp | 312-555-0094 |
Consistent casing, trimmed whitespace, normalized company names, phone filled via forward fill, and duplicate removed - 7,500 clean rows from 8,000.