← All posts
5 min read

How to Standardize Inconsistent Company Names in a CSV

You exported a vendor list from your CRM. You want to know how much you're spending per company. Simple enough. Then you run a GROUP BY and get this:

  • IBM - 23 transactions
  • I.B.M. - 8 transactions
  • International Business Machines - 4 transactions
  • Microsoft - 31 transactions
  • microsoft - 12 transactions
  • MICROSOFT - 3 transactions
  • Microsoft Corp - 7 transactions

That's 7 rows where there should be 2. Your spend-per-vendor chart is wrong. Your top-10 list is wrong. Every downstream calculation is wrong because the same company is split across multiple spellings.

This is one of the most common data cleaning problems. The data was entered by different people (or different systems) over months or years, and nobody enforced a standard format. Now you're stuck cleaning it up before you can do anything useful.

Here's how to fix it in ExploreMyData, step by step.

Step 1: Trim whitespace and normalize case

Before you start mapping variations to canonical names, knock out the easy wins. A surprising number of "duplicate" company names are just case differences or trailing spaces. "Microsoft " with a trailing space is not the same as "Microsoft" to a database.

Open the toolbar and select Text Transform from the Transform group. Choose your company name column. Select trim and apply. Then open Text Transform again, same column, and select lowercase.

Under the hood, this generates:

TRIM(LOWER(vendor_name))

Two pipeline steps, and you've already collapsed Microsoft, microsoft, and MICROSOFT into one value. That knocks the Microsoft variations from 4 rows down to 2.

Before and after applying trim + lowercase to the vendor_name column:

vendor_name (before) vendor_name (after)
Microsoftmicrosoft
microsoftmicrosoft
MICROSOFTmicrosoft
Microsoft Corpmicrosoft corp
IBMibm
I.B.M.i.b.m.

Case differences are resolved; distinct variations like "corp" and "i.b.m." remain for the next steps.

Step 2: Use Find & Replace for known variations

After normalizing case, you still have microsoft corp as a separate entry from microsoft. And i.b.m. is still separate from ibm.

Open Find & Replace from the Transform group. Select the vendor name column, enter microsoft corp in Find and microsoft in Replace. Apply.

Do the same for the IBM variations. Replace i.b.m. with ibm, and international business machines with ibm.

Each step generates a REPLACE() call:

REPLACE(vendor_name, 'microsoft corp', 'microsoft')

This works well when you have a handful of known variations. But if you've got dozens of companies with multiple spellings each, doing them one at a time gets tedious. That's where Bulk Replace comes in.

Step 3: Bulk Replace for large-scale mapping

When you need to map many values at once, Bulk Replace is the better tool. It lets you define a set of source values and a single target value in one operation.

Open Bulk Replace from the Transform group. Select your vendor name column. For the first group, set the target to ibm and add i.b.m. and international business machines as source values.

Under the hood, ExploreMyData generates a CASE WHEN expression:

CASE WHEN vendor_name IN ('i.b.m.', 'international business machines') THEN 'ibm' ELSE vendor_name END

You can add as many groups as you need in a single operation. Map all the Salesforce variations (salesforce.com, sfdc, salesforce inc) to salesforce. Map all the Amazon variations. Do it all in one step.

Bulk Replace mapping multiple variations to one canonical name per company:

Source values Target value
i.b.m., international business machinesibm
microsoft corp, msftmicrosoft
salesforce.com, sfdc, salesforce incsalesforce
amazon web services, awsamazon

One operation handles all groups at once, generating a single CASE WHEN expression.

How to spot the variations in the first place

The hardest part isn't fixing the names. It's finding all the variations. Here's a practical approach: sort the vendor name column alphabetically. Similar names will cluster together, and you'll spot patterns fast. ibm and i.b.m. end up right next to each other.

Another approach: run a Group By on the vendor name column to see every unique value and its count. Companies with suspiciously low counts are often misspellings or alternate names of a higher-count entry. If "microsoft" has 31 rows and "msft" has 2, that's a strong hint.

A strategy that scales

For a file with 10-20 companies, the three-step approach above takes about five minutes. For files with hundreds of unique company names, you'll want a strategy:

  1. Trim and lowercase first. Always. This alone can cut your unique count by 20-30%.
  2. Strip common suffixes. Use Find & Replace to remove " inc", " corp", " llc", " ltd" from the end of names. That collapses "microsoft corp" into "microsoft" without needing to know every company in the file.
  3. Bulk Replace the rest. After the automated cleanup, group what's left and map manually.

The full pipeline after applying all three steps - final vendor_name values are clean and consistent:

Step Operation Result for "I.B.M." Result for "Microsoft Corp"
1Text Transform: trim + lowercasei.b.m.microsoft corp
2Find & Replace: remove " inc", " corp", " ltd"i.b.m.microsoft
3Bulk Replace: map remaining variationsibmmicrosoft

Every step is in the pipeline

Each operation you apply shows up as a step in the pipeline panel. You can see the exact SQL, reorder steps, or delete one if you made a mistake. The pipeline rebuilds automatically.

When you get next month's vendor export, load it into ExploreMyData and the same pipeline structure works again. The company names will be just as messy. But now you know the pattern to clean them.

Start cleaning your data →

Try it yourself

No sign-up, no upload, no tracking.

Open ExploreMyData