← All posts
5 min read

Fixing Messy Phone Number Formats in Your Data

You're trying to deduplicate a customer list. Two records have the same name and address, but different phone numbers: (555) 123-4567 and 5551234567. Those are the same number. Your deduplication doesn't know that.

Phone numbers are one of the worst fields to work with in raw data. People enter them however they feel like it. Systems export them in whatever format was configured years ago. Merge two data sources and you get chaos. Here's what a typical phone column looks like:

  • (555) 123-4567
  • 555-123-4567
  • 5551234567
  • +1-555-123-4567
  • 555.123.4567
  • 1 (555) 123-4567
  • +15551234567
  • 555 123 4567

Eight formats. Same number. You can't match records, you can't count unique phone numbers, and you can't build any report that groups by phone. The fix is straightforward: strip every number down to just digits, then rebuild it in one consistent format.

Here's how to do it in ExploreMyData.

Step 1: Strip out all the formatting characters

The goal is to turn every phone number into a plain string of digits. That means removing parentheses, dashes, dots, spaces, and the plus sign. Use Find & Replace from the Transform group.

Select your phone column and run these replacements one at a time:

  1. Find (, replace with nothing (empty)
  2. Find ), replace with nothing
  3. Find -, replace with nothing
  4. Find ., replace with nothing
  5. Find (space), replace with nothing
  6. Find +, replace with nothing

Each step adds a REPLACE() call to the pipeline. After all six, your SQL looks like a chain:

REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(phone, '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''), '+', '')

Not pretty SQL, but it works. Every phone number in your column is now a string of digits. (555) 123-4567 became 5551234567. +1-555-123-4567 became 15551234567.

Phone column values after each Find & Replace step strips one character type:

Original After removing ( ) - . space +
(555) 123-45675551234567
555-123-45675551234567
555.123.45675551234567
+1-555-123-456715551234567
1 (555) 123-456715551234567
555 123 45675551234567

10-digit numbers are consistent; 11-digit numbers still have a leading country code "1" to handle in step 2.

Step 2: Handle the country code

After stripping formatting, some numbers are 10 digits (5551234567) and some are 11 (15551234567) because they had a leading "1" country code. You need to normalize this.

If all your data is US/Canada numbers, the simplest approach: use Extract Text from the Transform group. Select the phone column and extract the last 10 characters. This drops the leading "1" from 11-digit numbers and leaves 10-digit numbers unchanged.

The generated SQL uses RIGHT(phone, 10) or equivalently SUBSTR(phone, -10) to grab the last 10 characters.

Now every phone number is exactly 10 digits. 5551234567. Consistent. Matchable. Dedupable.

Step 3: Reassemble in a readable format

A 10-digit string works for matching, but it's not great for humans reading the data. If you want a formatted output like (555) 123-4567, you can rebuild it from the parts.

Use Extract Text to create three new columns from the phone column:

  • area_code: characters 1-3 - SUBSTR(phone, 1, 3)
  • prefix: characters 4-6 - SUBSTR(phone, 4, 3)
  • line: characters 7-10 - SUBSTR(phone, 7, 4)

Then use Combine Columns from the Columns group to join them back together. Select the three columns, set the separator as needed, and name the output column phone_formatted.

Or, for full control, the generated SQL for the combine step uses CONCAT():

CONCAT('(', area_code, ') ', prefix, '-', line)

The result: (555) 123-4567. Every row. Every time.

Three extracted columns are combined into a formatted phone number:

phone (digits) area_code prefix line phone_formatted
55512345675551234567(555) 123-4567
80055598768005559876(800) 555-9876
41555500114155550011(415) 555-0011

When to just keep the digits

Honestly? For most analysis, stop after step 2. The 10-digit string is the best format for matching, joining, and deduplication. It's unambiguous. Two records with the same 10-digit string are the same phone number, period.

Only do step 3 if you're exporting the data for human consumption - a mailing list, a customer directory, a report that someone will actually read. For analytics work, digits-only is cleaner.

What about international numbers?

If your data has numbers from multiple countries, the approach changes. You can't just take the last 10 digits because country codes vary in length (1 for US, 44 for UK, 91 for India). In that case, keep the full digit string including country code and standardize on the E.164 format: +15551234567. Strip all formatting characters (step 1), then prepend a "+" using Find & Replace or Combine Columns.

Final phone column - all 8 original formats normalized to the same output:

customer_name phone_original phone_formatted
Jordan Lee(555) 123-4567(555) 123-4567
Maria Santos+1-555-123-4567(555) 123-4567
David Osei555.123.4567(555) 123-4567
Priya Nair5551234567(555) 123-4567

Start cleaning your data →

Try it yourself

No sign-up, no upload, no tracking.

Open ExploreMyData