December 25, 2025 by Arif Aslam 5 min read

Fixing Messy Phone Number Formats in Your Data

You're trying to deduplicate a customer list. Two records have the same name and address, but different phone numbers: (555) 123-4567 and 5551234567. Those are the same number. Your deduplication doesn't know that.

Phone numbers are one of the worst fields to work with in raw data. People enter them however they feel like it. Systems export them in whatever format was configured years ago. Merge two data sources and you get chaos. Here's what a typical phone column looks like:

(555) 123-4567
555-123-4567
5551234567
+1-555-123-4567
555.123.4567
1 (555) 123-4567
+15551234567
555 123 4567

Eight formats. Same number. You can't match records, you can't count unique phone numbers, and you can't build any report that groups by phone. The fix is straightforward: strip every number down to just digits, then rebuild it in one consistent format.

Here's how to do it in ExploreMyData.

Step 1: Strip out all the formatting characters

The goal is to turn every phone number into a plain string of digits. That means removing parentheses, dashes, dots, spaces, and the plus sign. Use Find & Replace from the Transform group.

Select your phone column and run these replacements one at a time:

Find (, replace with nothing (empty)
Find ), replace with nothing
Find -, replace with nothing
Find ., replace with nothing
Find (space), replace with nothing
Find +, replace with nothing

Each step adds a REPLACE() call to the pipeline. After all six, your SQL looks like a chain:

REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(phone, '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''), '+', '')

Not pretty SQL, but it works. Every phone number in your column is now a string of digits. (555) 123-4567 became 5551234567. +1-555-123-4567 became 15551234567.

Phone column values after each Find & Replace step strips one character type:

Original	After removing ( ) - . space +
(555) 123-4567	5551234567
555-123-4567	5551234567
555.123.4567	5551234567
+1-555-123-4567	15551234567
1 (555) 123-4567	15551234567
555 123 4567	5551234567

10-digit numbers are consistent; 11-digit numbers still have a leading country code "1" to handle in step 2.

Step 2: Handle the country code

After stripping formatting, some numbers are 10 digits (5551234567) and some are 11 (15551234567) because they had a leading "1" country code. You need to normalize this.

If all your data is US/Canada numbers, the simplest approach: use Extract Text from the Transform group. Select the phone column and extract the last 10 characters. This drops the leading "1" from 11-digit numbers and leaves 10-digit numbers unchanged.

The generated SQL uses RIGHT(phone, 10) or equivalently SUBSTR(phone, -10) to grab the last 10 characters.

Now every phone number is exactly 10 digits. 5551234567. Consistent. Matchable. Dedupable.

Step 3: Reassemble in a readable format

A 10-digit string works for matching, but it's not great for humans reading the data. If you want a formatted output like (555) 123-4567, you can rebuild it from the parts.

Use Extract Text to create three new columns from the phone column:

area_code: characters 1-3 - SUBSTR(phone, 1, 3)
prefix: characters 4-6 - SUBSTR(phone, 4, 3)
line: characters 7-10 - SUBSTR(phone, 7, 4)

Then use Combine Columns from the Columns group to join them back together. Select the three columns, set the separator as needed, and name the output column phone_formatted.

Or, for full control, the generated SQL for the combine step uses CONCAT():

CONCAT('(', area_code, ') ', prefix, '-', line)

The result: (555) 123-4567. Every row. Every time.

Three extracted columns are combined into a formatted phone number:

phone (digits)	area_code	prefix	line	phone_formatted
5551234567	555	123	4567	(555) 123-4567
8005559876	800	555	9876	(800) 555-9876
4155550011	415	555	0011	(415) 555-0011

When to just keep the digits

Honestly? For most analysis, stop after step 2. The 10-digit string is the best format for matching, joining, and deduplication. It's unambiguous. Two records with the same 10-digit string are the same phone number, period.

Only do step 3 if you're exporting the data for human consumption - a mailing list, a customer directory, a report that someone will actually read. For analytics work, digits-only is cleaner.

What about international numbers?

If your data has numbers from multiple countries, the approach changes. You can't just take the last 10 digits because country codes vary in length (1 for US, 44 for UK, 91 for India). In that case, keep the full digit string including country code and standardize on the E.164 format: +15551234567. Strip all formatting characters (step 1), then prepend a "+" using Find & Replace or Combine Columns.

Final phone column - all 8 original formats normalized to the same output:

customer_name	phone_original	phone_formatted
Jordan Lee	(555) 123-4567	(555) 123-4567
Maria Santos	+1-555-123-4567	(555) 123-4567
David Osei	555.123.4567	(555) 123-4567
Priya Nair	5551234567	(555) 123-4567

Start cleaning your data →

Fixing Messy Phone Number Formats in Your Data

Step 1: Strip out all the formatting characters

Step 2: Handle the country code

Step 3: Reassemble in a readable format

When to just keep the digits

What about international numbers?

Try it yourself

Related tools

More from the blog

How to Standardize Inconsistent Company Names in a CSV

How to Remove Test Data and Junk Rows from a CSV

5 Ways to Clean Messy Data Without Writing Code