Fixing Messy Phone Number Formats in Your Data
You're trying to deduplicate a customer list. Two records have the same name and address, but
different phone numbers: (555) 123-4567 and
5551234567. Those are the same number.
Your deduplication doesn't know that.
Phone numbers are one of the worst fields to work with in raw data. People enter them however they feel like it. Systems export them in whatever format was configured years ago. Merge two data sources and you get chaos. Here's what a typical phone column looks like:
(555) 123-4567555-123-45675551234567+1-555-123-4567555.123.45671 (555) 123-4567+15551234567555 123 4567
Eight formats. Same number. You can't match records, you can't count unique phone numbers, and you can't build any report that groups by phone. The fix is straightforward: strip every number down to just digits, then rebuild it in one consistent format.
Here's how to do it in ExploreMyData.
Step 1: Strip out all the formatting characters
The goal is to turn every phone number into a plain string of digits. That means removing parentheses, dashes, dots, spaces, and the plus sign. Use Find & Replace from the Transform group.
Select your phone column and run these replacements one at a time:
- Find
(, replace with nothing (empty) - Find
), replace with nothing - Find
-, replace with nothing - Find
., replace with nothing - Find
(space), replace with nothing - Find
+, replace with nothing
Each step adds a REPLACE() call to the pipeline.
After all six, your SQL looks like a chain:
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(phone, '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''), '+', '')
Not pretty SQL, but it works. Every phone number in your column is now a string of digits.
(555) 123-4567 became
5551234567.
+1-555-123-4567 became
15551234567.
Phone column values after each Find & Replace step strips one character type:
| Original | After removing ( ) - . space + |
|---|---|
| (555) 123-4567 | 5551234567 |
| 555-123-4567 | 5551234567 |
| 555.123.4567 | 5551234567 |
| +1-555-123-4567 | 15551234567 |
| 1 (555) 123-4567 | 15551234567 |
| 555 123 4567 | 5551234567 |
10-digit numbers are consistent; 11-digit numbers still have a leading country code "1" to handle in step 2.
Step 2: Handle the country code
After stripping formatting, some numbers are 10 digits (5551234567)
and some are 11 (15551234567) because they had a leading "1" country code.
You need to normalize this.
If all your data is US/Canada numbers, the simplest approach: use Extract Text from the Transform group. Select the phone column and extract the last 10 characters. This drops the leading "1" from 11-digit numbers and leaves 10-digit numbers unchanged.
The generated SQL uses RIGHT(phone, 10) or
equivalently SUBSTR(phone, -10) to grab the last 10 characters.
Now every phone number is exactly 10 digits. 5551234567.
Consistent. Matchable. Dedupable.
Step 3: Reassemble in a readable format
A 10-digit string works for matching, but it's not great for humans reading the data. If you want
a formatted output like (555) 123-4567, you can
rebuild it from the parts.
Use Extract Text to create three new columns from the phone column:
area_code: characters 1-3 -SUBSTR(phone, 1, 3)prefix: characters 4-6 -SUBSTR(phone, 4, 3)line: characters 7-10 -SUBSTR(phone, 7, 4)
Then use Combine Columns from the Columns group to join them back together.
Select the three columns, set the separator as needed, and name the output column
phone_formatted.
Or, for full control, the generated SQL for the combine step uses CONCAT():
CONCAT('(', area_code, ') ', prefix, '-', line)
The result: (555) 123-4567. Every row. Every time.
Three extracted columns are combined into a formatted phone number:
| phone (digits) | area_code | prefix | line | phone_formatted |
|---|---|---|---|---|
| 5551234567 | 555 | 123 | 4567 | (555) 123-4567 |
| 8005559876 | 800 | 555 | 9876 | (800) 555-9876 |
| 4155550011 | 415 | 555 | 0011 | (415) 555-0011 |
When to just keep the digits
Honestly? For most analysis, stop after step 2. The 10-digit string is the best format for matching, joining, and deduplication. It's unambiguous. Two records with the same 10-digit string are the same phone number, period.
Only do step 3 if you're exporting the data for human consumption - a mailing list, a customer directory, a report that someone will actually read. For analytics work, digits-only is cleaner.
What about international numbers?
If your data has numbers from multiple countries, the approach changes. You can't just take the
last 10 digits because country codes vary in length (1 for US, 44 for UK, 91 for India). In that
case, keep the full digit string including country code and standardize on the E.164 format:
+15551234567. Strip all formatting characters (step 1),
then prepend a "+" using Find & Replace or Combine Columns.
Final phone column - all 8 original formats normalized to the same output:
| customer_name | phone_original | phone_formatted |
|---|---|---|
| Jordan Lee | (555) 123-4567 | (555) 123-4567 |
| Maria Santos | +1-555-123-4567 | (555) 123-4567 |
| David Osei | 555.123.4567 | (555) 123-4567 |
| Priya Nair | 5551234567 | (555) 123-4567 |