Using Regex to Extract Patterns from Text
Your data has a column called "address" and the values look like this: "Ship to: 123 Oak Lane, Portland, OR 97201". You need the zip code. Or you have a "notes" column full of strings like "Ref: INV-2024-0456" and you need to pull out those invoice numbers.
The data is there. It's just buried in free text. You could export to a script, write some Python, and re-import. Or you could extract it in place with a regular expression.
ExploreMyData's
Extract Text operation in regex mode uses DuckDB's
REGEXP_EXTRACT() function
to pull matching patterns out of any text column. The result goes into a new column,
and you keep working.
How Extract Text (regex mode) works
Select Extract Text from the toolbar, choose a text column,
and enter a regex pattern. ExploreMyData runs
REGEXP_EXTRACT(column, pattern)
against every row. It extracts the first match. If your pattern contains a capture group
(parentheses), it extracts the content of the first group instead of the full match.
Non-matching rows get NULL. No errors, no crashes -- just an empty cell you can handle later.
Extract Text operation - regex mode
- Source column: address
- Mode: Regex
- Pattern:
\d{5} - Output column name: zip_code
Generated SQL: REGEXP_EXTRACT(address, '\d{5}') AS zip_code
Non-matching rows return NULL. No errors thrown.
Pattern 1: Extract a 5-digit zip code
Given an address like "123 Oak Lane, Portland, OR 97201", you want the zip.
Pattern:
\d{5}
This matches exactly five consecutive digits. For most US address data, that's the zip code. It'll grab the first five-digit sequence it finds, so if the street number happens to be five digits ("12345 Main St, Portland, OR 97201"), you'd get "12345" instead of "97201".
A safer version:
\b\d{5}\b (word boundaries)
or even [A-Z]{2}\s+(\d{5})
with a capture group to grab the digits after a state abbreviation. More on capture groups below.
Pattern 2: Extract an invoice number
Your notes field contains "Ref: INV-2024-0456" and you want "INV-2024-0456".
Pattern:
INV-\d{4}-\d{4}
This matches the literal text "INV-", followed by exactly four digits, a dash, and four more
digits. Adjust the prefix and digit counts to match your actual format. If your invoices
look like "PO-123456", the pattern would be
PO-\d+.
Pattern 3: Extract an email address
A "contact_info" column contains mixed text: "John Smith, john.smith@acme.com, ext 4421". You want the email.
Pattern:
[\w.]+@[\w.]+
This matches one or more word characters or dots, then @, then more word characters or dots. It's not a perfect email validator (nothing is, really) but it works for extraction from semi-structured text. It'll grab "john.smith@acme.com" cleanly.
| contact_info | email_extracted |
|---|---|
| John Smith, john.smith@acme.com, ext 4421 | john.smith@acme.com |
| Sales team lead - sarah@globex.com | sarah@globex.com |
| call 555-0192 or email r.jones@initech.com | r.jones@initech.com |
| No email on file | NULL |
Pattern [\w.]+@[\w.]+ extracts the first email-like string from each cell. Rows without a match get NULL.
Pattern 4: Extract a dollar amount
Transaction descriptions like "Payment of $1,234.56 received" and you need the numeric amount.
Pattern:
\$[\d,]+\.?\d*
This matches a literal dollar sign, one or more digits or commas, an optional decimal point, and optional trailing digits. You'll get "$1,234.56" as a string. To use it numerically, follow up with a Find & Replace to strip the "$" and commas, then Convert Type to DOUBLE.
If you only want the number without the dollar sign, use a capture group:
\$([\d,]+\.?\d*).
The parentheses tell REGEXP_EXTRACT to return just the captured portion.
Pattern 5: Capture groups for precision
Capture groups are the sharp tool in the regex toolbox. When your pattern contains parentheses, REGEXP_EXTRACT returns the content inside the first group, not the entire match.
A concrete example: cells contain "Qty: (24 units)" and you want just the number inside the parentheses.
Pattern:
\((\d+)\)
\( matches a literal opening
parenthesis (backslash-escaped because parentheses are special in regex).
(\d+) is a capture group
matching one or more digits.
\) matches the closing parenthesis.
The result is just "24" -- the digits inside the parens.
Another example: extract the state abbreviation from "Portland, OR 97201" using
,\s+([A-Z]{2})\s+\d{5}.
The full match is ", OR 97201" but the capture group returns just "OR".
The generated SQL
Every Extract Text step produces a straightforward SQL expression:
REGEXP_EXTRACT(address, '\d{5}') AS address_extracted
With a capture group:
REGEXP_EXTRACT(notes, 'INV-(\d{4}-\d{4})', 1) AS notes_extracted
You can see this in the pipeline view. If the SQL looks right but results are wrong, the issue is always the pattern.
Regex gotchas to know about
Backslashes in DuckDB. DuckDB regex uses standard syntax.
\d means "digit",
\w means "word character",
\s means "whitespace".
Enter these directly in the pattern field -- ExploreMyData handles the escaping.
First match only. REGEXP_EXTRACT returns the first match in each cell. If an address has two zip codes ("Ship from 10001 to 97201"), you'll get "10001". Reorder your pattern to be more specific if needed.
Case sensitivity. Regex is case-sensitive by default. If your
invoice numbers might be "inv-2024-0456" or "INV-2024-0456", use
[Ii][Nn][Vv]-\d{4}-\d{4}
or apply a Text Transform (uppercase) before extracting.
Empty vs NULL. Non-matching rows return NULL, not an empty string. Use a Filter to find rows where the extracted column is NULL to audit what didn't match.