March 27, 2026 by Arif Aslam 5 min read

Using Regex to Extract Patterns from Text

Your data has a column called "address" and the values look like this: "Ship to: 123 Oak Lane, Portland, OR 97201". You need the zip code. Or you have a "notes" column full of strings like "Ref: INV-2024-0456" and you need to pull out those invoice numbers.

The data is there. It's just buried in free text. You could export to a script, write some Python, and re-import. Or you could extract it in place with a regular expression.

ExploreMyData's Extract Text operation in regex mode uses DuckDB's REGEXP_EXTRACT() function to pull matching patterns out of any text column. The result goes into a new column, and you keep working.

How Extract Text (regex mode) works

Select Extract Text from the toolbar, choose a text column, and enter a regex pattern. ExploreMyData runs REGEXP_EXTRACT(column, pattern) against every row. It extracts the first match. If your pattern contains a capture group (parentheses), it extracts the content of the first group instead of the full match.

Non-matching rows get NULL. No errors, no crashes. Just an empty cell you can handle later.

Extract Text operation - regex mode

Source column: address
Mode: Regex
Pattern: \d{5}
Output column name: zip_code

Generated SQL: REGEXP_EXTRACT(address, '\d{5}') AS zip_code

Non-matching rows return NULL. No errors thrown.

Pattern 1: Extract a 5-digit zip code

Given an address like "123 Oak Lane, Portland, OR 97201", you want the zip.

Pattern: \d{5}

This matches exactly five consecutive digits. For most US address data, that's the zip code. It'll grab the first five-digit sequence it finds, so if the street number happens to be five digits ("12345 Main St, Portland, OR 97201"), you'd get "12345" instead of "97201".

A safer version: \b\d{5}\b (word boundaries) or even [A-Z]{2}\s+(\d{5}) with a capture group to grab the digits after a state abbreviation. More on capture groups below.

Pattern 2: Extract an invoice number

Your notes field contains "Ref: INV-2024-0456" and you want "INV-2024-0456".

Pattern: INV-\d{4}-\d{4}

This matches the literal text "INV-", followed by exactly four digits, a dash, and four more digits. Adjust the prefix and digit counts to match your actual format. If your invoices look like "PO-123456", the pattern would be PO-\d+.

Pattern 3: Extract an email address

A "contact_info" column contains mixed text: "John Smith, john.smith@acme.com, ext 4421". You want the email.

Pattern: [\w.]+@[\w.]+

This matches one or more word characters or dots, then @, then more word characters or dots. It's not a perfect email validator (nothing is, really) but it works for extraction from semi-structured text. It'll grab "john.smith@acme.com" cleanly.

contact_info	email_extracted
John Smith, john.smith@acme.com, ext 4421	john.smith@acme.com
Sales team lead - sarah@globex.com	sarah@globex.com
call 555-0192 or email r.jones@initech.com	r.jones@initech.com
No email on file	NULL

Pattern [\w.]+@[\w.]+ extracts the first email-like string from each cell. Rows without a match get NULL.

Pattern 4: Extract a dollar amount

Transaction descriptions like "Payment of $1,234.56 received" and you need the numeric amount.

Pattern: \$[\d,]+\.?\d*

This matches a literal dollar sign, one or more digits or commas, an optional decimal point, and optional trailing digits. You'll get "$1,234.56" as a string. To use it numerically, follow up with a Find & Replace to strip the "$" and commas, then Convert Type to DOUBLE.

If you only want the number without the dollar sign, use a capture group: \$([\d,]+\.?\d*). The parentheses tell REGEXP_EXTRACT to return just the captured portion.

Pattern 5: Capture groups for precision

Capture groups are the sharp tool in the regex toolbox. When your pattern contains parentheses, REGEXP_EXTRACT returns the content inside the first group, not the entire match.

A concrete example: cells contain "Qty: (24 units)" and you want just the number inside the parentheses.

Pattern: $(\d+)$

$ matches a literal opening parenthesis (backslash-escaped because parentheses are special in regex). (\d+) is a capture group matching one or more digits. $ matches the closing parenthesis. The result is just "24", the digits inside the parens.

Another example: extract the state abbreviation from "Portland, OR 97201" using ,\s+([A-Z]{2})\s+\d{5}. The full match is ", OR 97201" but the capture group returns just "OR".

The generated SQL

Every Extract Text step produces a straightforward SQL expression:

REGEXP_EXTRACT(address, '\d{5}') AS address_extracted

With a capture group:

REGEXP_EXTRACT(notes, 'INV-(\d{4}-\d{4})', 1) AS notes_extracted

You can see this in the pipeline view. If the SQL looks right but results are wrong, the issue is always the pattern.

Regex gotchas to know about

Backslashes in DuckDB. DuckDB regex uses standard syntax. \d means "digit", \w means "word character", \s means "whitespace". Enter these directly in the pattern field. ExploreMyData handles the escaping.

First match only. REGEXP_EXTRACT returns the first match in each cell. If an address has two zip codes ("Ship from 10001 to 97201"), you'll get "10001". Reorder your pattern to be more specific if needed.

Case sensitivity. Regex is case-sensitive by default. If your invoice numbers might be "inv-2024-0456" or "INV-2024-0456", use [Ii][Nn][Vv]-\d{4}-\d{4} or apply a Text Transform (uppercase) before extracting.

Empty vs NULL. Non-matching rows return NULL, not an empty string. Use a Filter to find rows where the extracted column is NULL to audit what didn't match.

Extract patterns from your data →

Using Regex to Extract Patterns from Text

How Extract Text (regex mode) works

Pattern 1: Extract a 5-digit zip code

Pattern 2: Extract an invoice number

Pattern 3: Extract an email address

Pattern 4: Extract a dollar amount

Pattern 5: Capture groups for precision

The generated SQL

Regex gotchas to know about

Try it yourself

Related tools

More from the blog

Text Extraction Patterns

Extract Email Domains

Split Address Fields