← All posts
5 min read

Extracting Data From Messy Text Columns

You've got a column called "description" and it looks like this:

  • Invoice #INV-2024-0456 for client Acme Corp
  • Payment received - PO#8834 - $2,450.00
  • Ref: TXN-20240315-A | Warehouse pickup
  • Order SKU-4421 shipped to NY

There's useful structured data buried in each of those strings - invoice numbers, PO numbers, transaction IDs, SKUs. But it's all mashed into a free-text column. You need each piece in its own column so you can filter, sort, and match on it.

This happens all the time with CRM exports, ERP transaction logs, and anything where humans typed data into a notes field. The information is there, but getting it out requires pattern matching. Here's how to do it in ExploreMyData.

Extract text before or after a delimiter

The simplest extraction. Your value is separated by a consistent character - a dash, a pipe, a colon. Take the string Payment received - PO#8834 - $2,450.00. The PO number is between the first and second dashes.

Open Extract Text from the Transform group. Select the description column. Choose after_delimiter, enter - (space-dash-space) as the delimiter, and set occurrence to 1 (first occurrence). This gives you everything after the first dash: PO#8834 - $2,450.00.

Then apply before_delimiter on the result with the same delimiter. Now you have PO#8834. Two steps, and the PO number is in its own column.

Under the hood, ExploreMyData generates SPLIT_PART() calls:

SPLIT_PART(description, ' - ', 2) AS after_first_dash

description (original) after first " - " po_number (extracted)
Payment received - PO#8834 - $2,450.00PO#8834 - $2,450.00PO#8834
Payment received - PO#9021 - $780.00PO#9021 - $780.00PO#9021
Payment received - PO#7755 - $14,200.00PO#7755 - $14,200.00PO#7755

Two Extract Text steps: first after_delimiter(" - ", 1), then before_delimiter(" - ", 1) on the result.

Extract text between delimiters

Sometimes the value you want is sandwiched between two different markers. The string Invoice #INV-2024-0456 for client Acme Corp has the invoice number between "#" and " for".

Use between_delimiters. Set the start delimiter to # and the end delimiter to for. Result: INV-2024-0456.

This is one of the most useful extraction modes. It works for anything that has consistent start and end markers, even if the text between them varies in length.

Extract with regex for complex patterns

Delimiters work when the structure is consistent. But sometimes the patterns are more fluid. You need "any sequence that looks like INV-YYYY-NNNN" regardless of what surrounds it.

The regex extraction mode handles this. Enter a regular expression pattern, and ExploreMyData extracts the first match from each row.

For invoice numbers that follow the INV-YYYY-NNNN pattern:

INV-\d{4}-\d{4}

For SKU codes like SKU-4421:

SKU-\d+

For any dollar amount:

\$[\d,]+\.?\d*

ExploreMyData generates DuckDB's REGEXP_EXTRACT() function:

REGEXP_EXTRACT(description, 'INV-\d{4}-\d{4}') AS invoice_number

If a row doesn't match the pattern, the result is an empty string. No errors, no crashed queries. You can filter out empty results afterward if needed.

description (original) invoice_number (regex extracted)
Invoice #INV-2024-0456 for client Acme CorpINV-2024-0456
Re: INV-2024-0891 outstanding balanceINV-2024-0891
Order SKU-4421 shipped to NY(empty - no match)
Credit note for INV-2024-0456INV-2024-0456

Pattern used: INV-\d{4}-\d{4}. Rows without a matching pattern return an empty string, not an error.

Position-based extraction

Some codes have fixed positions. If every row starts with a 6-character product code, or the last 4 characters are always a location code, you can extract by position.

Use Add Column with DuckDB's substring functions:

First 6 characters: LEFT(description, 6) AS product_code

Last 4 characters: RIGHT(description, 4) AS location_code

Characters 3 through 8: SUBSTRING(description, 3, 6) AS middle_part

Position-based extraction is fast and reliable when the format is truly fixed-width. It breaks immediately if it isn't, so verify on a few rows before applying.

Checking if a pattern exists

Sometimes you don't need to extract the value - you just need to know if it's there. Does this description mention an invoice? Does it contain a dollar amount?

Use Add Column with a CASE expression:

CASE WHEN REGEXP_MATCHES(description, 'INV-\d{4}-\d{4}') THEN 'Yes' ELSE 'No' END AS has_invoice

Now you can filter to only rows that contain (or don't contain) an invoice number. This is useful for triaging mixed-format data - separating rows that follow the expected pattern from ones that need manual review.

Chaining extractions for complex parsing

Real data often needs multiple extraction passes. Take this transaction description:

Ref: TXN-20240315-A | Warehouse pickup

You want three things: the reference ID, the date embedded in it, and the fulfillment method.

  1. Step 1: Extract between "Ref: " and " |" to get TXN-20240315-A
  2. Step 2: Extract after "| " to get Warehouse pickup
  3. Step 3: On the reference ID, use regex \d{8} to extract 20240315, then convert it to a DATE type

Three pipeline steps, and you've turned one messy column into three clean, typed columns. Each step is visible in the pipeline, so you can see exactly what happened and adjust if a future file has a slightly different format.

description (original) ref_id txn_date fulfillment
Ref: TXN-20240315-A | Warehouse pickupTXN-20240315-A2024-03-15Warehouse pickup
Ref: TXN-20240407-B | Home deliveryTXN-20240407-B2024-04-07Home delivery
Ref: TXN-20240519-C | Locker collectionTXN-20240519-C2024-05-19Locker collection

Three pipeline steps: extract between "Ref: " and " |", extract after "| ", then use regex \d{8} on the ref_id and cast as DATE.

When extraction fails on some rows

Not every row will match your pattern. Some descriptions might be blank. Others might use a different format. That's fine. Failed extractions return empty strings or NULLs, not errors.

After extracting, filter where the new column is empty to see which rows didn't match. Often these are edge cases that reveal a second format you didn't know about. Handle those with a second extraction step, or flag them for manual review.

Start extracting from your text data →

Try it yourself

No sign-up, no upload, no tracking.

Open ExploreMyData