Extracting Data From Messy Text Columns
You've got a column called "description" and it looks like this:
Invoice #INV-2024-0456 for client Acme CorpPayment received - PO#8834 - $2,450.00Ref: TXN-20240315-A | Warehouse pickupOrder SKU-4421 shipped to NY
There's useful structured data buried in each of those strings - invoice numbers, PO numbers, transaction IDs, SKUs. But it's all mashed into a free-text column. You need each piece in its own column so you can filter, sort, and match on it.
This happens all the time with CRM exports, ERP transaction logs, and anything where humans typed data into a notes field. The information is there, but getting it out requires pattern matching. Here's how to do it in ExploreMyData.
Extract text before or after a delimiter
The simplest extraction. Your value is separated by a consistent character - a dash,
a pipe, a colon. Take the string
Payment received - PO#8834 - $2,450.00.
The PO number is between the first and second dashes.
Open Extract Text from the Transform group. Select the
description column. Choose after_delimiter, enter
- (space-dash-space)
as the delimiter, and set occurrence to 1 (first occurrence). This gives you everything after
the first dash: PO#8834 - $2,450.00.
Then apply before_delimiter on the result with the same
delimiter. Now you have PO#8834.
Two steps, and the PO number is in its own column.
Under the hood, ExploreMyData generates
SPLIT_PART() calls:
SPLIT_PART(description, ' - ', 2) AS after_first_dash
| description (original) | after first " - " | po_number (extracted) |
|---|---|---|
| Payment received - PO#8834 - $2,450.00 | PO#8834 - $2,450.00 | PO#8834 |
| Payment received - PO#9021 - $780.00 | PO#9021 - $780.00 | PO#9021 |
| Payment received - PO#7755 - $14,200.00 | PO#7755 - $14,200.00 | PO#7755 |
Two Extract Text steps: first after_delimiter(" - ", 1), then before_delimiter(" - ", 1) on the result.
Extract text between delimiters
Sometimes the value you want is sandwiched between two different markers. The string
Invoice #INV-2024-0456 for client Acme Corp
has the invoice number between "#" and " for".
Use between_delimiters. Set the start delimiter to
# and the end delimiter to
for. Result:
INV-2024-0456.
This is one of the most useful extraction modes. It works for anything that has consistent start and end markers, even if the text between them varies in length.
Extract with regex for complex patterns
Delimiters work when the structure is consistent. But sometimes the patterns are more fluid. You need "any sequence that looks like INV-YYYY-NNNN" regardless of what surrounds it.
The regex extraction mode handles this. Enter a regular expression pattern, and ExploreMyData extracts the first match from each row.
For invoice numbers that follow the INV-YYYY-NNNN pattern:
INV-\d{4}-\d{4}
For SKU codes like SKU-4421:
SKU-\d+
For any dollar amount:
\$[\d,]+\.?\d*
ExploreMyData generates DuckDB's
REGEXP_EXTRACT() function:
REGEXP_EXTRACT(description, 'INV-\d{4}-\d{4}') AS invoice_number
If a row doesn't match the pattern, the result is an empty string. No errors, no crashed queries. You can filter out empty results afterward if needed.
| description (original) | invoice_number (regex extracted) |
|---|---|
| Invoice #INV-2024-0456 for client Acme Corp | INV-2024-0456 |
| Re: INV-2024-0891 outstanding balance | INV-2024-0891 |
| Order SKU-4421 shipped to NY | (empty - no match) |
| Credit note for INV-2024-0456 | INV-2024-0456 |
Pattern used: INV-\d{4}-\d{4}. Rows without a matching pattern return an empty string, not an error.
Position-based extraction
Some codes have fixed positions. If every row starts with a 6-character product code, or the last 4 characters are always a location code, you can extract by position.
Use Add Column with DuckDB's substring functions:
First 6 characters: LEFT(description, 6) AS product_code
Last 4 characters: RIGHT(description, 4) AS location_code
Characters 3 through 8: SUBSTRING(description, 3, 6) AS middle_part
Position-based extraction is fast and reliable when the format is truly fixed-width. It breaks immediately if it isn't, so verify on a few rows before applying.
Checking if a pattern exists
Sometimes you don't need to extract the value - you just need to know if it's there. Does this description mention an invoice? Does it contain a dollar amount?
Use Add Column with a CASE expression:
CASE WHEN REGEXP_MATCHES(description, 'INV-\d{4}-\d{4}') THEN 'Yes' ELSE 'No' END AS has_invoice
Now you can filter to only rows that contain (or don't contain) an invoice number. This is useful for triaging mixed-format data - separating rows that follow the expected pattern from ones that need manual review.
Chaining extractions for complex parsing
Real data often needs multiple extraction passes. Take this transaction description:
Ref: TXN-20240315-A | Warehouse pickup
You want three things: the reference ID, the date embedded in it, and the fulfillment method.
- Step 1: Extract between "Ref: " and " |" to get
TXN-20240315-A - Step 2: Extract after "| " to get
Warehouse pickup - Step 3: On the reference ID, use regex
\d{8}to extract20240315, then convert it to a DATE type
Three pipeline steps, and you've turned one messy column into three clean, typed columns. Each step is visible in the pipeline, so you can see exactly what happened and adjust if a future file has a slightly different format.
| description (original) | ref_id | txn_date | fulfillment |
|---|---|---|---|
| Ref: TXN-20240315-A | Warehouse pickup | TXN-20240315-A | 2024-03-15 | Warehouse pickup |
| Ref: TXN-20240407-B | Home delivery | TXN-20240407-B | 2024-04-07 | Home delivery |
| Ref: TXN-20240519-C | Locker collection | TXN-20240519-C | 2024-05-19 | Locker collection |
Three pipeline steps: extract between "Ref: " and " |", extract after "| ", then use regex \d{8} on the ref_id and cast as DATE.
When extraction fails on some rows
Not every row will match your pattern. Some descriptions might be blank. Others might use a different format. That's fine. Failed extractions return empty strings or NULLs, not errors.
After extracting, filter where the new column is empty to see which rows didn't match. Often these are edge cases that reveal a second format you didn't know about. Handle those with a second extraction step, or flag them for manual review.