Building a Multi-Step Data Pipeline
You load a messy export from your CRM. It has bad rows at the top, inconsistent date formats, a "full name" column that should be two columns, no revenue calculations, and the output needs to be filtered and sorted for a specific stakeholder. That's at least six operations. In a spreadsheet, you'd do them one at a time, overwriting cells, hoping you don't make a mistake you can't undo.
In ExploreMyData, each operation becomes a step in a pipeline. The pipeline lives in the sidebar - every step is visible, editable, and shows the SQL it generates. This post walks through building a real 6-step pipeline from scratch.
The scenario
You've exported a contacts CSV from your CRM. Here's what's wrong with it:
- The first 3 rows are metadata headers, not real data (a common CRM export artifact).
- The "name" column has full names like "Jane Smith" - you need first and last separately.
- The "signup_date" column is a string like "03/15/2025" - not a real date type.
- There's no "days since signup" column, which your report needs.
- You only want active users from the last 90 days.
- The output should be sorted by signup date, newest first.
Let's build the pipeline.
Step 1: Remove bad rows
The first three rows are metadata - column descriptions, export timestamps, that kind of thing.
Use the Filter operation to remove them. You could filter
where "name" is not empty and doesn't contain the header text. Or, if the rows are clearly
identifiable, use a row removal approach
based on a condition like email contains '@'
to keep only real contact rows.
After applying, the pipeline sidebar shows one card: "Filter - keep rows where email contains '@'". The card displays the generated SQL and a green "applied" status.
Step 2: Split the name column
Use the Split Column operation. Select the "name" column, set the delimiter to a space, and set the number of parts to 2. This creates "name_1" (first name) and "name_2" (last name).
Now the sidebar shows two cards. Each one tells you exactly what it does. If a name has more than one space (like "Mary Jane Watson"), the split takes the first part and puts everything else in the second part. For more advanced name handling, see the name column post.
Pipeline sidebar - 2 steps
Keep rows where email contains '@'
WHERE email LIKE '%@%'
Split "name" on space into 2 parts → name_1, name_2
SPLIT_PART(name, ' ', 1) AS name_1, SPLIT_PART(name, ' ', 2) AS name_2
Step 3: Convert the date column
The "signup_date" column is text. Use Convert Type to cast
it to DATE. ExploreMyData uses TRY_CAST(),
so any values that can't be parsed become NULL instead of throwing an error. After this step, the
column is a real date - it sorts correctly and supports date arithmetic.
Step 4: Add a calculated column
You need "days since signup." Use the Expression operation
(or Add Calculated Column) and enter:
CURRENT_DATE - signup_date.
Name the output column "days_since_signup". DuckDB handles date subtraction natively and returns
the number of days. The calculated column post
and date operations guide
cover more expression patterns.
Step 5: Filter to active users, last 90 days
Add another Filter operation. Use two conditions with AND:
- status is "active"
- days_since_signup <= 90
Notice that you're filtering on a column you created in step 4. Each pipeline step builds on the previous result, so "days_since_signup" exists by the time this filter runs. The complex filters post covers more advanced AND/OR/NOT patterns.
Step 6: Sort by signup date
Add a Sort operation. Select "signup_date" and set the direction to descending. Newest signups appear first.
The pipeline sidebar now shows six cards, top to bottom:
- Filter - keep rows where email contains '@'
- Split Column - split "name" by space into 2 parts
- Convert Type - cast signup_date to DATE
- Expression - add days_since_signup
- Filter - keep active users, last 90 days
- Sort - signup_date descending
Pipeline sidebar - 6 steps complete
Editing and reordering steps
This is where pipelines beat spreadsheets. Click any card to edit it. Change the filter condition, rename the output column, adjust the sort direction. When you apply the edit, every step downstream re-executes automatically.
Need to move a step? You can reorder cards in the sidebar. The pipeline rebuilds in the new order. Want to remove step 2 entirely? Delete the card. Steps 3 through 6 re-execute against the new state (though they may break if they depended on columns from the deleted step - more on that next).
When steps break
Pipeline steps can have three statuses:
- Applied (green) - the step ran successfully.
- Warning (yellow) - the step ran but something looks off (like a column that might not exist in all cases).
- Broken (red) - the step failed. This usually happens when a step references a column that was removed or renamed by an earlier edit.
If step 3 breaks, steps 4 through 6 won't execute because they depend on step 3's output. The sidebar makes this visible immediately - you can see exactly which step failed and why. Fix it, and the downstream steps recover.
The SQL trail
Every card shows the SQL it generates. This isn't just for debugging - it's documentation. When you come back to this analysis in two weeks, the pipeline tells you exactly what was done and in what order. Each step has a plain-language description and the precise SQL. No guessing.
Compare this to a spreadsheet where transformations are invisible. You see the final result but have no idea what formulas, sorts, or manual edits got you there. The pipeline is a ledger of every transformation.
Tips for longer pipelines
- Clean first, calculate second. Put filters and type conversions at the top. Calculations and aggregations come after the data is clean.
- Name your output columns clearly. When you add a calculated column, give it a descriptive name. Future steps (and future you) will thank you.
- Use the SQL view for debugging. If a result looks wrong, click the card and read the SQL. Often the issue is a column name mismatch or a type you forgot to convert.
- Don't be afraid to delete and rebuild. Unlike a spreadsheet, deleting a pipeline step doesn't destroy your source data. The original file is untouched. You're always working on a derived view.