← All posts
5 min read

Building a Multi-Step Data Pipeline

You load a messy export from your CRM. It has bad rows at the top, inconsistent date formats, a "full name" column that should be two columns, no revenue calculations, and the output needs to be filtered and sorted for a specific stakeholder. That's at least six operations. In a spreadsheet, you'd do them one at a time, overwriting cells, hoping you don't make a mistake you can't undo.

In ExploreMyData, each operation becomes a step in a pipeline. The pipeline lives in the sidebar - every step is visible, editable, and shows the SQL it generates. This post walks through building a real 6-step pipeline from scratch.

The scenario

You've exported a contacts CSV from your CRM. Here's what's wrong with it:

  • The first 3 rows are metadata headers, not real data (a common CRM export artifact).
  • The "name" column has full names like "Jane Smith" - you need first and last separately.
  • The "signup_date" column is a string like "03/15/2025" - not a real date type.
  • There's no "days since signup" column, which your report needs.
  • You only want active users from the last 90 days.
  • The output should be sorted by signup date, newest first.

Let's build the pipeline.

Step 1: Remove bad rows

The first three rows are metadata - column descriptions, export timestamps, that kind of thing. Use the Filter operation to remove them. You could filter where "name" is not empty and doesn't contain the header text. Or, if the rows are clearly identifiable, use a row removal approach based on a condition like email contains '@' to keep only real contact rows.

After applying, the pipeline sidebar shows one card: "Filter - keep rows where email contains '@'". The card displays the generated SQL and a green "applied" status.

Step 2: Split the name column

Use the Split Column operation. Select the "name" column, set the delimiter to a space, and set the number of parts to 2. This creates "name_1" (first name) and "name_2" (last name).

Now the sidebar shows two cards. Each one tells you exactly what it does. If a name has more than one space (like "Mary Jane Watson"), the split takes the first part and puts everything else in the second part. For more advanced name handling, see the name column post.

Pipeline sidebar - 2 steps

1 - Filter applied

Keep rows where email contains '@'

WHERE email LIKE '%@%'

2 - Split Column applied

Split "name" on space into 2 parts → name_1, name_2

SPLIT_PART(name, ' ', 1) AS name_1, SPLIT_PART(name, ' ', 2) AS name_2

Step 3: Convert the date column

The "signup_date" column is text. Use Convert Type to cast it to DATE. ExploreMyData uses TRY_CAST(), so any values that can't be parsed become NULL instead of throwing an error. After this step, the column is a real date - it sorts correctly and supports date arithmetic.

Step 4: Add a calculated column

You need "days since signup." Use the Expression operation (or Add Calculated Column) and enter: CURRENT_DATE - signup_date. Name the output column "days_since_signup". DuckDB handles date subtraction natively and returns the number of days. The calculated column post and date operations guide cover more expression patterns.

Step 5: Filter to active users, last 90 days

Add another Filter operation. Use two conditions with AND:

  • status is "active"
  • days_since_signup <= 90

Notice that you're filtering on a column you created in step 4. Each pipeline step builds on the previous result, so "days_since_signup" exists by the time this filter runs. The complex filters post covers more advanced AND/OR/NOT patterns.

Step 6: Sort by signup date

Add a Sort operation. Select "signup_date" and set the direction to descending. Newest signups appear first.

The pipeline sidebar now shows six cards, top to bottom:

  1. Filter - keep rows where email contains '@'
  2. Split Column - split "name" by space into 2 parts
  3. Convert Type - cast signup_date to DATE
  4. Expression - add days_since_signup
  5. Filter - keep active users, last 90 days
  6. Sort - signup_date descending

Pipeline sidebar - 6 steps complete

1 - Filter keep rows where email contains '@' applied
2 - Split Column split "name" by space → name_1, name_2 applied
3 - Convert Type signup_date → DATE applied
4 - Add Column days_since_signup = CURRENT_DATE - signup_date applied
5 - Filter status = 'active' AND days_since_signup <= 90 applied
6 - Sort signup_date descending applied

Editing and reordering steps

This is where pipelines beat spreadsheets. Click any card to edit it. Change the filter condition, rename the output column, adjust the sort direction. When you apply the edit, every step downstream re-executes automatically.

Need to move a step? You can reorder cards in the sidebar. The pipeline rebuilds in the new order. Want to remove step 2 entirely? Delete the card. Steps 3 through 6 re-execute against the new state (though they may break if they depended on columns from the deleted step - more on that next).

When steps break

Pipeline steps can have three statuses:

  • Applied (green) - the step ran successfully.
  • Warning (yellow) - the step ran but something looks off (like a column that might not exist in all cases).
  • Broken (red) - the step failed. This usually happens when a step references a column that was removed or renamed by an earlier edit.

If step 3 breaks, steps 4 through 6 won't execute because they depend on step 3's output. The sidebar makes this visible immediately - you can see exactly which step failed and why. Fix it, and the downstream steps recover.

The SQL trail

Every card shows the SQL it generates. This isn't just for debugging - it's documentation. When you come back to this analysis in two weeks, the pipeline tells you exactly what was done and in what order. Each step has a plain-language description and the precise SQL. No guessing.

Compare this to a spreadsheet where transformations are invisible. You see the final result but have no idea what formulas, sorts, or manual edits got you there. The pipeline is a ledger of every transformation.

Tips for longer pipelines

  • Clean first, calculate second. Put filters and type conversions at the top. Calculations and aggregations come after the data is clean.
  • Name your output columns clearly. When you add a calculated column, give it a descriptive name. Future steps (and future you) will thank you.
  • Use the SQL view for debugging. If a result looks wrong, click the card and read the SQL. Often the issue is a column name mismatch or a type you forgot to convert.
  • Don't be afraid to delete and rebuild. Unlike a spreadsheet, deleting a pipeline step doesn't destroy your source data. The original file is untouched. You're always working on a derived view.

Build your pipeline →

Try it yourself

No sign-up, no upload, no tracking.

Open ExploreMyData