Cleaning Messy Data

End-to-end walkthrough: upload a messy file, build a cleaning pipeline, export clean results.

The Problem

You have a CSV export with missing values, inconsistent text, duplicate rows, and columns with wrong types. Here's how to fix it all in a few minutes.

Step 1: Upload and Inspect

Upload your file and open the Column Explorer to spot problems. Look for:

  • Columns with many null values (visible in value counts)
  • Inconsistent text (e.g., "US", "USA", "United States" in the same column)
  • Columns detected as text that should be numeric or dates

Step 2: Remove Duplicates

Use Remove Duplicates to eliminate exact duplicate rows. Choose specific columns if you want to deduplicate by a key field (like email or order ID).

SELECT DISTINCT * FROM data

Step 3: Fill Missing Values

Use Fill Missing on columns with nulls. Choose a fill method:

  • Literal: Replace nulls with a specific value (e.g., "unknown", 0)
  • Forward: Use the previous row's value
  • Backward: Use the next row's value
SELECT *, COALESCE(status, 'unknown') AS status FROM data

Step 4: Standardize Text

Use Text Transform to fix case issues (e.g., lowercase all email addresses). Then use Bulk Replace to map variations to a single value.

-- Standardize country codes
CASE WHEN country IN ('US', 'USA', 'United States') THEN 'US'
     WHEN country IN ('UK', 'GB', 'United Kingdom') THEN 'UK'
     ELSE country END

Step 5: Fix Data Types

Use Convert Type to cast columns to the right type. DuckDB uses TRY_CAST which returns null on invalid values instead of erroring.

Step 6: Filter Invalid Rows

Use Filter to remove rows that are still invalid after cleaning. For example, remove rows where a required field is still null.

SELECT * FROM data WHERE order_id IS NOT NULL AND amount > 0

Step 7: Export

Click Export to download your clean data as CSV or Parquet.

Tips

  • Build your pipeline incrementally - apply one operation at a time and check the results
  • Use the SQL preview to verify each step before applying
  • You can click any pipeline card to edit its settings
  • If something goes wrong, delete the card and try again