Cleaning Messy Data
End-to-end walkthrough: upload a messy file, build a cleaning pipeline, export clean results.
The Problem
You have a CSV export with missing values, inconsistent text, duplicate rows, and columns with wrong types. Here's how to fix it all in a few minutes.
Step 1: Upload and Inspect
Upload your file and open the Column Explorer to spot problems. Look for:
- Columns with many
nullvalues (visible in value counts) - Inconsistent text (e.g., "US", "USA", "United States" in the same column)
- Columns detected as text that should be numeric or dates
Step 2: Remove Duplicates
Use Remove Duplicates to eliminate exact duplicate rows. Choose specific columns if you want to deduplicate by a key field (like email or order ID).
SELECT DISTINCT * FROM data
Step 3: Fill Missing Values
Use Fill Missing on columns with nulls. Choose a fill method:
- Literal: Replace nulls with a specific value (e.g., "unknown", 0)
- Forward: Use the previous row's value
- Backward: Use the next row's value
SELECT *, COALESCE(status, 'unknown') AS status FROM data
Step 4: Standardize Text
Use Text Transform to fix case issues (e.g., lowercase all email addresses). Then use Bulk Replace to map variations to a single value.
-- Standardize country codes
CASE WHEN country IN ('US', 'USA', 'United States') THEN 'US'
WHEN country IN ('UK', 'GB', 'United Kingdom') THEN 'UK'
ELSE country END
Step 5: Fix Data Types
Use Convert Type to cast columns to the right type. DuckDB uses TRY_CAST which returns null on invalid values instead of erroring.
Step 6: Filter Invalid Rows
Use Filter to remove rows that are still invalid after cleaning. For example, remove rows where a required field is still null.
SELECT * FROM data WHERE order_id IS NOT NULL AND amount > 0
Step 7: Export
Click Export to download your clean data as CSV or Parquet.
Tips
- Build your pipeline incrementally - apply one operation at a time and check the results
- Use the SQL preview to verify each step before applying
- You can click any pipeline card to edit its settings
- If something goes wrong, delete the card and try again