How to Remove Test Data and Junk Rows from a CSV
Someone hands you an export from the production database. You open it, check the row count: 48,000 rows. Great. Then you start scrolling and see this:
- Order ID:
TEST-001, customer:Test User - Email:
test@test.com, revenue:$0.00 - Order ID:
DEMO-2024-05, customer:Demo Account - Email:
admin@example.com, revenue:$99,999.00
Test orders. Demo accounts. QA entries. Placeholder transactions with $0 revenue. Internal accounts with inflated dollar amounts. They're all mixed in with real customer data, and they're wrecking your numbers.
Your average order value is wrong because of the $0 test transactions. Your revenue total is inflated by the $99,999 demo entries. Your customer count includes fake accounts. You need to get rid of all of it before you can trust any calculation.
Here's how to clean it up in ExploreMyData.
Identify the patterns
Test data usually follows predictable patterns. Before you start filtering, spend a minute figuring out what yours looks like. Sort by the order ID column - test IDs often cluster together alphabetically because they start with "TEST" or "DEMO". Sort by email to spot the @test.com and @example.com addresses. Sort by revenue to find the $0.00 entries at the bottom.
In most exports, test data falls into a few categories:
- Prefixed IDs: order IDs starting with "TEST", "DEMO", "QA", or "SANDBOX"
- Fake emails: addresses containing "test.com", "example.com", or "@yourcompany.com"
- Zero-value transactions: $0.00 revenue, 0 quantity
- Placeholder names: "Test User", "Jane Doe", "Foo Bar"
- Internal accounts: employee email domains, specific account IDs
Orders export sorted by order_id - test and demo rows cluster near the top alphabetically:
| order_id | customer_name | revenue | |
|---|---|---|---|
| DEMO-2024-05 | Demo Account | demo@example.com | $99,999.00 |
| ORD-10042 | Sarah Chen | sarah@techwave.io | $249.00 |
| ORD-10043 | Marcus Webb | mwebb@acme.com | $89.00 |
| TEST-001 | Test User | test@test.com | $0.00 |
| TEST-002 | Test User | test@test.com | $0.00 |
| ORD-10044 | Priya Sharma | priya@dataflow.com | $399.00 |
Italicized rows are test/demo entries to be removed. They're easy to spot once sorted.
Build your filter with the Condition Builder
Open Filter from the Filter & Sort group. This opens the Condition Builder, where you can combine multiple conditions with AND/OR logic.
For a typical cleanup, you want to keep rows that pass all of these conditions. Set the logic to AND and add:
order_iddoes NOT start withTESTorder_iddoes NOT start withDEMOemaildoes NOT containtest.comemaildoes NOT containexample.comrevenueis not equal to0
Click Apply. The generated SQL is a single WHERE clause:
WHERE NOT (order_id LIKE 'TEST%') AND NOT (order_id LIKE 'DEMO%') AND NOT (email LIKE '%test.com%') AND NOT (email LIKE '%example.com%') AND revenue != 0
One pipeline step. All five conditions evaluated together. The row count drops from 48,000 to 45,200 - meaning about 6% of your "data" was junk.
Five filter conditions combined with AND logic to exclude junk rows:
| # | Column | Condition | Value |
|---|---|---|---|
| 1 | order_id | does not start with | TEST |
| 2 | order_id | does not start with | DEMO |
| 3 | does not contain | test.com | |
| 4 | does not contain | example.com | |
| 5 | revenue | is not equal to | 0 |
All five conditions must be true for a row to be kept (AND logic). Row count: 48,000 → 45,200.
Clean up duplicates while you're at it
Test data has a way of creating duplicates too. A QA engineer runs the same test scenario five times, generating five rows with slightly different timestamps but otherwise identical data. Even after filtering out the obvious test rows, you might have duplicates in the real data.
After your filter step, add Remove Duplicates from the Filter & Sort group.
Choose the columns that define a unique record - maybe order_id
alone, or a combination of email +
order_date +
product if order IDs aren't reliable.
This generates SELECT DISTINCT ON (order_id) * and keeps
the first occurrence of each.
Using OR logic for complex cases
Sometimes the patterns overlap in ways that need OR logic. Say you want to remove rows where the email contains "test.com" OR the customer name is exactly "Test User" OR the order ID starts with "QA". Any one of those conditions is enough to mark a row as junk.
In the Condition Builder, you can nest groups. Create an OR group with the three conditions above, then wrap the whole thing in a NOT. The result:
WHERE NOT (email LIKE '%test.com%' OR customer_name = 'Test User' OR order_id LIKE 'QA%')
This removes any row that matches at least one of the test patterns.
Check your work
After filtering, do a quick sanity check. Sort by revenue descending - are there still suspiciously large values that look like test data? Sort by email - any remaining @yourcompany.com addresses that should be excluded? Check the row count: does the number make sense for the time period?
The pipeline makes this easy to iterate on. If you spot more junk, add another filter condition. If you removed too much, click into the filter step and adjust the conditions. Every change rebuilds the results instantly.
Pipeline steps and their effect on row count:
| Step | Operation | Rows remaining | Rows removed |
|---|---|---|---|
| - | Original data | 48,000 | - |
| 1 | Filter (5 conditions, AND) | 45,200 | 2,800 |
| 2 | Remove Duplicates (on order_id) | 44,890 | 310 |
About 6.5% of the original file was junk or duplicate data.
Make it a habit
If you get regular exports from the same system, the test data patterns will be the same every time. The order of operations is always: filter out junk first, then analyze. Fifteen minutes of cleaning saves hours of second-guessing numbers that look off because a demo account with $99,999 in revenue is hiding in the data.