Working with Large Files: Sampling and Limiting Rows
Your file has 2 million rows. Maybe it's a year of transaction logs, a full customer database export, or every event from a tracking system. You can't meaningfully scroll through all of it, and you don't need to. What you need is a representative slice - enough data to understand the structure, spot patterns, and decide what to do next.
ExploreMyData can handle large files because DuckDB WASM is a real analytical engine, not a JavaScript array in memory. But even with a fast engine, working with a focused subset makes exploration faster and your analysis clearer.
Strategy 1: Limit to the first N rows
The simplest approach. If you just want to see what the data looks like - column names, types, sample values - the first 1,000 rows are usually enough.
- Open the toolbar and select Limit from the Filter & Sort group.
- Enter the number of rows (e.g., 1000).
- Apply. The grid now shows only the first 1,000 rows.
This adds a LIMIT 1000
to the pipeline. Every operation you apply after this - filters, transforms, aggregations -
runs against this smaller set, which keeps things responsive.
Pipeline with Limit applied (showing first 1,000 of 2,000,000 rows):
| # | event_id | user_id | event_type | created_at | value |
|---|---|---|---|---|---|
| Showing 1,000 of 2,000,000 rows • Pipeline: [Limit: 1,000] | |||||
| 1 | EVT-00001 | USR-4421 | page_view | 2025-01-01 00:00:04 | 1 |
| 2 | EVT-00002 | USR-8803 | click | 2025-01-01 00:00:11 | 1 |
| 3 | EVT-00003 | USR-2210 | purchase | 2025-01-01 00:00:33 | 49.99 |
The full 2M-row file is loaded, but operations only run against 1,000 rows. Instant response. Remove the Limit step when ready to work with the full dataset.
The downside: the first N rows aren't random. If your file is sorted by date, you're only seeing the earliest records. If it's sorted by customer ID, you might only see customers in one region. This is fine for a quick look, but not for analysis that needs to represent the full dataset.
Strategy 2: Top/Bottom N by a column
Instead of arbitrary first rows, you can take the top or bottom N rows based on a specific column. This is useful when you want to focus on extremes: the highest-revenue transactions, the most recent orders, or the oldest unresolved tickets.
- Select Top/Bottom from the Filter & Sort group.
- Choose "top" or "bottom."
- Enter the number of rows (e.g., 500).
- Select the column to rank by (e.g., "revenue" or "created_at").
Top 500 by revenue gives you the biggest deals. Bottom 100 by score gives you the lowest-performing items. Top 10000 by date gives you the most recent records. Each of these is a meaningful sample for different questions.
Top 500 by revenue from the full 2M-row file - meaningful, not arbitrary:
| # | order_id | customer | revenue | order_date |
|---|---|---|---|---|
| Showing 500 of 2,000,000 rows • Pipeline: [Top 500 by revenue desc] | ||||
| 1 | ORD-198204 | Acme Corp | 142,300.00 | 2025-11-03 |
| 2 | ORD-177891 | GlobalTech | 118,050.00 | 2025-09-17 |
| 3 | ORD-203441 | Sunrise Media | 76,400.00 | 2025-12-01 |
Instead of the first 500 rows (which could all be from January), you get the 500 highest-value orders from anywhere in the 2M-row dataset.
Strategy 3: Filter to a meaningful subset
Often the best sample isn't random - it's a specific slice. One month of data. One region. One product category. This gives you a subset small enough to explore but coherent enough to draw conclusions from.
Filter by date range:
- Select Filter from the toolbar.
- Choose a date column (e.g., "order_date").
- Set the operator to "greater than or equal" with a value of "2025-10-01".
- Apply, then add a second filter: "less than" with "2026-01-01".
Now you're working with just Q4 data. If that's still too much, stack another filter on top - narrow to one region or one product line.
Filter by category:
- Filter the "region" column with "equals" set to "North America".
- Or filter "status" to "active" to exclude historical records.
Each filter is a pipeline step, so you can add and remove them freely. The data grid updates immediately.
Combining strategies
These approaches stack. You can filter to Q4 data, then take the top 1,000 by revenue. Or limit to 50,000 rows first for a quick overview, then remove the limit and apply targeted filters once you know what you're looking for.
A typical workflow with a large file:
- Start with a limit. Set it to 1,000 or 5,000 rows. Browse the data, check column types, open the Column Explorer on a few columns.
- Remove the limit. Delete the Limit step from the pipeline.
- Apply targeted filters. Now that you know what the columns contain, filter to the subset you actually care about.
- Use Top/Bottom if needed. Focus on the highest, lowest, or most recent records for your analysis.
Why this matters
Working with a 2-million-row file doesn't mean you need to see all 2 million rows. Most data exploration is about building intuition: what columns exist, what the values look like, where the problems are. You can build that intuition from a well-chosen subset, then apply what you learn to the full dataset.
The pipeline makes this safe. Every limit and filter is a reversible step. You never modify the original file, and you can always go back to the full dataset by removing your sampling steps.