Parquet vs CSV: When to Use Which Format
CSV and Parquet are two of the most common data file formats, but they work in fundamentally different ways. Understanding when to use each can save you significant time and storage.
CSV: the universal format
CSV (Comma-Separated Values) stores data as plain text, one row per line, with values separated by commas (or tabs, in TSV). Every tool on every platform can read CSV. It's human-readable . you can open it in a text editor and understand the data.
The tradeoffs:
- No type information: every value is a string. "42" could be a number, a zip code, or a category. Tools have to guess.
- No compression: a 1GB CSV is 1GB on disk and 1GB in memory.
- Row-oriented: reading a single column requires scanning every row.
- Delimiter ambiguity: commas inside values, inconsistent quoting, and encoding issues are common.
Parquet: the columnar format
Apache Parquet stores data in a columnar binary format. Instead of storing rows together, it stores all values for each column together. This design decision has major implications:
- Built-in types: integers, floats, strings, dates, booleans are stored natively. No guessing.
- Compression: columnar layout compresses extremely well. A 1GB CSV might be 100-200MB as Parquet.
- Column pruning: reading 3 columns out of 100 only reads those 3 columns from disk.
- Predicate pushdown: filtering can skip entire row groups without reading them.
The tradeoff: Parquet is a binary format. You can't open it in a text editor. You need a tool that understands the format.
When to use CSV
- Small datasets (under 100MB) where simplicity matters
- Data interchange between systems that only support plain text
- Human review. when someone needs to eyeball the raw data
- One-off exports where format doesn't matter
When to use Parquet
- Large datasets (100MB+) where storage and speed matter
- Analytics workloads that read a subset of columns
- Data pipelines where type safety prevents bugs
- Archival storage where compression saves cost
- Sharing between Python (pandas/polars), R, Spark, and SQL engines
How to convert between them
With ExploreMyData, you can open a Parquet file and export it as CSV (or vice versa in terms of working with the data). Open your file, optionally apply filters or transformations, then click Export. The result downloads as CSV.
DuckDB WASM reads both formats natively, so there's no conversion overhead. just drag your file onto the page and start exploring.
Quick comparison
| Feature | CSV | Parquet |
|---|---|---|
| Storage format | Text (row-oriented) | Binary (columnar) |
| Compression | None | Snappy, Gzip, Zstd |
| Type safety | No (everything is text) | Yes (native types) |
| Human-readable | Yes | No |
| Read single column | Must scan all rows | Reads only that column |
| Typical compression ratio | 1x | 5-10x smaller |
| Ecosystem support | Universal | Python, Spark, DuckDB, Arrow |