Chapter 18 of 25
Parquet vs CSV: Why Columns Beat Rows for Analytics
Created May 28, 2026 Updated Jun 7, 2026
The data is the same. The query is the same. The layout on disk is different. That's the whole story.
CSV is row-oriented. Every row is written as one line; all of its columns sit next to each other on disk:
user_id,timestamp,amount,country,device,...
abc,2026-05-28T10:00,42.0,US,iOS,...
def,2026-05-28T10:01,17.5,DE,Android,...
To read it, you scan the file top-to-bottom and parse every byte. Text. Untyped. Compresses poorly because adjacent values are heterogeneous.
Parquet is column-oriented. Inside a file, all user_id values from a batch sit together, then all timestamp values, then all amount values. Binary, typed, self-describing — the schema is embedded.
Now look at this query against a 50-column events table with a million rows:
SELECT AVG(amount) FROM events WHERE user_id = 'abc'
CSV has to read all 50 columns for all million rows, parse every field, then discard the 48 columns you don't need. Roughly 96% of the I/O is wasted on data you immediately throw away.
Parquet reads only user_id and amount — two columns. The other 48 never come off disk. That's the column-scan win, before any other optimization fires.
The columnar layout enables two more wins that compound:
Type-specific compression. Because a column stores values of the same type — often with many repeats — it compresses far better than a heterogeneous row. Two techniques drive most of it:
- Dictionary encoding — repeating strings become short integer indices, with the dictionary stored once.
"US"repeated 100,000 times becomes one entry plus 100,000 small ints. - Run-length encoding (RLE) —
"US,US,US,US"becomes(US, 4).
For columns with low cardinality (countries, enums, status codes) the size shrinks by tens or hundreds of times.
Predicate pushdown. Parquet files are split into row groups (typically ~128 MB each). The file footer stores min/max statistics per column per row group. A filter like WHERE user_id = 'abc' can read the footer (kilobytes) first and skip entire row groups whose min/max range doesn't cover 'abc' — without ever opening those row groups. Combine with partition pruning (directory-level skipping by year=2026/month=05/) and large queries can avoid most of the data entirely.
When CSV still wins:
- Streaming row-by-row processing. Append-only ingestion, line-at-a-time tailing.
- Hand-off to humans / Excel / BI tools. Anything that touches Sheets or Excel ends up as CSV at the boundary.
- Tiny data. Below a few MB the overhead of Parquet metadata exceeds the savings.
The typical production pattern is exactly this split: data lives in Parquet inside the pipeline (cheap analytical reads, fast aggregations, smaller storage footprint), then gets converted to CSV only at the boundary when an analyst needs to open it in Excel. Parquet is the warehouse format; CSV is the export format.
Full breakdown — CSV edge cases (escaping, encoding, type inference), JSON for nested data, Parquet internals (row groups, page compression, schema evolution): Data Storage Formats.