React

Graph all the things

analyzing all the things you forgot to wonder about

The Parquet We Could Have

2024-01-02
interests: compression, data engineering (explained for beginners)

Parquet is an extremely popular format for storing tabular data, used in many major tech companies. I don't have a source for this, but I'm pretty confident there's over an exabyte of data (10^{18} bytes) stored in Parquet. Countless millions of dollars are spent storing it, transferring it over network, and processing it each year. And yet Parquet does somewhat poorly at compressing numerical columns.

I forked the Rust implementation of Parquet to demonstrate the Parquet we could have: one with excellent compression for numerical columns. I did this by adding pcodec as an encoding.

Results

I benchmarked against the numerical columns of 3 public, real-world datasets (methodology at bottom):

bar charts showing better compression ratios for pco parquet than zstd parquet or zstd parquet (level=10) on 3 datasets bar charts showing worse compression speeds for pco parquet than zstd parquet or zstd parquet (level=10) on 3 datasets bar charts showing better decompression speeds for pco parquet than zstd parquet or zstd parquet (level=10) on 3 datasets

Observations

  • Pco Parquet gets much greater compression ratio (44-158% higher) and slightly faster decompression than Zstd Parquet. Pco Parquet's stats here are almost identical to those of standalone pco.
  • Pco Parquet sometimes has noticeably slower compression than Zstd Parquet at default compression level (1). This is different from standalone pco, since Parquet adds considerable writing overhead. However, when compared to Zstd Parquet at high compression level (10), Pco Parquet wins at all 3 metrics.

I argue that compression ratio is usually the most important metric by far. Read throughput is often bottlenecked on network or disk IO, making it the only metric of interest. Even when latency is the primary concern, compression ratio can be more important than decompression speed, since it reduces the time taken to fetch the compressed data. Plus, better compression decreases storage costs.

From these benchmarks, pco's only weak spot is in compression speed, but I have plans to improve it in the future.

Implementation

Given pcodec, it wasn't terribly complicated to hack it into Parquet. This is not at all meant to be a good or fully-featured implementation, just a simple proof of concept.

To Parquet, pco is a type of encoding, as opposed to a compression. I've mentioned this detail in a previous post about pco's predecessor. This is because pco cares about the data type it's encoding. It goes from (a sequence of numbers -> bytes), as opposed to just (bytes -> bytes).

Encodings have other implications too, like how Parquet decides to split data pages. A fully-featured integration would need require different treatment from Parquet's other encodings on this matter, but that's beyond the scope of this blog post.

Methodology

Benchmarks were done on a single Apple M3 performance core.

For Parquet I used format version 2, library version 49.0.0. I took a quick look at version 1, but (as expected) version 2 was better in most cases. I used the default settings for everything except chunk size, where I chose 2^{20} after some experimentation to find what worked best for zstd on these datasets.

For pco I used the default compression configuration.

For details about the datasets, see the pcodec repo's benchmarks.

Update (2024-01-06)

I was asked for some benchmarks comparing the BYTE_STREAM_SPLIT encoding, which applies to floats, so I did another comparison pictured below. Only the Taxi dataset had floats, so that's what I used. I threw in the PLAIN encoding too. These are both different from the default of dictionary encoding, which is treated somewhat specially. Unfortunately arrow-rs doesn't support BYTE_STREAM_SPLIT yet, so I used pyarrow and could only make a fair comparison for compression ratio.

bar charts showing better compression ratio for pco parquet than various Parquet encodings with zstd on the f64 columns of the Taxi datasets

It may be surprising to some that BYTE_STREAM_SPLIT does worse than PLAIN in this case. The reason is that most columns in the Taxi dataset are decimal-ish; most of their values are approximately multiples of 0.01 or some other base. As a result, there are obvious patterns in their mantissa bytes that zstd can exploit. But if we split the floats into 8 separate streams, 1 for each byte, then zstd can't exploit the correlation between mantissa bytes anymore.