React

Graph all the things

analyzing all the things you forgot to wonder about

pcodec

2023-12-31
interests: compression, data engineering

I've made an upgrade to Quantile Compression, and I'm calling it pcodec, or pco ("pico") for short.

Just like its predecessor, pco is a format and algorithm for compression of numerical sequences. Columnar and time series data are the most obvious candidates. Pco's biggest improvement over Quantile Compression is in decompression speed, making it competitive in many more regimes. On my machine, it decompresses 2-4GB/s on a single core. Compared to Quantile Compression, it typically gets

  • 300-400% faster decompression,
  • much better compression for certain floating point distributions,
  • ~1% better compression for other numerical data on average, and
  • 10-40% faster compression.

Why break with the .qco format?

Both .qco and .pco allow non-breaking format changes. I could have made .qco compress to a new format as long as the new decompression code could handle both old and new formats. I've made non-breaking changes several times already, but I opted to break in this case because the magnitude of the change was so large. Almost none of the original .qco decompression code could be reused for the new format.

The changes were so large because I've learned a lot about CPU performance in the 2.5 years since I started Quantile Compression.

Here's a comparison of the .qco format on the left vs .pco on the right:

a diagram of how .qco files are structured a diagram of how .pco files are structured

The major format changes include:

  • using tANS for entropy coding instead of Huffman codes.
  • using batches of numbers. Suppose the ith number is represented by the bit sequences a_i and b_i. Quantile Compression would encode the numbers as the concatenation a_0b_0a_1b_1\ldots. Pcodec encodes them in batches of 256 as a_0a_1\ldots a_{255}b_0b_1\ldots b_{255}a_{256}\ldots.
  • supporting multiple latent variables. Quantile Compression would always decompose a number into a bin (formerly known as prefix) and offset, but for certain types of data, pco decomposes each number into 2 bins and 2 offsets, corresponding to 2 latents. These latents get combined into the final number. This is handy for thiings like decimal-valued data, where we can express each number as x = l_0 * \text{float_multiplier} + l_1 * \text{machine_epsilon}.
  • (not in diagrams) using little-endian encoding instead of big-endian.
  • (not in diagrams) using a fixed number of offset bits for each bin, instead of a conditionally-resolved number of bits. For instance, suppose we use the bin [0, 5]. There are 6 possible values in that bin, so pco will always use 3 offset bits, covering [0, 7]. Quantile Compression would sometimes used 2 bits and sometimes use 3, covering [0, 5] exactly.

CPU performance learnings

Keep the frequently-accessed state of your program small so that it fits in cache. Pco mostly keeps its state under 32kB, which is small enough to fit in most L1 caches.

Do fewer passes over your data. This will reduce how many times it needs to pull your data out of main memory/disk/wherever. And a related phenomenon: would-be performance improvements to your code won't help if you're bottlenecked on memory bandwidth. e.g. If your memory bandwidth is slower than serial addition, SIMD addition won't speed up the addition of large vectors. SIMD is most useful when you have multiple operations to do on your data while it's held in registers or a fast cache.

If you have a register-sized variable that gets modified in your hot loop, ensure it gets held in a register. In some cases, code like this will actually incur instructions to load and store self.foo on each iteration:

fn do_stuff(&mut self, data: &[...]) {
  for x in data {
    // do something modifying self.foo
  }
}

Even L1 cache has a latency of several cycles, so it can be more efficient to use an explicit stack variable for foo and write it back at the end:

fn do_stuff(data: ..., state: &mut State) {
  let mut foo = state.foo;
  for x in data {
    // do something modifying foo
  }
  state.foo = foo;
}

Make the sizes of your data elements nice. Situation 1: your hot loop does random access on a Vec<Foo> where each Foo is 7 bytes. Your program might calculate i * 7 to get the memory address it needs, which adds a few cycles of latency to the loop. You can probably improve performance by changing your struct's alignment to 8 bytes, allowing your processor to effectively compute i << 3 instead; shifting is faster than multiplication. Situation 2: you're looping over j and reading a[j] and b[j] on each iteration, where a holds 32-bit ints and b holds 64-bit ints. Your program might increment the memory address for each of these separately, since the increment is different. If it's an option, you may see a small speedup by changing a and b to hold the same data type, saving your processor an instruction on every iteration.

Look at the assembly code. This is the most important learning, because it generates new learnings and informs you of whether your code has some of the above maladies. This has happened to me several times: I look at the Rust code, think it couldn't get any faster, check the assembly code, realize it's wasting a few instructions on something silly, tweak the Rust code, and see a performance improvement. It's a bit laborious, and it requires learning to read assembly, but it's essential to answer my "why" questions. And the good news is that you only need to read assembly, never write it. Once I learned what I needed for x86_64, it was pretty easy to adjust to AArch64.

I mentioned a few more performance things, including branching, in this reddit post.

Why pco decompresses so much faster then Quantile Compression

Many of the reasons are related to some of the format changes I mentioned above:

  • The Huffman codes Quantile Compression generated were too big to fit into a single table, requiring up to 3 lookups to find a leaf node. The code branched to decide whether it had hit a leaf node or not, hurting performance. tANS gets good compression with a single, nicely-sized table, eliminating the branching. I first heard of ANS from a random Redditor's comment on one my posts; they had good instincts.
  • By using batches of numbers, pco can use SIMD to decode all the offsets. This was a substantial speedup I learned from Fabien Giesen.
  • Little-endian encoding is counterintuitive at first, but it actually simplified some of my code. Plus, all modern processors are little-endian by default (though many have big-endian support too).
  • The biggest speedup was using a fixed number of bits for each bin's offset instead of a conditionally-resolved number. This removed a high-entropy branch from the hot loop. On my machine, this single-handedly saved 3ns/number - compare that to the total decompression time of under 2ns per number now. Fortunately, the fractions of bits lost in this way were not important for compression ratio.

As I simplified decompression, reading and understanding the assembly became more tractable, allowing me to invest in the previous section's learnings.

Pcodec is more extensible to future change

During early development of Quantile Compression I encountered a challenge: millisecond timestamps in a microsecond-precise data type. They were all multiples of 1000, and any good codec should exploit that. So I made an amendment to the formula for a number. Instead of

x = \text{bin.lower_bound} + \text{offset}
we now had
x = \text{bin.lower_bound} + \text{offset} * \text{bin.gcd}
This posed one new problem and failed to fix an old one.

The new problem: multiplication is kinda slow. I worked around this by making specialized code: one loop for when all the GCDs are 1, and another for when they aren't (via a Rust generic type). You only pay for multiplication if you use it.

The old problem: decimal floats still got bad compression. Sometimes people compress floats that are all approximately(!) multiples of (say) 0.01. Or \pi. The GCD approach couldn't handle that.

For pco, I found an approach that kept the code simple, handled both integer and float multipliers, and can be extended to many new types of special data if I become aware of them. Instead of changing the classic formula for combining offsets and bin lower bounds, we decompose each number into multiple latent variables. The new formula for a number with a GCD becomes

x = l_0 + l_1 * \text{chunk.multiplier}
where l_0 and l_1 can be obtained by the classic formula:
l_i = \text{bin.lower_bound}_i + \text{offset}_i
And I've already mentioned the new formula for float multiples in a previous section.

Very often, a latent variable is trivial, like l_1 in the case of microsecond-precise millisecond timestamps. Instead of fancy generics, pco turns these off with an if check. Thanks to batching, checks like this can run outside of the hot loop without hurting performance.

And while Quantile Compression had hard-coded the special case of GCDs, pco is robust to future change. Adding a new mode requires no meaningful change to the format or data structures.

Parting words

  • Build flawed stuff, but learn as you go.
  • CPUs are lonely, maligned creatures waiting for someone who finally understands them.
  • Quantile Compression was good. Pcodec is fast, extensible, and better.