-
Notifications
You must be signed in to change notification settings - Fork 7
LZ optimal parsing
ZSTD encodes each match as 3 bit fields - litLen, matchLen and matchOffset. These fields are encoded independently, each getting its own "price", fixed for an entire compressed block.
If we have match candidates list for an entire block, we can compute matchOffset price for each candidate apriori, getting benefits of maximal ILP and even employing AVX-512.
For matchLen prices, we should make one precomputed table for the entire compressed block. It may include e.g. lengths up to 256, with computation on-the-fly for larger lengths.
Now the crucial part - let's compute each matchPrice[pos][len], i.e. the price of matches starting at each position with each possible length! Again, they can be computed independently, maximizing ILP.
Moreover, we can use CMOV to fly through positions, their candidates, and candidate lengths without any unpredicted jumps:
curPos = position[curCandidate];
curOffsetPrice = offsetPrice[curCandidate]; // precomputed above
matchPrice[curPos][len] = price[len] + curOffsetPrice;
len++;
if (EQUIPROBABLE (len > maxLen[curCandidate])) {
// Switch to the next candidate
curCandidate++;
len = startingLen[curCandidate]; // precomputed too - either MIN_MATCHLEN or 1 + prevCandidate.maxLen
}
Of course, matchPrice shouldn't be a rectangular matrix, instead we should alloc only the necessary space for each position. We can also process a compressed block in smaller chunks to limit matchPrice size (and the size of per-candidate arrays too).
This loop is about 15 commands, which means processing 1e9 pos+len combinations per second on 4 GHz Skylake, which is probably equivalent to ~~ 100 MB/s of input data. It already looks much faster than existing optimal parsers, but not yet good enough for a real deal.
So let's dig deeper! Dropping price[len], this loop is actually the classic UNRLE problem -
each offsetPrice[curCandidate] is repeated curCandidate.maxLen - prevCandidate.maxLen
times.
Knowing that we can put SIMD to work.
Our new shiny loop will usually make one iteration per candidate (rather than one iteration per len):
curLen = maxLen[curCandidate];
mask = ONES[curLen];
newPrices = broadcast( offsetPrice[curCandidate]);
simdPrices = select(mask, simdPrices, newPrices);
*matchPricesPtr = simdPrices; // always write to RAM
// Go either to the next SIMD word or the next match candidate
if (filled entire SIMD word) {
matchPricesPtr++;
mask = 0;
} else {
curCandidate++;
if (new position)
matchPricesPtr++;
}
It's again about 15 commands, but now per candidate/SIMD word. With the usual 1.5 candidates per position, this means > 500 MB/s.
And now, once we collected all the accessories, we can take the final boss. The optimal parser loop does this (ignoring repeat codes):
for pos = 0 .. bufsize-1:
price[pos+1] = min( price[pos+1], price[pos] + literalPrice[buf[pos]])
for matchLen = MIN_MATCHLEN .. maxMatchLen[pos]:
matchPrice = compute price of match with length matchLen starting at pos
price[pos+matchLen] = min( price[pos+matchLen], price[pos] + matchPrice)
Ignoring literals and their prices for a moment, we can implement it as:
curPrices = *pricePtr;
*pricePtr = min(*pricePtr, *matchPtr);
*pricePtr = curPrices;
if(no more SIMD words for this pos) {
pos++;
pricePtr++;
} else {
pricePtr += SIMD width;
}
matchPtr += SIMD width;
Now let's take literals back to the equation. Literals in ZSTD has an unusual pricing - the price of a literal run is the sum of individual literal prices (bits required to encode each literal in the run), plus the price of encoding the run length - which we should precompute and save into litRunPrice[runLength] table, just like matchLen prices (see above).
This means that encoding a literal means extending the literal run, which currently finishes at pos (and may have zero length), with one more literal, and thus the price of encoding a literal in position pos should be computed as:
// Length of the literal run finishing the best encoding chosen for pos:
runLen = runLength[pos];
litPrice = literalPrice[buf[pos]] + litRunPriceDiff[runLen];
// using precomputed:
// litRunPriceDiff[runLen] = litRunPrice[runLen+1] - litRunPrice[runLen]
// We can also precompute (and efficiently vectorize with AVX-512):
// litPriceAt[pos] = literalPrice[buf[pos]]
So, once we computed the final price[pos], the prices of pos+1 and pos+2 can be affected only by literals (since MIN_MATCHLEN==3).
// Earlier in the series: final price[pos] was established
price[pos+1] = min( price[pos+1], price[pos] + litPrice(pos));
price[pos+2] = min( price[pos+2], price[pos+1] + litPrice(pos+1));
// Now the final price[pos+1] and price[pos+2] were established.
// The real code should also update runLength[pos+1] and runLength[pos+2]
// according to the chosen path to these positions.
// We can also vectorize the litPrice(pos)/litPrice(pos+1) computation.
With all these prerequisites, we can take the final boss. Let's recall him the last time:
for matchLen = MIN_MATCHLEN .. maxMatchLen[pos]:
// Compute the price of the match with length matchLen starting at pos:
matchPrice = matchLenPrice[matchLen] + matchOffsetPrice[pos][matchLen]
price[pos+matchLen] = min( price[pos+matchLen], price[pos] + matchPrice)
This loop is perfectly vectorizable. The match price includes 3 parts - litRun, matchLen, and offset. Of those, we account for the litRun price separately (see above). matchLenPrice[] is an array indexed by matchLen, the same for all matches. And matchOffsetPrice[pos][] is also an array indexed by matchLen, so the required array matchPrice[pos][] can be computed by plain SIMD add.
Similarly, the price[pos+matchLen] computation in the next line
is also done with SIMD broadcast, add, and min operations
(really, min
should be replaced by cmp
, since we need these flags
to choose proper values for other arrays like runLength[]).
So, the branchless main loop body is as simple as:
It would be perfect, if not ignoring two problems. First, it lacks computing the literal run price. We can sacrifice a little performance by recomputing it at each iteration. Actually, we suffer mainly because we lose the ability to skip positions without matches, rather than due to rare occurrences of match lengths > 10:
Second problem - any x86 seriously suffers from writing to partially overlapped memory areas, and here we do that A LOT. A possible solution is to keep prices e.g. for the next 32 positions in vector registers, shifting data between them, one word per position, and use scalar writes to save into memory only data for the current position.
Another possible solution is to make only aligned stores and loads, but shift data in vector registers to align them with the current position:
1. Compute one SIMD register with price[pos] + matchPrice[i]
(3 SIMD operations - broadcast, load, and sum)
2. Load two corresponding price SIMD words from memory
3. Fill two SIMD registers with matchPrice[] shifted to match the loaded SIMD price words,
filling unused slots with 9999
4. Compare pairs of corresponding registers and use the comparison result
to conditionally select the best prices and their attributes like runLength[pos]
Taking into account that we took most computations at steps 2..4, we should revert the logic and just drop the unused part of the step 1 computation:
1. Compute two SIMD registers with mPrice[i] = price[pos] + matchPrice[i]
(5 SIMD operations - broadcast, two loads, and two additions)
2. Load corresponding simdPrice[pos] SIMD word from memory
3. Combine two mPrice[i] registers to match the alignment of simdPrice[pos],
filling unused slots with 9999
4. Compare aligned-combined mPrice[i] with simdPrice[pos] and use the comparison result
to conditionally select the best prices and their attributes like runLength[pos]
We can also choose any other N:N, N:N+1, or N:N-1 combination, but 1:1 or 2:1 seems like the optimal choice. This way, we will make 1-2 iterations per position on average.
- move "price[pos] + " after mPrice[i] shuffling
- shuffle matchPrice[i] on precomputing, storing it as SIMD words, aligned with simdPrice[pos]
- main loop over matchPrice[] SIMD words, loading curPos and baseLen from parallel arrays
- UNRLE loop storing 1-4 SIMD words per each match candidate, plus 9999 before MIN_MATCH and after maxMatchLen