CASSANDRA-17021: Support ZSTD dictionary compression #4399

yifan-c · 2025-09-26T22:09:56Z

Add ZstdDictionaryCompressor and wire to the read and the write path

Resolve a correct dictionary compressor on the read path

lazily init ZstdDictCompress and ZstdDictDecompress

checksumming

Added compression dictionary table and accessors

Wire to System table access to CompressionDictionaryManager

Integrate dictionary trainer

Add nodetool to train dictionary

Cleanup

break CompressionDictionaryManager up

properly release dictionary with reference counting

Add ZstdDictionaryCompressorTest

Simplify trainer interface

Bump SSTable version

Add CompressionDictionaryCacheTest

Add ZstdDictionaryTrainerTest

update to use assertJ

Add CompressionDictionaryTrainingConfigTest

Update comments in CompressionDictionaryEventHandler and update exception handling in MessagingService

Add CompressionDictionaryEventHandlerTest

Add CompressionDictionarySchedulerTest

Minor update

Add CompressionDictionaryManagerTest

Add TrainCompressionDictionaryTest (nodetool)

Add CompressionDictionaryIntegrationTest

Tidy up

Add ZstdCompressionDictionaryTest

Add SystemDistributedKeyspaceCompressionDictionaryTest

Add --sampling-rate for the nodetool

Update consistency level from QUORUM to ONE for best effort

Add the missing setCompressionDictionaryManager to sstable writer builders

Add benchmark

src/java/org/apache/cassandra/db/compression/ZstdDictionaryTrainer.java

src/java/org/apache/cassandra/tools/nodetool/TrainCompressionDictionary.java

smiklosovic · 2025-10-07T11:32:24Z

conf/cassandra.yaml

+compression_dictionary_refresh_interval: 3600s
+
+# Initial delay before starting the first dictionary refresh cycle after node startup.
+# This prevents all nodes from refreshing simultaneously when the cluster starts.


That means that all nodes would need to be started at the very same moment? Is this planned to be different per node in a real scenario? Because if all nodes have 10s then ... all of them would have same delay set, hence we would have the "simultaneous" situation again?

Maybe having random initial delay in some range (e.g. 5 to 20 seconds) would be better?

I do not think starting all nodes at the same moment is a requirement. The refresh interval just defines how frequent each node should check the system table. When there is a new dictionary trained, the node will broadcast the message to all live nodes to notify the dictionary is available. This happens outside of the periodic polling.
Combining both, you can consider the interval as the upper bound of how dictionary is refreshed (when broadcasting fails at some nodes).
When some nodes are running with the prior dictionary or no dictionary, it is still not a huge issue, from the perspective of read and write SSTables. Eventually, all nodes will pick up the latest.

What is the overhead of this? This seems like a huge bikeshed over something that has extremely low overhead.

@rustyrazorblade, I am not sure if I understand your comment. I would like to keep the settings as they are for now and not introduce randomized delay unless we see an actual need. If the default interval turns out to be sub-optimal, we can adjust that directly.

Sorry for the confusion. I'm not advocating for randomized delay. The point I was trying to make is that this operation appears to be so lightweight that debating if it should be run once a day or an hour seems like a bikeshed. We're talking about an incredibly trivial amount of data and a very lightweight operation. Doing it once a day vs an hour or even a minute is like the difference between one and three grain of sand on an entire beach.

smiklosovic · 2025-10-07T11:36:19Z

conf/cassandra.yaml

+# How long to keep compression dictionaries in the cache before they expire.
+# Expired dictionaries will be removed from memory but can be reloaded if needed.
+# Min unit: s
+compression_dictionary_cache_expire: 3600s


probably safe just saying the fact that this is same (3600s) as compression_dictionary_refresh_interval feels like there might be some kind of a problem when both periods ends at the same time.

Why such an aggressive expiration time? I'd make this 24h.

compression_dictionary_cache_expire and compression_dictionary_refresh_interval are not strongly connected, since they apply to different components. Cache entry expiry is renewed on every access. Meanwhile, compression_dictionary_refresh_interval only defines how frequent to read from system table.
If there is no access to the cache entry for 1 hour, it is evicted. The chance is low that the dictionary is wanted right after eviction. Even in such scenario, the cache entry is again populated from compressionInfo or system table.

I am open to 24 hours. Was just placing some default values in the initial patch.

I think the eviction code is broken, see my full comment in JIRA. When compacting a table that I left overnight (assuming the dictionary was evicted) I get this:

WARN [CompactionExecutor:869] 2025-10-08T17:16:05,033 SSTableWriter.java:157 - Failed to open /Users/jhaddad/dev/cassandra/bin/../data/data/chunk4_2/keyvalue-1b255f4def2540a600000000000000f6/pa-2-big for writing java.lang.IllegalStateException: Dictionary is being closed at org.apache.cassandra.io.compress.ZstdDictionaryCompressor.lambda$getOrCreate$1(ZstdDictionaryCompressor.java:100) ```

This should be fixed in 1ae168a.

src/java/org/apache/cassandra/db/compression/CompressionDictionaryCache.java

smiklosovic · 2025-10-07T14:01:36Z

src/java/org/apache/cassandra/db/compression/ManualTrainingOptions.java

+    {
+        if (options == null || !options.containsKey("maxSamplingDurationSeconds"))
+        {
+            throw new IllegalArgumentException("maxSamplingDurationSeconds parameter is required for manual dictionary training");


Can not that be some default when missing? Seems like a hassle to be force to include it all the time.

The nodetool command is supposed to provide a value, always. So the exception should never be thrown. I had been weighing both options for a while; the reason of not providing a default here is that 1) it is not going to be used, and 2) two default values in nodetool and server could diverge and create confusion.

What about using a max # of samples instead of a duration? Duration is kind of weird, highly variable. Then a default also makes sense.

The reason of using duration is to have a time bound for sampling. If using the max # of samples, there is a chance that the tool runs too long, even indefinitely, if no flush and no compaction occurs.
Having a time bound at least will terminate the tool process in a controlled manner. I acknowledge that having a failed training after 10 minutes is frustrating. It is up to operators to configure the tool properly; also the SSTable-based sampling that you suggested and I just added should solve the issue better.

@yifan-c well with a big gun there comes a lot of responsibility :) I would also introduce number of samples, if we are worried about that never ending, I would add duration as additional boundary.

For example, max samples 100 000 OR duration of 6 hours. Whatever happens first will end it.

@rustyrazorblade what do you think?

Sounds good. I am open to remove the write-based sampling from the manual training tool invocation.

I would like to clarify on whether we want to run flush on the table regardless or only when there is no live sstable. I think the former is probably fine as the training tool is used very sparsely, possibly only once.

that's a good question. What do you do when there is 10 SSTables already (or if somebody is changing compression algorithm)? You take the last one (newest one) or the first one? (oldest one?) I would take the newest one.

What if you flush and that flush will not have enough data? Like it would flush too few data. Then you take the older one?

Stefan, please check out f7a66b4 first. The sstable sampling is already implemented in this commit. It basically does random sampling across all data on disk. What it does not have is running 'flush' in advance.

I think if there's no SSTables on disk, run flush. Otherwise, train on what's there.

@yifan-c thanks for explaining the sampling, that's what I had in mind too.

@yifan-c thanks, I had to overlook that.

src/java/org/apache/cassandra/tools/NodeProbe.java

yifan-c · 2025-10-08T23:00:15Z

I think all comments are addressed. Please tag me if I missed. (GH is not super friendly with many comments)

src/java/org/apache/cassandra/db/compression/ZstdDictionaryTrainer.java

Add ZstdDictionaryCompressor and wire to the read and the write path Resolve a correct dictionary compressor on the read path lazily init ZstdDictCompress and ZstdDictDecompress checksumming Added compression dictionary table and accessors Wire to System table access to CompressionDictionaryManager Integrate dictionary trainer Add nodetool to train dictionary Cleanup break CompressionDictionaryManager up properly release dictionary with reference counting Add ZstdDictionaryCompressorTest Simplify trainer interface Bump SSTable version Add CompressionDictionaryCacheTest Add ZstdDictionaryTrainerTest update to use assertJ Add CompressionDictionaryTrainingConfigTest Update comments in CompressionDictionaryEventHandler and update exception handling in MessagingService Add CompressionDictionaryEventHandlerTest Add CompressionDictionarySchedulerTest Minor update Add CompressionDictionaryManagerTest Add TrainCompressionDictionaryTest (nodetool) Add CompressionDictionaryIntegrationTest Tidy up Add ZstdCompressionDictionaryTest Add SystemDistributedKeyspaceCompressionDictionaryTest Add --sampling-rate for the nodetool Update consistency level from QUORUM to ONE for best effort Add the missing setCompressionDictionaryManager to sstable writer builders Add benchmark

This reverts commit db95727.

Get rid of java.util.concurrent.CompletableFuture usage Use clock Relocate dictionary configurations and use IntKibibytesBound Change compression_dictionary_training_sampling_rate data type from int to float Address other various comments

yifan-c · 2025-10-14T21:44:28Z

Pushed changes and trigger the CI https://pre-ci.cassandra.apache.org/job/cassandra/123/pipeline-overview/

yifan-c force-pushed the CASSANDRA-17021/zstd-dict-compression-p1 branch from 9279a19 to 7bf7e30 Compare September 26, 2025 22:12

yifan-c changed the title ~~Support ZSTD dictionary compression~~ CASSANDRA-17021: Support ZSTD dictionary compression Sep 26, 2025

smiklosovic added the CEP-54 label Oct 6, 2025

smiklosovic self-requested a review October 7, 2025 11:02

smiklosovic requested changes Oct 7, 2025

View reviewed changes

smiklosovic reviewed Oct 7, 2025

View reviewed changes

src/java/org/apache/cassandra/tools/NodeProbe.java Outdated Show resolved Hide resolved

yifan-c force-pushed the CASSANDRA-17021/zstd-dict-compression-p1 branch from db95727 to a06d9b3 Compare October 8, 2025 22:58

apache deleted a comment from yifan-c Oct 9, 2025

smiklosovic reviewed Oct 9, 2025

View reviewed changes

src/java/org/apache/cassandra/db/compression/ZstdDictionaryTrainer.java Show resolved Hide resolved

apache deleted a comment from yifan-c Oct 9, 2025

yifan-c added 3 commits October 14, 2025 14:33

Add configs to cassandra.yaml and refactoring

b9d3c6c

Handle CFS invalidate event and close the CompressionDictionaryManager

03a3fce

rustyrazorblade and others added 18 commits October 14, 2025 14:33

Added docs for zstd dict

eb52e6a

review

3a5d532

Yifan's refactoring on top of Stefan's

032ac7d

Revert the long to DictId change from Stefan

79fdd80

Revert "Revert the long to DictId change from Stefan"

1f8a0cc

This reverts commit db95727.

Address Stefan's review comments

5c3a535

Get rid of java.util.concurrent.CompletableFuture usage Use clock Relocate dictionary configurations and use IntKibibytesBound Change compression_dictionary_training_sampling_rate data type from int to float Address other various comments

Fix currentDictionary being closed prematurely

302dec4

Display statistics when running traincompressiondictionary

8baa36a

Add --use-existing-sstables option in traincompressiondictionary

1e3eef9

resolve remaining checkstyle issues

77c928d

reuse constant

d2c8ea9

rework nodetool command to use table builder

4763c03

log just message instead of whole stacktrace

b1adf2b

minor change

17b2f08

Define DEFAULT_SAMPLING_DURATION_SECONDS

bb8fe74

simplify the close method in CompressionDictionaryCache

5421ee8

Fix issues in SSTableChunkSampler

798ebae

Remove write-based sampling from nodetool traincompressiondictionary

d9c7dea

yifan-c force-pushed the CASSANDRA-17021/zstd-dict-compression-p1 branch from e4d024e to d9c7dea Compare October 14, 2025 21:38

CASSANDRA-17021: Support ZSTD dictionary compression #4399

Are you sure you want to change the base?

CASSANDRA-17021: Support ZSTD dictionary compression #4399

Uh oh!

Conversation

yifan-c commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yifan-c Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rustyrazorblade Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smiklosovic Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yifan-c commented Oct 8, 2025

Uh oh!

Uh oh!

yifan-c commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yifan-c Oct 8, 2025 •

edited

Loading

rustyrazorblade Oct 8, 2025 •

edited

Loading

smiklosovic Oct 13, 2025 •

edited

Loading