Skip to content

Conversation

yifan-c
Copy link
Contributor

@yifan-c yifan-c commented Sep 26, 2025

Add ZstdDictionaryCompressor and wire to the read and the write path

Resolve a correct dictionary compressor on the read path

lazily init ZstdDictCompress and ZstdDictDecompress

checksumming

Added compression dictionary table and accessors

Wire to System table access to CompressionDictionaryManager

Integrate dictionary trainer

Add nodetool to train dictionary

Cleanup

break CompressionDictionaryManager up

properly release dictionary with reference counting

Add ZstdDictionaryCompressorTest

Simplify trainer interface

Bump SSTable version

Add CompressionDictionaryCacheTest

Add ZstdDictionaryTrainerTest

update to use assertJ

Add CompressionDictionaryTrainingConfigTest

Update comments in CompressionDictionaryEventHandler and update exception handling in MessagingService

Add CompressionDictionaryEventHandlerTest

Add CompressionDictionarySchedulerTest

Minor update

Add CompressionDictionaryManagerTest

Add TrainCompressionDictionaryTest (nodetool)

Add CompressionDictionaryIntegrationTest

Tidy up

Add ZstdCompressionDictionaryTest

Add SystemDistributedKeyspaceCompressionDictionaryTest

Add --sampling-rate for the nodetool

Update consistency level from QUORUM to ONE for best effort

Add the missing setCompressionDictionaryManager to sstable writer builders

Add benchmark

@yifan-c yifan-c force-pushed the CASSANDRA-17021/zstd-dict-compression-p1 branch from 9279a19 to 7bf7e30 Compare September 26, 2025 22:12
@yifan-c yifan-c changed the title Support ZSTD dictionary compression CASSANDRA-17021: Support ZSTD dictionary compression Sep 26, 2025
@smiklosovic smiklosovic self-requested a review October 7, 2025 11:02
compression_dictionary_refresh_interval: 3600s

# Initial delay before starting the first dictionary refresh cycle after node startup.
# This prevents all nodes from refreshing simultaneously when the cluster starts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means that all nodes would need to be started at the very same moment? Is this planned to be different per node in a real scenario? Because if all nodes have 10s then ... all of them would have same delay set, hence we would have the "simultaneous" situation again?

Maybe having random initial delay in some range (e.g. 5 to 20 seconds) would be better?

Copy link
Contributor Author

@yifan-c yifan-c Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think starting all nodes at the same moment is a requirement. The refresh interval just defines how frequent each node should check the system table. When there is a new dictionary trained, the node will broadcast the message to all live nodes to notify the dictionary is available. This happens outside of the periodic polling.
Combining both, you can consider the interval as the upper bound of how dictionary is refreshed (when broadcasting fails at some nodes).
When some nodes are running with the prior dictionary or no dictionary, it is still not a huge issue, from the perspective of read and write SSTables. Eventually, all nodes will pick up the latest.

Copy link
Contributor

@rustyrazorblade rustyrazorblade Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the overhead of this? This seems like a huge bikeshed over something that has extremely low overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rustyrazorblade, I am not sure if I understand your comment. I would like to keep the settings as they are for now and not introduce randomized delay unless we see an actual need. If the default interval turns out to be sub-optimal, we can adjust that directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion. I'm not advocating for randomized delay. The point I was trying to make is that this operation appears to be so lightweight that debating if it should be run once a day or an hour seems like a bikeshed. We're talking about an incredibly trivial amount of data and a very lightweight operation. Doing it once a day vs an hour or even a minute is like the difference between one and three grain of sand on an entire beach.

# How long to keep compression dictionaries in the cache before they expire.
# Expired dictionaries will be removed from memory but can be reloaded if needed.
# Min unit: s
compression_dictionary_cache_expire: 3600s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably safe just saying the fact that this is same (3600s) as compression_dictionary_refresh_interval feels like there might be some kind of a problem when both periods ends at the same time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why such an aggressive expiration time? I'd make this 24h.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compression_dictionary_cache_expire and compression_dictionary_refresh_interval are not strongly connected, since they apply to different components. Cache entry expiry is renewed on every access. Meanwhile, compression_dictionary_refresh_interval only defines how frequent to read from system table.
If there is no access to the cache entry for 1 hour, it is evicted. The chance is low that the dictionary is wanted right after eviction. Even in such scenario, the cache entry is again populated from compressionInfo or system table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to 24 hours. Was just placing some default values in the initial patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the eviction code is broken, see my full comment in JIRA. When compacting a table that I left overnight (assuming the dictionary was evicted) I get this:

WARN  [CompactionExecutor:869] 2025-10-08T17:16:05,033 SSTableWriter.java:157 - Failed to open /Users/jhaddad/dev/cassandra/bin/../data/data/chunk4_2/keyvalue-1b255f4def2540a600000000000000f6/pa-2-big for writing
java.lang.IllegalStateException: Dictionary is being closed
    at org.apache.cassandra.io.compress.ZstdDictionaryCompressor.lambda$getOrCreate$1(ZstdDictionaryCompressor.java:100)
    ```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fixed in 1ae168a.

{
if (options == null || !options.containsKey("maxSamplingDurationSeconds"))
{
throw new IllegalArgumentException("maxSamplingDurationSeconds parameter is required for manual dictionary training");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can not that be some default when missing? Seems like a hassle to be force to include it all the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nodetool command is supposed to provide a value, always. So the exception should never be thrown. I had been weighing both options for a while; the reason of not providing a default here is that 1) it is not going to be used, and 2) two default values in nodetool and server could diverge and create confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using a max # of samples instead of a duration? Duration is kind of weird, highly variable. Then a default also makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason of using duration is to have a time bound for sampling. If using the max # of samples, there is a chance that the tool runs too long, even indefinitely, if no flush and no compaction occurs.
Having a time bound at least will terminate the tool process in a controlled manner. I acknowledge that having a failed training after 10 minutes is frustrating. It is up to operators to configure the tool properly; also the SSTable-based sampling that you suggested and I just added should solve the issue better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yifan-c well with a big gun there comes a lot of responsibility :) I would also introduce number of samples, if we are worried about that never ending, I would add duration as additional boundary.

For example, max samples 100 000 OR duration of 6 hours. Whatever happens first will end it.

@rustyrazorblade what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I am open to remove the write-based sampling from the manual training tool invocation.

I would like to clarify on whether we want to run flush on the table regardless or only when there is no live sstable. I think the former is probably fine as the training tool is used very sparsely, possibly only once.

Copy link
Contributor

@smiklosovic smiklosovic Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good question. What do you do when there is 10 SSTables already (or if somebody is changing compression algorithm)? You take the last one (newest one) or the first one? (oldest one?) I would take the newest one.

What if you flush and that flush will not have enough data? Like it would flush too few data. Then you take the older one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stefan, please check out f7a66b4 first. The sstable sampling is already implemented in this commit. It basically does random sampling across all data on disk. What it does not have is running 'flush' in advance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if there's no SSTables on disk, run flush. Otherwise, train on what's there.

@yifan-c thanks for explaining the sampling, that's what I had in mind too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yifan-c thanks, I had to overlook that.

@yifan-c yifan-c force-pushed the CASSANDRA-17021/zstd-dict-compression-p1 branch from db95727 to a06d9b3 Compare October 8, 2025 22:58
@yifan-c
Copy link
Contributor Author

yifan-c commented Oct 8, 2025

I think all comments are addressed. Please tag me if I missed. (GH is not super friendly with many comments)

@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
@apache apache deleted a comment from yifan-c Oct 9, 2025
Add ZstdDictionaryCompressor and wire to the read and the write path

Resolve a correct dictionary compressor on the read path

lazily init ZstdDictCompress and ZstdDictDecompress

checksumming

Added compression dictionary table and accessors

Wire to System table access to CompressionDictionaryManager

Integrate dictionary trainer

Add nodetool to train dictionary

Cleanup

break CompressionDictionaryManager up

properly release dictionary with reference counting

Add ZstdDictionaryCompressorTest

Simplify trainer interface

Bump SSTable version

Add CompressionDictionaryCacheTest

Add ZstdDictionaryTrainerTest

update to use assertJ

Add CompressionDictionaryTrainingConfigTest

Update comments in CompressionDictionaryEventHandler and update exception handling in MessagingService

Add CompressionDictionaryEventHandlerTest

Add CompressionDictionarySchedulerTest

Minor update

Add CompressionDictionaryManagerTest

Add TrainCompressionDictionaryTest (nodetool)

Add CompressionDictionaryIntegrationTest

Tidy up

Add ZstdCompressionDictionaryTest

Add SystemDistributedKeyspaceCompressionDictionaryTest

Add --sampling-rate for the nodetool

Update consistency level from QUORUM to ONE for best effort

Add the missing setCompressionDictionaryManager to sstable writer builders

Add benchmark
@yifan-c yifan-c force-pushed the CASSANDRA-17021/zstd-dict-compression-p1 branch from e4d024e to d9c7dea Compare October 14, 2025 21:38
@yifan-c
Copy link
Contributor Author

yifan-c commented Oct 14, 2025

Pushed changes and trigger the CI https://pre-ci.cassandra.apache.org/job/cassandra/123/pipeline-overview/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants