-
Notifications
You must be signed in to change notification settings - Fork 3.8k
CASSANDRA-17021: Support ZSTD dictionary compression #4399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
CASSANDRA-17021: Support ZSTD dictionary compression #4399
Conversation
9279a19
to
7bf7e30
Compare
src/java/org/apache/cassandra/db/compression/ZstdDictionaryTrainer.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/tools/nodetool/TrainCompressionDictionary.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/tools/nodetool/TrainCompressionDictionary.java
Outdated
Show resolved
Hide resolved
conf/cassandra.yaml
Outdated
compression_dictionary_refresh_interval: 3600s | ||
|
||
# Initial delay before starting the first dictionary refresh cycle after node startup. | ||
# This prevents all nodes from refreshing simultaneously when the cluster starts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means that all nodes would need to be started at the very same moment? Is this planned to be different per node in a real scenario? Because if all nodes have 10s then ... all of them would have same delay set, hence we would have the "simultaneous" situation again?
Maybe having random initial delay in some range (e.g. 5 to 20 seconds) would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think starting all nodes at the same moment is a requirement. The refresh interval just defines how frequent each node should check the system table. When there is a new dictionary trained, the node will broadcast the message to all live nodes to notify the dictionary is available. This happens outside of the periodic polling.
Combining both, you can consider the interval as the upper bound of how dictionary is refreshed (when broadcasting fails at some nodes).
When some nodes are running with the prior dictionary or no dictionary, it is still not a huge issue, from the perspective of read and write SSTables. Eventually, all nodes will pick up the latest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the overhead of this? This seems like a huge bikeshed over something that has extremely low overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rustyrazorblade, I am not sure if I understand your comment. I would like to keep the settings as they are for now and not introduce randomized delay unless we see an actual need. If the default interval turns out to be sub-optimal, we can adjust that directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion. I'm not advocating for randomized delay. The point I was trying to make is that this operation appears to be so lightweight that debating if it should be run once a day or an hour seems like a bikeshed. We're talking about an incredibly trivial amount of data and a very lightweight operation. Doing it once a day vs an hour or even a minute is like the difference between one and three grain of sand on an entire beach.
conf/cassandra.yaml
Outdated
# How long to keep compression dictionaries in the cache before they expire. | ||
# Expired dictionaries will be removed from memory but can be reloaded if needed. | ||
# Min unit: s | ||
compression_dictionary_cache_expire: 3600s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably safe just saying the fact that this is same (3600s) as compression_dictionary_refresh_interval
feels like there might be some kind of a problem when both periods ends at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why such an aggressive expiration time? I'd make this 24h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compression_dictionary_cache_expire
and compression_dictionary_refresh_interval
are not strongly connected, since they apply to different components. Cache entry expiry is renewed on every access. Meanwhile, compression_dictionary_refresh_interval
only defines how frequent to read from system table.
If there is no access to the cache entry for 1 hour, it is evicted. The chance is low that the dictionary is wanted right after eviction. Even in such scenario, the cache entry is again populated from compressionInfo or system table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to 24 hours. Was just placing some default values in the initial patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the eviction code is broken, see my full comment in JIRA. When compacting a table that I left overnight (assuming the dictionary was evicted) I get this:
WARN [CompactionExecutor:869] 2025-10-08T17:16:05,033 SSTableWriter.java:157 - Failed to open /Users/jhaddad/dev/cassandra/bin/../data/data/chunk4_2/keyvalue-1b255f4def2540a600000000000000f6/pa-2-big for writing
java.lang.IllegalStateException: Dictionary is being closed
at org.apache.cassandra.io.compress.ZstdDictionaryCompressor.lambda$getOrCreate$1(ZstdDictionaryCompressor.java:100)
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fixed in 1ae168a.
src/java/org/apache/cassandra/db/compression/CompressionDictionaryCache.java
Show resolved
Hide resolved
{ | ||
if (options == null || !options.containsKey("maxSamplingDurationSeconds")) | ||
{ | ||
throw new IllegalArgumentException("maxSamplingDurationSeconds parameter is required for manual dictionary training"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can not that be some default when missing? Seems like a hassle to be force to include it all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nodetool command is supposed to provide a value, always. So the exception should never be thrown. I had been weighing both options for a while; the reason of not providing a default here is that 1) it is not going to be used, and 2) two default values in nodetool and server could diverge and create confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using a max # of samples instead of a duration? Duration is kind of weird, highly variable. Then a default also makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason of using duration is to have a time bound for sampling. If using the max # of samples, there is a chance that the tool runs too long, even indefinitely, if no flush and no compaction occurs.
Having a time bound at least will terminate the tool process in a controlled manner. I acknowledge that having a failed training after 10 minutes is frustrating. It is up to operators to configure the tool properly; also the SSTable-based sampling that you suggested and I just added should solve the issue better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yifan-c well with a big gun there comes a lot of responsibility :) I would also introduce number of samples, if we are worried about that never ending, I would add duration as additional boundary.
For example, max samples 100 000 OR duration of 6 hours. Whatever happens first will end it.
@rustyrazorblade what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I am open to remove the write-based sampling from the manual training tool invocation.
I would like to clarify on whether we want to run flush
on the table regardless or only when there is no live sstable. I think the former is probably fine as the training tool is used very sparsely, possibly only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good question. What do you do when there is 10 SSTables already (or if somebody is changing compression algorithm)? You take the last one (newest one) or the first one? (oldest one?) I would take the newest one.
What if you flush and that flush will not have enough data? Like it would flush too few data. Then you take the older one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stefan, please check out f7a66b4 first. The sstable sampling is already implemented in this commit. It basically does random sampling across all data on disk. What it does not have is running 'flush' in advance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if there's no SSTables on disk, run flush. Otherwise, train on what's there.
@yifan-c thanks for explaining the sampling, that's what I had in mind too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yifan-c thanks, I had to overlook that.
db95727
to
a06d9b3
Compare
I think all comments are addressed. Please tag me if I missed. (GH is not super friendly with many comments) |
Add ZstdDictionaryCompressor and wire to the read and the write path Resolve a correct dictionary compressor on the read path lazily init ZstdDictCompress and ZstdDictDecompress checksumming Added compression dictionary table and accessors Wire to System table access to CompressionDictionaryManager Integrate dictionary trainer Add nodetool to train dictionary Cleanup break CompressionDictionaryManager up properly release dictionary with reference counting Add ZstdDictionaryCompressorTest Simplify trainer interface Bump SSTable version Add CompressionDictionaryCacheTest Add ZstdDictionaryTrainerTest update to use assertJ Add CompressionDictionaryTrainingConfigTest Update comments in CompressionDictionaryEventHandler and update exception handling in MessagingService Add CompressionDictionaryEventHandlerTest Add CompressionDictionarySchedulerTest Minor update Add CompressionDictionaryManagerTest Add TrainCompressionDictionaryTest (nodetool) Add CompressionDictionaryIntegrationTest Tidy up Add ZstdCompressionDictionaryTest Add SystemDistributedKeyspaceCompressionDictionaryTest Add --sampling-rate for the nodetool Update consistency level from QUORUM to ONE for best effort Add the missing setCompressionDictionaryManager to sstable writer builders Add benchmark
This reverts commit db95727.
Get rid of java.util.concurrent.CompletableFuture usage Use clock Relocate dictionary configurations and use IntKibibytesBound Change compression_dictionary_training_sampling_rate data type from int to float Address other various comments
e4d024e
to
d9c7dea
Compare
Pushed changes and trigger the CI https://pre-ci.cassandra.apache.org/job/cassandra/123/pipeline-overview/ |
Add ZstdDictionaryCompressor and wire to the read and the write path
Resolve a correct dictionary compressor on the read path
lazily init ZstdDictCompress and ZstdDictDecompress
checksumming
Added compression dictionary table and accessors
Wire to System table access to CompressionDictionaryManager
Integrate dictionary trainer
Add nodetool to train dictionary
Cleanup
break CompressionDictionaryManager up
properly release dictionary with reference counting
Add ZstdDictionaryCompressorTest
Simplify trainer interface
Bump SSTable version
Add CompressionDictionaryCacheTest
Add ZstdDictionaryTrainerTest
update to use assertJ
Add CompressionDictionaryTrainingConfigTest
Update comments in CompressionDictionaryEventHandler and update exception handling in MessagingService
Add CompressionDictionaryEventHandlerTest
Add CompressionDictionarySchedulerTest
Minor update
Add CompressionDictionaryManagerTest
Add TrainCompressionDictionaryTest (nodetool)
Add CompressionDictionaryIntegrationTest
Tidy up
Add ZstdCompressionDictionaryTest
Add SystemDistributedKeyspaceCompressionDictionaryTest
Add --sampling-rate for the nodetool
Update consistency level from QUORUM to ONE for best effort
Add the missing setCompressionDictionaryManager to sstable writer builders
Add benchmark