-
Notifications
You must be signed in to change notification settings - Fork 952
Accelerate RDB loading by prefetching hashtable buckets #2311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
Signed-off-by: chzhoo <czawyx@163.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #2311 +/- ##
============================================
+ Coverage 71.42% 71.49% +0.07%
============================================
Files 123 123
Lines 67030 67112 +82
============================================
+ Hits 47876 47982 +106
+ Misses 19154 19130 -24
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5% is a great improvement!
One thing we might try that is simpler and might get us most of the improvement is to pre-fetch the key right after we parse it. That way we are prefetching the key while the subsequent processing of the value is happening.
So right after we read the key, originally on line 3302, we can use your new helper function to pre-fetch the bucket while the value loading is happening.
Signed-off-by: chzhoo <czawyx@163.com>
Signed-off-by: chzhoo <czawyx@163.com>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see this improvement!
I have some questions.
- Did we try for values larger than 16kb? It may not be a popular value size, but wondering if we see higher deg than 1-2%
- Is there a way to write tests for this? Just to be sure that it doesn't break.
- Also, did you try backward compatibility? Basically something like a version upgrade. Ideally it shouldnt break but might be nice to still try it.
@@ -809,6 +809,11 @@ int kvstoreHashtableFindPositionForInsert(kvstore *kvs, int didx, void *key, has | |||
return hashtableFindPositionForInsert(ht, key, position, existing); | |||
} | |||
|
|||
void kvstoreHashtablePrefetch(kvstore *kvs, int didx, const void *key) { | |||
hashtable *ht = createHashtableIfNeeded(kvs, didx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to check this everytime or we can directly pass it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should keep checking this. It's similar to the kvstoreHashtableAdd
implementation.
|
|
|
thanks @chzhoo for the benchmarks. I think the numbers look good to me! |
Signed-off-by: chzhoo <czawyx@163.com>
src/rdb.c
Outdated
@@ -3300,6 +3300,8 @@ int rdbLoadRioWithLoadingCtx(rio *rdb, int rdbflags, rdbSaveInfo *rsi, rdbLoadin | |||
|
|||
/* Read key */ | |||
if ((key = rdbGenericLoadStringObject(rdb, RDB_LOAD_SDS, NULL)) == NULL) goto eoferr; | |||
/* Prefetch the corresponding hashtable bucket to reduce cache misses. */ | |||
dbPrefetch(db, key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By doing the prefetch, we are repeating some work we will need to do later:
getKVStoreIndexForKey
/getKeySlot
(insidedbPrefetch
anddbAddRDBLoad
)hashKey
(insidekvstoreHashtablePrefetch
anddictFindPositionForInsert
)
In cluster mode, 1 could be quite expensive since we will need to do the CRC hash and lookup the hashtable.
getKeySlot
will attempt to cache the keyslot to server.current_client
, but I don't think we set this to a value during RDB load so this will require doubling the getKeySlot
overhead.
I think we could set the server.current_client
to a fake client (similar to AOF load) which would allow this caching mechanism to be used and reduce the overhead there. Alternatively, you could return the keyslot from dbPrefetch
and pass this through to the RDB load later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did overlook the fact that prefetch would trigger an extra CRC32 calculation when cluster mode is enabled. Thanks for pointing this out. I'll rerun the benchmark with cluster mode enabled to evaluate the performance impact, then finalize our optimization plan accordingly.
/* It is a helper function that prefetches the corresponding hashtable bucket | ||
* into the CPU cache before batch key insertion operations, reducing memory latency. */ | ||
void dbPrefetch(serverDb *db, sds key) { | ||
int dict_index = getKVStoreIndexForKey(key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the benchmark done with cluster mode enabled? If not, I think we might see different results with cluster mode since getKVStoreIndexForKey short-circuits when cluster mode is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous benchmark was done with cluster mode disabled. I'll turn on cluster mode and run the benchmark again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I reran the benchmark with larger values (greater than 4KB) and cluster mode enabled.
- The reason for not testing small values is that the performance degradation is more pronounced in large-value scenarios.
- The maximum performance degradation is 2.3%. I'm hesitant about whether increasing code complexity is worthwhile for a minor improvement.
key_size * key_num = total_size | Loading time before optimization (second) | Loading time after optimization (second) | Performance Boost |
---|---|---|---|
4KB*262144=1GB | 1.810 | 1.851 | -2.3% |
4KB*524288=2GB | 3.692 | 3.713 | -0.6% |
4KB*1048576=4GB | 7.230 | 7.361 | -1.8% |
4KB*2097152=8GB | 14.395 | 14.635 | -1.7% |
4KB*4194304=16GB | 28.784 | 29.295 | -1.8% |
16KB*65536=1GB | 1.309 | 1.319 | -0.8% |
16KB*131072=2GB | 2.603 | 2.649 | -1.8% |
16KB*262144=4GB | 5.220 | 5.283 | -1.2% |
16KB*524288=8GB | 10.412 | 10.510 | -0.9% |
16KB*1048576=16GB | 20.750 | 21.063 | -1.5% |
64KB*16384=1GB | 0.988 | 0.995 | -0.7% |
64KB*32768=2GB | 1.963 | 1.974 | -0.6% |
64KB*65536=4GB | 3.922 | 3.942 | -0.5% |
64KB*131072=8GB | 7.833 | 7.876 | -0.5% |
64KB*262144=16GB | 15.652 | 15.746 | -0.6% |
256KB*4096=1GB | 0.905 | 0.906 | -0.1% |
256KB*8192=2GB | 1.808 | 1.808 | 0.0% |
256KB*16384=4GB | 3.829 | 3.841 | -0.3% |
256KB*32768=8GB | 7.662 | 7.672 | -0.1% |
256KB*65536=16GB | 15.286 | 15.320 | -0.2% |
1024KB*1024=1GB | 0.896 | 0.896 | 0.0% |
1024KB*2048=2GB | 1.784 | 1.786 | -0.1% |
1024KB*4096=4GB | 3.564 | 3.581 | -0.5% |
1024KB*8192=8GB | 7.115 | 7.161 | -0.6% |
1024KB*16384=16GB | 14.191 | 14.253 | -0.4% |
4096KB*256=1GB | 0.882 | 0.883 | -0.1% |
4096KB*512=2GB | 1.774 | 1.775 | -0.1% |
4096KB*1024=4GB | 3.535 | 3.536 | -0.0% |
4096KB*2048=8GB | 7.041 | 7.068 | -0.4% |
4096KB*4096=16GB | 14.050 | 14.087 | -0.3% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for not testing small values is that the performance degradation is more pronounced in large-value scenarios.
The cost of the CRC hash is O(n) on the key name length - not on the value size. What is the key name length you are using?
If key name length is fixed across tests - I would expect small values may be even more impacted by the CRC hash work being repeated - since the hash computation will be more expensive in comparison to the rest of the loading operation. Also consider that 1GiB of small values would require many more CRC hashes than 1GiB of large values (since it will contain more keys)
Would it be possible to check the lower value sizes with cluster mode on for comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost of the CRC hash is O(n) on the key name length - not on the value size. What is the key name length you are using?
I generated the RDB file for benchmark using the following command, where key:000552565592
is an example key (16 bytes in size):
valkey-benchmark -t set -n $KEY_NUM -r 1000000000 -d $VALUE_SIZE -P 16
If key name length is fixed across tests - I would expect small values may be even more impacted by the CRC hash work being repeated - since the hash computation will be more expensive in comparison to the rest of the loading operation. Also consider that 1GiB of small values would require many more CRC hashes than 1GiB of large values (since it will contain more keys)
Would it be possible to check the lower value sizes with cluster mode on for comparison?
- I'll add benchmarks for smaller value sizes with cluster mode enabled later.
- I understand your concern that the additional CRC32/hash computations may impact performance. As a solution, I plan to modify the code to implement caching for CRC32/hash values, thereby reducing redundant calculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I have implemented caching of key slots and hash values via dbPrefetch return parameters, eliminating redundant computations. And the code is now in a review-ready state.
-
After rerunning benchmarks, we observed performance improvements of approximately 10.9% in some scenarios, while no test cases now exhibit significant degradation. This addresses the previously noted ~2% performance degradation in specific cases. Detailed results are presented in the table below.
Benchmark results with cluster mode disabled
value_size * key_num = total_size Loading time before optimization (second) Loading time after optimization (second) Performance Boost 8B*134217728=1GB 84.954 75.715 10.9% 8B*268435456=2GB 163.619 147.234 10.0% 32B*33554432=1GB 23.215 20.958 9.7% 32B*67108864=2GB 46.513 42.253 9.2% 256B*4194304=1GB 4.453 4.129 7.3% 256B*8388608=2GB 8.955 8.324 7.0% 4KB*262144=1GB 1.752 1.727 1.4% 4KB*1048576=4GB 7.004 6.944 0.9% 4KB*4194304=16GB 28.029 27.641 1.4% 16KB*65536=1GB 1.299 1.292 0.5% 16KB*262144=4GB 5.150 5.141 0.2% 16KB*1048576=16GB 20.588 20.587 0.0% 64KB*16384=1GB 0.977 0.983 -0.6% 64KB*65536=4GB 3.905 3.918 -0.3% 64KB*262144=16GB 15.634 15.629 0.0% 256KB*4096=1GB 0.909 0.912 -0.3% 256KB*16384=4GB 3.629 3.628 0.0% 256KB*65536=16GB 14.399 14.412 -0.1% 1024KB*1024=1GB 0.888 0.891 -0.3% 1024KB*4096=4GB 3.542 3.545 -0.1% 1024KB*16384=16GB 14.152 14.150 0.0% 4096KB*256=1GB 0.883 0.884 -0.1% 4096KB*1024=4GB 3.518 3.528 -0.3% 4096KB*4096=16GB 14.058 14.047 0.1% Benchmark results with cluster mode enabled
value_size * key_num = total_size Loading time before optimization (second) Loading time after optimization (second) Performance Boost 8B*134217728=1GB 101.177 90.707 10.3% 8B*268435456=2GB 195.051 176.312 9.6% 32B*33554432=1GB 27.303 25.009 8.4% 32B*67108864=2GB 54.820 49.913 9.0% 256B*4194304=1GB 5.046 4.743 6.0% 256B*8388608=2GB 10.083 9.495 5.8% 4KB*262144=1GB 1.800 1.791 0.5% 4KB*1048576=4GB 7.215 7.148 0.9% 4KB*4194304=16GB 28.806 28.508 1.0% 16KB*65536=1GB 1.310 1.313 -0.2% 16KB*262144=4GB 5.230 5.238 -0.2% 16KB*1048576=16GB 20.873 20.864 0.0% 64KB*16384=1GB 0.984 0.983 0.1% 64KB*65536=4GB 3.953 3.939 0.4% 64KB*262144=16GB 15.693 15.749 -0.4% 256KB*4096=1GB 0.910 0.910 0.0% 256KB*16384=4GB 3.610 3.609 0.0% 256KB*65536=16GB 14.529 14.469 0.4% 1024KB*1024=1GB 0.888 0.895 -0.8% 1024KB*4096=4GB 3.571 3.572 -0.0% 1024KB*16384=16GB 14.172 14.146 0.2% 4096KB*256=1GB 0.885 0.886 -0.1% 4096KB*1024=4GB 3.525 3.528 -0.1% 4096KB*4096=16GB 14.064 14.060 0.0% -
For details of the benchmark, see the benchmark script below.
set -ex
# value_size(byte) * key_num
TEST_CASES=(
# Small-value scenarios (8B to 256B)
"8*134217728" "8*268435456"
"32*33554432" "32*67108864"
"256*4194304" "256*8388608"
#Large-value scenarios (4KB to 4MB)
"4096*262144" "4096*1048576" "4096*4194304"
"16384*65536" "16384*262144" "16384*1048576"
"65536*16384" "65536*65536" "65536*262144"
"262144*4096" "262144*16384" "262144*65536"
"1048576*1024" "1048576*4096" "1048576*16384"
"4194304*256" "4194304*1024" "4194304*4096"
)
CLUSTER_ENABLED=yes
REPEAT_TESTS=5
OPTIMIZE_BRANCH_VALKEY_SERVER="./opt/src/valkey-server"
UNSTABLE_BRANCH_VALKEY_SERVER="./unstable/src/valkey-server"
VALKEY_CLI="./opt/src/valkey-cli"
VALKEY_BENCHMARK="./opt/src/valkey-benchmark"
wait_for_valkey_server_exit() {
while true; do
count=`ps aux |grep valkey-server|grep -v grep |wc -l`
if [ "$count" -eq 0 ]; then
break
fi
sleep 1
done
return 0
}
wait_for_loading_complete() {
while true; do
count=`$VALKEY_CLI info |grep -P "^loading:0"| wc -l`
if [ "$count" -eq 1 ]; then
break
fi
sleep 1
done
}
run_benchmark() {
local val_size=$1
local key_num=$2
local branch=$3
local server_bin=$4
local iteration=$5
local log_file="./${branch}_log"
echo "$branch val_size $val_size key_num $key_num time $iteration" >> $log_file
$server_bin --cluster-enabled $CLUSTER_ENABLED --save '' --appendonly no --daemonize yes --logfile "./$log_file" --pidfile ""
sleep 2
wait_for_loading_complete
$VALKEY_CLI shutdown
wait_for_valkey_server_exit
}
# Main test loop
for test_case in "${TEST_CASES[@]}"; do
IFS='*' read -r val_size key_num <<< "$test_case"
echo $val_size $key_num
# Clean previous data
rm -rf dump.rdb
# Prepare test data
$OPTIMIZE_BRANCH_VALKEY_SERVER --save '' --appendonly no --daemonize yes --pidfile ""
sleep 2
$VALKEY_CLI flushall
$VALKEY_BENCHMARK -t set -n $key_num -r 10000000000 -d $val_size -P 32 -c 2
$VALKEY_CLI save
$VALKEY_CLI shutdown
wait_for_valkey_server_exit
# Run tests for optimize branches
for((x=0;x<$REPEAT_TESTS;x+=1));
do
run_benchmark $val_size $key_num "optimize" $OPTIMIZE_BRANCH_VALKEY_SERVER $x
done
# Run tests for unstable branches
for((x=0;x<$REPEAT_TESTS;x+=1));
do
run_benchmark $val_size $key_num "unstable" $UNSTABLE_BRANCH_VALKEY_SERVER $x
done
done
Signed-off-by: chzhoo <czawyx@163.com>
Description
Under specific workloads, flame graph analysis reveals that during RDB loading significant CPU time is consumed by hashtable bucket lookups (due to CPU cache misses) for key-value pair insertion. By prefetching corresponding buckets to reduce cache misses, we can accelerate RDB load times - our tests demonstrate approximately 10.9% improvement in this scenario.
Original RDB loading logic:
Optimized loading logic:
Benchmark
Benchmark steps:
Benchmark result
Test Env