Skip to content

Accelerate RDB loading by prefetching hashtable buckets #2311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: unstable
Choose a base branch
from

Conversation

chzhoo
Copy link
Contributor

@chzhoo chzhoo commented Jul 6, 2025

Description

Under specific workloads, flame graph analysis reveals that during RDB loading significant CPU time is consumed by hashtable bucket lookups (due to CPU cache misses) for key-value pair insertion. By prefetching corresponding buckets to reduce cache misses, we can accelerate RDB load times - our tests demonstrate approximately 10.9% improvement in this scenario.

Original RDB loading logic:

  1. Parse key
  2. Parse value
  3. Perform insertion

Optimized loading logic:

  1. Parse key
  2. Prefetch hashtable bucket
  3. Parse value
  4. Perform insertion

Benchmark

Benchmark steps:

  1. Generate the RDB file using specified commands
  2. Load RDB file 5 times
  3. Take the minimum RDB loading time as the final benchmark
  4. For more detail about benchmark, see discussion

Benchmark result

value_size * key_num = total_size Loading time before optimization (second) Loading time after optimization (second) Performance Boost
8B*134217728=1GB 84.954 75.715 10.9%
8B*268435456=2GB 163.619 147.234 10.0%
32B*33554432=1GB 23.215 20.958 9.7%
32B*67108864=2GB 46.513 42.253 9.2%
256B*4194304=1GB 4.453 4.129 7.3%
256B*8388608=2GB 8.955 8.324 7.0%

Test Env

  • OS: Ubuntu Server 20.04 LTS 64bit
  • CPU: Intel(R) Xeon(R) Platinum 8255C * 96
  • Memory: 384GB
  • File system: memory file system (eliminate the distorting effects of disk I/O on benchmark outcomes)
  • Valkey server run with cluster mode disabled

Signed-off-by: chzhoo <czawyx@163.com>
Copy link

codecov bot commented Jul 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.49%. Comparing base (293be3e) to head (c08aee8).
Report is 17 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2311      +/-   ##
============================================
+ Coverage     71.42%   71.49%   +0.07%     
============================================
  Files           123      123              
  Lines         67030    67112      +82     
============================================
+ Hits          47876    47982     +106     
+ Misses        19154    19130      -24     
Files with missing lines Coverage Δ
src/db.c 90.14% <100.00%> (+0.03%) ⬆️
src/hashtable.c 82.59% <100.00%> (+0.27%) ⬆️
src/kvstore.c 95.07% <100.00%> (+0.03%) ⬆️
src/rdb.c 76.95% <100.00%> (+0.51%) ⬆️
src/server.h 100.00% <ø> (ø)

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5% is a great improvement!

One thing we might try that is simpler and might get us most of the improvement is to pre-fetch the key right after we parse it. That way we are prefetching the key while the subsequent processing of the value is happening.

So right after we read the key, originally on line 3302, we can use your new helper function to pre-fetch the bucket while the value loading is happening.

Signed-off-by: chzhoo <czawyx@163.com>
@chzhoo chzhoo force-pushed the optimize_loaddata branch from c968fbb to 51c63cd Compare July 7, 2025 15:50
Signed-off-by: chzhoo <czawyx@163.com>
@chzhoo
Copy link
Contributor Author

chzhoo commented Jul 7, 2025

5% is a great improvement!

One thing we might try that is simpler and might get us most of the improvement is to pre-fetch the key right after we parse it. That way we are prefetching the key while the subsequent processing of the value is happening.

So right after we read the key, originally on line 3302, we can use your new helper function to pre-fetch the bucket while the value loading is happening.

  • The new code have been submitted, and the code is now in a review-ready state.
  • Benchmark results have also been updated.

Copy link
Contributor

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see this improvement!

I have some questions.

  1. Did we try for values larger than 16kb? It may not be a popular value size, but wondering if we see higher deg than 1-2%
  2. Is there a way to write tests for this? Just to be sure that it doesn't break.
  3. Also, did you try backward compatibility? Basically something like a version upgrade. Ideally it shouldnt break but might be nice to still try it.

@@ -809,6 +809,11 @@ int kvstoreHashtableFindPositionForInsert(kvstore *kvs, int didx, void *key, has
return hashtableFindPositionForInsert(ht, key, position, existing);
}

void kvstoreHashtablePrefetch(kvstore *kvs, int didx, const void *key) {
hashtable *ht = createHashtableIfNeeded(kvs, didx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check this everytime or we can directly pass it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should keep checking this. It's similar to the kvstoreHashtableAdd implementation.

@chzhoo
Copy link
Contributor Author

chzhoo commented Jul 9, 2025

  1. Did we try for values larger than 16kb? It may not be a popular value size, but wondering if we see higher deg than 1-2%
  • We've now extended benchmarks to include larger values (up to 4096KB). The maximum performance degradation observed was less than 2%. See the table below for detailed results.
  • Since I observed significant variations in benchmark results when testing on a disk-based file system, all subsequent tests were conducted on a memory file system to eliminate the distorting effects of disk I/O on benchmark outcomes.
key_size * key_num = total_size Loading time before optimization (second) Loading time after optimization (second) Performance Boost
4KB*262144=1GB 1.759 1.772 -0.7%
4KB*524288=2GB 3.510 3.541 -0.9%
4KB*1048576=4GB 6.992 7.071 -1.1%
4KB*2097152=8GB 13.953 14.114 -1.2%
4KB*4194304=16GB 27.834 28.205 -1.3%
16KB*65536=1GB 1.301 1.303 -0.2%
16KB*131072=2GB 2.594 2.605 -0.4%
16KB*262144=4GB 5.139 5.191 -1.0%
16KB*524288=8GB 10.319 10.356 -0.4%
16KB*1048576=16GB 20.624 20.750 -0.6%
64KB*16384=1GB 0.987 0.988 -0.1%
64KB*32768=2GB 1.972 1.979 -0.4%
64KB*65536=4GB 3.929 3.938 -0.2%
64KB*131072=8GB 7.814 7.831 -0.2%
64KB*262144=16GB 15.617 15.657 -0.3%
256KB*4096=1GB 0.910 0.913 -0.3%
256KB*8192=2GB 1.814 1.817 -0.2%
256KB*16384=4GB 3.825 3.838 -0.3%
256KB*32768=8GB 7.624 7.657 -0.4%
256KB*65536=16GB 15.303 15.320 -0.1%
1024KB*1024=1GB 0.899 0.898 0.1%
1024KB*2048=2GB 1.785 1.787 -0.1%
1024KB*4096=4GB 3.566 3.571 -0.1%
1024KB*8192=8GB 7.122 7.127 -0.1%
1024KB*16384=16GB 14.221 14.210 0.1%
4096KB*256=1GB 0.894 0.895 -0.1%
4096KB*512=2GB 1.770 1.778 -0.5%
4096KB*1024=4GB 3.522 3.535 -0.4%
4096KB*2048=8GB 7.074 7.045 0.4%
4096KB*4096=16GB 14.000 14.044 -0.3%

@chzhoo
Copy link
Contributor Author

chzhoo commented Jul 9, 2025

2. Is there a way to write tests for this? Just to be sure that it doesn't break.

@chzhoo
Copy link
Contributor Author

chzhoo commented Jul 9, 2025

3. Also, did you try backward compatibility? Basically something like a version upgrade. Ideally it shouldnt break but might be nice to still try it.

  • This PR introduces no changes to the RDB format
  • Memory prefetching operates as a read-only mechanism
    I consider this update fundamentally safe. If I miss anything, please let me know.

@sarthakaggarwal97
Copy link
Contributor

thanks @chzhoo for the benchmarks. I think the numbers look good to me!

Signed-off-by: chzhoo <czawyx@163.com>
src/rdb.c Outdated
@@ -3300,6 +3300,8 @@ int rdbLoadRioWithLoadingCtx(rio *rdb, int rdbflags, rdbSaveInfo *rsi, rdbLoadin

/* Read key */
if ((key = rdbGenericLoadStringObject(rdb, RDB_LOAD_SDS, NULL)) == NULL) goto eoferr;
/* Prefetch the corresponding hashtable bucket to reduce cache misses. */
dbPrefetch(db, key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing the prefetch, we are repeating some work we will need to do later:

  1. getKVStoreIndexForKey/getKeySlot (inside dbPrefetch and dbAddRDBLoad)
  2. hashKey (inside kvstoreHashtablePrefetch and dictFindPositionForInsert)

In cluster mode, 1 could be quite expensive since we will need to do the CRC hash and lookup the hashtable.

getKeySlot will attempt to cache the keyslot to server.current_client, but I don't think we set this to a value during RDB load so this will require doubling the getKeySlot overhead.

I think we could set the server.current_client to a fake client (similar to AOF load) which would allow this caching mechanism to be used and reduce the overhead there. Alternatively, you could return the keyslot from dbPrefetch and pass this through to the RDB load later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did overlook the fact that prefetch would trigger an extra CRC32 calculation when cluster mode is enabled. Thanks for pointing this out. I'll rerun the benchmark with cluster mode enabled to evaluate the performance impact, then finalize our optimization plan accordingly.

/* It is a helper function that prefetches the corresponding hashtable bucket
* into the CPU cache before batch key insertion operations, reducing memory latency. */
void dbPrefetch(serverDb *db, sds key) {
int dict_index = getKVStoreIndexForKey(key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the benchmark done with cluster mode enabled? If not, I think we might see different results with cluster mode since getKVStoreIndexForKey short-circuits when cluster mode is disabled.

Copy link
Contributor Author

@chzhoo chzhoo Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous benchmark was done with cluster mode disabled. I'll turn on cluster mode and run the benchmark again.

Copy link
Contributor Author

@chzhoo chzhoo Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I reran the benchmark with larger values (greater than 4KB) and cluster mode enabled.
  • The reason for not testing small values is that the performance degradation is more pronounced in large-value scenarios.
  • The maximum performance degradation is 2.3%. I'm hesitant about whether increasing code complexity is worthwhile for a minor improvement.
key_size * key_num = total_size Loading time before optimization (second) Loading time after optimization (second) Performance Boost
4KB*262144=1GB 1.810 1.851 -2.3%
4KB*524288=2GB 3.692 3.713 -0.6%
4KB*1048576=4GB 7.230 7.361 -1.8%
4KB*2097152=8GB 14.395 14.635 -1.7%
4KB*4194304=16GB 28.784 29.295 -1.8%
16KB*65536=1GB 1.309 1.319 -0.8%
16KB*131072=2GB 2.603 2.649 -1.8%
16KB*262144=4GB 5.220 5.283 -1.2%
16KB*524288=8GB 10.412 10.510 -0.9%
16KB*1048576=16GB 20.750 21.063 -1.5%
64KB*16384=1GB 0.988 0.995 -0.7%
64KB*32768=2GB 1.963 1.974 -0.6%
64KB*65536=4GB 3.922 3.942 -0.5%
64KB*131072=8GB 7.833 7.876 -0.5%
64KB*262144=16GB 15.652 15.746 -0.6%
256KB*4096=1GB 0.905 0.906 -0.1%
256KB*8192=2GB 1.808 1.808 0.0%
256KB*16384=4GB 3.829 3.841 -0.3%
256KB*32768=8GB 7.662 7.672 -0.1%
256KB*65536=16GB 15.286 15.320 -0.2%
1024KB*1024=1GB 0.896 0.896 0.0%
1024KB*2048=2GB 1.784 1.786 -0.1%
1024KB*4096=4GB 3.564 3.581 -0.5%
1024KB*8192=8GB 7.115 7.161 -0.6%
1024KB*16384=16GB 14.191 14.253 -0.4%
4096KB*256=1GB 0.882 0.883 -0.1%
4096KB*512=2GB 1.774 1.775 -0.1%
4096KB*1024=4GB 3.535 3.536 -0.0%
4096KB*2048=8GB 7.041 7.068 -0.4%
4096KB*4096=16GB 14.050 14.087 -0.3%

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for not testing small values is that the performance degradation is more pronounced in large-value scenarios.

The cost of the CRC hash is O(n) on the key name length - not on the value size. What is the key name length you are using?

If key name length is fixed across tests - I would expect small values may be even more impacted by the CRC hash work being repeated - since the hash computation will be more expensive in comparison to the rest of the loading operation. Also consider that 1GiB of small values would require many more CRC hashes than 1GiB of large values (since it will contain more keys)

Would it be possible to check the lower value sizes with cluster mode on for comparison?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost of the CRC hash is O(n) on the key name length - not on the value size. What is the key name length you are using?

I generated the RDB file for benchmark using the following command, where key:000552565592 is an example key (16 bytes in size):

valkey-benchmark -t set -n $KEY_NUM -r 1000000000 -d $VALUE_SIZE -P 16

If key name length is fixed across tests - I would expect small values may be even more impacted by the CRC hash work being repeated - since the hash computation will be more expensive in comparison to the rest of the loading operation. Also consider that 1GiB of small values would require many more CRC hashes than 1GiB of large values (since it will contain more keys)

Would it be possible to check the lower value sizes with cluster mode on for comparison?

  • I'll add benchmarks for smaller value sizes with cluster mode enabled later.
  • I understand your concern that the additional CRC32/hash computations may impact performance. As a solution, I plan to modify the code to implement caching for CRC32/hash values, thereby reducing redundant calculations.

Copy link
Contributor Author

@chzhoo chzhoo Jul 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I have implemented caching of key slots and hash values via dbPrefetch return parameters, eliminating redundant computations. And the code is now in a review-ready state.

  • After rerunning benchmarks, we observed performance improvements of approximately 10.9% in some scenarios, while no test cases now exhibit significant degradation. This addresses the previously noted ~2% performance degradation in specific cases. Detailed results are presented in the table below.

    Benchmark results with cluster mode disabled

    value_size * key_num = total_size Loading time before optimization (second) Loading time after optimization (second) Performance Boost
    8B*134217728=1GB 84.954 75.715 10.9%
    8B*268435456=2GB 163.619 147.234 10.0%
    32B*33554432=1GB 23.215 20.958 9.7%
    32B*67108864=2GB 46.513 42.253 9.2%
    256B*4194304=1GB 4.453 4.129 7.3%
    256B*8388608=2GB 8.955 8.324 7.0%
    4KB*262144=1GB 1.752 1.727 1.4%
    4KB*1048576=4GB 7.004 6.944 0.9%
    4KB*4194304=16GB 28.029 27.641 1.4%
    16KB*65536=1GB 1.299 1.292 0.5%
    16KB*262144=4GB 5.150 5.141 0.2%
    16KB*1048576=16GB 20.588 20.587 0.0%
    64KB*16384=1GB 0.977 0.983 -0.6%
    64KB*65536=4GB 3.905 3.918 -0.3%
    64KB*262144=16GB 15.634 15.629 0.0%
    256KB*4096=1GB 0.909 0.912 -0.3%
    256KB*16384=4GB 3.629 3.628 0.0%
    256KB*65536=16GB 14.399 14.412 -0.1%
    1024KB*1024=1GB 0.888 0.891 -0.3%
    1024KB*4096=4GB 3.542 3.545 -0.1%
    1024KB*16384=16GB 14.152 14.150 0.0%
    4096KB*256=1GB 0.883 0.884 -0.1%
    4096KB*1024=4GB 3.518 3.528 -0.3%
    4096KB*4096=16GB 14.058 14.047 0.1%

    Benchmark results with cluster mode enabled

    value_size * key_num = total_size Loading time before optimization (second) Loading time after optimization (second) Performance Boost
    8B*134217728=1GB 101.177 90.707 10.3%
    8B*268435456=2GB 195.051 176.312 9.6%
    32B*33554432=1GB 27.303 25.009 8.4%
    32B*67108864=2GB 54.820 49.913 9.0%
    256B*4194304=1GB 5.046 4.743 6.0%
    256B*8388608=2GB 10.083 9.495 5.8%
    4KB*262144=1GB 1.800 1.791 0.5%
    4KB*1048576=4GB 7.215 7.148 0.9%
    4KB*4194304=16GB 28.806 28.508 1.0%
    16KB*65536=1GB 1.310 1.313 -0.2%
    16KB*262144=4GB 5.230 5.238 -0.2%
    16KB*1048576=16GB 20.873 20.864 0.0%
    64KB*16384=1GB 0.984 0.983 0.1%
    64KB*65536=4GB 3.953 3.939 0.4%
    64KB*262144=16GB 15.693 15.749 -0.4%
    256KB*4096=1GB 0.910 0.910 0.0%
    256KB*16384=4GB 3.610 3.609 0.0%
    256KB*65536=16GB 14.529 14.469 0.4%
    1024KB*1024=1GB 0.888 0.895 -0.8%
    1024KB*4096=4GB 3.571 3.572 -0.0%
    1024KB*16384=16GB 14.172 14.146 0.2%
    4096KB*256=1GB 0.885 0.886 -0.1%
    4096KB*1024=4GB 3.525 3.528 -0.1%
    4096KB*4096=16GB 14.064 14.060 0.0%
  • For details of the benchmark, see the benchmark script below.

set -ex 

# value_size(byte) * key_num
TEST_CASES=(
    # Small-value scenarios (8B to 256B)
    "8*134217728" "8*268435456"
    "32*33554432" "32*67108864"
    "256*4194304" "256*8388608"

    #Large-value scenarios (4KB to 4MB)
    "4096*262144" "4096*1048576" "4096*4194304"
    "16384*65536" "16384*262144" "16384*1048576"
    "65536*16384" "65536*65536" "65536*262144"
    "262144*4096" "262144*16384" "262144*65536"
    "1048576*1024" "1048576*4096" "1048576*16384"
    "4194304*256" "4194304*1024" "4194304*4096"
)

CLUSTER_ENABLED=yes
REPEAT_TESTS=5
OPTIMIZE_BRANCH_VALKEY_SERVER="./opt/src/valkey-server"
UNSTABLE_BRANCH_VALKEY_SERVER="./unstable/src/valkey-server"
VALKEY_CLI="./opt/src/valkey-cli"
VALKEY_BENCHMARK="./opt/src/valkey-benchmark"

wait_for_valkey_server_exit() {
	while true; do
		count=`ps aux |grep valkey-server|grep -v grep |wc -l`
		if [ "$count" -eq 0 ]; then
			break
		fi
		sleep 1
	done
	return 0
}

wait_for_loading_complete() {
    while true; do
        count=`$VALKEY_CLI info |grep -P "^loading:0"| wc -l`
        if [ "$count" -eq 1 ]; then
            break
        fi
        sleep 1
        done
}

run_benchmark() {
    local val_size=$1
    local key_num=$2
    local branch=$3
    local server_bin=$4
    local iteration=$5
    local log_file="./${branch}_log"
  
    echo "$branch val_size $val_size key_num $key_num time $iteration" >> $log_file

    $server_bin --cluster-enabled $CLUSTER_ENABLED --save '' --appendonly no --daemonize yes --logfile "./$log_file" --pidfile ""
    sleep 2
    wait_for_loading_complete
    $VALKEY_CLI shutdown
    wait_for_valkey_server_exit
}

# Main test loop
for test_case in "${TEST_CASES[@]}"; do
    IFS='*' read -r val_size key_num <<< "$test_case"
    echo $val_size $key_num

     # Clean previous data
    rm -rf dump.rdb

    # Prepare test data
    $OPTIMIZE_BRANCH_VALKEY_SERVER --save '' --appendonly no --daemonize yes --pidfile ""
    sleep 2
	$VALKEY_CLI flushall
	$VALKEY_BENCHMARK -t set -n $key_num -r 10000000000 -d $val_size -P 32 -c 2
	$VALKEY_CLI save
	$VALKEY_CLI shutdown
	wait_for_valkey_server_exit

    # Run tests for optimize branches
    for((x=0;x<$REPEAT_TESTS;x+=1));
	do
		run_benchmark $val_size $key_num "optimize" $OPTIMIZE_BRANCH_VALKEY_SERVER $x
	done

    # Run tests for unstable branches
    for((x=0;x<$REPEAT_TESTS;x+=1));
	do
        run_benchmark $val_size $key_num "unstable" $UNSTABLE_BRANCH_VALKEY_SERVER $x
	done
done

@chzhoo chzhoo changed the title Accelerate RDB loading by prefetching hashtable buckets. Accelerate RDB loading by prefetching hashtable buckets Jul 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants