Suboptimal P99 latency under pipelining workloads #4998

romange · 2025-04-25T06:08:41Z

We have not really focused on combination of P99 latency and pipelining, especially in the context
of uncoordinated omission access patterns.

When investigating the elevated latency phenomena even under small CPU load I noticed the following:

The connection fiber yields after reading 11 requests from the socket. This limits how much optimization we can do further down the road when processing batches of requests.
When reading from the socket we limit the socket buffer to 4KB, unless a single request requires more. Which means that when reading series of small requests from the socket we can read at most K request that fit into 4KB. This causes a similar effect of the pipelining efficiency.
For cluster mode, we shard our request by tags. Valkey Cluster spec allows processing a multi-key command that touches multiple tags that belong to the same slot if, i.e. mget {foo}aaa {bar}aaa is allowed as long as both {foo} and {bar} belong to the same shard. Dragonfly processes such requests correctly but this means these commands run on multiple shards. Unfortunately pipelining optimizations do not work well with multi-shard commands and loose this efficiency (can be fixed in the future)

The text was updated successfully, but these errors were encountered:

Addresses #4998. Removes unnecessary Yielding that humpers pipeline efficiency. Moreover, it also increases the socket buffer effectively allowing processing more requests in bulk. Finally, it changes the sharding function for cluster mode to shard by slot id. Signed-off-by: Roman Gershman <roman@dragonflydb.io>

Addresses #4998. 1. Removes unnecessary yielding when reading multiple requests since it humpers pipeline efficiency.\ 2. Increases socket read buffer size effectively allowing processing more requests in bulk. 3. Changes the sharding function for cluster mode to shard by slot id. Signed-off-by: Roman Gershman <roman@dragonflydb.io>

romange · 2025-04-25T23:06:20Z

Regarding (1): we should still yield in the connnection fiber to let the AsyncFiber to unload requests - otherwise we will keep reading from the socket until all the data is read or we reached the pipelining limit. It is sub-optimal because of the lost opportunity to kick off the pipeline running in parallel to reading from the socket.

There are two possible solutions:

Simple: To add a configurable counter limit and yield like we do today (default of 11 is still probably too low).
Add a more sophisticated self tuning heuristic. For that we could take into account:
a. Hardcoded uper limit (1000?)
b. dispatch queue memory usage: after a certain threshold - yield
c. How many commands we could squash in one step: there is no point adding much more than this - if we will end up dividing the dispatch stream into a smaller batches for squashing.

romange · 2025-04-25T23:07:04Z

Finally whatever we do - we should comment why we do it and even link this issue for more details.

Addresses #4998. 1. Removes unnecessary yielding when reading multiple requests since it humpers pipeline efficiency.\ 2. Increases socket read buffer size effectively allowing processing more requests in bulk. 3. Changes the sharding function for cluster mode to shard by slot id. Signed-off-by: Roman Gershman <roman@dragonflydb.io>

Addresses #4998. 1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency. Now we yield consistently based on cpu time spend since the last resume point. 2. Increases socket read buffer size effectively allowing processing more requests in bulk. 3. Changes the sharding function for cluster mode to shard by slot id. Signed-off-by: Roman Gershman <roman@dragonflydb.io>

Addresses #4998. 1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency. Now we yield consistently based on cpu time spend since the last resume point (via flag with sane defaults). 2. Increases socket read buffer size effectively allowing processing more requests in bulk. Before this PR: `./dragonfly --cluster_mode=emulated` latencies (usec) for pipeline sizes 80-199: p50: p50: 1887, p75: 2367, p90: 2897, p99: 6266 After this PR: `./dragonfly --cluster_mode=emulated --experimental_cluster_shard_by_slot` latencies (usec) for pipeline sizes 80-199: p50: 813, p75: 976, p90: 1216, p99: 3528 Signed-off-by: Roman Gershman <roman@dragonflydb.io>

Fixes #4998. 1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency. Now we yield consistently based on cpu time spend since the last resume point (via flag with sane defaults). 2. Increases socket read buffer size effectively allowing processing more requests in bulk. `./dragonfly --cluster_mode=emulated` latencies (usec) for pipeline sizes 80-199: p50: 1887, p75: 2367, p90: 2897, p99: 6266 `./dragonfly --cluster_mode=emulated --experimental_cluster_shard_by_slot` latencies (usec) for pipeline sizes 80-199: p50: 813, p75: 976, p90: 1216, p99: 3528 Signed-off-by: Roman Gershman <roman@dragonflydb.io>

romange · 2025-04-27T07:07:35Z

Here is the general overview of the flow + the bugs we found

Fixes #4998. 1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency. Now we yield consistently based on cpu time spend since the last resume point (via flag with sane defaults). 2. Increases socket read buffer size effectively allowing processing more requests in bulk. `./dragonfly --cluster_mode=emulated` latencies (usec) for pipeline sizes 80-199: p50: 1887, p75: 2367, p90: 2897, p99: 6266 `./dragonfly --cluster_mode=emulated --experimental_cluster_shard_by_slot` latencies (usec) for pipeline sizes 80-199: p50: 813, p75: 976, p90: 1216, p99: 3528 Signed-off-by: Roman Gershman <roman@dragonflydb.io>

romange added the enhancement New feature or request label Apr 25, 2025

romange linked a pull request Apr 25, 2025 that will close this issue

chore: Pipelining fixes #4994

Open

romange self-assigned this Apr 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal P99 latency under pipelining workloads #4998

Suboptimal P99 latency under pipelining workloads #4998

romange commented Apr 25, 2025

romange commented Apr 25, 2025 •

edited

Loading

romange commented Apr 25, 2025

romange commented Apr 27, 2025

Suboptimal P99 latency under pipelining workloads #4998

Suboptimal P99 latency under pipelining workloads #4998

Comments

romange commented Apr 25, 2025

romange commented Apr 25, 2025 • edited Loading

romange commented Apr 25, 2025

romange commented Apr 27, 2025

romange commented Apr 25, 2025 •

edited

Loading