Skip to content

Suboptimal P99 latency under pipelining workloads #4998

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
romange opened this issue Apr 25, 2025 · 3 comments · May be fixed by #4994
Open

Suboptimal P99 latency under pipelining workloads #4998

romange opened this issue Apr 25, 2025 · 3 comments · May be fixed by #4994
Assignees
Labels
enhancement New feature or request

Comments

@romange
Copy link
Collaborator

romange commented Apr 25, 2025

We have not really focused on combination of P99 latency and pipelining, especially in the context
of uncoordinated omission access patterns.

When investigating the elevated latency phenomena even under small CPU load I noticed the following:

  1. The connection fiber yields after reading 11 requests from the socket. This limits how much optimization we can do further down the road when processing batches of requests.
  2. When reading from the socket we limit the socket buffer to 4KB, unless a single request requires more. Which means that when reading series of small requests from the socket we can read at most K request that fit into 4KB. This causes a similar effect of the pipelining efficiency.
  3. For cluster mode, we shard our request by tags. Valkey Cluster spec allows processing a multi-key command that touches multiple tags that belong to the same slot if, i.e. mget {foo}aaa {bar}aaa is allowed as long as both {foo} and {bar} belong to the same shard. Dragonfly processes such requests correctly but this means these commands run on multiple shards. Unfortunately pipelining optimizations do not work well with multi-shard commands and loose this efficiency (can be fixed in the future)
@romange romange added the enhancement New feature or request label Apr 25, 2025
romange added a commit that referenced this issue Apr 25, 2025
Addresses #4998.
Removes unnecessary Yielding that humpers pipeline efficiency. Moreover, it also increases the socket buffer effectively allowing processing more requests in bulk.

Finally, it changes the sharding function for cluster mode to shard by slot id.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
@romange romange linked a pull request Apr 25, 2025 that will close this issue
romange added a commit that referenced this issue Apr 25, 2025
Addresses #4998.
1. Removes unnecessary yielding when reading multiple requests since it humpers pipeline efficiency.\
2. Increases socket read buffer size effectively allowing processing more requests in bulk.
3. Changes the sharding function for cluster mode to shard by slot id.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
@romange
Copy link
Collaborator Author

romange commented Apr 25, 2025

Regarding (1): we should still yield in the connnection fiber to let the AsyncFiber to unload requests - otherwise we will keep reading from the socket until all the data is read or we reached the pipelining limit. It is sub-optimal because of the lost opportunity to kick off the pipeline running in parallel to reading from the socket.

There are two possible solutions:

  1. Simple: To add a configurable counter limit and yield like we do today (default of 11 is still probably too low).
  2. Add a more sophisticated self tuning heuristic. For that we could take into account:
    a. Hardcoded uper limit (1000?)
    b. dispatch queue memory usage: after a certain threshold - yield
    c. How many commands we could squash in one step: there is no point adding much more than this - if we will end up dividing the dispatch stream into a smaller batches for squashing.

@romange
Copy link
Collaborator Author

romange commented Apr 25, 2025

Finally whatever we do - we should comment why we do it and even link this issue for more details.

romange added a commit that referenced this issue Apr 26, 2025
Addresses #4998.
1. Removes unnecessary yielding when reading multiple requests since it humpers pipeline efficiency.\
2. Increases socket read buffer size effectively allowing processing more requests in bulk.
3. Changes the sharding function for cluster mode to shard by slot id.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
romange added a commit that referenced this issue Apr 26, 2025
Addresses #4998.
1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency.
   Now we yield consistently based on cpu time spend since the last resume point.
2. Increases socket read buffer size effectively allowing processing more requests in bulk.
3. Changes the sharding function for cluster mode to shard by slot id.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
romange added a commit that referenced this issue Apr 26, 2025
Addresses #4998.
1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency.
   Now we yield consistently based on cpu time spend since the last resume point.
2. Increases socket read buffer size effectively allowing processing more requests in bulk.
3. Changes the sharding function for cluster mode to shard by slot id.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
romange added a commit that referenced this issue Apr 26, 2025
Addresses #4998.
1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency.
   Now we yield consistently based on cpu time spend since the last resume point.
2. Increases socket read buffer size effectively allowing processing more requests in bulk.
3. Changes the sharding function for cluster mode to shard by slot id.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
romange added a commit that referenced this issue Apr 26, 2025
Addresses #4998.
1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency.
   Now we yield consistently based on cpu time spend since the last resume point.
2. Increases socket read buffer size effectively allowing processing more requests in bulk.
3. Changes the sharding function for cluster mode to shard by slot id.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
romange added a commit that referenced this issue Apr 26, 2025
Addresses #4998.
1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency.
   Now we yield consistently based on cpu time spend since the last resume point (via flag with sane defaults).
2. Increases socket read buffer size effectively allowing processing more requests in bulk.

Before this PR:
`./dragonfly  --cluster_mode=emulated`
latencies (usec) for pipeline sizes 80-199:
p50:  p50: 1887, p75: 2367, p90: 2897, p99: 6266

After this PR:
`./dragonfly  --cluster_mode=emulated --experimental_cluster_shard_by_slot`
latencies (usec) for pipeline sizes 80-199:
p50: 813, p75: 976, p90: 1216, p99: 3528

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
romange added a commit that referenced this issue Apr 26, 2025
Fixes #4998.
1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency.
   Now we yield consistently based on cpu time spend since the last resume point (via flag with sane defaults).
2. Increases socket read buffer size effectively allowing processing more requests in bulk.

`./dragonfly  --cluster_mode=emulated`
latencies (usec) for pipeline sizes 80-199:
p50: 1887, p75: 2367, p90: 2897, p99: 6266

`./dragonfly  --cluster_mode=emulated --experimental_cluster_shard_by_slot`
latencies (usec) for pipeline sizes 80-199:
p50: 813, p75: 976, p90: 1216, p99: 3528

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
romange added a commit that referenced this issue Apr 26, 2025
Fixes #4998.
1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency.
   Now we yield consistently based on cpu time spend since the last resume point (via flag with sane defaults).
2. Increases socket read buffer size effectively allowing processing more requests in bulk.

`./dragonfly  --cluster_mode=emulated`
latencies (usec) for pipeline sizes 80-199:
p50: 1887, p75: 2367, p90: 2897, p99: 6266

`./dragonfly  --cluster_mode=emulated --experimental_cluster_shard_by_slot`
latencies (usec) for pipeline sizes 80-199:
p50: 813, p75: 976, p90: 1216, p99: 3528

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
romange added a commit that referenced this issue Apr 26, 2025
Fixes #4998.
1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency.
   Now we yield consistently based on cpu time spend since the last resume point (via flag with sane defaults).
2. Increases socket read buffer size effectively allowing processing more requests in bulk.

`./dragonfly  --cluster_mode=emulated`
latencies (usec) for pipeline sizes 80-199:
p50: 1887, p75: 2367, p90: 2897, p99: 6266

`./dragonfly  --cluster_mode=emulated --experimental_cluster_shard_by_slot`
latencies (usec) for pipeline sizes 80-199:
p50: 813, p75: 976, p90: 1216, p99: 3528

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
romange added a commit that referenced this issue Apr 27, 2025
Fixes #4998.
1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency.
   Now we yield consistently based on cpu time spend since the last resume point (via flag with sane defaults).
2. Increases socket read buffer size effectively allowing processing more requests in bulk.

`./dragonfly  --cluster_mode=emulated`
latencies (usec) for pipeline sizes 80-199:
p50: 1887, p75: 2367, p90: 2897, p99: 6266

`./dragonfly  --cluster_mode=emulated --experimental_cluster_shard_by_slot`
latencies (usec) for pipeline sizes 80-199:
p50: 813, p75: 976, p90: 1216, p99: 3528

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
romange added a commit that referenced this issue Apr 27, 2025
Fixes #4998.
1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency.
   Now we yield consistently based on cpu time spend since the last resume point (via flag with sane defaults).
2. Increases socket read buffer size effectively allowing processing more requests in bulk.

`./dragonfly  --cluster_mode=emulated`
latencies (usec) for pipeline sizes 80-199:
p50: 1887, p75: 2367, p90: 2897, p99: 6266

`./dragonfly  --cluster_mode=emulated --experimental_cluster_shard_by_slot`
latencies (usec) for pipeline sizes 80-199:
p50: 813, p75: 976, p90: 1216, p99: 3528

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
@romange
Copy link
Collaborator Author

romange commented Apr 27, 2025

Here is the general overview of the flow + the bugs we found

Image

@romange romange self-assigned this Apr 27, 2025
romange added a commit that referenced this issue Apr 27, 2025
Fixes #4998.
1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency.
   Now we yield consistently based on cpu time spend since the last resume point (via flag with sane defaults).
2. Increases socket read buffer size effectively allowing processing more requests in bulk.

`./dragonfly  --cluster_mode=emulated`
latencies (usec) for pipeline sizes 80-199:
p50: 1887, p75: 2367, p90: 2897, p99: 6266

`./dragonfly  --cluster_mode=emulated --experimental_cluster_shard_by_slot`
latencies (usec) for pipeline sizes 80-199:
p50: 813, p75: 976, p90: 1216, p99: 3528

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
romange added a commit that referenced this issue Apr 27, 2025
Fixes #4998.
1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency.
   Now we yield consistently based on cpu time spend since the last resume point (via flag with sane defaults).
2. Increases socket read buffer size effectively allowing processing more requests in bulk.

`./dragonfly  --cluster_mode=emulated`
latencies (usec) for pipeline sizes 80-199:
p50: 1887, p75: 2367, p90: 2897, p99: 6266

`./dragonfly  --cluster_mode=emulated --experimental_cluster_shard_by_slot`
latencies (usec) for pipeline sizes 80-199:
p50: 813, p75: 976, p90: 1216, p99: 3528

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant