IO Driver sensitive to scheduling delays #7612

olivergillespie · 2025-09-11T15:42:21Z

olivergillespie
Sep 11, 2025

TL;DR: Scheduling delays are common, and Tokio’s IO driver is especially sensitive to them. Can/should Tokio do anything about it?

Overview

Scheduling delays over 1ms are common [1] in many typical Linux configurations.
Scheduling delays which hit Tokio workers are usually limited to impacting one task, but delaying the IO driver can block progress for some or all worker threads, as no IO events can be delivered. The IO driver is also especially likely to experience scheduling delays, thanks to the epoll_wait syscall in every turn.
Thus, the expected impact from scheduling delays is amplified for Tokio applications which frequently use IO.

Demonstration

(TODO: minimal repro with busy background system, IO worker hit by delay, all in-flight requests paused)

We see below, with data from my application, that scheduling delays usually have little impact, but delays to the IO driver reliably cause bursts of slow requests. The IO driver is especially sensitive to scheduling delays.

Requests are marked as individual green points on the scatter plot with their end time and latency. Scheduling delays (recorded with bpftrace) over 2ms affecting any of the tokio worker threads are overlaid as red bands. Time spent in io::driver::Driver::turn is also traced with bpftrace and overlaid in blue. Since this overlaps with a scheduling delay, it shows as purple.

Mitigations

Scheduling delays can be reduced by tuning the kernel/scheduler, using CPU affinity and isolation, changing scheduling priorities or using a real-time scheduler. These may partially (tuning) or entirely (isolation) avoid scheduling delays for Tokio workers. Each of these has tradeoffs, though, and requires manual, sometimes brittle configuration.
Splitting one runtime into several runtimes reduces the maximum impact of any one delay, but adds configuration complexity while reducing efficiency and limiting work-stealing.

Detection

TODO: details. User can monitor scheduling delays with perf sched or sched tracepoints via bpftrace or similar. These can be correlated with request latencies, and with IO driver execution times by adding further tracing.

Discussion

Tokio can not avoid scheduling delays, but it might be able to reduce the impact if the architectural choices amplify the problem, which seems to be the case with the IO driver.

Tokio expects a well-behaved system. Is mitigating the impact of scheduling delays out of scope, even if it’s the practical reality for many users?
If Tokio were to attempt to mitigate this problem, what could be done?
- Configurable number of IO drivers?
  - Split the fds, or have multiple listeners for each?
- Dedicated IO driver thread for which user can use realtime scheduling or isolation?
- (tok)io-uring?
- Detect the issue (automatic, or a how-to), and document suggested mitigations?

Mitigations may reduce performance for well-behaved systems, or reduce throughput to improve tail latency, so they may need to be user-configurable.

[1] Linux scheduler has a tick rate set by CONFIG_HZ. 100, 250 and 1000Hz are the common options, corresponding to 10ms, 4ms and 1ms respectively. Along with a variety of other details, this means in practice that a busy M:N system where M > N is bound to frequently experience 1ms+ delays, and this is not a bug

Darksonn · 2025-09-12T06:55:42Z

Darksonn
Sep 12, 2025
Maintainer

My initial reaction to this is that you might be putting too much work on a single machine.

1 reply

olivergillespie Sep 12, 2025
Author

Thanks. It's quite trivial/common to see big scheduling delays without being near machine CPU capacity, and besides, I'd want Tokio to run well near capacity, too, if possible.

Measuring scheduling delays on my system:

#!/usr/bin/env bpftrace

tracepoint:sched:sched_stat_wait
{
    @delays = hist(args->delay / 1000);
}

interval:s:5
{
    print(@delays);
    clear(@delays);
}

From top, I'm nowhere near 100% CPU utilization:

%Cpu(s):  9.2 us, 19.6 sy,  0.0 ni, 70.3 id,  0.0 wa,  0.0 hi,  0.8 si,  0.0 st

But I see lots of long scheduling delays:

Delay (us)         count (in 5 seconds)
[0]               658160
[1]                  326
[2, 4)              4211
[4, 8)              2280
[8, 16)             3736
[16, 32)            2819
[32, 64)            5698
[64, 128)            818
[128, 256)           308
[256, 512)            43
[512, 1K)             13
[1K, 2K)              10
[2K, 4K)               0
[4K, 8K)               9
[8K, 16K)             11
[16K, 32K)             2

So 30 scheduling delays over 1ms in 5 seconds on a host running at about 30% CPU utilization. With similar utilization, but a more churny workload I see 250 1ms+ delays in 5 seconds.

[1K, 2K)              55
[2K, 4K)              55
[4K, 8K)              93
[8K, 16K)             45
[16K, 32K)             4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

IO Driver sensitive to scheduling delays #7612

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

IO Driver sensitive to scheduling delays #7612

Uh oh!

olivergillespie Sep 11, 2025

Overview

Demonstration

Mitigations

Detection

Discussion

Replies: 1 comment · 1 reply

Uh oh!

Darksonn Sep 12, 2025 Maintainer

Uh oh!

olivergillespie Sep 12, 2025 Author

olivergillespie
Sep 11, 2025

Replies: 1 comment 1 reply

Darksonn
Sep 12, 2025
Maintainer

olivergillespie Sep 12, 2025
Author