|
| 1 | +========================== |
| 2 | +Extensible Scheduler Class |
| 3 | +========================== |
| 4 | + |
| 5 | +sched_ext is a scheduler class whose behavior can be defined by a set of BPF |
| 6 | +programs - the BPF scheduler. |
| 7 | + |
| 8 | +* sched_ext exports a full scheduling interface so that any scheduling |
| 9 | + algorithm can be implemented on top. |
| 10 | + |
| 11 | +* The BPF scheduler can group CPUs however it sees fit and schedule them |
| 12 | + together, as tasks aren't tied to specific CPUs at the time of wakeup. |
| 13 | + |
| 14 | +* The BPF scheduler can be turned on and off dynamically anytime. |
| 15 | + |
| 16 | +* The system integrity is maintained no matter what the BPF scheduler does. |
| 17 | + The default scheduling behavior is restored anytime an error is detected, |
| 18 | + a runnable task stalls, or on invoking the SysRq key sequence |
| 19 | + :kbd:`SysRq-S`. |
| 20 | + |
| 21 | +* When the BPF scheduler triggers an error, debug information is dumped to |
| 22 | + aid debugging. The debug dump is passed to and printed out by the |
| 23 | + scheduler binary. The debug dump can also be accessed through the |
| 24 | + `sched_ext_dump` tracepoint. The SysRq key sequence :kbd:`SysRq-D` |
| 25 | + triggers a debug dump. This doesn't terminate the BPF scheduler and can |
| 26 | + only be read through the tracepoint. |
| 27 | + |
| 28 | +Switching to and from sched_ext |
| 29 | +=============================== |
| 30 | + |
| 31 | +``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and |
| 32 | +``tools/sched_ext`` contains the example schedulers. The following config |
| 33 | +options should be enabled to use sched_ext: |
| 34 | + |
| 35 | +.. code-block:: none |
| 36 | +
|
| 37 | + CONFIG_BPF=y |
| 38 | + CONFIG_SCHED_CLASS_EXT=y |
| 39 | + CONFIG_BPF_SYSCALL=y |
| 40 | + CONFIG_BPF_JIT=y |
| 41 | + CONFIG_DEBUG_INFO_BTF=y |
| 42 | + CONFIG_BPF_JIT_ALWAYS_ON=y |
| 43 | + CONFIG_BPF_JIT_DEFAULT_ON=y |
| 44 | + CONFIG_PAHOLE_HAS_SPLIT_BTF=y |
| 45 | + CONFIG_PAHOLE_HAS_BTF_TAG=y |
| 46 | +
|
| 47 | +sched_ext is used only when the BPF scheduler is loaded and running. |
| 48 | + |
| 49 | +If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be |
| 50 | +treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is |
| 51 | +loaded. |
| 52 | + |
| 53 | +When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set |
| 54 | +in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and |
| 55 | +``SCHED_EXT`` tasks are scheduled by sched_ext. |
| 56 | + |
| 57 | +However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is |
| 58 | +set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled |
| 59 | +by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and |
| 60 | +``SCHED_IDLE`` policies are scheduled by CFS. |
| 61 | + |
| 62 | +Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or |
| 63 | +detection of any internal error including stalled runnable tasks aborts the |
| 64 | +BPF scheduler and reverts all tasks back to CFS. |
| 65 | + |
| 66 | +.. code-block:: none |
| 67 | +
|
| 68 | + # make -j16 -C tools/sched_ext |
| 69 | + # tools/sched_ext/scx_simple |
| 70 | + local=0 global=3 |
| 71 | + local=5 global=24 |
| 72 | + local=9 global=44 |
| 73 | + local=13 global=56 |
| 74 | + local=17 global=72 |
| 75 | + ^CEXIT: BPF scheduler unregistered |
| 76 | +
|
| 77 | +The current status of the BPF scheduler can be determined as follows: |
| 78 | + |
| 79 | +.. code-block:: none |
| 80 | +
|
| 81 | + # cat /sys/kernel/sched_ext/state |
| 82 | + enabled |
| 83 | + # cat /sys/kernel/sched_ext/root/ops |
| 84 | + simple |
| 85 | +
|
| 86 | +``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more |
| 87 | +detailed information: |
| 88 | + |
| 89 | +.. code-block:: none |
| 90 | +
|
| 91 | + # tools/sched_ext/scx_show_state.py |
| 92 | + ops : simple |
| 93 | + enabled : 1 |
| 94 | + switching_all : 1 |
| 95 | + switched_all : 1 |
| 96 | + enable_state : enabled (2) |
| 97 | + bypass_depth : 0 |
| 98 | + nr_rejected : 0 |
| 99 | +
|
| 100 | +If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can |
| 101 | +be determined as follows: |
| 102 | + |
| 103 | +.. code-block:: none |
| 104 | +
|
| 105 | + # grep ext /proc/self/sched |
| 106 | + ext.enabled : 1 |
| 107 | +
|
| 108 | +The Basics |
| 109 | +========== |
| 110 | + |
| 111 | +Userspace can implement an arbitrary BPF scheduler by loading a set of BPF |
| 112 | +programs that implement ``struct sched_ext_ops``. The only mandatory field |
| 113 | +is ``ops.name`` which must be a valid BPF object name. All operations are |
| 114 | +optional. The following modified excerpt is from |
| 115 | +``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler. |
| 116 | + |
| 117 | +.. code-block:: c |
| 118 | +
|
| 119 | + /* |
| 120 | + * Decide which CPU a task should be migrated to before being |
| 121 | + * enqueued (either at wakeup, fork time, or exec time). If an |
| 122 | + * idle core is found by the default ops.select_cpu() implementation, |
| 123 | + * then dispatch the task directly to SCX_DSQ_LOCAL and skip the |
| 124 | + * ops.enqueue() callback. |
| 125 | + * |
| 126 | + * Note that this implementation has exactly the same behavior as the |
| 127 | + * default ops.select_cpu implementation. The behavior of the scheduler |
| 128 | + * would be exactly same if the implementation just didn't define the |
| 129 | + * simple_select_cpu() struct_ops prog. |
| 130 | + */ |
| 131 | + s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, |
| 132 | + s32 prev_cpu, u64 wake_flags) |
| 133 | + { |
| 134 | + s32 cpu; |
| 135 | + /* Need to initialize or the BPF verifier will reject the program */ |
| 136 | + bool direct = false; |
| 137 | +
|
| 138 | + cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct); |
| 139 | +
|
| 140 | + if (direct) |
| 141 | + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); |
| 142 | +
|
| 143 | + return cpu; |
| 144 | + } |
| 145 | +
|
| 146 | + /* |
| 147 | + * Do a direct dispatch of a task to the global DSQ. This ops.enqueue() |
| 148 | + * callback will only be invoked if we failed to find a core to dispatch |
| 149 | + * to in ops.select_cpu() above. |
| 150 | + * |
| 151 | + * Note that this implementation has exactly the same behavior as the |
| 152 | + * default ops.enqueue implementation, which just dispatches the task |
| 153 | + * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same |
| 154 | + * if the implementation just didn't define the simple_enqueue struct_ops |
| 155 | + * prog. |
| 156 | + */ |
| 157 | + void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) |
| 158 | + { |
| 159 | + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); |
| 160 | + } |
| 161 | +
|
| 162 | + s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) |
| 163 | + { |
| 164 | + /* |
| 165 | + * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and |
| 166 | + * SCHED_BATCH tasks should use sched_ext. |
| 167 | + */ |
| 168 | + return 0; |
| 169 | + } |
| 170 | +
|
| 171 | + void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) |
| 172 | + { |
| 173 | + exit_type = ei->type; |
| 174 | + } |
| 175 | +
|
| 176 | + SEC(".struct_ops") |
| 177 | + struct sched_ext_ops simple_ops = { |
| 178 | + .select_cpu = (void *)simple_select_cpu, |
| 179 | + .enqueue = (void *)simple_enqueue, |
| 180 | + .init = (void *)simple_init, |
| 181 | + .exit = (void *)simple_exit, |
| 182 | + .name = "simple", |
| 183 | + }; |
| 184 | +
|
| 185 | +Dispatch Queues |
| 186 | +--------------- |
| 187 | + |
| 188 | +To match the impedance between the scheduler core and the BPF scheduler, |
| 189 | +sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a |
| 190 | +priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``), |
| 191 | +and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage |
| 192 | +an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and |
| 193 | +``scx_bpf_destroy_dsq()``. |
| 194 | + |
| 195 | +A CPU always executes a task from its local DSQ. A task is "dispatched" to a |
| 196 | +DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's |
| 197 | +local DSQ. |
| 198 | + |
| 199 | +When a CPU is looking for the next task to run, if the local DSQ is not |
| 200 | +empty, the first task is picked. Otherwise, the CPU tries to consume the |
| 201 | +global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()`` |
| 202 | +is invoked. |
| 203 | + |
| 204 | +Scheduling Cycle |
| 205 | +---------------- |
| 206 | + |
| 207 | +The following briefly shows how a waking task is scheduled and executed. |
| 208 | + |
| 209 | +1. When a task is waking up, ``ops.select_cpu()`` is the first operation |
| 210 | + invoked. This serves two purposes. First, CPU selection optimization |
| 211 | + hint. Second, waking up the selected CPU if idle. |
| 212 | + |
| 213 | + The CPU selected by ``ops.select_cpu()`` is an optimization hint and not |
| 214 | + binding. The actual decision is made at the last step of scheduling. |
| 215 | + However, there is a small performance gain if the CPU |
| 216 | + ``ops.select_cpu()`` returns matches the CPU the task eventually runs on. |
| 217 | + |
| 218 | + A side-effect of selecting a CPU is waking it up from idle. While a BPF |
| 219 | + scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, |
| 220 | + using ``ops.select_cpu()`` judiciously can be simpler and more efficient. |
| 221 | + |
| 222 | + A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by |
| 223 | + calling ``scx_bpf_dispatch()``. If the task is dispatched to |
| 224 | + ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the |
| 225 | + local DSQ of whichever CPU is returned from ``ops.select_cpu()``. |
| 226 | + Additionally, dispatching directly from ``ops.select_cpu()`` will cause the |
| 227 | + ``ops.enqueue()`` callback to be skipped. |
| 228 | + |
| 229 | + Note that the scheduler core will ignore an invalid CPU selection, for |
| 230 | + example, if it's outside the allowed cpumask of the task. |
| 231 | + |
| 232 | +2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the |
| 233 | + task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()`` |
| 234 | + can make one of the following decisions: |
| 235 | + |
| 236 | + * Immediately dispatch the task to either the global or local DSQ by |
| 237 | + calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or |
| 238 | + ``SCX_DSQ_LOCAL``, respectively. |
| 239 | + |
| 240 | + * Immediately dispatch the task to a custom DSQ by calling |
| 241 | + ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63. |
| 242 | + |
| 243 | + * Queue the task on the BPF side. |
| 244 | + |
| 245 | +3. When a CPU is ready to schedule, it first looks at its local DSQ. If |
| 246 | + empty, it then looks at the global DSQ. If there still isn't a task to |
| 247 | + run, ``ops.dispatch()`` is invoked which can use the following two |
| 248 | + functions to populate the local DSQ. |
| 249 | + |
| 250 | + * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can |
| 251 | + be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``, |
| 252 | + ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()`` |
| 253 | + currently can't be called with BPF locks held, this is being worked on |
| 254 | + and will be supported. ``scx_bpf_dispatch()`` schedules dispatching |
| 255 | + rather than performing them immediately. There can be up to |
| 256 | + ``ops.dispatch_max_batch`` pending tasks. |
| 257 | + |
| 258 | + * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ |
| 259 | + to the dispatching DSQ. This function cannot be called with any BPF |
| 260 | + locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks |
| 261 | + before trying to consume the specified DSQ. |
| 262 | + |
| 263 | +4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ, |
| 264 | + the CPU runs the first one. If empty, the following steps are taken: |
| 265 | + |
| 266 | + * Try to consume the global DSQ. If successful, run the task. |
| 267 | + |
| 268 | + * If ``ops.dispatch()`` has dispatched any tasks, retry #3. |
| 269 | + |
| 270 | + * If the previous task is an SCX task and still runnable, keep executing |
| 271 | + it (see ``SCX_OPS_ENQ_LAST``). |
| 272 | + |
| 273 | + * Go idle. |
| 274 | + |
| 275 | +Note that the BPF scheduler can always choose to dispatch tasks immediately |
| 276 | +in ``ops.enqueue()`` as illustrated in the above simple example. If only the |
| 277 | +built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as |
| 278 | +a task is never queued on the BPF scheduler and both the local and global |
| 279 | +DSQs are consumed automatically. |
| 280 | + |
| 281 | +``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use |
| 282 | +``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as |
| 283 | +``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue |
| 284 | +dispatching, and must be dispatched to with ``scx_bpf_dispatch()``. See the |
| 285 | +function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for |
| 286 | +more information. |
| 287 | + |
| 288 | +Where to Look |
| 289 | +============= |
| 290 | + |
| 291 | +* ``include/linux/sched/ext.h`` defines the core data structures, ops table |
| 292 | + and constants. |
| 293 | + |
| 294 | +* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers. |
| 295 | + The functions prefixed with ``scx_bpf_`` can be called from the BPF |
| 296 | + scheduler. |
| 297 | + |
| 298 | +* ``tools/sched_ext/`` hosts example BPF scheduler implementations. |
| 299 | + |
| 300 | + * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a |
| 301 | + custom DSQ. |
| 302 | + |
| 303 | + * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five |
| 304 | + levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``. |
| 305 | + |
| 306 | +ABI Instability |
| 307 | +=============== |
| 308 | + |
| 309 | +The APIs provided by sched_ext to BPF schedulers programs have no stability |
| 310 | +guarantees. This includes the ops table callbacks and constants defined in |
| 311 | +``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in |
| 312 | +``kernel/sched/ext.c``. |
| 313 | + |
| 314 | +While we will attempt to provide a relatively stable API surface when |
| 315 | +possible, they are subject to change without warning between kernel |
| 316 | +versions. |
0 commit comments