RFC: Add task hooks for create/switch/finish events #49083

jpsamaroo · 2023-03-21T20:03:54Z

Explanation adapted from #39994:

Certain libraries are configured using global or thread-local state instead of passing handles to every function. CUDA, for example, has a cudaSetDevice function that binds a device to the current thread for all future API calls. This is at odds with Julia's task-based concurrency, which presents an execution environment that's local to the current task (e.g., in the case of CUDA, using a different device).

This PR adds a hook mechanism that can be used to detect task creation, task switches, and task finish, allowing synchronization of Julia's task-local environment with a library's global or thread-local state.

Further information

When writing code which interfaces with libraries, OS interfaces, or Julia's runtime itself, certain data and operations are intended to operate thread-locally or globally. For example:

CUDA and AMD's HIP expect that a "context" object will be registered in thread-local storage
OS timers measure a global or per-thread timer value
Julia's Base.gc_num() is a global set of counters

This becomes an issue when integrating such APIs with Julia's task system, because task switches cause a change in which piece of code is executing in the middle of a given task's lifetime. That code might expect that a certain thread-local or global value, which was previously set or measured, is still at it's original value; obviously, we can't currently make this guarantee.

We've had a variety of workarounds in place for this; in the GPU case, we can prefix every GPU API call with thread-local context setup, and make sure that no yields happen in-between that and the API call itself. This works, but is slow, cumbersome, and error-prone. But in the case of measuring task execution time or allocations, we have no way to guarantee that we're actually measuring the metrics for the task we care about.

The solution proposed here is simple and straightforward: allow a restricted set of arbitrary user code to run during task switches. Such code is called a "hook", and each registered hook executes when a task (which contains the hooks) is switched from or to some other task. By doing this, hooks can ensure that changes in the task-to-thread relationship and current running state of a task can be adequately accounted for, when necessary. Additionally, hooks are triggered at task creation time to allow "inheriting" behavior or values from the parent task; at task start for any initialization at first run; and at task finish for any cleanup.

Unlike the implementation in #39994, this PR adds hooks on a per-task basis. This prevents hooks from running when they aren't needed; hook code is only run when a hook is registered to a given task. This makes it easy to add and remove hooks safely, because a task's hooks can and should only be modified by the task itself (or the parent task during task creation, when the task is stopped), so expensive synchronization is not necessary.

The hooks here run in 5 places:

Task creation (in the parent's context)
Task start for the first time
Task switch away
Task switch back
Task finishing

Hooks are passed an event code (currently just the numbers 0 through 4), and if the hook is running during task creation, the child task is also passed as the second argument (otherwise it is nothing). Hooks are treated similarly to finalizer callbacks, in that they are not allowed to yield.

Overhead

Some basic benchmark results suggest that a task without hooks registered has almost no measurable overhead during task switches: https://gist.github.com/jpsamaroo/7da2124ce306ad6ae3e2588368bbdcda.

Preemption

It was pointed out that task hooks might make future implementation of non-cooperative (preemptive) multitasking harder, because code may be interrupted outside of yield calls without calling the hooks. However, preemptive multitasking might itself break lots of code which is relying on code between yield points being non-interruptible, and so some kind of way to disable preemption for regions of code will likely be necessary. Additionally, task hooks could potentially be marked as "preemption-safe" and thus run during preemption; if any task hook isn't marked as preemption-safe, then that task should not be preemptible.

Use cases

Managing thread-local state efficiently (such as for GPU libraries)
Precise measurement of metrics during a specific task's execution (https://gist.github.com/jpsamaroo/18d4132dd5115716fd1f02a628ebc874)
Implementation of context variables with task-local storage
Replacement of hard-coded RNG state fields with task-local storage
Replacement of logstate field with task-local storage
Tracking of task spawning with "viral hooks" (hook attaches itself to child tasks, recursively)
(Maybe) Implementation of structured concurrency concepts such as nurseries and cancellation
Quick freeing of per-task allocations via finish hook (rather than waiting on GC finalizers)

Todo:

Validate and possibly relax atomic usage
Consider alternatives to array of hooks
Add tests for various kinds of hooks
Add documentation

JeffBezanson · 2023-03-21T21:37:35Z

I agree there are lots of use cases for this, but our Task implementation is not particularly efficient and I really worry that it never will be. We should think carefully about how we can pay for this. Replacing logstate as you suggest and maybe also task_done_hook would be a good start.

jpsamaroo · 2023-03-21T21:50:22Z

Replacing logstate as you suggest and maybe also task_done_hook would be a good start.

Should I do that in this PR, or create a follow-up?

src/task.c

Certain libraries are configured using global or thread-local state instead of passing handles to every function. CUDA, for example, has a `cudaSetDevice` function that binds a device to the current thread for all future API calls. This is at odds with Julia's task-based concurrency, which presents an execution environment that's local to the current task (e.g., in the case of CUDA, using a different device). This PR adds a hook mechanism that can be used to detect task creation and task switches, allowing synchronization of Julia's task-local environment with a library's global or thread-local state.

jpsamaroo · 2023-04-05T15:21:36Z

The latest push attempts to switch logstate to a combination of task hooks and a TLS variable; it's currently hanging for me in precompile, and I would appreciate some help debugging it :)

jpsamaroo added multithreading Base.Threads and related functionality needs tests Unit tests are required for this change needs docs Documentation for this change is required labels Mar 21, 2023

vtjnash reviewed Mar 22, 2023

View reviewed changes

src/task.c Outdated Show resolved Hide resolved

maleadt and others added 2 commits April 5, 2023 09:30

Move logstate into task-local storage

a676386

jpsamaroo force-pushed the jps/task-action-hooks branch from 9f2aa3c to a676386 Compare April 5, 2023 15:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: Add task hooks for create/switch/finish events #49083

RFC: Add task hooks for create/switch/finish events #49083

jpsamaroo commented Mar 21, 2023 •

edited

Loading

Uh oh!

JeffBezanson commented Mar 21, 2023

Uh oh!

jpsamaroo commented Mar 21, 2023

Uh oh!

Uh oh!

jpsamaroo commented Apr 5, 2023

Uh oh!

Uh oh!

Uh oh!

RFC: Add task hooks for create/switch/finish events #49083

Are you sure you want to change the base?

RFC: Add task hooks for create/switch/finish events #49083

Conversation

jpsamaroo commented Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Explanation adapted from #39994:

Further information

Overhead

Preemption

Use cases

Uh oh!

JeffBezanson commented Mar 21, 2023

Uh oh!

jpsamaroo commented Mar 21, 2023

Uh oh!

Uh oh!

jpsamaroo commented Apr 5, 2023

Uh oh!

Uh oh!

jpsamaroo commented Mar 21, 2023 •

edited

Loading