Skip to content

RFC: Add task hooks for create/switch/finish events #49083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jpsamaroo
Copy link
Member

@jpsamaroo jpsamaroo commented Mar 21, 2023

Explanation adapted from #39994:

Certain libraries are configured using global or thread-local state instead of passing handles to every function. CUDA, for example, has a cudaSetDevice function that binds a device to the current thread for all future API calls. This is at odds with Julia's task-based concurrency, which presents an execution environment that's local to the current task (e.g., in the case of CUDA, using a different device).

This PR adds a hook mechanism that can be used to detect task creation, task switches, and task finish, allowing synchronization of Julia's task-local environment with a library's global or thread-local state.

Further information

When writing code which interfaces with libraries, OS interfaces, or Julia's runtime itself, certain data and operations are intended to operate thread-locally or globally. For example:

  • CUDA and AMD's HIP expect that a "context" object will be registered in thread-local storage
  • OS timers measure a global or per-thread timer value
  • Julia's Base.gc_num() is a global set of counters

This becomes an issue when integrating such APIs with Julia's task system, because task switches cause a change in which piece of code is executing in the middle of a given task's lifetime. That code might expect that a certain thread-local or global value, which was previously set or measured, is still at it's original value; obviously, we can't currently make this guarantee.

We've had a variety of workarounds in place for this; in the GPU case, we can prefix every GPU API call with thread-local context setup, and make sure that no yields happen in-between that and the API call itself. This works, but is slow, cumbersome, and error-prone. But in the case of measuring task execution time or allocations, we have no way to guarantee that we're actually measuring the metrics for the task we care about.

The solution proposed here is simple and straightforward: allow a restricted set of arbitrary user code to run during task switches. Such code is called a "hook", and each registered hook executes when a task (which contains the hooks) is switched from or to some other task. By doing this, hooks can ensure that changes in the task-to-thread relationship and current running state of a task can be adequately accounted for, when necessary. Additionally, hooks are triggered at task creation time to allow "inheriting" behavior or values from the parent task; at task start for any initialization at first run; and at task finish for any cleanup.

Unlike the implementation in #39994, this PR adds hooks on a per-task basis. This prevents hooks from running when they aren't needed; hook code is only run when a hook is registered to a given task. This makes it easy to add and remove hooks safely, because a task's hooks can and should only be modified by the task itself (or the parent task during task creation, when the task is stopped), so expensive synchronization is not necessary.

The hooks here run in 5 places:

  • Task creation (in the parent's context)
  • Task start for the first time
  • Task switch away
  • Task switch back
  • Task finishing

Hooks are passed an event code (currently just the numbers 0 through 4), and if the hook is running during task creation, the child task is also passed as the second argument (otherwise it is nothing). Hooks are treated similarly to finalizer callbacks, in that they are not allowed to yield.

Overhead

Some basic benchmark results suggest that a task without hooks registered has almost no measurable overhead during task switches: https://gist.github.com/jpsamaroo/7da2124ce306ad6ae3e2588368bbdcda.

Preemption

It was pointed out that task hooks might make future implementation of non-cooperative (preemptive) multitasking harder, because code may be interrupted outside of yield calls without calling the hooks. However, preemptive multitasking might itself break lots of code which is relying on code between yield points being non-interruptible, and so some kind of way to disable preemption for regions of code will likely be necessary. Additionally, task hooks could potentially be marked as "preemption-safe" and thus run during preemption; if any task hook isn't marked as preemption-safe, then that task should not be preemptible.

Use cases

  • Managing thread-local state efficiently (such as for GPU libraries)
  • Precise measurement of metrics during a specific task's execution (https://gist.github.com/jpsamaroo/18d4132dd5115716fd1f02a628ebc874)
  • Implementation of context variables with task-local storage
  • Replacement of hard-coded RNG state fields with task-local storage
  • Replacement of logstate field with task-local storage
  • Tracking of task spawning with "viral hooks" (hook attaches itself to child tasks, recursively)
  • (Maybe) Implementation of structured concurrency concepts such as nurseries and cancellation
  • Quick freeing of per-task allocations via finish hook (rather than waiting on GC finalizers)

Todo:

  • Validate and possibly relax atomic usage
  • Consider alternatives to array of hooks
  • Add tests for various kinds of hooks
  • Add documentation

@jpsamaroo jpsamaroo added multithreading Base.Threads and related functionality needs tests Unit tests are required for this change needs docs Documentation for this change is required labels Mar 21, 2023
@JeffBezanson
Copy link
Member

I agree there are lots of use cases for this, but our Task implementation is not particularly efficient and I really worry that it never will be. We should think carefully about how we can pay for this. Replacing logstate as you suggest and maybe also task_done_hook would be a good start.

@jpsamaroo
Copy link
Member Author

Replacing logstate as you suggest and maybe also task_done_hook would be a good start.

Should I do that in this PR, or create a follow-up?

maleadt and others added 2 commits April 5, 2023 09:30
Certain libraries are configured using global or thread-local state
instead of passing handles to every function. CUDA, for example, has a
`cudaSetDevice` function that binds a device to the current thread for
all future API calls. This is at odds with Julia's task-based
concurrency, which presents an execution environment that's local to the
current task (e.g., in the case of CUDA, using a different device).

This PR adds a hook mechanism that can be used to detect task creation
and task switches, allowing synchronization of Julia's task-local
environment with a library's global or thread-local state.
@jpsamaroo jpsamaroo force-pushed the jps/task-action-hooks branch from 9f2aa3c to a676386 Compare April 5, 2023 15:20
@jpsamaroo
Copy link
Member Author

The latest push attempts to switch logstate to a combination of task hooks and a TLS variable; it's currently hanging for me in precompile, and I would appreciate some help debugging it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality needs docs Documentation for this change is required needs tests Unit tests are required for this change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants