Thread-Aware `micro_yield(thread_id)` #31

guillaumeriousat · 2025-10-08T20:05:21Z

This adds a worker_yield_t with an operator operator()(thread_index_t thread_idx) which is only called by worker threads that are waiting defaults to micro_yield but can be customized by the user.

Right now, fork_union offers two mechanisms to control the behaviour of threads that do not have work to do:

sleep(microseconds) to tell the threads assume a chill_k mood_t and to std::this_thread::sleep_for(microseconds) in a loop while waiting for a new job.
micro_yield() which calls operator() of a user-definable micro_yield_t.

Since we are writing an application that is frequently used on consumer hardware alongside a DAW, multiple VST plugins and a host of other random applications, relying on micro_yield() alone is not an option; it can use power optimized instructions but still results in 100% CPU usage.

sleep() is much better for our usecase except for the added latency. Since our application is doing real-time audio, the added latency worsens performance since a slow wakeup can cause a missed deadline and generally hurts performance.

One possible way to solve that dilemma is to wait for new jobs by doing increasing amount of micro_yield() before finally calling a std::this_thread::sleep_for(microseconds). This form of tiered waiting, augmented by some sort of backoff measure is described in this article and seems to be more or less common.

With fork_union as it is right now, I don't think we can implement that form of waiting for our application.

It would be possible to implement this kind of waiting scheme inside the worker's lambda using unsafe_for_threads and tying the wait procedure to the acquiring of a mutex. In pseudo-code it would look like this:

void process() {
    // unblock previously spinning threads
    spin_mutex.unlock()
    // wait for previously spinning threads
    unsafe_join();
    // locking the mutex causes the threads to spin and eventually sleep until the mutex is unlocked
    spin_mutex.lock();
    fu::unsafe_for_threads() {
        some_task();
        spin_mutex.lock();
        spin_mutex.unlock();
    }
}

This would work (maybe with some performance issue due to contention on the lock) if we never needed to use the full result of the computation and we could live with potentially partial results until the next call to process.

The problem is that we call fork_union's for_n API inside a process function which is called by the OS's audio interrupt handling and we need to do some single-threaded processing on the result of the multi-threaded computation, so we need to join at the end of process. This means the threads are forced to be in-between jobs between the audio interrupt calls, calling micro_yield or sleep in a loop.

We could define a micro_yield_t to implement the waiting behaviour and to use std::this_thread::get_id and counters to track which stage of the wait each thread is on.

I think std::this_thread::get_id would make it hard to implement a very performant solution because we would need to use some kind of map data structure. Furthermore, fork_union uses micro_yield() outside of _worker_loop to implement various waits which makes interactions with std::this_thread::get_id harder to implement correctly.

This is why I propose to include a worker_yield_t with an operator()(thread_index_t thread_idx) operator that defaults to calling micro_yield() but is guaranteed to only be called in _worker_loop between jobs.

Downsides:

A bit more complexity
Passing one more size_t parameter anytime a wait needs to happen ? I don't know what the performance impact is but I'm guessing it must be minimal.
Not sure this can works for the linux_colocated_pool since only the global thread seems to sleep ?
Since a worker_yield_t (micro_yield_t) is instanciated in the _worker_loop function, the user needs to rely on static members to reset wait attempt count. Maybe pre-instanciate a worker_yield_t in a basic_pool and make it accessible but this makes the API messier.
- A possibly more elegant solution would be to add a user definable void() reset member to worker_yield_t that could be called internally at every epoch change. This is a possibly more constraining API but it would work for our use-case.

I have tested these changes for our use-case and the results are pretty good but I'm open to any changes/suggestion if it means an API that enables us to overcome the challenges outlined above gets into main !

ashvardanian · 2025-10-08T21:34:40Z

Hi @guillaumeriousat! How about we add a little bit of meta-programming to check if the existing yield object can receive a thread ID instead of having one more argument?

guillaumeriousat · 2025-10-09T13:22:50Z

I hadn't thought of that ! I did what I hope is what you had in mind.

Worker threads now invoke user-supplied `code worker_yield_t::operator()(index_type)` between jobs so real-time audio workloads can tier `code micro_yield()` before falling back to sleeps.

Default worker path now delegates to the configured `code micro_yield_t` instead of hardwiring `code std::this_thread::yield()`, keeping custom wait strategies effective.

Drops the extra template parameter by using `code if constexpr` so worker threads hand their index to `code worker_yield_t` only when that overload exists.

ashvardanian force-pushed the feat/worker-yield branch from adda6d7 to 974fd3f Compare October 11, 2025 20:02

guillaumeriousat added 3 commits October 11, 2025 20:06

Improve: Add worker_yield_t hook for idle backoff

18b6d63

Worker threads now invoke user-supplied `code worker_yield_t::operator()(index_type)` between jobs so real-time audio workloads can tier `code micro_yield()` before falling back to sleeps.

Fix: Route standard_worker_yield through micro_yield()

101a955

Default worker path now delegates to the configured `code micro_yield_t` instead of hardwiring `code std::this_thread::yield()`, keeping custom wait strategies effective.

Improve: Select worker-yield overloads with if constexpr

53e1b5d

Drops the extra template parameter by using `code if constexpr` so worker threads hand their index to `code worker_yield_t` only when that overload exists.

ashvardanian force-pushed the feat/worker-yield branch from 1cc9458 to 53e1b5d Compare October 11, 2025 20:09

ashvardanian changed the title ~~Add a worker_yield template to basic_pool to enable user-defined wait-between-job behaviour~~ Thread-Aware micro_yield(thread_id) Oct 11, 2025

ashvardanian merged commit 53e1b5d into ashvardanian:main-dev Oct 11, 2025
36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thread-Aware `micro_yield(thread_id)` #31

Thread-Aware `micro_yield(thread_id)` #31

Uh oh!

guillaumeriousat commented Oct 8, 2025

Uh oh!

ashvardanian commented Oct 8, 2025

Uh oh!

guillaumeriousat commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Thread-Aware micro_yield(thread_id) #31

Thread-Aware micro_yield(thread_id) #31

Uh oh!

Conversation

guillaumeriousat commented Oct 8, 2025

Uh oh!

ashvardanian commented Oct 8, 2025

Uh oh!

guillaumeriousat commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Thread-Aware `micro_yield(thread_id)` #31

Thread-Aware `micro_yield(thread_id)` #31