feat: detect CPUs and configure threading sensibly #291

tjohnson31415 · 2025-07-08T22:27:44Z

Description

When running inside a container, it is easy to oversaturate the CPU because the default for threading configurations is to use all CPUs on the system. This is especially true with --tensor-parallel becuase each worker creates its own thread pool. This PR adds code to detect the CPU limits allocated to a container (falling back to a simple os.cpu_count()) and considers the distributed world size to provide better default behavior of num_cpus / num_workers per process. There is also an internal configuration affecting threading during graph compilation called DT_PARALLEL_THREADS that does not check OMP_NUM_THREADS.

The PR adds a VLLM_SPYRE_UPDATE_THREAD_CONFIG env var configuration that is on by default but can be disabled to check and warn about threading configurations but not override anything.

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

github-actions · 2025-07-08T22:27:53Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

sducouedic

I am wondering whether (and how) we should test that code, what is the case in the CI/CD container?

sducouedic · 2025-07-15T09:48:07Z

vllm_spyre/platform.py

+                raise RuntimeError(
+                    f"{failed_detection_message} Use "
+                    "VLLM_SPYRE_UPDATE_THREAD_CONFIG=0 and configure manually."
+                )


Imo it would be better to have a warning message and keep the original threading values instead (ie. not changing anything), since you set VLLM_SPYRE_UPDATE_THREAD_CONFIG to 1 by default and there is the possibility that both the parsing of cpu.max fails and os.cpu_count() returns None

That's a good point since I made it on by default. I'll change it to just log instead of raising.

sducouedic · 2025-07-15T09:57:35Z

vllm_spyre/platform.py

+                f"Detected {cpu_count} CPUs from `os.cpu_count()`"
+
+        cpus_per_worker = math.ceil(
+            cpu_count / worker_count) if cpu_count is not None else None


I am not sure if worker_count will always contain a valid value. Can you add an assert for worker_count to be >0 in the beginning of the function

sducouedic · 2025-07-15T10:24:13Z

vllm_spyre/platform.py

+            if quota_str != 'max':
+                quota = int(quota_str)
+                period = int(period_str)
+                cpu_count = float(quota) / period


In python we don't need to cast to float in order to do a floating point division

Suggested change

cpu_count = float(quota) / period

cpu_count = quota / period

sducouedic · 2025-07-15T10:25:45Z

vllm_spyre/platform.py

+        #  - sets TORCHINDUCTOR_COMPILE_THREADS=1 (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/env_override.py#L38-L39)
+        #  - it will set OMP_NUM_THREADS=1 when using multiple workers (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/executor/multiproc_worker_utils.py#L304)
+        #  - has configurations for OMP thread binding (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/envs.py#L435-L438)
+        #    - the bind attempts to detect NUMA nodes (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/v1/worker/cpu_worker.py#L111)


thanks for the pointers

joerunde · 2025-07-18T17:46:24Z

vllm_spyre/platform.py

+        #    - the bind attempts to detect NUMA nodes (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/v1/worker/cpu_worker.py#L111)
+
+        # Always print current env for awareness
+        threading_envs = [


Thoughts on setting these as "constants" in this file?

feat: detect CPUs and set threading by default

72c6449

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

tjohnson31415 requested review from yannicks1, tdoublep, nikolaospapandreou and sducouedic as code owners July 8, 2025 22:27

sducouedic reviewed Jul 15, 2025

View reviewed changes

joerunde reviewed Jul 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: detect CPUs and configure threading sensibly #291

feat: detect CPUs and configure threading sensibly #291

tjohnson31415 commented Jul 8, 2025

Uh oh!

github-actions bot commented Jul 8, 2025

Uh oh!

sducouedic left a comment

Uh oh!

sducouedic Jul 15, 2025

Uh oh!

tjohnson31415 Jul 15, 2025

Uh oh!

sducouedic Jul 15, 2025

Uh oh!

sducouedic Jul 15, 2025

Uh oh!

sducouedic Jul 15, 2025

Uh oh!

joerunde Jul 18, 2025

Uh oh!

Uh oh!

feat: detect CPUs and configure threading sensibly #291

Are you sure you want to change the base?

feat: detect CPUs and configure threading sensibly #291

Conversation

tjohnson31415 commented Jul 8, 2025

Description

Uh oh!

github-actions bot commented Jul 8, 2025

Uh oh!

sducouedic left a comment

Choose a reason for hiding this comment

Uh oh!

sducouedic Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

tjohnson31415 Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

sducouedic Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

sducouedic Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

sducouedic Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

joerunde Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!