Skip to content

feat: detect CPUs and configure threading sensibly #291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tjohnson31415
Copy link
Collaborator

Description

When running inside a container, it is easy to oversaturate the CPU because the default for threading configurations is to use all CPUs on the system. This is especially true with --tensor-parallel becuase each worker creates its own thread pool. This PR adds code to detect the CPU limits allocated to a container (falling back to a simple os.cpu_count()) and considers the distributed world size to provide better default behavior of num_cpus / num_workers per process. There is also an internal configuration affecting threading during graph compilation called DT_PARALLEL_THREADS that does not check OMP_NUM_THREADS.

The PR adds a VLLM_SPYRE_UPDATE_THREAD_CONFIG env var configuration that is on by default but can be disabled to check and warn about threading configurations but not override anything.

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Copy link

github-actions bot commented Jul 8, 2025

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Copy link
Collaborator

@sducouedic sducouedic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether (and how) we should test that code, what is the case in the CI/CD container?

Comment on lines +392 to +395
raise RuntimeError(
f"{failed_detection_message} Use "
"VLLM_SPYRE_UPDATE_THREAD_CONFIG=0 and configure manually."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo it would be better to have a warning message and keep the original threading values instead (ie. not changing anything), since you set VLLM_SPYRE_UPDATE_THREAD_CONFIG to 1 by default and there is the possibility that both the parsing of cpu.max fails and os.cpu_count() returns None

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point since I made it on by default. I'll change it to just log instead of raising.

f"Detected {cpu_count} CPUs from `os.cpu_count()`"

cpus_per_worker = math.ceil(
cpu_count / worker_count) if cpu_count is not None else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if worker_count will always contain a valid value. Can you add an assert for worker_count to be >0 in the beginning of the function

if quota_str != 'max':
quota = int(quota_str)
period = int(period_str)
cpu_count = float(quota) / period
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In python we don't need to cast to float in order to do a floating point division

Suggested change
cpu_count = float(quota) / period
cpu_count = quota / period

Comment on lines +331 to +334
# - sets TORCHINDUCTOR_COMPILE_THREADS=1 (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/env_override.py#L38-L39)
# - it will set OMP_NUM_THREADS=1 when using multiple workers (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/executor/multiproc_worker_utils.py#L304)
# - has configurations for OMP thread binding (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/envs.py#L435-L438)
# - the bind attempts to detect NUMA nodes (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/v1/worker/cpu_worker.py#L111)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the pointers

# - the bind attempts to detect NUMA nodes (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/v1/worker/cpu_worker.py#L111)

# Always print current env for awareness
threading_envs = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on setting these as "constants" in this file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants