-
Notifications
You must be signed in to change notification settings - Fork 18
feat: detect CPUs and configure threading sensibly #291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Or this can be done with
Now you are good to go 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering whether (and how) we should test that code, what is the case in the CI/CD container?
raise RuntimeError( | ||
f"{failed_detection_message} Use " | ||
"VLLM_SPYRE_UPDATE_THREAD_CONFIG=0 and configure manually." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo it would be better to have a warning message and keep the original threading values instead (ie. not changing anything), since you set VLLM_SPYRE_UPDATE_THREAD_CONFIG
to 1 by default and there is the possibility that both the parsing of cpu.max
fails and os.cpu_count()
returns None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point since I made it on by default. I'll change it to just log instead of raising.
f"Detected {cpu_count} CPUs from `os.cpu_count()`" | ||
|
||
cpus_per_worker = math.ceil( | ||
cpu_count / worker_count) if cpu_count is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if worker_count
will always contain a valid value. Can you add an assert for worker_count
to be >0
in the beginning of the function
if quota_str != 'max': | ||
quota = int(quota_str) | ||
period = int(period_str) | ||
cpu_count = float(quota) / period |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In python we don't need to cast to float in order to do a floating point division
cpu_count = float(quota) / period | |
cpu_count = quota / period |
# - sets TORCHINDUCTOR_COMPILE_THREADS=1 (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/env_override.py#L38-L39) | ||
# - it will set OMP_NUM_THREADS=1 when using multiple workers (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/executor/multiproc_worker_utils.py#L304) | ||
# - has configurations for OMP thread binding (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/envs.py#L435-L438) | ||
# - the bind attempts to detect NUMA nodes (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/v1/worker/cpu_worker.py#L111) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the pointers
# - the bind attempts to detect NUMA nodes (https://github.com/vllm-project/vllm/blob/baba0389f7e810a361fff5229ce20c2d5a2b1fac/vllm/v1/worker/cpu_worker.py#L111) | ||
|
||
# Always print current env for awareness | ||
threading_envs = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on setting these as "constants" in this file?
Description
When running inside a container, it is easy to oversaturate the CPU because the default for threading configurations is to use all CPUs on the system. This is especially true with
--tensor-parallel
becuase each worker creates its own thread pool. This PR adds code to detect the CPU limits allocated to a container (falling back to a simpleos.cpu_count()
) and considers the distributed world size to provide better default behavior ofnum_cpus / num_workers
per process. There is also an internal configuration affecting threading during graph compilation calledDT_PARALLEL_THREADS
that does not checkOMP_NUM_THREADS
.The PR adds a
VLLM_SPYRE_UPDATE_THREAD_CONFIG
env var configuration that is on by default but can be disabled to check and warn about threading configurations but not override anything.