-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
Add test case for compiling multiple graphs #21044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
cc: @zou3519 @youkaichao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new ignore_torch_compile
decorator and adds test cases for compiling multiple graphs. There is a potential issue with the inheritance of the IGNORE_COMPILE_KEY
that needs to be addressed.
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me. One minor comment
cc @BoyuanFeng |
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
def test_multi_graph_piecewise_compile(monkeypatch): | ||
assert VLLM_USE_V1 | ||
|
||
monkeypatch.setenv("VLLM_DISABLE_COMPILE_CACHE", "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 I noticed that when I remove this and enable compile caching, the compile for CompiledAttentionTwo
fails when running pytest tests/compile/piecewise/test_multiple_graphs.py::test_multi_graph_piecewise_compile
. The error is:
E torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0x7f8addaf6cf0>' raised:
E RuntimeError: vLLM failed to compile the model. The most likely reason for this is that a previous compilation failed, leading to a corrupted compilation artifact. We recommend trying to remove ~/.cache/vllm/torch_compile_cache and try again to see the real issue.
E
E Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
I found that when I keep the prefix in inductor compile cache dir here, the test passes. Before this change, inductor cache dir would be /home/yhshin/.cache/vllm/torch_compile_cache/bbbe1d91c9/rank_0_0
for both attn_one
and attn_two
, but after the change these are separate (i.e. /home/yhshin/.cache/vllm/torch_compile_cache/bbbe1d91c9/rank_0_0/{prefix}
).
For compiling multiple graphs in a single model, should we be setting inductor cache dir to separate directories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we're compiling multiple graphs in a single model, each graph should be tagged with a different model tag. That will make it so that their intermediate artifacts are stored in a separate subfolder of the cache directory.
Are you saying that even with the set_model_tags, you need to set VLLM_DISABLE_COMPILE_CACHE=1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that even with the set_model_tags, you need to set VLLM_DISABLE_COMPILE_CACHE=1?
Yes, and it seems to be due to shared inductor cache dir. If I force it to be stored in separate subfolder according to the model tag, it runs fine.
Having them share inductor cache seems to be a conscious decision in #19064:
This PR re-organizes the cache directory structure, so that the same vLLM instances will use the same TORCHINDUCTOR_CACHE_DIR and TRITON_CACHE_DIR, but just different storage for vllm_compile_cache.py etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! a few minor comments
model(inputs[:2]) | ||
model(inputs[:1]) | ||
|
||
inputs[:2].fill_(1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For test, should we random initialize inputs and check bitwise equivalence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to randomize test input each test run. By bitwise equivalence do you mean torch.equal()
on final output or something else? Using torch.equal
(rtol=0,atol=0) fails currently, but that might be expected?
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
In some cases, we might want to compile multiple subgraphs, e.g. if some submodules in the model are not traceable with torch dynamo. See #14913 and #19719 for motivating scenarios.
This PR adds a test case showcasing how this can be supported in vLLM for piecewise compilation. This requires several components:
set_model_tag(tag)
to ensure compile cache is not shared between these subgraphs (initially added in [torch.compile] reorganize the cache directory to support compiling multiple models #19064 to support compiling multiple models)ignore_torch_compile
helper decorator in scenarios where a class inherits a parent module that hassupport_torch_compile
decorator applied but this class does not want to compile its forward(). Addedtest_ignore_torch_compile_decorator
that tests this.Test Plan
Test Result
Both newly added unit tests (
test_ignore_torch_compile_decorator
andtest_multi_graph_piecewise_compile
) pass