-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer #20059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
92b1733
58ce477
c2c5fea
1606880
806432a
7c5df45
c7a9424
e8b9296
94d0b79
a67c698
da110af
deaf0fe
02ca154
fa0d25c
5108bef
1c1873d
7d4667a
fedff47
833ac56
d57257d
8b7ea7a
328615d
debc682
cad6c39
dc455ee
620a728
b1e6978
beee69a
21b1a8d
ec79af7
210359a
11263e0
9a38a4e
699aff3
ef3d9d9
658565e
15e2b4a
4253dbf
2783e26
1b54962
fb2a3c7
d6269bd
2e1304c
db22ca5
72d40e6
0c79e53
75db3a6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3974,13 +3974,21 @@ | |
splitting certain operations such as attention into subgraphs. Thus this | ||
flag cannot be used together with splitting_ops. This may provide | ||
performance benefits for smaller models.""" | ||
separate_attention_routine: bool = False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this should be named better. Perhaps There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we must leave such a flag in the global config, which tells the compiler backend to do the right thing. Otherwise, how is the attention backend supposed to communicate its requirements to the compiler? At least for now, the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I am not sure what name can be better. Btw, I'm afraid There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good call on the name. Also makes sense we use this to communicate from attention backend to compiler. Let's make sure that happens inside There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should figure out a different name for this; the current name doesnt indicate any relation to cudagraphs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not as zoned into this PR as you folks are, but I have no clue what this flag is from the name. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
How about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just changed to |
||
""" | ||
Enable a distinct attention calls routine under an attention backend for full | ||
cuda graph capturing. This is because some attention backends like FlashMLA, | ||
FlashInfer, FA2, etc. implement different branches for mix prefill-decode and | ||
pure decode cases. This flag enables us to potentially capture the cudagraph | ||
separately for each branch. | ||
""" | ||
|
||
pass_config: PassConfig = field(default_factory=PassConfig) | ||
"""Custom inductor passes, see PassConfig for more details""" | ||
|
||
max_capture_size: int = field(default=None, init=False) # type: ignore | ||
"""not configurable, computed after init""" | ||
local_cache_dir: str = field(default=None, init=False) # type: ignore | ||
"""local cache dir for each rank""" | ||
bs_to_padded_graph_size: list[int] = field( | ||
default=None, # type: ignore | ||
|
@@ -4172,13 +4180,16 @@ | |
|
||
def set_splitting_ops_for_v1(self): | ||
# NOTE: this function needs to be called | ||
if self.splitting_ops and self.full_cuda_graph: | ||
raise ValueError("full_cuda_graph cannot be used together with " | ||
"splitting_ops, as Full CUDA graph will override " | ||
f"the splitting_ops: {self.splitting_ops}") | ||
|
||
# NOTE: When full_cuda_graph is True, instead of setting an empty | ||
# list and capture the full cudagraph inside the flattened fx graph, | ||
# we keep the piecewise fx graph structure but capture the full | ||
# cudagraph outside the fx graph. This reduces some cpu overhead when | ||
# the runtime batch_size is not cudagraph captured. This is only | ||
# supported for separate_attention_routine. | ||
if self.separate_attention_routine: | ||
assert self.full_cuda_graph, "separate_attention_routine requires full_cuda_graph to be True" | ||
fhl2000 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if not self.splitting_ops: | ||
self.splitting_ops = [] if self.full_cuda_graph else [ | ||
self.splitting_ops = [ | ||
"vllm.unified_attention", | ||
"vllm.unified_attention_with_output", | ||
] | ||
|
@@ -4186,7 +4197,7 @@ | |
|
||
@config | ||
@dataclass(config=ConfigDict(arbitrary_types_allowed=True)) | ||
class VllmConfig: | ||
"""Dataclass which contains all vllm-related configuration. This | ||
simplifies passing around the distinct configurations in the codebase. | ||
""" | ||
|
Uh oh!
There was an error while loading. Please reload this page.