why max_num_batched_tokens must be <= 65528 when LoRA is enabled? can it be extended longer #6247
Closed
junior-zsy
announced in
Q&A
Replies: 1 comment 2 replies
-
Version 0.5.5 has already removed this restriction, see: #7288. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
When I used LoRA, I found that Due to limitations of the custom LoRA CUDA kernel, max_num_batched_tokens must be <= 65528 when LoRA is enabled. ,I would like to know the specific reason for this. Can it be extended for a longer period of time? I have a need in this area. Do you have any specific ideas for me? Thank you @Yard1 @simon-mo
Beta Was this translation helpful? Give feedback.
All reactions