Replies: 2 comments
-
Please give our fork a look! Tenstorrent implemented the paged kernels needed for vLLM https://github.com/tenstorrent/vllm/blob/dev/tt_metal/README.md |
Beta Was this translation helpful? Give feedback.
0 replies
-
There are some RFCs related to hardware support in the issues. You can look into them |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, vLLM community
I want to make vLLM support a new hardware, the Tenstorrent's Grayskull (which is a general purpose DLA, just like CUDA, but not CUDA). After reading the document and the code, I have some understanding and some questions, need the community's help to clarify my thoughts and check my understanding. Please correct me if I have any misunderstandings.
My understandings
PagedAttention
, which is a highly optimized "memory paging mechanism" implemented on CUDA.attention_kernel.cu
torch_bindings.cpp
PagedAttention
with Tenstorrent Grayskull kernel. (that will a huge work)My questions
torch_binding.py
I saw there binds a lot of operations, but do I need to implement them all or just thepaged_attention_v2()
?forward()
function to adapt vLLM's interface, without thePagedAttention
? will it work but just with worse performance?Thank you for reading my long questions and thanks in advance for the helping :D
Beta Was this translation helpful? Give feedback.
All reactions