TensorRT-LLM released #3658

Dampfinchen · 2023-10-17T19:41:03Z

Dampfinchen
Oct 17, 2023

https://www.tomshardware.com/news/nvidia-boosts-ai-performance-with-tensorrt

Could TensorRT-LLM be useful for CUDA acceleration? @slaren @JohannesGaessler

slaren · 2023-10-17T20:13:35Z

slaren
Oct 17, 2023
Maintainer

There are a few areas that I think could still improve the performance of the CUDA backend significantly, especially in prompt or batch processing:

Matrix multiplication kernels for quantized formats using tensor cores, which requires custom kernels to support our quantization formats
Fused attention kernels similar to flash attention or paged attention, which again will require writing custom kernels to support the way we handle attention and multiple sequences
Storing activations as F16 may also help somewhat

I don't think that TensorRT is likely to help with these issues. Additionally, in general we try to avoid adding large dependencies to llama.cpp.

1 reply

Dampfinchen Oct 17, 2023
Author

I see, thank you for the heads up. Do you think its possible to reduce VRAM usage more, to make it compete with MMQ in that regard? Perhaps its also possible to quantize the KV-Cache to 8 bit like Johannes has demonstrated with MMQ.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TensorRT-LLM released #3658

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

TensorRT-LLM released #3658

Uh oh!

Uh oh!

Dampfinchen Oct 17, 2023

Replies: 1 comment · 1 reply

Uh oh!

slaren Oct 17, 2023 Maintainer

Uh oh!

Dampfinchen Oct 17, 2023 Author

Dampfinchen
Oct 17, 2023

Replies: 1 comment 1 reply

slaren
Oct 17, 2023
Maintainer

Dampfinchen Oct 17, 2023
Author