Skip to content

Commit 4ef45da

Browse files
authored
docs: Triton TRT-LLM user guide (#7529)
1 parent 96144e0 commit 4ef45da

File tree

1 file changed

+118
-0
lines changed

1 file changed

+118
-0
lines changed
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
<!--
2+
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
#
4+
# Redistribution and use in source and binary forms, with or without
5+
# modification, are permitted provided that the following conditions
6+
# are met:
7+
# * Redistributions of source code must retain the above copyright
8+
# notice, this list of conditions and the following disclaimer.
9+
# * Redistributions in binary form must reproduce the above copyright
10+
# notice, this list of conditions and the following disclaimer in the
11+
# documentation and/or other materials provided with the distribution.
12+
# * Neither the name of NVIDIA CORPORATION nor the names of its
13+
# contributors may be used to endorse or promote products derived
14+
# from this software without specific prior written permission.
15+
#
16+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
17+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
19+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
20+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
21+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
22+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
23+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
24+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
25+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
26+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
27+
-->
28+
29+
# TensorRT-LLM User Guide
30+
31+
## What is TensorRT-LLM
32+
33+
[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
34+
(TRT-LLM) is an open-source library designed to accelerate and optimize the
35+
inference performance of large language models (LLMs) on NVIDIA GPUs. TRT-LLM
36+
offers users an easy-to-use Python API to build TensorRT engines for LLMs,
37+
incorporating state-of-the-art optimizations to ensure efficient inference on
38+
NVIDIA GPUs.
39+
40+
## How to run TRT-LLM models with Triton Server via TensorRT-LLM backend
41+
42+
The
43+
[TensorRT-LLM Backend](https://github.com/triton-inference-server/tensorrtllm_backend)
44+
lets you serve TensorRT-LLM models with Triton Inference Server. Check out the
45+
[Getting Started](https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#getting-started)
46+
section in the TensorRT-LLM Backend repo to learn how to utlize the
47+
[NGC Triton TRT-LLM container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver)
48+
to prepare engines for your LLM models and serve them with Triton.
49+
50+
## How to use your custom TRT-LLM model
51+
52+
All the supported models can be found in the
53+
[examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) folder in
54+
the TRT-LLM repo. Follow the examples to convert your models to TensorRT
55+
engines.
56+
57+
After the engine is built, [prepare the model repository](https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#prepare-the-model-repository)
58+
for Triton, and
59+
[modify the model configuration](https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#modify-the-model-configuration).
60+
61+
Only the *mandatory parameters* need to be set in the model config file. Feel free
62+
to modify the optional parameters as needed. To learn more about the
63+
parameters, model inputs, and outputs, see the
64+
[model config documentation](ttps://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/model_config.md) for more details.
65+
66+
## Advanced Configuration Options and Deployment Strategies
67+
68+
Explore advanced configuration options and deployment strategies to optimize
69+
and run Triton with your TRT-LLM models effectively:
70+
71+
- [Model Deployment](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#model-deployment): Techniques for efficiently deploying and managing your models in various environments.
72+
- [Multi-Instance GPU (MIG) Support](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#mig-support): Run Triton and TRT-LLM models with MIG to optimize GPU resource management.
73+
- [Scheduling](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#scheduling): Configure scheduling policies to control how requests are managed and executed.
74+
- [Key-Value Cache](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#key-value-cache): Utlizte KV cache and KV cache reuse to optimize memory usage and improve performance.
75+
- [Decoding](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#decoding): Advanced methods for generating text, including top-k, top-p, top-k top-p, beam search, Medusa, and speculative decoding.
76+
- [Chunked Context](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#chunked-context): Splitting the context into several chunks and batching them during generation phase to increase overall throughput.
77+
- [Quantization](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#quantization): Apply quantization techniques to reduce model size and enhance inference speed.
78+
- [LoRa (Low-Rank Adaptation)](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#lora): Use LoRa for efficient model fine-tuning and adaptation.
79+
80+
## Tutorials
81+
82+
Make sure to check out the
83+
[tutorials](https://github.com/triton-inference-server/tutorials) repo to see
84+
more guides on serving popular LLM models with Triton Server and TensorRT-LLM,
85+
as well as deploying them on Kubernetes.
86+
87+
## Benchmark
88+
89+
[GenAI-Perf](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf)
90+
is a command line tool for measuring the throughput and latency of LLMs served
91+
by Triton Inference Server. Check out the
92+
[Quick Start](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf#quick-start)
93+
to learn how to use GenAI-Perf to benchmark your LLM models.
94+
95+
## Performance Best Practices
96+
97+
Check out the
98+
[Performance Best Practices guide](https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.html)
99+
to learn how to optimize your TensorRT-LLM models for better performance.
100+
101+
## Metrics
102+
103+
Triton Server provides
104+
[metrics](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md)
105+
indicating GPU and request statistics.
106+
See the
107+
[Triton Metrics](https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#triton-metrics)
108+
section in the TensorRT-LLM Backend repo to learn how to query the Triton
109+
metrics endpoint to obtain TRT-LLM statistics.
110+
111+
## Ask questions or report issues
112+
113+
Can't find what you're looking for, or have a question or issue? Feel free to
114+
ask questions or report issues in the GitHub issues page:
115+
116+
- [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/issues)
117+
- [TensorRT-LLM Backend](https://github.com/triton-inference-server/tensorrtllm_backend/issues)
118+
- [Triton Inference Server](https://github.com/triton-inference-server/server/issues)

0 commit comments

Comments
 (0)