You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/getting_started/troubleshooting.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ To isolate the model downloading and loading issue, you can use the `--load-form
24
24
25
25
## Out of memory
26
26
27
-
If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider [using tensor parallelism](#distributed-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
27
+
If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider adopting [these options](#reducing-memory-usage) to reduce the memory consumption.
Copy file name to clipboardExpand all lines: docs/source/serving/offline_inference.md
+53-3Lines changed: 53 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -59,6 +59,8 @@ model = LLM(
59
59
60
60
Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.
61
61
62
+
(reducing-memory-usage)=
63
+
62
64
### Reducing memory usage
63
65
64
66
Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
@@ -81,6 +83,12 @@ before initializing vLLM. Otherwise, you may run into an error like `RuntimeErro
81
83
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
82
84
:::
83
85
86
+
:::{note}
87
+
With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
88
+
89
+
You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
90
+
:::
91
+
84
92
#### Quantization
85
93
86
94
Quantized models take less memory at the cost of lower precision.
0 commit comments