Skip to content

Commit 89b0f84

Browse files
authored
[doc] fix "Other AI accelerators" getting started page (#19457)
Signed-off-by: David Xia <david@davidxia.com>
1 parent 497a91e commit 89b0f84

File tree

3 files changed

+50
-43
lines changed

3 files changed

+50
-43
lines changed

docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,8 @@ to set up the execution environment. To achieve the best performance,
1919
please follow the methods outlined in the
2020
[Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).
2121

22-
## Configure a new environment
22+
# --8<-- [end:requirements]
23+
# --8<-- [start:configure-a-new-environment]
2324

2425
### Environment verification
2526

@@ -56,7 +57,7 @@ docker run \
5657
vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
5758
```
5859

59-
# --8<-- [end:requirements]
60+
# --8<-- [end:configure-a-new-environment]
6061
# --8<-- [start:set-up-using-python]
6162

6263
# --8<-- [end:set-up-using-python]
@@ -183,7 +184,6 @@ Currently in vLLM for HPU we support four execution modes, depending on selected
183184
| 0 | 0 | torch.compile |
184185
| 0 | 1 | PyTorch eager mode |
185186
| 1 | 0 | HPU Graphs |
186-
<figcaption>vLLM execution modes</figcaption>
187187

188188
!!! warning
189189
In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.

docs/getting_started/installation/ai_accelerator/neuron.inc.md

Lines changed: 28 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# --8<-- [start:installation]
22

3-
[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
4-
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
5-
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
3+
[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
4+
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
5+
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
66
This tab describes how to set up your environment to run vLLM on Neuron.
77

88
!!! warning
@@ -17,11 +17,12 @@
1717
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
1818
- AWS Neuron SDK 2.23
1919

20-
## Configure a new environment
20+
# --8<-- [end:requirements]
21+
# --8<-- [start:configure-a-new-environment]
2122

2223
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
2324

24-
The easiest way to launch a Trainium or Inferentia instance with pre-installed Neuron dependencies is to follow this
25+
The easiest way to launch a Trainium or Inferentia instance with pre-installed Neuron dependencies is to follow this
2526
[quick start guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/multiframework/multi-framework-ubuntu22-neuron-dlami.html#setup-ubuntu22-multi-framework-dlami) using the Neuron Deep Learning AMI (Amazon machine image).
2627

2728
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
@@ -30,14 +31,14 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N
3031
source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate
3132
```
3233

33-
Refer to the [NxD Inference Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html)
34+
Refer to the [NxD Inference Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html)
3435
for alternative setup instructions including using Docker and manually installing dependencies.
3536

3637
!!! note
37-
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
38-
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
38+
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
39+
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
3940

40-
# --8<-- [end:requirements]
41+
# --8<-- [end:configure-a-new-environment]
4142
# --8<-- [start:set-up-using-python]
4243

4344
# --8<-- [end:set-up-using-python]
@@ -59,14 +60,14 @@ pip install -U -r requirements/neuron.txt
5960
VLLM_TARGET_DEVICE="neuron" pip install -e .
6061
```
6162

62-
AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
63-
[https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's
63+
AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
64+
[https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's
6465
available on vLLM V0. Please utilize the AWS Fork for the following features:
6566

6667
- Llama-3.2 multi-modal support
67-
- Multi-node distributed inference
68+
- Multi-node distributed inference
6869

69-
Refer to [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html)
70+
Refer to [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html)
7071
for more details and usage examples.
7172

7273
To install the AWS Neuron fork, run the following:
@@ -101,11 +102,11 @@ Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dock
101102
[](){ #feature-support-through-nxd-inference-backend }
102103
### Feature support through NxD Inference backend
103104

104-
The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
105-
to perform most of the heavy lifting which includes PyTorch model initialization, compilation, and runtime execution. Therefore, most
106-
[features supported on Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html) are also available via the vLLM integration.
105+
The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
106+
to perform most of the heavy lifting which includes PyTorch model initialization, compilation, and runtime execution. Therefore, most
107+
[features supported on Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html) are also available via the vLLM integration.
107108

108-
To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
109+
To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
109110
as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
110111
```console
111112
override_neuron_config={
@@ -117,33 +118,33 @@ or when launching vLLM from the CLI, pass
117118
--override-neuron-config "{\"enable_bucketing\":false}"
118119
```
119120

120-
Alternatively, users can directly call the NxDI library to trace and compile your model, then load the pre-compiled artifacts
121-
(via `NEURON_COMPILED_ARTIFACTS` environment variable) in vLLM to run inference workloads.
121+
Alternatively, users can directly call the NxDI library to trace and compile your model, then load the pre-compiled artifacts
122+
(via `NEURON_COMPILED_ARTIFACTS` environment variable) in vLLM to run inference workloads.
122123

123124
### Known limitations
124125

125126
- EAGLE speculative decoding: NxD Inference requires the EAGLE draft checkpoint to include the LM head weights from the target model. Refer to this
126127
[guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility)
127128
for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI.
128-
- Quantization: the native quantization flow in vLLM is not well supported on NxD Inference. It is recommended to follow this
129-
[Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
129+
- Quantization: the native quantization flow in vLLM is not well supported on NxD Inference. It is recommended to follow this
130+
[Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
130131
to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
131-
- Multi-LoRA serving: NxD Inference only supports loading of LoRA adapters at server startup. Dynamic loading of LoRA adapters at
132+
- Multi-LoRA serving: NxD Inference only supports loading of LoRA adapters at server startup. Dynamic loading of LoRA adapters at
132133
runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
133134
- Multi-modal support: multi-modal support is only available through the AWS Neuron fork. This feature has not been upstreamed
134135
to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
135136
- Multi-node support: distributed inference across multiple Trainium/Inferentia instances is only supported on the AWS Neuron fork. Refer
136137
to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
137138
to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
138-
- Known edge case bug in speculative decoding: An edge case failure may occur in speculative decoding when sequence length approaches
139-
max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
140-
to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
141-
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
139+
- Known edge case bug in speculative decoding: An edge case failure may occur in speculative decoding when sequence length approaches
140+
max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
141+
to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
142+
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
142143
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
143144

144145

145146
### Environment variables
146-
- `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid
147+
- `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid
147148
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
148149
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
149150
but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts

docs/getting_started/installation/ai_accelerator/tpu.inc.md

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -58,43 +58,49 @@ assigned to your Google Cloud project for your immediate exclusive use.
5858
### Provision Cloud TPUs with GKE
5959

6060
For more information about using TPUs with GKE, see:
61+
6162
- <https://cloud.google.com/kubernetes-engine/docs/how-to/tpus>
6263
- <https://cloud.google.com/kubernetes-engine/docs/concepts/tpus>
6364
- <https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus>
6465

65-
## Configure a new environment
66+
# --8<-- [end:requirements]
67+
# --8<-- [start:configure-a-new-environment]
6668

6769
### Provision a Cloud TPU with the queued resource API
6870

6971
Create a TPU v5e with 4 TPU chips:
7072

7173
```console
7274
gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
73-
--node-id TPU_NAME \
74-
--project PROJECT_ID \
75-
--zone ZONE \
76-
--accelerator-type ACCELERATOR_TYPE \
77-
--runtime-version RUNTIME_VERSION \
78-
--service-account SERVICE_ACCOUNT
75+
--node-id TPU_NAME \
76+
--project PROJECT_ID \
77+
--zone ZONE \
78+
--accelerator-type ACCELERATOR_TYPE \
79+
--runtime-version RUNTIME_VERSION \
80+
--service-account SERVICE_ACCOUNT
7981
```
8082

8183
| Parameter name | Description |
8284
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
8385
| QUEUED_RESOURCE_ID | The user-assigned ID of the queued resource request. |
84-
| TPU_NAME | The user-assigned name of the TPU which is created when the queued |
86+
| TPU_NAME | The user-assigned name of the TPU which is created when the queued resource request is allocated. |
8587
| PROJECT_ID | Your Google Cloud project |
86-
| ZONE | The GCP zone where you want to create your Cloud TPU. The value you use |
87-
| ACCELERATOR_TYPE | The TPU version you want to use. Specify the TPU version, for example |
88-
| RUNTIME_VERSION | The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information see [TPU VM images](https://cloud.google.com/tpu/docs/runtimes). |
89-
<figcaption>Parameter descriptions</figcaption>
88+
| ZONE | The GCP zone where you want to create your Cloud TPU. The value you use depends on the version of TPUs you are using. For more information, see [TPU regions and zones] |
89+
| ACCELERATOR_TYPE | The TPU version you want to use. Specify the TPU version, for example `v5litepod-4` specifies a v5e TPU with 4 cores, `v6e-1` specifies a v6e TPU with 1 core. For more information, see [TPU versions]. |
90+
| RUNTIME_VERSION | The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information see [TPU VM images]. |
91+
| SERVICE_ACCOUNT | The email address for your service account. You can find it in the IAM Cloud Console under *Service Accounts*. For example: `tpu-service-account@<your_project_ID>.iam.gserviceaccount.com` |
9092

9193
Connect to your TPU using SSH:
9294

9395
```bash
9496
gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE
9597
```
9698

97-
# --8<-- [end:requirements]
99+
[TPU versions]: https://cloud.google.com/tpu/docs/runtimes
100+
[TPU VM images]: https://cloud.google.com/tpu/docs/runtimes
101+
[TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones
102+
103+
# --8<-- [end:configure-a-new-environment]
98104
# --8<-- [start:set-up-using-python]
99105

100106
# --8<-- [end:set-up-using-python]

0 commit comments

Comments
 (0)