13
13
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
14
# See the License for the specific language governing permissions and
15
15
# limitations under the License.
16
- #
16
+
17
+ # ----------------------------------------------------------------------------------
18
+ # This module manage the patch for vllm. There are two folders in this module:
19
+ # - platform: contains the patches applied before worker starts. It's called by
20
+ # `vllm_ascend.utils.adapt_patch(is_global_patch=True)` in
21
+ # `vllm_ascend.platform.NPUPlatform.pre_register_and_update()` function.
22
+ # - worker: contains the patches applied when worker starts. It's called by
23
+ # `vllm_ascend.utils.adapt_patch(is_global_patch=False)` in
24
+ # each worker's `__init__` function.
25
+ #
26
+ # Then in each kind of patch, there are three folders:
27
+ # - patch_0_8_4: contains the patches applied when vllm version is 0.8.4.
28
+ # - patch_main: contains the patches applied when vllm version is main branch.
29
+ # - patch_common: contains the patches applied in both 0.8.4 and main branch.
30
+ #
31
+ # In the future, with the vllm version upgrade, the new patch folder such as
32
+ # patch_0_8_5, patch_0_8_6, etc. will be added to manage the patch for different
33
+ # vllm version. And the patch_common will contain the patches applied in all the
34
+ # vllm version.
35
+ # Once the vllm version is too old that vllm-ascend will not support, the related
36
+ # patch folder will be removed as well.
37
+ #
38
+ # Once a new patch is added in vllm-ascend, please add the patch description into this file as well.
39
+ # ----------------------------------------------------------------------------------
40
+
41
+ # What's Patched and how it works:
42
+ # --------------------------------
43
+ # * Platform Patch:
44
+ # =================
45
+ # ** File: platform/patch_0_8_4/patch_config.py**
46
+ # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
47
+ # 1. `vllm.config.ModelConfig.__init__()`
48
+ # Why:
49
+ # It is hard coded for sleep mode to support cuda platform only
50
+ # How:
51
+ # Using a new method to check if sleep mode is available
52
+ # Related PR (if no, explain why): 1. refused by vllm. 2. vllm doesn't support 3. prepare to submit....
53
+ # https://github.com/vllm-project/vllm/pull/16562
54
+ # Future Plan:
55
+ # This patch is only used for 084 and can't be revert. just keep as it is.
56
+ #
57
+ # ** File: platform/patch_common/patch_distributed.py**
58
+ # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
59
+ # 1. `vllm.distributed.parallel_state.destroy_model_parallel()`
60
+ # Why:
61
+ # vllm dose not support outside platform maintain its own `CoordinatorGroup`, vllm-ascend maintain EP and ETP
62
+ # inside of the repo, and needs a common interface to destroy them, this patch add the interface of destroy
63
+ # platform owned `CoordinatorGroup` to make sure all the CoordinateGroup can be properly destroyed
64
+ # How:
65
+ # Call platform method `destroy_platform_model_parallel` to destroy all the `CoordinateGroup`
66
+ # Related PR (if no, explain why): no related PR, we want add this ability into vllm
67
+ # Future Plan:
68
+ # Remove those patch when vllm merged them
69
+ # 2. `vllm.distributed.stateless_init_torch_distributed_process_group()`
70
+ # Why:
71
+ # The stateless process group can not be initialized except from gloo and nccl backend, vllm-ascend
72
+ # needs to initialize its own stateless process group for communication, so we add the platform related
73
+ # call to the `stateless_init_torch_distributed_process_group`, to enable other platform which may support
74
+ # stateless process group initialize method
75
+ # How:
76
+ # Call platform method `platform_has_backend_register` to judge if there is a stateless process group initialize
77
+ # method and call platform method `platform_register_backend` to initialize them
78
+ # Related PR (if no, explain why): no related PR, we want add this ability into vllm
79
+ # Future Plan:
80
+ # Remove those patch when vllm merged them
81
+ # 3. `ParallelConfig.get_next_dp_init_port`
82
+ # Why:
83
+ # We want to get dp port from env variable, so the multi-node inference can be properly initialized and run.
84
+ # How:
85
+ # Get the dp port from env variable enable multi-mode dp inference
86
+ # Related PR (if no, explain why): no related PR, we want add this ability into vllm
87
+ # Future Plan:
88
+ # Its a workaround in vllm-ascend to enable multi-node dp inference, maybe removed if vllm have better plan
89
+ # on multi-node dp inference implementation
90
+ #
91
+ # * Worker Patch:
92
+ # ===============
93
+ # ** File: worker/patch_common/patch_metrics.py **
94
+ # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
95
+ # 1. `vllm.spec_decode.metrics.AsyncMetricsCollector.init_tensors` and
96
+ # `vllm.spec_decode.metrics.AsyncMetricsCollector._copy_rejsample_metrics_async`
97
+ # Why:
98
+ # There are cuda hard code (torch.cuda.Stream) in `AsyncMetricsCollector.init_tensors` and
99
+ # `AsyncMetricsCollector._copy_rejsample_metrics_async`
100
+ # How:
101
+ # Replace it with the corresponding npu method
102
+ # Related PR (if no, explain why): 1. refused by vllm. 2. vllm doesn't support 3. prepare to submit....
103
+ # https://github.com/vllm-project/vllm/pull/14411
104
+ # Future Plan:
105
+ # Revert it when the related pr is merged in vllm.
106
+ #
107
+ # 2. `vllm.spec_decode.metrics.AsyncMetricsCollector.maybe_collect_rejsample_metrics`
108
+ # Why:
109
+ # There are cuda hard code (current_platform.is_cuda_alike()) in
110
+ # `AsyncMetricsCollector.maybe_collect_rejsample_metrics`
111
+ # How:
112
+ # Change to use `current_platform.Event` to determine whether to return None
113
+ # Related PR (if no, explain why): 1. refused by vllm. 2. vllm doesn't support 3. prepare to submit....
114
+ # https://github.com/vllm-project/vllm/pull/14411
115
+ # Future Plan:
116
+ # Revert it when the related pr is merged in vllm.
117
+ #
118
+ # ** File: worker/patch_common/patch_multi_step_worker.py **
119
+ # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
120
+ # 1. `vllm.spec_decode.multi_step_worker.MultiStepWorker.sampler_output`
121
+ # Why:
122
+ # There are cuda hard code (current_platform.is_cuda_alike()) in
123
+ # `MultiStepWorker.sampler_output`, and we need to use the patched `TP1DraftModelRunner` in it.
124
+ # How:
125
+ # Make speculative decoding extensible to different backends.
126
+ # - support attention metadata register to the set supported spec decode
127
+ # - offer a api in platform to determine whether spec decode is supported,
128
+ # and deprecate is_cuda_alike in it.
129
+ # Related PR (if no, explain why): 1. refused by vllm. 2. vllm doesn't support 3. prepare to submit....
130
+ # - https://github.com/vllm-project/vllm/pull/15195
131
+ # - https://github.com/vllm-project/vllm-ascend/pull/395
132
+ # Future Plan:
133
+ # Revert it when the related pr is merged in vllm and vllm-ascend.
134
+ #
135
+ # ** File: worker/patch_common/patch_multi_step_worker.py **
136
+ # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
137
+ # 1. `vllm.spec_decode.spec_decode_worker.SpecDecodeWorker.create_worker`
138
+ # Why:
139
+ # We need to use the patched `TP1DraftModelRunner` in `SpecDecodeWorker.create_worker`.
140
+ # The mainly reason to overwrite `TP1DraftModelRunner`is the hard code of
141
+ # `FlashAttentionMetadata`
142
+ # How:
143
+ # ditto
144
+ # Related PR (if no, explain why): 1. refused by vllm. 2. vllm doesn't support 3. prepare to submit....
145
+ # - https://github.com/vllm-project/vllm/pull/15195
146
+ # - https://github.com/vllm-project/vllm-ascend/pull/395
147
+ # Future Plan:
148
+ # Revert it when the related pr is merged in vllm and vllm-ascend.
0 commit comments