-
Notifications
You must be signed in to change notification settings - Fork 38
Open
Description
Description
GPUs: H20 96GB * 6
Model: Qwen2.5VL-7B
Installation: git clone
我在autodl上使用source /etc/network_turbo && ray start --head && trinity run --config chord.yaml
开始训练,source /etc/network_turbo
是学术资源加速方便下载hf资源的指令。
终端会长时间停留在如日志所示的位置,看起来是因为某种原因不继续进行,并且GPU状态如下图:

如yaml所示,我的batchsize已经尽可能设置的很小了,GPU分配是4张卡用于训练,2张卡用于推理,但是每次内存占用只占用推理的卡的数量。
- 这是一种OOM的表现吗?
- 是因为我的workflow搭建的原因吗,代码如下。
- 在多张H20上进行vlm-chord的训练,gpu该如何进行分配。我知道你们使用的是8 * H20,可以按照这个配置给我一个例子吗?
yaml
project: "lumbar_chord"
name: "chord-lingshu-7B"
checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./qwen2.5vl_checkpoints}
algorithm:
algorithm_type: mix_chord
repeat_times: 8
kl_loss_fn_args:
kl_coef: 0.0
sample_strategy_args:
expert_data_ratio: 0.20
policy_loss_fn_args: # feel free to change, we encourage you to try out different hyperparameters
mu_warmup_steps: 200 # 0 for chord-mu and chord-phi
mu_decay_steps: 400 # 200 for chord-mu and 0 for chord-phi
mu_peak: 0.5 # 0.9 for chord-mu and 0.1 for chord-phi
mu_valley: 0.02 # 0.05 for chord-mu and 0.1 for chord-phi
enable_phi_function: true # false for chord-mu and true for chord-phi
clip_range: 0.2
use_token_level_loss_in_sft: true
use_dynamic_bsz: true
ppo_mini_batch_size: 32 # 16 + 16
ppo_micro_batch_size_per_gpu: 4
ngpus_trainer: 4
train_batch_size_expert: 16
train_batch_size_usual: 16 # 2 batchsize * 8 repeat times
model:
model_path: for_chord
max_response_tokens: 4096
max_model_len: 11264
cluster:
node_num: 1
gpu_per_node: 6 # nums of gpus
buffer:
total_epochs: 5
batch_size: 2
train_batch_size: 32
explorer_input:
taskset: # used to train
name: rl_dataset
storage_type: file
path: muskwff/lumbar_rl
subset_name: 'default'
split: 'train'
format:
prompt_key: 'problem'
response_key: 'answer'
image_key: 'images'
rollout_args:
temperature: 1.0
logprobs: 0
eval_tasksets: []
default_workflow_type: 'LumbarWorkFlow'
default_reward_fn_type: 'LumbarRewardFn'
trainer_input:
experience_buffer:
name: experience_buffer
storage_type: queue
auxiliary_buffers:
sft_dataset:
total_epochs: 25
name: sft_dataset
storage_type: file
path: muskwff/lumbar_sft
split: 'train'
format:
prompt_key: 'problem'
response_key: 'answer'
image_key: 'images'
explorer:
eval_interval: 10
runner_per_model: 8
rollout_model:
engine_num: 2
tensor_parallel_size: 1
gpu_memory_utilization: 0.95
enable_prefix_caching: false
enforce_eager: true
dtype: bfloat16
seed: 42
synchronizer:
sync_method: 'nccl'
sync_interval: 1
sync_timeout: 1200
trainer:
save_interval: 200
trainer_config:
actor_rollout_ref:
model:
use_remove_padding: true
actor:
use_dynamic_bsz: true
ppo_max_token_len_per_gpu: 10240
ulysses_sequence_parallel_size: 4
optim:
lr: 1e-6
ref:
log_prob_use_dynamic_bsz: ${trainer.trainer_config.actor_rollout_ref.actor.use_dynamic_bsz}
log_prob_max_token_len_per_gpu: ${trainer.trainer_config.actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: ${trainer.trainer_config.actor_rollout_ref.actor.ulysses_sequence_parallel_size}
log
(vLLMRolloutModel pid=9083) INFO 10-11 19:10:08 [__init__.py:742] Resolved architecture: Qwen2_5_VLForConditionalGeneration
(vLLMRolloutModel pid=9083) INFO 10-11 19:10:08 [__init__.py:1815] Using max model len 11264
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:02 [vllm_model.py:48] Using vLLM v1 engine
(vLLMRolloutModel pid=9083) `torch_dtype` is deprecated! Use `dtype` instead!
(vLLMRolloutModel pid=9120) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:08 [arg_utils.py:1208] Using ray runtime env: {'env_vars': {'TRINITY_LOG_DIR': '/root/autodl-tmp/Trinity-RFT/./qwen2.5vl_checkpoints/lumbar_chord/chord-lingshu-7B/log', 'TRINITY_LOG_LEVEL': 'INFO', 'TRINITY_LOG_NODE_IP': '0', 'TRINITY_PLUGIN_DIRS': ''}}
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:08 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=5120.
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:08 [__init__.py:3400] Cudagraph is disabled under eager mode
(QueueStorage pid=13224) WARNING 10-11 19:10:11 [queue.py:241] Save experiences in /root/autodl-tmp/Trinity-RFT/qwen2.5vl_checkpoints/lumbar_chord/chord-lingshu-7B/buffer/experience_buffer.jsonl.
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:15 [__init__.py:216] Automatically detected platform cuda.
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:08 [__init__.py:742] Resolved architecture: Qwen2_5_VLForConditionalGeneration
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:08 [__init__.py:1815] Using max model len 11264
(vLLMRolloutModel pid=9083) INFO 10-11 19:10:09 [arg_utils.py:1208] Using ray runtime env: {'env_vars': {'TRINITY_LOG_DIR': '/root/autodl-tmp/Trinity-RFT/./qwen2.5vl_checkpoints/lumbar_chord/chord-lingshu-7B/log', 'TRINITY_LOG_LEVEL': 'INFO', 'TRINITY_LOG_NODE_IP': '0', 'TRINITY_PLUGIN_DIRS': ''}}
(vLLMRolloutModel pid=9083) INFO 10-11 19:10:09 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=5120.
(vLLMRolloutModel pid=9083) INFO 10-11 19:10:09 [__init__.py:3400] Cudagraph is disabled under eager mode
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:16 [core.py:654] Waiting for init message from front-end.
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:16 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='for_chord', speculative_config=None, tokenizer='for_chord', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=11264, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=42, served_model_name=for_chord, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:19 [worker_base.py:595] Injected <class 'trinity.common.models.vllm_worker.WorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['init_process_group', 'update_weight']
(vLLMRolloutModel pid=9120) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(vLLMRolloutModel pid=9120) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(vLLMRolloutModel pid=9120) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(vLLMRolloutModel pid=9120) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(vLLMRolloutModel pid=9120) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(vLLMRolloutModel pid=9120) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:19 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(vLLMRolloutModel pid=9120) [W1011 19:10:19.454693254 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
(vLLMRolloutModel pid=9120) `torch_dtype` is deprecated! Use `dtype` instead!
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) WARNING 10-11 19:10:19 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) WARNING 10-11 19:10:21 [profiling.py:280] The sequence length (11264) is smaller than the pre-defined worst-case total number of multimodal tokens (32768). This may cause certain multi-modal inputs to fail during inference. To avoid this, you should increase `max_model_len` or reduce `mm_counts`.
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:21 [gpu_model_runner.py:2338] Starting to load model for_chord...
(vLLMRolloutModel pid=9083) INFO 10-11 19:10:16 [__init__.py:216] Automatically detected platform cuda.
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:17 [core.py:654] Waiting for init message from front-end.
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:17 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='for_chord', speculative_config=None, tokenizer='for_chord', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=11264, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=42, served_model_name=for_chord, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:22 [gpu_model_runner.py:2370] Loading model from scratch...
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:22 [__init__.py:3400] Cudagraph is disabled under eager mode
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:22 [cuda.py:362] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:00, 4.29it/s]
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552)
(vLLMRolloutModel pid=9083) [W1011 19:10:19.687664909 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:25 [default_loader.py:268] Loading weights took 2.97 seconds
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:19 [worker_base.py:595] Injected <class 'trinity.common.models.vllm_worker.WorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['init_process_group', 'update_weight']
(vLLMRolloutModel pid=9083) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:19 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) WARNING 10-11 19:10:20 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556)
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:25 [gpu_model_runner.py:2392] Model loading took 15.6269 GiB and 3.209289 seconds
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:26 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:30 [gpu_worker.py:298] Available KV cache memory: 72.24 GiB
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) WARNING 10-11 19:10:21 [profiling.py:280] The sequence length (11264) is smaller than the pre-defined worst-case total number of multimodal tokens (32768). This may cause certain multi-modal inputs to fail during inference. To avoid this, you should increase `max_model_len` or reduce `mm_counts`.
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:21 [gpu_model_runner.py:2338] Starting to load model for_chord...
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:22 [gpu_model_runner.py:2370] Loading model from scratch...
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:22 [__init__.py:3400] Cudagraph is disabled under eager mode
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:22 [cuda.py:362] Using Flash Attention backend on V1 engine.
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:25 [default_loader.py:268] Loading weights took 3.08 seconds
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:30 [kv_cache_utils.py:864] GPU KV cache size: 1,352,640 tokens
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:30 [kv_cache_utils.py:868] Maximum concurrency for 11,264 tokens per request: 120.09x
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:30 [gpu_worker.py:391] Free memory on device (94.7/95.07 GiB) on startup. Desired GPU memory utilization is (0.95, 90.32 GiB). Actual usage is 15.63 GiB for weight, 2.38 GiB for peak activation, 0.07 GiB for non-torch memory, and 0.0 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=77408563097` to fit into requested memory, or `--kv-cache-memory=82111370752` to fully utilize gpu memory. Current kv cache memory in use is 77565849497 bytes.
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:30 [core.py:218] init engine (profile, create kv cache, warmup model) took 4.94 seconds
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:26 [gpu_model_runner.py:2392] Model loading took 15.6269 GiB and 3.232434 seconds
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:31 [__init__.py:3400] Cudagraph is disabled under eager mode
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:26 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:31 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 84540
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:31 [async_llm.py:180] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
workflow code
from typing import Optional, List
from trinity.common.workflows import SimpleWorkflow, Task, WORKFLOWS
from trinity.common.models.model import ModelWrapper
from trinity.common.experience import Experience
from verl.utils.dataset.vision_utils import process_image
from trinity.common.rewards.lumbar_reward_fn import LumbarRewardFn
@WORKFLOWS.register_module("LumbarWorkFlow")
class LumbarWorkFlow(SimpleWorkflow):
def __init__(
self,
task: Task,
model: ModelWrapper,
auxiliary_models: Optional[List] = None,
** kwargs
):
super().__init__(task=task, model=model, auxiliary_models=auxiliary_models)
self.model = model
self.reward_fn = LumbarRewardFn() # load lumbar_reward
def run(self) -> List[Experience]:
"""
generate response -> calculate reward -> return Experience
"""
# generate response
responses = self.model.chat_mm(
messages=self.messages, images=self.images, **self.rollout_args
)
# return experience
for i, response in enumerate(responses):
reward_dict = self.reward_fn( # type: ignore [misc]
response=response.response_text, # type: ignore [arg-type]
truth=self.truth,
)
if response.metrics is None:
response.metrics = {}
response.metrics.update(reward_dict)
reward = sum(reward_dict.values())
response.reward = reward
response.eid.run = i + self.run_id_base
return responses
def resettable(self):
return True
def reset(self, task: Task):
"""reset task state"""
self.task = task
assert task.raw_task is not None
self.system_prompt = "You are a radiologist specializing in lumbar spine diseases " \
"and are good at writing structured and standardized radiology diagnostic reports based on MRI images."
self.task_desc = task.raw_task.get(task.format_args.prompt_key)
self.truth = task.raw_task.get(task.format_args.response_key)
self.image_key = task.format_args.image_key
self.images = []
if self.image_key and self.raw_task.get(self.image_key) is not None:
self.images = [process_image(img) for img in self.raw_task[self.image_key]] # type: ignore [index]
self.messages = self.format_messages()
Metadata
Metadata
Assignees
Labels
No labels