Some questions about VLM-CHORD

# Description
GPUs: H20 96GB * 6
Model: Qwen2.5VL-7B
Installation: git clone

我在autodl上使用`source /etc/network_turbo && ray start --head && trinity run --config chord.yaml`开始训练，`source /etc/network_turbo`是学术资源加速方便下载hf资源的指令。
终端会长时间停留在如日志所示的位置，看起来是因为某种原因不继续进行，并且GPU状态如下图：

<img width="276" height="479" alt="Image" src="https://github.com/user-attachments/assets/b9bd7413-8731-46be-94fa-f1ff708232e2" />

如yaml所示，我的batchsize已经尽可能设置的很小了，GPU分配是4张卡用于训练，2张卡用于推理，但是每次内存占用只占用推理的卡的数量。
1. 这是一种OOM的表现吗？
2. 是因为我的workflow搭建的原因吗，代码如下。
3. 在多张H20上进行vlm-chord的训练，gpu该如何进行分配。我知道你们使用的是8 * H20，可以按照这个配置给我一个例子吗？
## yaml
```
project: "lumbar_chord"
name: "chord-lingshu-7B"
checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./qwen2.5vl_checkpoints}
algorithm:
  algorithm_type: mix_chord
  repeat_times: 8
  kl_loss_fn_args:
    kl_coef: 0.0
  sample_strategy_args:
    expert_data_ratio: 0.20
  policy_loss_fn_args: # feel free to change, we encourage you to try out different hyperparameters
    mu_warmup_steps: 200  # 0 for chord-mu and chord-phi
    mu_decay_steps: 400 # 200 for chord-mu and 0 for chord-phi
    mu_peak: 0.5 # 0.9 for chord-mu and 0.1 for chord-phi
    mu_valley: 0.02 # 0.05 for chord-mu and 0.1 for chord-phi
    enable_phi_function: true # false for chord-mu and true for chord-phi
    clip_range: 0.2
    use_token_level_loss_in_sft: true
    use_dynamic_bsz: true
    ppo_mini_batch_size: 32   # 16 + 16
    ppo_micro_batch_size_per_gpu: 4
    ngpus_trainer: 4
    train_batch_size_expert: 16
    train_batch_size_usual: 16 # 2 batchsize * 8 repeat times
model:
  model_path: for_chord
  max_response_tokens: 4096
  max_model_len: 11264
cluster:
  node_num: 1
  gpu_per_node: 6 # nums of gpus
buffer:
  total_epochs: 5
  batch_size: 2
  train_batch_size: 32
  explorer_input:
    taskset:    # used to train
      name: rl_dataset
      storage_type: file
      path: muskwff/lumbar_rl
      subset_name: 'default'
      split: 'train'
      format:
        prompt_key: 'problem'
        response_key: 'answer'
        image_key: 'images'
      rollout_args:
        temperature: 1.0
        logprobs: 0
    eval_tasksets: []
    default_workflow_type: 'LumbarWorkFlow'
    default_reward_fn_type: 'LumbarRewardFn'
  trainer_input:
    experience_buffer:
      name: experience_buffer
      storage_type: queue
    auxiliary_buffers:
      sft_dataset:
        total_epochs: 25
        name: sft_dataset
        storage_type: file
        path: muskwff/lumbar_sft
        split: 'train'
        format:
          prompt_key: 'problem'
          response_key: 'answer'
          image_key: 'images'
explorer:
  eval_interval: 10
  runner_per_model: 8
  rollout_model:
    engine_num: 2
    tensor_parallel_size: 1
    gpu_memory_utilization: 0.95
    enable_prefix_caching: false
    enforce_eager: true
    dtype: bfloat16
    seed: 42
synchronizer:
  sync_method: 'nccl'
  sync_interval: 1
  sync_timeout: 1200
trainer:
  save_interval: 200
  trainer_config:
    actor_rollout_ref:
      model:
        use_remove_padding: true
      actor:
        use_dynamic_bsz: true
        ppo_max_token_len_per_gpu: 10240
        ulysses_sequence_parallel_size: 4
        optim:
          lr: 1e-6
      ref:
        log_prob_use_dynamic_bsz: ${trainer.trainer_config.actor_rollout_ref.actor.use_dynamic_bsz}
        log_prob_max_token_len_per_gpu: ${trainer.trainer_config.actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
        ulysses_sequence_parallel_size: ${trainer.trainer_config.actor_rollout_ref.actor.ulysses_sequence_parallel_size}
```
## log
```
(vLLMRolloutModel pid=9083) INFO 10-11 19:10:08 [__init__.py:742] Resolved architecture: Qwen2_5_VLForConditionalGeneration
(vLLMRolloutModel pid=9083) INFO 10-11 19:10:08 [__init__.py:1815] Using max model len 11264
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:02 [vllm_model.py:48] Using vLLM v1 engine
(vLLMRolloutModel pid=9083) `torch_dtype` is deprecated! Use `dtype` instead!
(vLLMRolloutModel pid=9120) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:08 [arg_utils.py:1208] Using ray runtime env: {'env_vars': {'TRINITY_LOG_DIR': '/root/autodl-tmp/Trinity-RFT/./qwen2.5vl_checkpoints/lumbar_chord/chord-lingshu-7B/log', 'TRINITY_LOG_LEVEL': 'INFO', 'TRINITY_LOG_NODE_IP': '0', 'TRINITY_PLUGIN_DIRS': ''}}
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:08 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=5120.
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:08 [__init__.py:3400] Cudagraph is disabled under eager mode
(QueueStorage pid=13224) WARNING 10-11 19:10:11 [queue.py:241] Save experiences in /root/autodl-tmp/Trinity-RFT/qwen2.5vl_checkpoints/lumbar_chord/chord-lingshu-7B/buffer/experience_buffer.jsonl.
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:15 [__init__.py:216] Automatically detected platform cuda.
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:08 [__init__.py:742] Resolved architecture: Qwen2_5_VLForConditionalGeneration
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:08 [__init__.py:1815] Using max model len 11264
(vLLMRolloutModel pid=9083) INFO 10-11 19:10:09 [arg_utils.py:1208] Using ray runtime env: {'env_vars': {'TRINITY_LOG_DIR': '/root/autodl-tmp/Trinity-RFT/./qwen2.5vl_checkpoints/lumbar_chord/chord-lingshu-7B/log', 'TRINITY_LOG_LEVEL': 'INFO', 'TRINITY_LOG_NODE_IP': '0', 'TRINITY_PLUGIN_DIRS': ''}}
(vLLMRolloutModel pid=9083) INFO 10-11 19:10:09 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=5120.
(vLLMRolloutModel pid=9083) INFO 10-11 19:10:09 [__init__.py:3400] Cudagraph is disabled under eager mode
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:16 [core.py:654] Waiting for init message from front-end.
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:16 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='for_chord', speculative_config=None, tokenizer='for_chord', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=11264, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=42, served_model_name=for_chord, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:19 [worker_base.py:595] Injected <class 'trinity.common.models.vllm_worker.WorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['init_process_group', 'update_weight']
(vLLMRolloutModel pid=9120) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(vLLMRolloutModel pid=9120) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(vLLMRolloutModel pid=9120) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(vLLMRolloutModel pid=9120) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(vLLMRolloutModel pid=9120) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(vLLMRolloutModel pid=9120) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:19 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(vLLMRolloutModel pid=9120) [W1011 19:10:19.454693254 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
(vLLMRolloutModel pid=9120) `torch_dtype` is deprecated! Use `dtype` instead!
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) WARNING 10-11 19:10:19 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) WARNING 10-11 19:10:21 [profiling.py:280] The sequence length (11264) is smaller than the pre-defined worst-case total number of multimodal tokens (32768). This may cause certain multi-modal inputs to fail during inference. To avoid this, you should increase `max_model_len` or reduce `mm_counts`.
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:21 [gpu_model_runner.py:2338] Starting to load model for_chord...
(vLLMRolloutModel pid=9083) INFO 10-11 19:10:16 [__init__.py:216] Automatically detected platform cuda.
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:17 [core.py:654] Waiting for init message from front-end.
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:17 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='for_chord', speculative_config=None, tokenizer='for_chord', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=11264, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=42, served_model_name=for_chord, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:22 [gpu_model_runner.py:2370] Loading model from scratch...
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:22 [__init__.py:3400] Cudagraph is disabled under eager mode
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:22 [cuda.py:362] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.29it/s]
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) 
(vLLMRolloutModel pid=9083) [W1011 19:10:19.687664909 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:25 [default_loader.py:268] Loading weights took 2.97 seconds
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:19 [worker_base.py:595] Injected <class 'trinity.common.models.vllm_worker.WorkerExtension'> into <class 'vllm.v1.worker.gpu_worker.Worker'> for extended collective_rpc calls ['init_process_group', 'update_weight']
(vLLMRolloutModel pid=9083) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:19 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) WARNING 10-11 19:10:20 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) 
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:25 [gpu_model_runner.py:2392] Model loading took 15.6269 GiB and 3.209289 seconds
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:26 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:30 [gpu_worker.py:298] Available KV cache memory: 72.24 GiB
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) WARNING 10-11 19:10:21 [profiling.py:280] The sequence length (11264) is smaller than the pre-defined worst-case total number of multimodal tokens (32768). This may cause certain multi-modal inputs to fail during inference. To avoid this, you should increase `max_model_len` or reduce `mm_counts`.
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:21 [gpu_model_runner.py:2338] Starting to load model for_chord...
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:22 [gpu_model_runner.py:2370] Loading model from scratch...
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:22 [__init__.py:3400] Cudagraph is disabled under eager mode
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:22 [cuda.py:362] Using Flash Attention backend on V1 engine.
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:25 [default_loader.py:268] Loading weights took 3.08 seconds
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:30 [kv_cache_utils.py:864] GPU KV cache size: 1,352,640 tokens
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:30 [kv_cache_utils.py:868] Maximum concurrency for 11,264 tokens per request: 120.09x
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:30 [gpu_worker.py:391] Free memory on device (94.7/95.07 GiB) on startup. Desired GPU memory utilization is (0.95, 90.32 GiB). Actual usage is 15.63 GiB for weight, 2.38 GiB for peak activation, 0.07 GiB for non-torch memory, and 0.0 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=77408563097` to fit into requested memory, or `--kv-cache-memory=82111370752` to fully utilize gpu memory. Current kv cache memory in use is 77565849497 bytes.
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:30 [core.py:218] init engine (profile, create kv cache, warmup model) took 4.94 seconds
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:26 [gpu_model_runner.py:2392] Model loading took 15.6269 GiB and 3.232434 seconds
(vLLMRolloutModel pid=9120) (EngineCore_DP0 pid=13552) INFO 10-11 19:10:31 [__init__.py:3400] Cudagraph is disabled under eager mode
(vLLMRolloutModel pid=9083) (EngineCore_DP0 pid=13556) INFO 10-11 19:10:26 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:31 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 84540
(vLLMRolloutModel pid=9120) INFO 10-11 19:10:31 [async_llm.py:180] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
```
## workflow code
```python
from typing import Optional, List
from trinity.common.workflows import SimpleWorkflow, Task, WORKFLOWS
from trinity.common.models.model import ModelWrapper
from trinity.common.experience import Experience
from verl.utils.dataset.vision_utils import process_image
from trinity.common.rewards.lumbar_reward_fn import LumbarRewardFn

@WORKFLOWS.register_module("LumbarWorkFlow")
class LumbarWorkFlow(SimpleWorkflow):
    def __init__(
        self,
        task: Task,
        model: ModelWrapper,
        auxiliary_models: Optional[List] = None,
        ** kwargs
    ):
        super().__init__(task=task, model=model, auxiliary_models=auxiliary_models)
        self.model = model
        self.reward_fn = LumbarRewardFn()     # load lumbar_reward

    def run(self) -> List[Experience]:
        """
        generate response -> calculate reward -> return Experience
        """
        # generate response
        responses = self.model.chat_mm(
            messages=self.messages, images=self.images, **self.rollout_args
        )
        # return experience
        for i, response in enumerate(responses):
            reward_dict = self.reward_fn(  # type: ignore [misc]
                response=response.response_text,  # type: ignore [arg-type]
                truth=self.truth,
            )
            if response.metrics is None:
                response.metrics = {}
            response.metrics.update(reward_dict)
            reward = sum(reward_dict.values())
            response.reward = reward
            response.eid.run = i + self.run_id_base
        return responses
    
    def resettable(self):
        return True

    def reset(self, task: Task):
        """reset task state"""
        self.task = task
        assert task.raw_task is not None

        self.system_prompt = "You are a radiologist specializing in lumbar spine diseases " \
            "and are good at writing structured and standardized radiology diagnostic reports based on MRI images."
        self.task_desc = task.raw_task.get(task.format_args.prompt_key)
        self.truth = task.raw_task.get(task.format_args.response_key)

        self.image_key = task.format_args.image_key
        self.images = []
        if self.image_key and self.raw_task.get(self.image_key) is not None:
            self.images = [process_image(img) for img in self.raw_task[self.image_key]]  # type: ignore [index]
        self.messages = self.format_messages()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some questions about VLM-CHORD #318

Description

yaml

log

workflow code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some questions about VLM-CHORD #318

Description

Description

yaml

log

workflow code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions