Skip to content

How to train on multi gpus? #334

@IamMegatron2025

Description

@IamMegatron2025

🐛 Describe the bug

Hello, I follow [docs ](the https://github.com/allenai/olmocr/tree/main/olmocr/train) and run

python -m olmocr.train.train --config olmocr/train/configs/qwen25_vl_olmocrv3_1epoch.yaml

It could train on the olmOCR-mix-0225 normally on single gpu.
But when I modify the yaml file to:

...
model:
  name: /xxx/Qwen/Qwen2.5-VL-7B-Instruct
  trust_remote_code: true
  torch_dtype: bfloat16
  use_flash_attention: true
  attn_implementation: flash_attention_2
  
  # LoRA settings (disabled by default)
  use_lora: false
  # lora_rank: 8
  # lora_alpha: 32
  # lora_dropout: 0.1
  # lora_target_modules:
  #   - q_proj
  #   - v_proj
  #   - k_proj
  #   - o_proj

  device_map: auto
...

it shows error:

INFO:__main__:No existing checkpoints found in output directory
WARNING:accelerate.big_modeling:You shouldn't move a model that is dispatched using accelerate hooks.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/xxx/olmocr/olmocr/train/train.py", line 591, in <module>
    main()
  File "/xxx/olmocr/olmocr/train/train.py", line 446, in main
    metrics = evaluate_model(model, eval_dataloaders, device)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/xxx/olmocr/olmocr/train/train.py", line 170, in evaluate_model
    for batch in dataloader:
  File "/yyy/anaconda3/envs/olmocr/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 733, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/yyy/anaconda3/envs/olmocr/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 789, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/yyy/anaconda3/envs/olmocr/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/xxx/olmocr/olmocr/train/train.py", line 86, in __call__
    "input_ids": torch.stack(batch["input_ids"]),
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: stack expects each tensor to be equal size, but got [2804] at entry 0 and [3334] at entry 1
�[1;34mwandb�[0m: 
�[1;34mwandb�[0m: You can sync this run to the cloud by running:
�[1;34mwandb�[0m: �[1mwandb sync /xxx/olmocr/wandb/offline-run-20250917_165906-5wowgg47�[0m
�[1;34mwandb�[0m: Find logs at: �[1;35mwandb/offline-run-20250917_165906-5wowgg47/logs�[0m

So how to train on multi gpus?
Can you give some advises? Looking forward to your response.
@jakep-allenai

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions