How to train on multi gpus?

### 🐛 Describe the bug

Hello, I follow [docs ](the https://github.com/allenai/olmocr/tree/main/olmocr/train) and run
```
python -m olmocr.train.train --config olmocr/train/configs/qwen25_vl_olmocrv3_1epoch.yaml
```
It could train on the [olmOCR-mix-0225](https://huggingface.co/datasets/allenai/olmOCR-mix-0225) normally on single gpu. 
But when I modify the yaml file to:
```
...
model:
  name: /xxx/Qwen/Qwen2.5-VL-7B-Instruct
  trust_remote_code: true
  torch_dtype: bfloat16
  use_flash_attention: true
  attn_implementation: flash_attention_2
  
  # LoRA settings (disabled by default)
  use_lora: false
  # lora_rank: 8
  # lora_alpha: 32
  # lora_dropout: 0.1
  # lora_target_modules:
  #   - q_proj
  #   - v_proj
  #   - k_proj
  #   - o_proj

  device_map: auto
...
```
it shows error:
```
INFO:__main__:No existing checkpoints found in output directory
WARNING:accelerate.big_modeling:You shouldn't move a model that is dispatched using accelerate hooks.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/xxx/olmocr/olmocr/train/train.py", line 591, in <module>
    main()
  File "/xxx/olmocr/olmocr/train/train.py", line 446, in main
    metrics = evaluate_model(model, eval_dataloaders, device)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/xxx/olmocr/olmocr/train/train.py", line 170, in evaluate_model
    for batch in dataloader:
  File "/yyy/anaconda3/envs/olmocr/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 733, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/yyy/anaconda3/envs/olmocr/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 789, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/yyy/anaconda3/envs/olmocr/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/xxx/olmocr/olmocr/train/train.py", line 86, in __call__
    "input_ids": torch.stack(batch["input_ids"]),
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: stack expects each tensor to be equal size, but got [2804] at entry 0 and [3334] at entry 1
[1;34mwandb[0m: 
[1;34mwandb[0m: You can sync this run to the cloud by running:
[1;34mwandb[0m: [1mwandb sync /xxx/olmocr/wandb/offline-run-20250917_165906-5wowgg47[0m
[1;34mwandb[0m: Find logs at: [1;35mwandb/offline-run-20250917_165906-5wowgg47/logs[0m
```
So how to train on multi gpus?  
Can you give some advises?  Looking forward to  your response.
@jakep-allenai 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to train on multi gpus? #334

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to train on multi gpus? #334

Description

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions