generated from allenai/python-package-template
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
🐛 Describe the bug
Hello, I follow [docs ](the https://github.com/allenai/olmocr/tree/main/olmocr/train) and run
python -m olmocr.train.train --config olmocr/train/configs/qwen25_vl_olmocrv3_1epoch.yaml
It could train on the olmOCR-mix-0225 normally on single gpu.
But when I modify the yaml file to:
...
model:
name: /xxx/Qwen/Qwen2.5-VL-7B-Instruct
trust_remote_code: true
torch_dtype: bfloat16
use_flash_attention: true
attn_implementation: flash_attention_2
# LoRA settings (disabled by default)
use_lora: false
# lora_rank: 8
# lora_alpha: 32
# lora_dropout: 0.1
# lora_target_modules:
# - q_proj
# - v_proj
# - k_proj
# - o_proj
device_map: auto
...
it shows error:
INFO:__main__:No existing checkpoints found in output directory
WARNING:accelerate.big_modeling:You shouldn't move a model that is dispatched using accelerate hooks.
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/xxx/olmocr/olmocr/train/train.py", line 591, in <module>
main()
File "/xxx/olmocr/olmocr/train/train.py", line 446, in main
metrics = evaluate_model(model, eval_dataloaders, device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/xxx/olmocr/olmocr/train/train.py", line 170, in evaluate_model
for batch in dataloader:
File "/yyy/anaconda3/envs/olmocr/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 733, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/yyy/anaconda3/envs/olmocr/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 789, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/yyy/anaconda3/envs/olmocr/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
^^^^^^^^^^^^^^^^^^^^^
File "/xxx/olmocr/olmocr/train/train.py", line 86, in __call__
"input_ids": torch.stack(batch["input_ids"]),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: stack expects each tensor to be equal size, but got [2804] at entry 0 and [3334] at entry 1
�[1;34mwandb�[0m:
�[1;34mwandb�[0m: You can sync this run to the cloud by running:
�[1;34mwandb�[0m: �[1mwandb sync /xxx/olmocr/wandb/offline-run-20250917_165906-5wowgg47�[0m
�[1;34mwandb�[0m: Find logs at: �[1;35mwandb/offline-run-20250917_165906-5wowgg47/logs�[0m
So how to train on multi gpus?
Can you give some advises? Looking forward to your response.
@jakep-allenai
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working