Skip to content

ValueError: Processor was not found, please check and update your model file. #8009

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
mryt66 opened this issue May 10, 2025 · 3 comments
Closed
1 task done
Labels
solved This problem has been already solved

Comments

@mryt66
Copy link

mryt66 commented May 10, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

(OCR) PS C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory> llamafactory-cli env
[2025-05-11 00:10:57,337] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
test.c
LINK : fatal error LNK1181: nie można otworzyć pliku wejściowego „aio.lib”
test.c
LINK : fatal error LNK1181: nie można otworzyć pliku wejściowego „cufile.lib”
W0511 00:10:59.717000 21920 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.

  • llamafactory version: 0.9.3.dev0
  • Platform: Windows-10-10.0.19045-SP0
  • Python version: 3.11.0
  • PyTorch version: 2.5.0+cu124 (GPU)
  • Transformers version: 4.50.0.dev0
  • Datasets version: 3.5.0
  • Accelerate version: 1.6.0
  • PEFT version: 0.15.1
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 3060
  • GPU number: 1
  • GPU memory: 12.00GB
  • DeepSpeed version: 0.16.5
  • Bitsandbytes version: 0.45.5
  • Git commit: cef3a0b

Reproduction

What to do?
I tried with different models but it still didn't work:
also my dataset_info.json

{
  "doclaynet_internvl": {
    "file_name": "doclaynet_finetune_data.jsonl",
    "formatting": "sharegpt",
    "columns": {
      "messages": "conversations",
      "images": "image"
    },
    "tags": {
      "role_tag": "from",    
      "content_tag": "value",
      "user_tag": "user",    
      "assistant_tag": "assistant"
    }
  }
}

(OCR) PS C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory> $env:CUDA_VISIBLE_DEVICES="0"; python src/train.py `
>>     --stage sft `
>>     --do_train `
>>     --model_name_or_path OpenGVLab/InternVL3-1B-hf `
>>     --dataset doclaynet_internvl `
>>     --dataset_dir ./data `
>>     --template intern_vl `
>>     --finetuning_type lora `
>>     --lora_target all `
>>     --output_dir ./outputs/InternVL2_5-2B-doclaynet-lora `
>>     --overwrite_cache `
>>     --gradient_accumulation_steps 8 `
>>     --lr_scheduler_type cosine `
>>     --logging_steps 10 `
>>     --save_steps 100 `
>>     --learning_rate 1e-5 `
>>     --num_train_epochs 3.0 `
>>     --plot_loss `
>>     --fp16 `
>>     --overwrite_output_dir `
>>     --report_to tensorboard `
>>     --cutoff_len 2048 `
>>     --max_new_tokens 1536 `
>>     --trust_remote_code
                          `\x0a    --max_new_tokens 1536 `\x0a    --trust_remote_code;fd73082c-ee19-4d46-99[2025-05-11 00:08:06,935] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
test.c
LINK : fatal error LNK1181: nie można otworzyć pliku wejściowego „aio.lib”
test.c
LINK : fatal error LNK1181: nie można otworzyć pliku wejściowego „cufile.lib”
W0511 00:08:09.340000 13184 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
[INFO|2025-05-11 00:08:09] llamafactory.hparams.parser:401 >> Process rank: 0, world size: 1, device: cuda:0, distributed training: False, compute dtype: torch.float16
tokenizer_config.json: 100%|█████████████████████████████████████████████████| 6.86k/6.86k [00:00<?, ?B/s]
C:\Users\kogut\PYTHONIK\OCR\OCR\Lib\site-packages\huggingface_hub\file_download.py:144: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf. Caching files will still work but in a degraded version that might require more space on your disk. This warning can 
be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
vocab.json: 100%|████████████████████████████████████████████████████| 2.78M/2.78M [00:00<00:00, 5.07MB/s]
merges.txt: 100%|████████████████████████████████████████████████████| 1.67M/1.67M [00:00<00:00, 4.25MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████| 11.4M/11.4M [00:01<00:00, 10.3MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████| 811/811 [00:00<?, ?B/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████| 877/877 [00:00<?, ?B/s]
chat_template.jinja: 100%|███████████████████████████████████████████████████████| 481/481 [00:00<?, ?B/s]
[INFO|tokenization_utils_base.py:2060] 2025-05-11 00:08:14,215 >> loading file vocab.json from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\vocab.json
[INFO|tokenization_utils_base.py:2060] 2025-05-11 00:08:14,215 >> loading file merges.txt from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\merges.txt
[INFO|tokenization_utils_base.py:2060] 2025-05-11 00:08:14,215 >> loading file tokenizer.json from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\tokenizer.json
[INFO|tokenization_utils_base.py:2060] 2025-05-11 00:08:14,215 >> loading file added_tokens.json from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\added_tokens.json
[INFO|tokenization_utils_base.py:2060] 2025-05-11 00:08:14,216 >> loading file special_tokens_map.json from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\special_tokens_map.json
[INFO|tokenization_utils_base.py:2060] 2025-05-11 00:08:14,216 >> loading file tokenizer_config.json from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\tokenizer_config.json
[INFO|tokenization_utils_base.py:2060] 2025-05-11 00:08:14,216 >> loading file chat_template.jinja from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\chat_template.jinja
[INFO|tokenization_utils_base.py:2323] 2025-05-11 00:08:14,564 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
processor_config.json: 100%|███████████████████████████████████████████████████| 72.0/72.0 [00:00<?, ?B/s]
[INFO|processing_utils.py:816] 2025-05-11 00:08:15,330 >> loading configuration file processor_config.json 
from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\processor_config.json
[INFO|tokenization_utils_base.py:2060] 2025-05-11 00:08:15,472 >> loading file vocab.json from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\vocab.json
[INFO|tokenization_utils_base.py:2060] 2025-05-11 00:08:15,472 >> loading file merges.txt from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\merges.txt
[INFO|tokenization_utils_base.py:2060] 2025-05-11 00:08:15,472 >> loading file tokenizer.json from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\tokenizer.json
[INFO|tokenization_utils_base.py:2060] 2025-05-11 00:08:15,473 >> loading file added_tokens.json from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\added_tokens.json
[INFO|tokenization_utils_base.py:2060] 2025-05-11 00:08:15,473 >> loading file special_tokens_map.json from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\special_tokens_map.json
[INFO|tokenization_utils_base.py:2060] 2025-05-11 00:08:15,473 >> loading file tokenizer_config.json from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\tokenizer_config.json
[INFO|tokenization_utils_base.py:2060] 2025-05-11 00:08:15,473 >> loading file chat_template.jinja from cache at C:\Users\kogut\.cache\huggingface\hub\models--OpenGVLab--InternVL3-1B-hf\snapshots\014c0583a0d4bedf29fbe2dbff4f865eb998e171\chat_template.jinja
[INFO|tokenization_utils_base.py:2323] 2025-05-11 00:08:15,785 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


--- DEBUG: tokenizer_module contents ---
Type of tokenizer_module: <class 'dict'>
Keys in tokenizer_module: ['tokenizer', 'processor']
Processor object in tokenizer_module: None
!!! Processor object IS NONE !!!
--- END DEBUG ---


[INFO|2025-05-11 00:08:15] llamafactory.data.template:143 >> Add <|im_end|> to stop words.
[INFO|2025-05-11 00:08:15] llamafactory.data.loader:143 >> Loading dataset doclaynet_finetune_data.jsonl...Converting format of dataset: 100%|████████████████████████████| 694/694 [00:00<00:00, 6088.26 examples/s]
Running tokenizer on dataset:   0%|                                        | 0/694 [00:00<?, ? examples/s]
Traceback (most recent call last):
  File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\train.py", line 28, in <module>
    main()
  File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\train.py", line 19, in main
    run_exp()
  File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\train\tuner.py", line 110, in run_exp   
    _training_function(config={"args": args, "callbacks": callbacks})
  File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\train\tuner.py", line 72, in _training_function
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\train\sft\workflow.py", line 67, in run_sft
    dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\data\loader.py", line 315, in get_dataset
    dataset = _get_preprocessed_dataset(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\data\loader.py", line 256, in _get_preprocessed_dataset
    dataset = dataset.map(
              ^^^^^^^^^^^^
  File "C:\Users\kogut\PYTHONIK\OCR\OCR\Lib\site-packages\datasets\arrow_dataset.py", line 557, in wrapper 
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kogut\PYTHONIK\OCR\OCR\Lib\site-packages\datasets\arrow_dataset.py", line 3074, in map    
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "C:\Users\kogut\PYTHONIK\OCR\OCR\Lib\site-packages\datasets\arrow_dataset.py", line 3516, in _map_single
    for i, batch in iter_outputs(shard_iterable):
  File "C:\Users\kogut\PYTHONIK\OCR\OCR\Lib\site-packages\datasets\arrow_dataset.py", line 3466, in iter_outputs
    yield i, apply_function(example, i, offset=offset)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kogut\PYTHONIK\OCR\OCR\Lib\site-packages\datasets\arrow_dataset.py", line 3389, in apply_function
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\data\processor\supervised.py", line 99, 
in preprocess_dataset
    input_ids, labels = self._encode_data_example(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\data\processor\supervised.py", line 43, 
    messages = self.template.mm_plugin.process_messages(prompt + response, images, videos, audios, self.processor)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\data\mm_plugin.py", line 595, in process_messages
    self._validate_input(processor, images, videos, audios)
  File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\data\mm_plugin.py", line 170, in _validate_input
    raise ValueError("Processor was not found, please check and update your model file.")
ValueError: Processor was not found, please check and update your model file.

Others

No response

@mryt66 mryt66 added bug Something isn't working pending This problem is yet to be addressed labels May 10, 2025
@Kuangdd01
Copy link
Collaborator

update your transfomrers, recommend building it from the lateset source code

@Kuangdd01 Kuangdd01 added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels May 11, 2025
@mryt66
Copy link
Author

mryt66 commented May 11, 2025

I get this same error even I upgraded the transformers and set, $env:DISABLE_VERSION_CHECK="1".
I even tried internvl2_5 (the same error)

(OCR) PS C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory> llamafactory-cli env
[WARNING|2025-05-11 12:37:22] llamafactory.extras.misc:154 >> Version checking has been disabled, may lead
to unexpected behaviors.
[2025-05-11 12:37:22,556] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
test.c
LINK : fatal error LNK1181: nie można otworzyć pliku wejściowego „aio.lib”
test.c
LINK : fatal error LNK1181: nie można otworzyć pliku wejściowego „cufile.lib”
W0511 12:37:24.895000 11504 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.

  • llamafactory version: 0.9.3.dev0
  • Platform: Windows-10-10.0.19045-SP0
  • Python version: 3.11.0
  • PyTorch version: 2.5.0+cu124 (GPU)
  • Transformers version: 4.52.0.dev0
  • Datasets version: 3.5.0
  • Accelerate version: 1.6.0
  • PEFT version: 0.15.1
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 3060
  • GPU number: 1
  • GPU memory: 12.00GB
  • DeepSpeed version: 0.16.5
  • Bitsandbytes version: 0.45.5
  • Git commit: cef3a0b

command:
(OCR) PS C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory> $env:DISABLE_VERSION_CHECK="1"

$env:CUDA_VISIBLE_DEVICES="0"; python src/train.py --stage sft
--do_train --model_name_or_path OpenGVLab/InternVL3-2B
--dataset doclaynet_internvl --dataset_dir ./data
--template intern_vl --finetuning_type lora
--lora_target all --output_dir ./outputs/InternVL3-2B-doclaynet-lora
--overwrite_cache --per_device_train_batch_size 1
--lr_scheduler_type cosine --logging_steps 10
--save_steps 100 --learning_rate 1e-5
--num_train_epochs 3.0 --plot_loss
--fp16 --overwrite_output_dir
--report_to tensorboard --cutoff_len 2048
--max_new_tokens 1536 `
--trust_remote_code

d \x0a --cutoff_len 2048 \x0a --max_new_tokens 1536 `\x0a --trust_remote_code\x0a;fd73082c-ee[WARNING|2025-05-11 12:36:35] llamafactory.extras.misc:154 >> Version checking has been disabled, may lead
to unexpected behaviors.
[2025-05-11 12:36:35,839] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
test.c
LINK : fatal error LNK1181: nie można otworzyć pliku wejściowego „aio.lib”
test.c
LINK : fatal error LNK1181: nie można otworzyć pliku wejściowego „cufile.lib”
W0511 12:36:38.250000 20936 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
[INFO|2025-05-11 12:36:38] llamafactory.hparams.parser:401 >> Process rank: 0, world size: 1, device: cuda:0, distributed training: False, compute dtype: torch.float16
[INFO|tokenization_utils_base.py:2023] 2025-05-11 12:36:39,112 >> loading file vocab.json from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\vocab.json
[INFO|tokenization_utils_base.py:2023] 2025-05-11 12:36:39,112 >> loading file merges.txt from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\merges.txt
[INFO|tokenization_utils_base.py:2023] 2025-05-11 12:36:39,112 >> loading file tokenizer.json from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\tokenizer.json
[INFO|tokenization_utils_base.py:2023] 2025-05-11 12:36:39,112 >> loading file added_tokens.json from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\added_tokens.json
[INFO|tokenization_utils_base.py:2023] 2025-05-11 12:36:39,112 >> loading file special_tokens_map.json from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\special_tokens_map.json
[INFO|tokenization_utils_base.py:2023] 2025-05-11 12:36:39,113 >> loading file tokenizer_config.json from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\tokenizer_config.json
[INFO|tokenization_utils_base.py:2023] 2025-05-11 12:36:39,113 >> loading file chat_template.jinja from cache at None
[INFO|tokenization_utils_base.py:2299] 2025-05-11 12:36:39,322 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|image_processing_base.py:380] 2025-05-11 12:36:39,711 >> loading configuration file preprocessor_config.json from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\preprocessor_config.json
[INFO|feature_extraction_utils.py:550] 2025-05-11 12:36:39,842 >> loading configuration file preprocessor_config.json from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\preprocessor_config.json
[INFO|configuration_utils.py:694] 2025-05-11 12:36:40,100 >> loading configuration file config.json from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\config.json
[INFO|configuration_utils.py:694] 2025-05-11 12:36:40,373 >> loading configuration file config.json from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\config.json
[INFO|configuration_utils.py:766] 2025-05-11 12:36:40,375 >> Model config InternVLChatConfig {
"architectures": [
"InternVLChatModel"
],
"auto_map": {
"AutoConfig": "OpenGVLab/InternVL3-2B--configuration_internvl_chat.InternVLChatConfig",
"AutoModel": "OpenGVLab/InternVL3-2B--modeling_internvl_chat.InternVLChatModel",
"AutoModelForCausalLM": "OpenGVLab/InternVL3-2B--modeling_internvl_chat.InternVLChatModel"
},
"downsample_ratio": 0.5,
"dynamic_image_size": true,
"force_image_size": 448,
"hidden_size": 1536,
"image_fold": null,
"llm_config": {
"_name_or_path": "./pretrained/Qwen2.5-32B-Instruct",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 1536,
"initializer_range": 0.02,
"intermediate_size": 8960,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"moe_config": null,
"num_attention_heads": 12,
"num_hidden_layers": 28,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"factor": 2.0,
"rope_type": "dynamic",
"type": "dynamic"
},
"rope_theta": 1000000.0,
"sliding_window": null,
"torch_dtype": "bfloat16",
"use_bfloat16": true,
"use_cache": false,
"use_sliding_window": false,
"vocab_size": 151674
},
"max_dynamic_patch": 12,
"min_dynamic_patch": 1,
"model_type": "internvl_chat",
"pad2square": false,
"ps_version": "v2",
"select_layer": -1,
"system_message": null,
"template": "internvl2_5",
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": null,
"use_backbone_lora": 0,
"use_llm_lora": 0,
"use_thumbnail": true,
"vision_config": {
"_name_or_path": "OpenGVLab/InternViT-6B-448px-V1-5",
"architectures": [
"InternVisionModel"
],
"attention_dropout": 0.0,
"auto_map": {
"AutoConfig": "configuration_intern_vit.InternVisionConfig",
"AutoModel": "modeling_intern_vit.InternVisionModel"
},
"capacity_factor": 1.2,
"drop_path_rate": 0.1,
"dropout": 0.0,
"eval_capacity_factor": 1.4,
"hidden_act": "gelu",
"hidden_size": 1024,
"image_size": 448,
"initializer_factor": 0.1,
"initializer_range": 1e-10,
"intermediate_size": 4096,
"laux_allreduce": "all_nodes",
"layer_norm_eps": 1e-06,
"model_type": "intern_vit_6b",
"moe_coeff_ratio": 0.5,
"moe_intermediate_size": 768,
"moe_output_scale": 4.0,
"noisy_gate_policy": "RSample_before",
"norm_type": "layer_norm",
"num_attention_heads": 16,
"num_channels": 3,
"num_experts": 8,
"num_hidden_layers": 24,
"num_routed_experts": 4,
"num_shared_experts": 4,
"patch_size": 14,
"qk_normalization": false,
"qkv_bias": true,
"shared_expert_intermediate_size": 3072,
"torch_dtype": "bfloat16",
"use_bfloat16": true,
"use_flash_attn": true,
"use_moe": false,
"use_residual": true,
"use_rts": false,
"use_weighted_residual": false
}
}

[INFO|tokenization_utils_base.py:2023] 2025-05-11 12:36:40,641 >> loading file vocab.json from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\vocab.json
[INFO|tokenization_utils_base.py:2023] 2025-05-11 12:36:40,641 >> loading file merges.txt from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\merges.txt
[INFO|tokenization_utils_base.py:2023] 2025-05-11 12:36:40,641 >> loading file tokenizer.json from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\tokenizer.json
[INFO|tokenization_utils_base.py:2023] 2025-05-11 12:36:40,641 >> loading file added_tokens.json from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\added_tokens.json
[INFO|tokenization_utils_base.py:2023] 2025-05-11 12:36:40,641 >> loading file special_tokens_map.json from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\special_tokens_map.json
[INFO|tokenization_utils_base.py:2023] 2025-05-11 12:36:40,642 >> loading file tokenizer_config.json from cache at C:\Users\kogut.cache\huggingface\hub\models--OpenGVLab--InternVL3-2B\snapshots\1c09cd1a952cb8a20fe59cd9cf749842b6ceeccc\tokenizer_config.json
[INFO|tokenization_utils_base.py:2023] 2025-05-11 12:36:40,642 >> loading file chat_template.jinja from cache at None
[INFO|tokenization_utils_base.py:2299] 2025-05-11 12:36:40,848 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

--- DEBUG: tokenizer_module contents ---
Type of tokenizer_module: <class 'dict'>
Keys in tokenizer_module: ['tokenizer', 'processor']
Processor object in tokenizer_module: None
!!! Processor object IS NONE !!!
--- END DEBUG ---

[INFO|2025-05-11 12:36:40] llamafactory.data.template:143 >> Add <|im_end|> to stop words.
[INFO|2025-05-11 12:36:40] llamafactory.data.loader:143 >> Loading dataset doclaynet_finetune_data.jsonl...Converting format of dataset: 100%|████████████████████████████| 694/694 [00:00<00:00, 6308.65 examples/s]
Running tokenizer on dataset: 0%| | 0/694 [00:00<?, ? examples/s]
Traceback (most recent call last):
File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\train.py", line 28, in
main()
File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\train.py", line 19, in main
run_exp()
File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\train\tuner.py", line 110, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\train\tuner.py", line 72, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\train\sft\workflow.py", line 67, in run_sft
dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\data\loader.py", line 315, in get_dataset
dataset = _get_preprocessed_dataset(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\data\loader.py", line 256, in _get_preprocessed_dataset
dataset = dataset.map(
^^^^^^^^^^^^
File "C:\Users\kogut\PYTHONIK\OCR\OCR\Lib\site-packages\datasets\arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\kogut\PYTHONIK\OCR\OCR\Lib\site-packages\datasets\arrow_dataset.py", line 3074, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "C:\Users\kogut\PYTHONIK\OCR\OCR\Lib\site-packages\datasets\arrow_dataset.py", line 3516, in _map_single
for i, batch in iter_outputs(shard_iterable):
File "C:\Users\kogut\PYTHONIK\OCR\OCR\Lib\site-packages\datasets\arrow_dataset.py", line 3466, in iter_outputs
yield i, apply_function(example, i, offset=offset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\kogut\PYTHONIK\OCR\OCR\Lib\site-packages\datasets\arrow_dataset.py", line 3389, in apply_function
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\data\processor\supervised.py", line 99,
in preprocess_dataset
input_ids, labels = self._encode_data_example(
^^^^^^^^^^^^^^^^^^^^^^^^^^
in _encode_data_example
messages = self.template.mm_plugin.process_messages(prompt + response, images, videos, audios, self.processor)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\data\mm_plugin.py", line 595, in process_messages
self._validate_input(processor, images, videos, audios)
File "C:\Users\kogut\PYTHONIK\OCR\LLaMA-Factory\src\llamafactory\data\mm_plugin.py", line 170, in _validate_input
raise ValueError("Processor was not found, please check and update your model file.")

@mryt66
Copy link
Author

mryt66 commented May 11, 2025

Okay, i had to also add "-hf" into OpenGVLab/InternVL3-2B-hf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants