Skip to content

Dreambooth LoRA Flux training last step error #10839

@PluginBOXone

Description

@PluginBOXone

Describe the bug

As soon as the training is done and the code wants to clear up and do its last steps I get this error

Steps: 99%|█████████▉| 397/400 [07:45<00:03, 1.16s/it, loss=0.397, lr=1]
Steps: 100%|█████████▉| 398/400 [07:47<00:02, 1.20s/it, loss=0.397, lr=1]
Steps: 100%|█████████▉| 398/400 [07:47<00:02, 1.20s/it, loss=0.539, lr=1]
Steps: 100%|█████████▉| 399/400 [07:48<00:01, 1.19s/it, loss=0.539, lr=1]
Steps: 100%|█████████▉| 399/400 [07:48<00:01, 1.19s/it, loss=0.58, lr=1]
Steps: 100%|██████████| 400/400 [07:49<00:00, 1.18s/it, loss=0.58, lr=1]
Steps: 100%|██████████| 400/400 [07:49<00:00, 1.18s/it, loss=0.288, lr=1] Model weights saved in /workspace/output_model/dd304483-afdc-4398-9c46-c660d0725e70-e1/pytorch_lora_weights.safetensors
2025-02-19T21:38:48.866518894Z Traceback (most recent call last):
2025-02-19T21:38:48.866557263Z File "/workspace/./train_dreambooth_lora_flux.py", line 1935, in
2025-02-19T21:38:48.867054758Z main(args)
2025-02-19T21:38:48.867072609Z File "/workspace/./train_dreambooth_lora_flux.py", line 1887, in main
2025-02-19T21:38:48.867457265Z pipeline = FluxPipeline.from_pretrained(
2025-02-19T21:38:48.867479814Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-19T21:38:48.867487574Z File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
2025-02-19T21:38:48.867554504Z return fn(*args, **kwargs)
2025-02-19T21:38:48.867601603Z ^^^^^^^^^^^^^^^^^^^
2025-02-19T21:38:48.867606703Z File "/usr/local/lib/python3.11/dist-packages/diffusers/pipelines/pipeline_utils.py", line 793, in from_pretrained
2025-02-19T21:38:48.867905410Z config_dict = cls.load_config(cached_folder, dduf_entries=dduf_entries)
2025-02-19T21:38:48.867973610Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-19T21:38:48.867992349Z File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
2025-02-19T21:38:48.868030829Z return fn(*args, **kwargs)
2025-02-19T21:38:48.868039849Z ^^^^^^^^^^^^^^^^^^^
2025-02-19T21:38:48.868053699Z File "/usr/local/lib/python3.11/dist-packages/diffusers/configuration_utils.py", line 381, in load_config
2025-02-19T21:38:48.868183318Z raise EnvironmentError(
2025-02-19T21:38:48.868199778Z OSError: Error no file named model_index.json found in directory /workspace/model/realflux1.
2025-02-19T21:38:49.009733209Z
Steps: 100%|██████████| 400/400 [07:49<00:00, 1.17s/it, loss=0.288, lr=1]
2025-02-19T21:38:50.343330576Z Traceback (most recent call last):
2025-02-19T21:38:50.343379125Z File "/usr/local/bin/accelerate", line 8, in
2025-02-19T21:38:50.343568443Z sys.exit(main())
2025-02-19T21:38:50.343608783Z ^^^^^^
2025-02-19T21:38:50.343691572Z File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
2025-02-19T21:38:50.343770471Z args.func(args)
2025-02-19T21:38:50.343896080Z File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 1106, in launch_command
2025-02-19T21:38:50.344200407Z simple_launcher(args)
2025-02-19T21:38:50.344262447Z File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 704, in simple_launcher
2025-02-19T21:38:50.344583144Z raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
2025-02-19T21:38:50.819696870Z ✅ FLUX LoRA Training abgeschlossen!

So basically the error says OSError: Error no file named model_index.json found in directory /workspace/model/realflux1. File "/workspace/./train_dreambooth_lora_flux.py", line 1887, in main
2025-02-19T21:38:48.867457265Z pipeline = FluxPipeline.from_pretrained(

But the path exists and the model in it. It is the same path that I started the training with it and it found the files/works.

It still looks like it finished the training because it saved a 98MB big .safetensors file and a log... but i have the feeling the LoRA is broken because when I try to load it, the inference output ist corrupted.

Tried inference without loading the LoRA:

Image

As soon as loading the LoRA (i even tried different prompts that dont have todo anything with the LoRA)

Image

Image

Reproduction

Dreambooth Training:
Using the newest train_dreambooth_lora_flux.py, started with these parameters (of course with accelerate): ./train_dreambooth_lora_flux.py', '--pretrained_model_name_or_path', '/workspace/model/realflux1', '--instance_data_dir', '/workspace/job_files/dd304483-afdc-4398-9c46-e1/clean_data', '--output_dir', '/workspace/output_model/dd304483-afdc-4398-9c46-e1', '--instance_prompt', 'photo of WIXBSAHA black car', '--resolution', '768', '--learning_rate', '1.0', '--mixed_precision', 'bf16', '--lr_warmup_steps', '0', '--gradient_accumulation_steps', '1', '--lr_scheduler', 'constant', '--train_batch_size', '1', '--max_train_steps', '400', '--checkpointing_steps', '500', '--num_train_epochs', '10', '--checkpoints_total_limit', '1', '--train_text_encoder', '--rank', '16', '--optimizer', 'prodigy', '--repeats', '3', '--guidance_scale', '1'

Inference:

print("Lade FLUX Modell")
    pipe = FluxPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16).to("cuda")
    pipe.enable_model_cpu_offload()

    generator = None

    if lora_path:
        print(f"🔄 Lade FLUX LoRA-Modell: {lora_path}")
        pipe.load_lora_weights(lora_path)
        print("✅ LoRA geladen.")
    
    if seed is not None:
        generator = torch.Generator(device="cpu").manual_seed(seed)

    image = pipe(
        prompt=prompt,
        guidance_scale=guidance_scale, #0.
        negative_prompt=negative_prompt,
        height=height,
        true_cfg_scale=true_cfg_scale,
        width=width,
        num_inference_steps=num_inference_steps,
        max_sequence_length=max_sequence_length, #256
        generator=generator
    ).images[0]

image.save("test.png")

I also get a warning on inference:

✅ LoRA geladen.\n
[info]🔄 Lade FLUX LoRA-Modell: /workspace/lora_output_model\n
[info]\rLoading pipeline components...: 57%|█████▋ | 4/7 [00:00<00:00, 4.14it/s]\rLoading pipeline components...: 71%|███████▏ | 5/7 [00:01<00:00, 2.34it/s]\rLoading pipeline components...: 100%|██████████| 7/7 [00:01<00:00, 4.16it/s]\n
[info]\rLoading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 7.39it/s]�[A\rLoading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 7.37it/s]\n
[info]\rLoading checkpoint shards: 50%|█████ | 1/2 [00:00<00:00, 7.33it/s]�[A\n
[info]\rLoading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]�[A\n
[info]\rLoading pipeline components...: 43%|████▎ | 3/7 [00:00<00:00, 5.06it/s]\n
[info]\rLoading pipeline components...: 0%| | 0/7 [00:00<?, ?it/s]\rLoading pipeline components...: 14%|█▍ | 1/7 [00:00<00:00, 7.55it/s]\rLoading pipeline components...: 29%|██▊ | 2/7 [00:00<00:00, 5.91it/s]You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers\n
[info]Lade FLUX Modell\n
[info]FLUX Inference\n

(read from bottom to top)

this one: You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers\n

Logs

System Info

A100 80GB VRAM 120GB RAM
pytorch:2.4.0-py3.11-cuda12.4.1
CUDA 12.4
accelerate==0.33.0

Who can help?

@sayakpaul

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions