SD3 Controlnet Train Example, run out of memory on validation step

### Describe the bug

On default settings provided in SD3 controlnet example, with 2 validation images training will error out with out of memory during validation on single A100 80GB.


```
04/07/2025 21:15:15 - INFO - __main__ - ***** Running training *****    
04/07/2025 21:15:15 - INFO - __main__ -   Num examples = 10000                 
04/07/2025 21:15:15 - INFO - __main__ -   Num batches each epoch = 10000       
04/07/2025 21:15:15 - INFO - __main__ -   Num Epochs = 2                                                                                                      
04/07/2025 21:15:15 - INFO - __main__ -   Instantaneous batch size per device = 1                                                                             
04/07/2025 21:15:15 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4               
04/07/2025 21:15:15 - INFO - __main__ -   Gradient Accumulation steps = 4      
04/07/2025 21:15:15 - INFO - __main__ -   Total optimization steps = 4000                                                                                     
Steps:   0%|          | 5/4000 [00:21<4:38:36,  4.18s/it, loss=0.00669, lr=1e-5]04/07/2025 21:15:36 - INFO - __main__ - Running validation... 
{'controlnet', 'image_encoder', 'feature_extractor'} was not found in config. Values will be initialized to default values.                                   Keyword arguments {'safety_checker': None} are not expected by StableDiffusion3ControlNetPipeline and will be ignored.                                                                                                             Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-3.5-medium.e components...:   0%|          | 0/9 [00:00<?, ?it/s]                                                                                         {'invert_sigmas', 'base_shift', 'base_image_seq_len', 'use_dynamic_shifting', 'shift_terminal', 'time_shift_type', 'use_exponential_sigmas', 'max_shift', 'use
_karras_sigmas', 'max_image_seq_len', 'use_beta_sigmas'} was not found in config. Values will be initialized to default values.
Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-3.5-medium.                
Instantiating AutoencoderKL model under default dtype torch.float32.                                                                                          
All model checkpoint weights were used when initializing AutoencoderKL.                                                                                       
                                                                                                                                                              
All the weights of AutoencoderKL were initialized from the model checkpoint at /home/jakubdawidowicz/.cache/huggingface/hub/models--stabilityai--stable-diffus
ion-3.5-medium/snapshots/b940f670f0eda2d07fbb75229e779da1ad11eb80/vae.

If your task is similar to the task the model of the checkpoint was trained on, you can already use AutoencoderKL for predictions without further training.
Loaded vae as AutoencoderKL from `vae` subfolder of stabilityai/stable-diffusion-3.5-medium.
                                                                             Loaded text_encoder as CLIPTextModelWithProjection from `text_encoder` subfolder of stabilityai/stable-diffusion-3.5-medium.    | 3/9 [00:00<00:00, 13.70it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-3.5-medium.
                                                                             Loaded text_encoder_2 as CLIPTextModelWithProjection from `text_encoder_2` subfolder of stabilityai/stable-diffusion-3.5-medium.| 5/9 [00:00<00:00,  8.40it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:12<00:00,  6.02s/it]
Loaded text_encoder_3 as T5EncoderModel from `text_encoder_3` subfolder of stabilityai/stable-diffusion-3.5-medium.
Loading checkpoint shards: 100%|██████████| 2/2 [00:12<00:00,  5.98s/it]     Loaded tokenizer_3 as T5TokenizerFast from `tokenizer_3` subfolder of stabilityai/stable-diffusion-3.5-medium..:  78%|███████▊  | 7/9 [00:14<00:07,  3.66s/it]
                                                                             Instantiating SD3Transformer2DModel model under default dtype torch.float32.
All model checkpoint weights were used when initializing SD3Transformer2DModel.

All the weights of SD3Transformer2DModel were initialized from the model checkpoint at /home/jakubdawidowicz/.cache/huggingface/hub/models--stabilityai--stable-diffusion-3.5-medium/snapshots/b940f670f0eda2d07fbb75229e779da1ad11eb80/transformer.
If your task is similar to the task the model of the checkpoint was trained on, you can already use SD3Transformer2DModel for predictions without further training.
Loaded transformer as SD3Transformer2DModel from `transformer` subfolder of stabilityai/stable-diffusion-3.5-medium.
Loading pipeline components...: 100%|██████████| 9/9 [00:26<00:00,  3.00s/it]
Traceback (most recent call last):0%|██████████| 9/9 [00:26<00:00,  5.44s/it]
  File "/home/jakubdawidowicz/diffusers/examples/controlnet/train_controlnet_sd3.py", line 1429, in <module>
    main(args)
  File "/home/jakubdawidowicz/diffusers/examples/controlnet/train_controlnet_sd3.py", line 1377, in main
    image_logs = log_validation(
                 ^^^^^^^^^^^^^^^
  File "/home/jakubdawidowicz/diffusers/examples/controlnet/train_controlnet_sd3.py", line 83, in log_validation
    pipeline = pipeline.to(torch.device(accelerator.device))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/diffusers/src/diffusers/pipelines/pipeline_utils.py", line 482, in to
    module.to(device, dtype)
  File "/opt/diffusers/src/diffusers/models/modeling_utils.py", line 1351, in to
    return super().to(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/control/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1343, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/control/lib/python3.12/site-packages/torch/nn/modules/module.py", line 903, in _apply
    module._apply(fn)
  File "/opt/miniconda/envs/control/lib/python3.12/site-packages/torch/nn/modules/module.py", line 903, in _apply
    module._apply(fn)
  File "/opt/miniconda/envs/control/lib/python3.12/site-packages/torch/nn/modules/module.py", line 903, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/opt/miniconda/envs/control/lib/python3.12/site-packages/torch/nn/modules/module.py", line 930, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/opt/miniconda/envs/control/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1329, in convert
    return t.to(
           ^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 54.00 MiB. GPU 0 has a total capacity of 79.25 GiB of which 4.75 MiB is free. Including non-PyTorch memory, this process has 79.24 GiB memory in use. Of the allocated memory 76.93 GiB is allocated by PyTorch, and 1.81 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

### Reproduction

Running SD3 controlnet example (adjusted dataset size and validation steps to decrease job time).

```
export MODEL_DIR="stabilityai/stable-diffusion-3.5-medium"
export OUTPUT_DIR="sd3-controlnet-out"

accelerate launch train_controlnet_sd3.py \
    --pretrained_model_name_or_path=$MODEL_DIR \
    --output_dir=$OUTPUT_DIR \
    --train_data_dir="fill50k" \
    --resolution=1024 \
    --learning_rate=1e-5 \
    --max_train_samples=10000 \
    --max_train_steps=4000 \
    --checkpointing_steps=500 \
    --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
    --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
    --validation_steps=5 \
    --train_batch_size=1 \
    --gradient_accumulation_steps=4
```

### Logs

```shell

```

### System Info

System Info
- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-6.8.0-53-generic-x86_64-with-glibc2.39
- Running on Google Colab?: No
- Python version: 3.12.9
- PyTorch version (GPU?): 2.6.0+cu124
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.29.3
- Transformers version: 4.50.3
- Accelerate version: 1.5.2
- PEFT version: not installed
- Bitsandbytes version: not installed
- Safetensors version: 0.5.3
- xFormers version: not installed
- Accelerator: NVIDIA A100 80GB PCIe, 81920 MiB

### Who can help?

@sayakpaul

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SD3 Controlnet Train Example, run out of memory on validation step #11225

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SD3 Controlnet Train Example, run out of memory on validation step #11225

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions