[FSDP LORA] No device index is set for xpu when running FSDP+LORA #8003

zejun-chen · 2025-05-09T15:29:46Z

Reminder

I have read the above rules and searched the existing issues.

System Info

When running lora qwen3 model with FSDP on 8 GPUs, we met the following issue that the device index has not been assigned to the self.device, so when FSDP uses the self.device to init the FSDP model, there is no device index info. Below is the warning message:

home/sdp/miniforge3/envs/zejun_ccl/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py:831: UserWarning: FSDP got the argument device_idxpu on rank 1, which does not have an explicit index. FSDP will use the current device 0. If this is incorrect, please explicitly calltorch.xpu.set_device()before FSDP initialization or pass in the explicit device index as thedevice_id argument.

Reproduction

Running command:

FSDP_CONFIG_FILE=examples/accelerate/fsdp_config_4c8t.yaml
SRC_FILE=src/train.py
MODEL_CONFIG_FILE=examples/train_lora/qwen3_8b_lora_sft.yaml
LOG_FILE=saves/qwen3-8b/lora/sft/qwen3_lora_sft_fsdp_4c8t.log

echo "FSDP config file: ${FSDP_CONFIG_FILE}"
echo "LLaMA Factory src file: ${SRC_FILE}"
echo "Model config file: ${MODEL_CONFIG_FILE}"
echo "Log file: ${LOG_FILE}"

accelerate launch                      \
    --config_file ${FSDP_CONFIG_FILE}  \
    ${SRC_FILE}                        \
    ${MODEL_CONFIG_FILE} 2>&1 | tee ${LOG_FILE}

examples/accelerate/fsdp_config_4c8t.yaml

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_forward_prefetch: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
ipex_config:
  ipex: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16  # or fp16
num_machines: 1  # the number of nodes
num_processes: 8  # the number of GPUs in all nodes
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Others

No response

The text was updated successfully, but these errors were encountered:

zejun-chen · 2025-05-16T00:40:48Z

Root caused

zejun-chen added bug Something isn't working pending This problem is yet to be addressed labels May 9, 2025

zejun-chen closed this as completed May 16, 2025

hiyouga added invalid This doesn't seem right and removed bug Something isn't working pending This problem is yet to be addressed labels May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP LORA] No device index is set for xpu when running FSDP+LORA #8003

[FSDP LORA] No device index is set for xpu when running FSDP+LORA #8003

zejun-chen commented May 9, 2025 •

edited

Loading

zejun-chen commented May 16, 2025

Uh oh!

[FSDP LORA] No device index is set for xpu when running FSDP+LORA #8003

[FSDP LORA] No device index is set for xpu when running FSDP+LORA #8003

Comments

zejun-chen commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reminder

System Info

Reproduction

Others

zejun-chen commented May 16, 2025

Uh oh!

zejun-chen commented May 9, 2025 •

edited

Loading