Skip to content

[FSDP LORA] No device index is set for xpu when running FSDP+LORA #8003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
zejun-chen opened this issue May 9, 2025 · 1 comment
Closed
1 task done
Labels
invalid This doesn't seem right

Comments

@zejun-chen
Copy link

zejun-chen commented May 9, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

Hi, @hiyouga

When running lora qwen3 model with FSDP on 8 GPUs, we met the following issue that the device index has not been assigned to the self.device, so when FSDP uses the self.device to init the FSDP model, there is no device index info. Below is the warning message:

home/sdp/miniforge3/envs/zejun_ccl/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py:831: UserWarning: FSDP got the argument device_idxpu on rank 1, which does not have an explicit index. FSDP will use the current device 0. If this is incorrect, please explicitly calltorch.xpu.set_device()before FSDP initialization or pass in the explicit device index as thedevice_id argument.

Reproduction

Running command:

FSDP_CONFIG_FILE=examples/accelerate/fsdp_config_4c8t.yaml
SRC_FILE=src/train.py
MODEL_CONFIG_FILE=examples/train_lora/qwen3_8b_lora_sft.yaml
LOG_FILE=saves/qwen3-8b/lora/sft/qwen3_lora_sft_fsdp_4c8t.log

echo "FSDP config file: ${FSDP_CONFIG_FILE}"
echo "LLaMA Factory src file: ${SRC_FILE}"
echo "Model config file: ${MODEL_CONFIG_FILE}"
echo "Log file: ${LOG_FILE}"

accelerate launch                      \
    --config_file ${FSDP_CONFIG_FILE}  \
    ${SRC_FILE}                        \
    ${MODEL_CONFIG_FILE} 2>&1 | tee ${LOG_FILE}

examples/accelerate/fsdp_config_4c8t.yaml

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_forward_prefetch: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
ipex_config:
  ipex: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16  # or fp16
num_machines: 1  # the number of nodes
num_processes: 8  # the number of GPUs in all nodes
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Others

No response

@zejun-chen zejun-chen added bug Something isn't working pending This problem is yet to be addressed labels May 9, 2025
@zejun-chen
Copy link
Author

Root caused

@hiyouga hiyouga added invalid This doesn't seem right and removed bug Something isn't working pending This problem is yet to be addressed labels May 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants