Skip to content

单机多卡训练卡住 #240

@sevenandseven

Description

@sevenandseven

你好我使用以下命令
PUS_PER_NODE=2 ./tools/run_dist_launch.sh 2 ./configs/r50_deformable_detr_plus_iter
ative_bbox_refinement_plus_plus_two_stage.sh 进行两张卡训练,但是他出现以下内容之后卡住一直没有进展,请问怎么解决?而且怎么在训练过程中指定卡号?

  • GPUS=2
  • RUN_COMMAND=./configs/r50_deformable_detr_plus_iterative_bbox_refinement_plus_plus_two_stage.sh
  • '[' 2 -lt 8 ']'
  • GPUS_PER_NODE=2
  • MASTER_ADDR=127.0.0.1
  • MASTER_PORT=29500
  • NODE_RANK=0
  • let NNODES=GPUS/GPUS_PER_NODE
  • python ./tools/launch.py --nnodes 1 --node_rank 1 --master_addr 127.0.0.1 --master_port 29511 --nproc_per_node 2 ./configs/r50_deformable_detr_plus_iterative_bbox_refinement_plus_plus_two_stage.sh
  • EXP_DIR=exps/r50_deformable_detr_plus_iterative_bbox_refinement_plus_plus_two_stage
  • PY_ARGS=
  • python -u main.py --output_dir exps/r50_deformable_detr_plus_iterative_bbox_refinement_plus_plus_two_stage --with_box_refine --two_stage
  • EXP_DIR=exps/r50_deformable_detr_plus_iterative_bbox_refinement_plus_plus_two_stage
  • PY_ARGS=
  • python -u main.py --output_dir exps/r50_deformable_detr_plus_iterative_bbox_refinement_plus_plus_two_stage --with_box_refine --two_stage
    | distributed init (rank 2): env://
    | distributed init (rank 3): env://

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions