how to use multi-gpu for training LLM #3113
Unanswered
deepankar27
asked this question in
Q&A
Replies: 1 comment
-
in command line you could try accelerate launch --multi_gpu --num_processes 2 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello Team,
I am using ml.g5.48xlarge to finetune GPT-j-6b LLM with custom dataset. I used the below config file to distribute the training but it gives me "out of memory" exception during starting of the script itself. When I looked at closely I found during initial startup of the script it takes only 1 GPU. It's not distributing the load to other available GPUs.
I am completely new to this & I am not able to figure it out where I am wrong. I tried with very small batch size as well or I am missing any very basic thing.
Python package:
accelerate>=0.12.0
deepspeed==0.8.0
transformers[torch]==4.25.1
Exception:
OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 22.06 GiB total capacity; 21.38 GiB already allocated; 6.38 MiB free; 21.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Configuration of ml.g5.48xlarge:
8xNVIDIA A10G, total: 192 vCPUs.
Deepspeed config:
{
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Trainer code:
local_output_dir = 'logging_dir'
epochs = 2
lr = 1e-5
bf16 = True
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
local_output_dir = 'output_dir'
local_rank = True
deepspeed = 'ds_z3_bf16_config.json'
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=split_dataset["train"],
eval_dataset=split_dataset["test"],
data_collator=data_collator,
)
training_args = TrainingArguments(
output_dir=local_output_dir,
per_device_train_batch_size=per_device_train_batch_size,
per_device_eval_batch_size=per_device_eval_batch_size,
fp16=False,
bf16=bf16,
learning_rate=lr,
num_train_epochs=epochs,
deepspeed=deepspeed,
gradient_checkpointing=True,
logging_dir=f"{local_output_dir}/runs",
logging_strategy="steps",
logging_steps=10,
evaluation_strategy="steps",
eval_steps=10,
save_strategy="steps",
save_steps=200,
save_total_limit=1,
load_best_model_at_end=True,
#report_to="tensorboard",
disable_tqdm=True,
remove_unused_columns=False,
local_rank=local_rank,
)
Thanks..
Beta Was this translation helpful? Give feedback.
All reactions