Skip to content

OOM during finetuning #77

@Shiro-LK

Description

@Shiro-LK

Hi,

Thank you for sharing your repo.

I am trying to finetune a LM with multifit on custom dataset and then finetune the classifier for prediction. Unfortunately I got an OOM after few steps with multifit during the training of the CLS.
I tried to first train the LM then close the session to clean the gpu memory and then train the classifier (loading the encoder weights if I am not wrong in my code) but it does not help. I can not use the same batch size. Is it normal or am I doing something wrong ?
PS : bs = 256
`---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
in ()
3 learn_cls_fwd.load_encoder("encoder_lm_fr_fwd")
4 learn_cls_fwd.freeze()
----> 5 learn_cls_fwd.fit_one_cycle(3)
6 learn_cls_fwd.save("multifit_cls_pretrained_fr")

9 frames
/usr/local/lib/python3.6/dist-packages/fastai/text/learner.py in (.0)
253 def concat(self, arrs:Sequence[Sequence[Tensor]])->List[Tensor]:
254 "Concatenate the arrs along the batch dimension."
--> 255 return [torch.cat([l[si] for l in arrs], dim=1) for si in range_of(arrs[0])]
256
257 def reset(self):

RuntimeError: CUDA out of memory. Tried to allocate 1.02 GiB (GPU 0; 15.90 GiB total capacity; 12.72 GiB already allocated; 599.88 MiB free; 14.61 GiB reserved in total by PyTorch)`

My piece of code :

# pretrained LM
if pretrained_lm:
  data_lm_fwd = (TextList.from_df(lm_tr.iloc[:10000], path, cols='comment_text', **fa_config)
                  .split_by_rand_pct(0.05, seed=42)
                  .label_for_lm()
                  .databunch(bs=bs, num_workers=4))
  data_lm_fwd.save("fr_data_lm_forward")
if pretrained_lm:
  learn_fwd = exp.finetune_lm.get_learner(data_lm_fwd)
  learn_fwd.model.cuda()

  learn_fwd.lr_find()
  learn_fwd.recorder.plot()

# learn is a preconfigured fastai learner with a pretrained model loaded
if pretrained_lm:
  learn_fwd.fit_one_cycle(2)
  learn_fwd.unfreeze()
  for i in range(5):
    learn_fwd.fit_one_cycle(2)
    learn_fwd.save_encoder("encoder_lm_fr_fwd")

# cls

if pretrained_cls:
  data_cls = (TextList.from_df(tr1, path, cols="comment_text", **fa_config)
      .split_from_df(col="val")
      .label_from_df(cols="toxic")
      .databunch(bs=64, num_workers=2))

if pretrained_cls:
  learn_cls_fwd = exp.classifier.get_learner(data_cls)#, metrics=[AUROC])
  learn_cls_fwd.load_encoder("encoder_lm_fr_fwd")
  learn_cls_fwd.freeze()
  learn_cls_fwd.fit_one_cycle(3)
  learn_cls_fwd.save("multifit_cls_pretrained_fr")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions