This repository was archived by the owner on Oct 9, 2023. It is now read-only.
ASR Task Failing due to CUDA Memory Issue - How to introduce Lightning Fabric support for Lightning Flash Tasks? #1657
Unanswered
greeshmasmenon
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am trying to finetune the
Wav2vec2 model ("facebook/wav2vec2-large-960h-lv60-self")
with custom data that i have.GPU : Tesla V100-SXM2-16GB
Number of GPUs: 8 (2 Nodes of 4 each)
Shape of audio dataset (Each audio segment is roughly 3 seconds long) :
{ "training": [ 133328, 4 ], "validation": [ 33332, 3 ] }
Each audio segment is roughly 3 seconds long. The training arguments are below :
I am getting the CUDA Memory error -
I would like to use a distributed training approach as I have gone down to BATCH SIZE of 1 and ACCUMULATE_GRAD_BATCHES = 1 and don't know how to reduce the data loaded to the GPU still further. Any advice here would be appreciated?
Also, I want to start looking at Lightning Fabric and introduce the parallel training procedures to see if it solves my problem. With the high level interfaces, I am not sure where to start. Can someone guide me how?
Logs:
Beta Was this translation helpful? Give feedback.
All reactions