It would be useful to help people scoping the requirements in term of GPUs to properly set expectations: - how much VRAM required for training from scratch - how much VRAM for transfer learning - some ratio of GPU model / audio volume / training time