In current implementation, the batch size for quality tasks must be 1 during training. However, it seems that training with batch_size_per_gpu=1 is not fast enough. Is there any solution for this problem?