-
Notifications
You must be signed in to change notification settings - Fork 248
Description
Piotr, there's something that it would be nice to have support for at some point, which is multiple parallel versions of the same audio but with different augmentations.
As you probably know, all our current recipes in Icefall depend on CR-CTC, where we have two versions of the same audio with different spec-aug masks; in the extra loss introduced in CR-CTC, the CTC output for one copy acts as a reference for the network's CTC log-probs of the other copy. (In the SpecAug used for CR-CTC, we use a 2.5 times larger-than-normal num_frame_masks and max_frames_mask_fraction fraction versus the default setup).
I was speaking about this with MILA's group (@mravenelli-mila) and one of them asked me whether we did different music-and-noise augmentation in the two copies. I said we don't. But I wonder how hard this would be to implement in Lhotse? And whether can it be done without too-unpleasant code changes?
In our current recipe we move the SpecAug out of lhotse.
One of those guys from MILA also mentioned that they are working on something where they have several copies of the augmented data, but the copy that produces the "reference output" stays clean without augmentations. This is just something to bear in mind, I'm not saying to necessarily support this as I don't know how the code would be structured.
In MILA's work they do the same kind of thing as we do in CR-CTC but with the attention decoder. (I'm not sure if this is in addition to a CTC version). In that case there wouldn't be a natural need to keep the two copies of each utterance synchronized time-wise, because there is no concept of time in the attention-decoder outputs.