-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Is there any official source for this integration ? I have query bit not sure this is the right forum. @i-am-shreya @ControlNet As in this part of the paper
Lip Synchronization (LS) is another line of research that require facial region specific spatio-temporal synchronization. This downstream adaptation further elaborates the adaptation capability of MARLIN for face generation tasks. For adaptation, we replace the facial encoder module in Wav2Lip [57] with MARLIN, and adjust the temporal window accordingly i.e. from 5 frames to T frames. For evaluation, we use the LRS2 [22] dataset having 45,838 train, 1,082 val, and 1,243 test videos. Following the prior literature [57, 74], we use Lip-Sync Error-Distance (LSE-D ↓), Lip-Sync Error-Confidence (LSE-C ↑) and Frechet Inception Distance (FID ↓) [38] as evaluation matrices.
Did you folks train a wav2lip with a marlin encoder and if yes
- The flattened face sequences are processed by the Marlin encoder's extract_features method to produce the final face feature map.
- Only the final output of the extract_features method is used in the forward pass.
OR
- Intermediate Feature Storage: where the extract_features method is modified to store selected intermediate outputs from the transformer blocks, ensuring the number of stored features matches the number of CNN decoder blocks.
- During Integration with Decoder Blocks as in during the forward pass of the Wav2Lip model, the decoder blocks process the audio embeddings.
- At each decoder block, the corresponding intermediate feature map from face_features is concatenated with the current decoder output.
- The features are accessed in reverse order to match the original processing sequence.