-
Notifications
You must be signed in to change notification settings - Fork 58
Open
Description
Hi,
We found that video text joint loss in pretraining is calculated from masked video and text. Why not use the origin video and text like retrieval finetune?
Line 258 in 0a7c07f
sim_matrix_text_visual = self.get_similarity_logits(sequence_output_alm, visual_output_alm, |
Metadata
Metadata
Assignees
Labels
No labels