-
Notifications
You must be signed in to change notification settings - Fork 299
Description
Thank you for sharing the DLRM implementation which has significantly clarified the M-Falcon methodology mentioned in the paper ❤
Understanding of M-Falcon's Attention Mask
From my understanding, M-Falcon utilizes the attention mask to control the visibility of historical items for multiple targets, ensuring efficient training and inference. For example, in a decoder-only approach with a sequence length of 4, the attention mask would look like:
T, F, F, F
T, T, F, F
T, T, T, F
T, T, T, T
With M-Falcon applied to a pairwise ranking task, using a sequence length of 4, 2 interaction histories, and 2 targets, the attention mask is as follows:
T, F, F, F
T, T, F, F
T, T, T, F
T, T, F, T
Issue with Listwise Ranking Attention Mask
However, in the context of a listwise ranking task, I would expect the attention mask with a sequence length of 4, 2 interaction histories, and 2 targets to be:
T, F, F, F
T, T, F, F
T, T, T, T
T, T, T, T
This configuration allows all target items to see each other, which is essential for effective listwise ranking.
Observed Behavior in Current Implementation
In the current implementation of DLRM, it appears that the default attention mask is being used instead of the expected M-Falcon attention mask for listwise ranking. This default mask restricts high-score retrieval items from seeing low-score items, which might inadvertently affect the performance of the ranking task.
Inquiry
Is there a specific reason why the default attention mask is used for listwise ranking instead of the M-Falcon-designed mask that allows all target items to see each other? If this is unintended behavior, I wanted to bring it to your attention in case it affects the performance of listwise ranking tasks.
Thank you once again for your excellent work and for providing such a valuable resource to the community!