add attention_sink.py #6579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

facebook-github-bot merged 9 commits into gh/helunwencser/70/base from gh/helunwencser/70/head

Nov 28, 2024

Contributor

helunwencser commented Oct 30, 2024 •

edited

Loading

Stack from ghstack (oldest at bottom):

-> add attention_sink.py #6579

This PR adds KVCacheWithAttentionSink, which is required for AttentionSink. It keeps the first sink_size tokens as attention sinks and maintains a sliding window with window_size for new tokens.

Note: I am trying to implement and verify AttentionSink in eager mode first. So the current implementation may still have some lower errors or performance issue. For example, it does not support the case when dynamic shape is disabled. Will leave these problems to resolve when we are ready to deploy AttentionSink to edge.

Differential Revision: D65235798


          add attention_sink.py

47dcd57

This PR adds `KVCacheWithAttentionSink`, which is required for `AttentionSink`. It keeps the first `sink_size` tokens as attention sinks and maintains a sliding window with `window_size` for new tokens.

Note: I am trying to implement and verify `AttentionSink` in eager mode first. So the current implementation may still have some lower errors or performance issue. For example, it does not support the case when dynamic shape is disabled. Will leave these problems to resolve when we are ready to deploy `AttentionSink` to edge.

Differential Revision: [D65235798](https://our.internmc.facebook.com/intern/diff/D65235798/)

[ghstack-poisoned]

pytorch-bot bot commented Oct 30, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6579

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9105c8f with merge base c726a9b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

helunwencser mentioned this pull request

move rope related logic together #6560

Merged

Contributor

facebook-github-bot commented Oct 30, 2024

This pull request was exported from Phabricator. Differential Revision: D65235798

facebook-github-bot added the fb-exported label


          Update on "add attention_sink.py"

7140dec

This PR adds `KVCacheWithAttentionSink`, which is required for `AttentionSink`. It keeps the first `sink_size` tokens as attention sinks and maintains a sliding window with `window_size` for new tokens.

Note: I am trying to implement and verify `AttentionSink` in eager mode first. So the current implementation may still have some lower errors or performance issue. For example, it does not support the case when dynamic shape is disabled. Will leave these problems to resolve when we are ready to deploy `AttentionSink` to edge.

Differential Revision: [D65235798](https://our.internmc.facebook.com/intern/diff/D65235798/)

[ghstack-poisoned]

helunwencser mentioned this pull request

implement position encoding for shifted tokens #6646

Merged

Contributor

facebook-github-bot commented Nov 4, 2024

This pull request was exported from Phabricator. Differential Revision: D65235798


          Update on "add attention_sink.py"

7baa27d

This PR adds `KVCacheWithAttentionSink`, which is required for `AttentionSink`. It keeps the first `sink_size` tokens as attention sinks and maintains a sliding window with `window_size` for new tokens.

Note: I am trying to implement and verify `AttentionSink` in eager mode first. So the current implementation may still have some lower errors or performance issue. For example, it does not support the case when dynamic shape is disabled. Will leave these problems to resolve when we are ready to deploy `AttentionSink` to edge.

Differential Revision: [D65235798](https://our.internmc.facebook.com/intern/diff/D65235798/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 6, 2024

This pull request was exported from Phabricator. Differential Revision: D65235798

This was referenced Nov 7, 2024

Transform model to be able to use Attention Sink #6700

Merged

update llama runner to decode single token #6703

Merged


          Update on "add attention_sink.py"

a9aaf66

This PR adds `KVCacheWithAttentionSink`, which is required for `AttentionSink`. It keeps the first `sink_size` tokens as attention sinks and maintains a sliding window with `window_size` for new tokens.

Note: I am trying to implement and verify `AttentionSink` in eager mode first. So the current implementation may still have some lower errors or performance issue. For example, it does not support the case when dynamic shape is disabled. Will leave these problems to resolve when we are ready to deploy `AttentionSink` to edge.

Differential Revision: [D65235798](https://our.internmc.facebook.com/intern/diff/D65235798/)

[ghstack-poisoned]

helunwencser mentioned this pull request

Fix Cuda out of memory issue for eager runner #6866

Merged

Contributor

facebook-github-bot commented Nov 17, 2024

This pull request was exported from Phabricator. Differential Revision: D65235798

helunwencser mentioned this pull request

Update eager runner to support AttentionSink #6921

Merged


          Update on "add attention_sink.py"

5de701d

This PR adds `KVCacheWithAttentionSink`, which is required for `AttentionSink`. It keeps the first `sink_size` tokens as attention sinks and maintains a sliding window with `window_size` for new tokens.

Note: I am trying to implement and verify `AttentionSink` in eager mode first. So the current implementation may still have some lower errors or performance issue. For example, it does not support the case when dynamic shape is disabled. Will leave these problems to resolve when we are ready to deploy `AttentionSink` to edge.

Differential Revision: [D65235798](https://our.internmc.facebook.com/intern/diff/D65235798/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 22, 2024

This pull request was exported from Phabricator. Differential Revision: D65235798


          Update on "add attention_sink.py"

dbbaa85

This PR adds `KVCacheWithAttentionSink`, which is required for `AttentionSink`. It keeps the first `sink_size` tokens as attention sinks and maintains a sliding window with `window_size` for new tokens.

Note: I am trying to implement and verify `AttentionSink` in eager mode first. So the current implementation may still have some lower errors or performance issue. For example, it does not support the case when dynamic shape is disabled. Will leave these problems to resolve when we are ready to deploy `AttentionSink` to edge.

Differential Revision: [D65235798](https://our.internmc.facebook.com/intern/diff/D65235798/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 22, 2024

This pull request was exported from Phabricator. Differential Revision: D65235798

This was referenced Nov 22, 2024

add support to evalulate the model with attention sink #7033

Open

add eval for attention sink #7070

Merged


          Update on "add attention_sink.py"

6693a02

This PR adds `KVCacheWithAttentionSink`, which is required for `AttentionSink`. It keeps the first `sink_size` tokens as attention sinks and maintains a sliding window with `window_size` for new tokens.

Note: I am trying to implement and verify `AttentionSink` in eager mode first. So the current implementation may still have some lower errors or performance issue. For example, it does not support the case when dynamic shape is disabled. Will leave these problems to resolve when we are ready to deploy `AttentionSink` to edge.

Differential Revision: [D65235798](https://our.internmc.facebook.com/intern/diff/D65235798/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 26, 2024

This pull request was exported from Phabricator. Differential Revision: D65235798


          Update on "add attention_sink.py"

349af4f

This PR adds `KVCacheWithAttentionSink`, which is required for `AttentionSink`. It keeps the first `sink_size` tokens as attention sinks and maintains a sliding window with `window_size` for new tokens.

Note: I am trying to implement and verify `AttentionSink` in eager mode first. So the current implementation may still have some lower errors or performance issue. For example, it does not support the case when dynamic shape is disabled. Will leave these problems to resolve when we are ready to deploy `AttentionSink` to edge.

Differential Revision: [D65235798](https://our.internmc.facebook.com/intern/diff/D65235798/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 27, 2024

This pull request was exported from Phabricator. Differential Revision: D65235798

helunwencser mentioned this pull request

Implement get_freqs for RopeWithAttentionSink #7100

Merged

helunwencser added the release notes:attention_sink label

larryliu0820 approved these changes

View reviewed changes


          Update on "add attention_sink.py"

9105c8f

This PR adds `KVCacheWithAttentionSink`, which is required for `AttentionSink`. It keeps the first `sink_size` tokens as attention sinks and maintains a sliding window with `window_size` for new tokens.

Note: I am trying to implement and verify `AttentionSink` in eager mode first. So the current implementation may still have some lower errors or performance issue. For example, it does not support the case when dynamic shape is disabled. Will leave these problems to resolve when we are ready to deploy `AttentionSink` to edge.

Differential Revision: [D65235798](https://our.internmc.facebook.com/intern/diff/D65235798/)

[ghstack-poisoned]

helunwencser added a commit that referenced this pull request


          add KVCacheWithAttentionSink

3905f19

Pull Request resolved: #6579

This PR adds `KVCacheWithAttentionSink`, which is required for `AttentionSink`. It keeps the first `sink_size` tokens as attention sinks and maintains a sliding window with `window_size` for new tokens.

Note: I am trying to implement and verify `AttentionSink` in eager mode first. So the current implementation may still have some lower errors. Will leave these problems to resolve when we are ready to deploy `AttentionSink` to edge.
ghstack-source-id: 255715047
@exported-using-ghexport

Differential Revision: [D65235798](https://our.internmc.facebook.com/intern/diff/D65235798/)

Contributor

facebook-github-bot commented Nov 28, 2024

This pull request was exported from Phabricator. Differential Revision: D65235798

facebook-github-bot merged commit 8d30fc1 into gh/helunwencser/70/base

41 of 43 checks passed

facebook-github-bot deleted the gh/helunwencser/70/head branch

November 28, 2024 01:54

facebook-github-bot temporarily deployed to cherry-pick-bot

November 28, 2024 01:54

— with

GitHub Actions Inactive

pytorchbot mentioned this pull request

add attention_sink.py #7119

Merged

kirklandsign pushed a commit that referenced this pull request


          add attention_sink.py

9d084c4

add KVCacheWithAttentionSink

Pull Request resolved: #6579

This PR adds `KVCacheWithAttentionSink`, which is required for `AttentionSink`. It keeps the first `sink_size` tokens as attention sinks and maintains a sliding window with `window_size` for new tokens.

Note: I am trying to implement and verify `AttentionSink` in eager mode first. So the current implementation may still have some lower errors. Will leave these problems to resolve when we are ready to deploy `AttentionSink` to edge.
ghstack-source-id: 255715047
@exported-using-ghexport

Differential Revision: [D65235798](https://our.internmc.facebook.com/intern/diff/D65235798/)

Co-authored-by: Lunwen He <lwhecser@gmail.com>

kedarnath03 pushed a commit to kedarnath03/executorch that referenced this pull request


          add KVCacheWithAttentionSink

38169d1

Pull Request resolved: pytorch/executorch#6579

This PR adds `KVCacheWithAttentionSink`, which is required for `AttentionSink`. It keeps the first `sink_size` tokens as attention sinks and maintains a sliding window with `window_size` for new tokens.

Note: I am trying to implement and verify `AttentionSink` in eager mode first. So the current implementation may still have some lower errors. Will leave these problems to resolve when we are ready to deploy `AttentionSink` to edge.
ghstack-source-id: 254019779
@exported-using-ghexport

Differential Revision: [D65235798](https://our.internmc.facebook.com/intern/diff/D65235798/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported