Feature Request: Integrate HiP Attention for Extended Context Length #11910

MubarakHAlketbi · 2025-02-16T14:28:56Z

MubarakHAlketbi
Feb 16, 2025

Description:

This issue requests the integration of HiP (Hierarchically Pruned) Attention into Ollama to enable significantly extended context lengths for supported models. HiP Attention is a training-free method that allows for sub-quadratic cost in Transformer models, making it possible to handle very long contexts with reduced computational resources.

Motivation:

Ollama users frequently need to process long documents, codebases, or conversations. Current context length limitations can hinder the effectiveness of LLMs in these scenarios. Integrating HiP Attention would provide the following benefits:

Greatly Extended Context Length: HiP Attention has demonstrated the ability to handle context lengths of up to 3 million tokens on a single L40S 48GB GPU (as per the research paper). This is a substantial improvement over typical context limits.
Improved Performance: The sub-quadratic cost of HiP Attention translates to faster inference speeds, especially for long contexts. The paper claims an estimated 7.24x speedup.
Reduced Memory Requirements: By efficiently managing the Key-Value (KV) cache, HiP Attention can reduce the memory footprint required for long-context processing. This allows for larger contexts on existing hardware.
Training-Free Approach: A key advantage of HiP Attention is that it doesn't require model retraining. This makes it easier to integrate and apply to existing models supported by Ollama.
SGlang Integration: HiP Attention has integration with SGlang.

Proposed Implementation (Suggestions):

Integrate the HiP Attention Library: The core implementation is available at DeepAuto-AI/hip-attention. This repository provides Python bindings and CUDA kernels for implementing HiP Attention. The installation instructions (building from source or using Docker) are provided in the repository.
Model Compatibility: Determine which models within Ollama's supported model set are compatible with HiP Attention. The initial focus could be on models commonly used for long-context tasks.
Configuration Options: Expose configuration options to users, allowing them to:
- Enable/disable HiP Attention.
- Control parameters related to the pruning hierarchy (if applicable and exposed by the HiP Attention library).
- Potentially manage KV cache offloading (if using the ainl-hip-offload version or future versions with this feature).
Performance Testing: Thoroughly test the integration to ensure performance gains and stability, especially with very long contexts. Compare performance with and without HiP Attention enabled.
Licensing Considerations: HiP Attention is currently under the FSL-1.1-MIT license, which is free for non-commercial use but transitions to MIT after two years. Ensure that Ollama's usage complies with this license. This is a crucial point to address.

Relevant Links:

HiP Attention GitHub Repository: https://github.com/DeepAuto-AI/hip-attention
ArXiv Paper (latest): https://arxiv.org/abs/2406.09827
ICLR 2025 Paper: https://openreview.net/forum?id=PTcMzQgKmn
SGlang Integration: https://github.com/DeepAuto-AI/sglang

Conclusion:

Adding support for HiP Attention would be a significant enhancement to Ollama, enabling users to work with much longer contexts more efficiently. This would open up new possibilities for using LLMs in various applications that require processing large amounts of text. I believe this feature would be highly valuable to the Ollama community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Integrate HiP Attention for Extended Context Length #11910

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Feature Request: Integrate HiP Attention for Extended Context Length #11910

Uh oh!

MubarakHAlketbi Feb 16, 2025

Replies: 0 comments

MubarakHAlketbi
Feb 16, 2025