FlashDeBERTa 🦾 – Boost inference speed by 3-5x ⚡ and run DeBERTa models on long sequences 📚.

FlashDeBERTa is an optimized version of the DeBERTa model leveraging flash attention to implement a disentangled attention mechanism. It significantly reduces memory usage and latency, especially with long sequences. The project enables loading and running original DeBERTa models on tens of thousands of tokens without retraining, maintaining original accuracy.

Use Cases

DeBERTa remains one of the top-performing models for the following tasks:

Named Entity Recognition: It serves as the main backbone for models such as GLiNER, an efficient architecture for zero-shot information extraction.
Text Classification: DeBERTa is highly effective for supervised and zero-shot classification tasks, such as GLiClass.
Reranking: The model offers competitive performance compared to other reranking models, making it a valuable component in many RAG systems.

Warning

This project is under active development and may contain bugs. Please create an issue if you encounter bugs or have suggestions for improvements.

Installation

First, install the package:

pip install flashdeberta -U

Then import the appropriate model heads for your use case and initialize the model from pretrained checkpoints:

from flashdeberta import FlashDebertaV2Model  # FlashDebertaV2ForSequenceClassification, FlashDebertaV2ForTokenClassification, etc.
from transformers import AutoTokenizer
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
model = FlashDebertaV2Model.from_pretrained("microsoft/deberta-v3-base").to('cuda')

# Tokenize input text
input_text = "Hello world!"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to('cuda')

# Model inference
outputs = model(input_ids)

In order to switch to eager attention implementation, initialise a model in the following way:

model = FlashDebertaV2Model.from_pretrained("microsoft/deberta-v3-base", _attn_implementation='eager').to('cuda')

Benchmarks

While context-to-position and position-to-context biases still require quadratic memory, our flash attention implementation reduces overall memory requirements to nearly linear. This efficiency is particularly impactful for longer sequences. Starting from 512 tokens, FlashDeBERTa achieves more than a 50% performance improvement, and at 4k tokens, it's over 5 times faster than naive implementations.

Future Work

Implement backward kernels.
Train DeBERTa models on 8,192-token sequences using high-quality data.
Integrate FlashDeBERTa into GLiNER and GLiClass.
Train multi-modal DeBERTa models.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
images		images
src/flashdeberta		src/flashdeberta
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FlashDeBERTa 🦾 – Boost inference speed by 3-5x ⚡ and run DeBERTa models on long sequences 📚.

Use Cases

Installation

Benchmarks

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Knowledgator/FlashDeBERTa

Folders and files

Latest commit

History

Repository files navigation

FlashDeBERTa 🦾 – Boost inference speed by 3-5x ⚡ and run DeBERTa models on long sequences 📚.

Use Cases

Installation

Benchmarks

Future Work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages