Skip to content

v0.2.1

Choose a tag to compare

@yzhangcs yzhangcs released this 23 Apr 17:08
· 312 commits to main since this release
a670dff

Highlights

🚀 Performance Boost for DeltaNet

We've achieved a notable performance enhancement for (Gated) DeltaNet models. The optimization efforts focused on the fused LayerNormGated layer, particularly for small headdims, which has resulted in a 1.1x speedup.

Below are the benchmarks for 1B parameter models, tested on 4k sequences in varlen mode, using a single H100 GPU

TPS (K tokens/s)
Transformer++ 53.8
DeltaNet (before) 48.6
DeltaNet (after) 54.0

by running

python -m benchmarks.benchmark_training_throughput \
  --name delta_net \
  --batch_size 1 \
  --seq_len 32768 \
  --context_len 4096 \
  --varlen \
  --steps 512

What's Changed

New Contributors

Full Changelog: v0.2.0...v0.2.1