Release v0.2.1 · fla-org/flash-linear-attention

Highlights

🚀 Performance Boost for DeltaNet

We've achieved a notable performance enhancement for (Gated) DeltaNet models. The optimization efforts focused on the fused LayerNormGated layer, particularly for small headdims, which has resulted in a 1.1x speedup.

Below are the benchmarks for 1B parameter models, tested on 4k sequences in varlen mode, using a single H100 GPU

	TPS (K tokens/s)
Transformer++	53.8
DeltaNet (before)	48.6
DeltaNet (after)	54.0

by running

python -m benchmarks.benchmark_training_throughput \
  --name delta_net \
  --batch_size 1 \
  --seq_len 32768 \
  --context_len 4096 \
  --varlen \
  --steps 512

What's Changed

[Gated DeltaNet] optimize UT transform by @sustcsonglin in #349
[RWKV] remove duplicate params from autotune key list by @jihaoh98 in #359
Fix some arg passing by @yibozhong in #358
[RWKV7] Update RWKV7 to follow official initialization by @zhiyuan1i in #365
Remove all NT: constexpr by @sustcsonglin in #364
[Misc.] Use logger.info instead of print in fla.utils.py by @zhiyuan1i in #366
[RWKV]: Prevent initialization when loading pretrained weights by @zhiyuan1i in #369
[Norm] Optimize speed for small headdim by @yzhangcs in #368
[GroupNorm] Optimized speed for small headdims by @yzhangcs in #371
[LayerNormGated] Fix arg bugs during autotuning by @yzhangcs in #372

New Contributors

@jihaoh98 made their first contribution in #359
@yibozhong made their first contribution in #358

Full Changelog: v0.2.0...v0.2.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.2.1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

🚀 Performance Boost for DeltaNet

What's Changed

New Contributors

Contributors

Uh oh!