v0.2.1
Highlights
🚀 Performance Boost for DeltaNet
We've achieved a notable performance enhancement for (Gated) DeltaNet models. The optimization efforts focused on the fused LayerNormGated layer, particularly for small headdims, which has resulted in a 1.1x speedup.
Below are the benchmarks for 1B parameter models, tested on 4k sequences in varlen mode, using a single H100 GPU
| TPS (K tokens/s) | |
|---|---|
| Transformer++ | 53.8 |
| DeltaNet (before) | 48.6 |
| DeltaNet (after) | 54.0 |
by running
python -m benchmarks.benchmark_training_throughput \
--name delta_net \
--batch_size 1 \
--seq_len 32768 \
--context_len 4096 \
--varlen \
--steps 512What's Changed
- [Gated DeltaNet] optimize UT transform by @sustcsonglin in #349
- [RWKV] remove duplicate params from autotune key list by @jihaoh98 in #359
- Fix some arg passing by @yibozhong in #358
- [RWKV7] Update RWKV7 to follow official initialization by @zhiyuan1i in #365
- Remove all
NT: constexprby @sustcsonglin in #364 - [Misc.] Use
logger.infoinstead ofprintinfla.utils.pyby @zhiyuan1i in #366 - [RWKV]: Prevent initialization when loading pretrained weights by @zhiyuan1i in #369
- [Norm] Optimize speed for small headdim by @yzhangcs in #368
- [GroupNorm] Optimized speed for small headdims by @yzhangcs in #371
- [LayerNormGated] Fix arg bugs during autotuning by @yzhangcs in #372
New Contributors
- @jihaoh98 made their first contribution in #359
- @yibozhong made their first contribution in #358
Full Changelog: v0.2.0...v0.2.1