💬 Discord •
This repository aims to implement SOTA efficient token/channel mixers. Any technologies related to non-Vanilla Transformer are welcome. If you are interested in this repository, please join our Discord.
- Token Mixers
- Linear Attention
- Linear RNN
- Long Convolution
- Channel Mixers
- Pad embed dim;
- hybrid, transformer, hybrid;
- Update use bias to use offset;
- Mpa
- T6
- Mla
- Hgrn2
- left varlen
- Hgrn2-scalar-decay
- left varlen
- Linear Transformer
- Llama
- Tnl
- Deltanet
- Vector Decay Deltanet
- Scalar Decay Deltanet
- TTT
- [] Inference
- GSA
- Titan
- NSA
- Inference
- Alibi
- GPT
- Stick-breaking
- Forgetting Transformer
- Fsq kv
- Fsq kv mpa
- Tnl
- Hgrn2
- Hgrn2-scalar-decay
- Linear Transformer
- Lightnet
- Mamba2, need rm extra residual.
- GPT
- Doreamonzzz/xmixers_gpt_120m_50b
- LLaMA
- Doreamonzzz/xmixers_llama_120m_50b
- Add use cache for all module.
- Add init state for linear attention.
- Update attn mask treat for attention, mpa, tpa, mla;
- Update init weights for every model.
- Rm bias for layernorm since it will raise nan error.
- linear attention
- hgru2
- hgru3
- vanilla attention
- attention
- linear attention
- Add causal.
- Update _initialize_weights, _init_weights.
- linear attention
- vanilla attention
- long conv
- Update cache for token mixer.
- linear attention
- hgru3
- linear attention
- tnl attention
- long conv
- gtu
- vanilla attention
- attention
- flex attention
- mpa
- n_attention
- linear attention
- Add special init.
- Add causal.
- Clear next_decoder_cache.
- Add varlen for softmax attn.
- LLaMA.
- GPT.
- Add data type for class and function.
- long_conv_1d_op.
- Gtu.
[Feature Add]
[Bug Fix]
[Benchmark Add]
[Document Add]
[README Add]