Skip to content

Doraemonzzz/xmixers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Xmixers: A collection of SOTA efficient token/channel mixers

💬 Discord

Introduction

This repository aims to implement SOTA efficient token/channel mixers. Any technologies related to non-Vanilla Transformer are welcome. If you are interested in this repository, please join our Discord.

Roadmap

  • Token Mixers
    • Linear Attention
    • Linear RNN
    • Long Convolution
  • Channel Mixers

ToDo

  • Pad embed dim;
    • hybrid, transformer, hybrid;
  • Update use bias to use offset;

Finished Model

  • Mpa
  • T6
  • Mla
  • Hgrn2
    • left varlen
  • Hgrn2-scalar-decay
    • left varlen
  • Linear Transformer
  • Llama
  • Tnl
  • Deltanet
  • Vector Decay Deltanet
  • Scalar Decay Deltanet
  • TTT
    • [] Inference
  • GSA
  • Titan
  • NSA
    • Inference
  • Alibi
  • GPT
  • Stick-breaking
  • Forgetting Transformer
  • Fsq kv
  • Fsq kv mpa

Change to xopes

  • Tnl
  • Hgrn2
  • Hgrn2-scalar-decay
  • Linear Transformer
  • Lightnet
  • Mamba2, need rm extra residual.

Pretrained weights

  • GPT
    • Doreamonzzz/xmixers_gpt_120m_50b
  • LLaMA
    • Doreamonzzz/xmixers_llama_120m_50b

ToDo

  • Add use cache for all module.
  • Add init state for linear attention.
  • Update attn mask treat for attention, mpa, tpa, mla;
  • Update init weights for every model.
  • Rm bias for layernorm since it will raise nan error.
    • linear attention
      • hgru2
      • hgru3
    • vanilla attention
      • attention
  • Add causal.
  • Update _initialize_weights, _init_weights.
    • linear attention
    • vanilla attention
    • long conv
  • Update cache for token mixer.
    • linear attention
      • hgru3
      • linear attention
      • tnl attention
    • long conv
      • gtu
    • vanilla attention
      • attention
      • flex attention
      • mpa
      • n_attention
  • Add special init.
  • Add causal.
  • Clear next_decoder_cache.
  • Add varlen for softmax attn.

Model

  • LLaMA.
  • GPT.

Basic

  • Add data type for class and function.

Ops

  • long_conv_1d_op.

Token Mixers

  • Gtu.

Note

[Feature Add]
[Bug Fix]
[Benchmark Add]
[Document Add]
[README Add]

About

Xmixers: A collection of SOTA efficient token/channel mixers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published