Skip to content

Add transformer example with RoPE and MoE-like mechanisms #3078

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

Cydral
Copy link
Contributor

@Cydral Cydral commented May 29, 2025

This PR introduces a new example demonstrating:

  1. Rotary Positional Embeddings (RoPE) implementation
  2. Experimental Mixture-of-Experts (MoE) layer showing how to:
    • Extend Dlib's capabilities without modifying core library
    • Achieve better results than basic feed-forward layers
    • Note: current MoE implementation is simplified (2 experts) but provides a working template for expansion

Key features:

  • Complete text processing pipeline (train/generate/verify modes)
  • Reconstruction capability
  • Memory-efficient sliding window training
  • Custom BPE tokenizer integration

The MoE layer serves as both:

  • A practical performance improvement over standard FFN layers
  • An educational example of extending Dlib's neural network components

Implementation references:

@pfeatherstone
Copy link
Contributor

Have you looked at https://arxiv.org/pdf/2202.08906 and https://arxiv.org/pdf/2308.00951

@Cydral
Copy link
Contributor Author

Cydral commented Jun 3, 2025

Have you looked at https://arxiv.org/pdf/2202.08906 and https://arxiv.org/pdf/2308.00951

Yes, I'm familiar with these mechanisms, although strict implementation would be greatly facilitated by support for dynamic network building—which is not typically the approach or philosophy behind Dlib. The example I provided is actually closer to a Soft MoE mechanism rather than a standard MoE.

That said, thanks to the most recent additions to Dlib, we now have quite a few tools available to build more "modern" networks, at least in line with recent publications. For example, for those interested in integrating causal attention in image processing, I've also published a fully functional and performant ViT-like architecture using Dlib (and a pre-computed model to test). I'm still experimenting with a few specific architectural patterns, and I hope to include a ViT example in the coming weeks.

Back to the MoE topic, I have the intuition that we might get close to dynamic networks via a sort of "network-in-a-network" approach—or more precisely, networks within a layer. I'm currently evaluating this possibility, and if results are promising, I’ll be sure to share an example as well.

@Cydral
Copy link
Contributor Author

Cydral commented Jun 3, 2025

Of course, the shared example is just that—an example. For a more robust MoE implementation, we would likely need to add things like a Gaussian noise layer to improve the distribution across experts (ideally deactivated or made transparent during inference), implement a top-n ranking mechanism, and so on. But again, embedding such logic directly within a layer would significantly simplify the whole process.

@pfeatherstone
Copy link
Contributor

I did think a couple years ago to create a new api which was dynamic but then gave up and now do everything in PyTorch and onnxruntime. I think the template based network building is no longer fit for purpose. I found it could take up to 20 minutes to compile a yolo model.

@davisking
Copy link
Owner

davisking commented Jun 5, 2025

I did think a couple years ago to create a new api which was dynamic but then gave up and now do everything in PyTorch and onnxruntime. I think the template based network building is no longer fit for purpose. I found it could take up to 20 minutes to compile a yolo model.

Yeah I regret not making it all runtime classes instead of the template thing :| Oh well. I too use pytorch for DNN stuff as well. Still use dlib for many other things though.

@Cydral
Copy link
Contributor Author

Cydral commented Jun 7, 2025

Despite these constraints, with the recent modifications, it is now possible to build networks in Dlib following the most "modern" architectures, whether for text sequence processing (LM) or image processing (ViT). I remember several people doubting that we could even implement causal attention in Dlib, yet I managed to produce - even with a templated structure - a functional and relatively performant building block, thus paving the way for bringing in all new, up-to-date models. I will also share a ViT example soon (a model is already available as an example). The key is simply allowing ourselves to not necessarily put all layers directly in Dlib to somewhat simplify future developments.

@pfeatherstone
Copy link
Contributor

What would be super awesome is the ability to provide an attention predicate like the new flex_attention() API in pytorch. We could add a class template parameter in the attention layer which would be used when computing the attention score. For example:

struct causal_block_mask
{
    bool operator()(size_t b, size_t h, size_t q_idx, size_t kv_idx)
    {
        return q_idx >= kv_idx;
    }
};

Then use causal_block_mask in the attention layer template parameters.
Another example:

template<size_t window_size>
struct sliding_window_block_mask
{
    bool operator()(size_t b, size_t h, size_t q_idx, size_t kv_idx)
    {
         constexpr size_t hlen  = window_size // 2;
         return std::abs(q_idx - kv_idx) <= hlen;
    }
};

Hopefully this could be use to improve performance as well by not wasting compute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants