Add transformer example with RoPE and MoE-like mechanisms #3078

Cydral · 2025-05-29T20:50:59Z

This PR introduces a new example demonstrating:

Rotary Positional Embeddings (RoPE) implementation
Experimental Mixture-of-Experts (MoE) layer showing how to:
- Extend Dlib's capabilities without modifying core library
- Achieve better results than basic feed-forward layers
- Note: current MoE implementation is simplified (2 experts) but provides a working template for expansion

Key features:

Complete text processing pipeline (train/generate/verify modes)
Reconstruction capability
Memory-efficient sliding window training
Custom BPE tokenizer integration

The MoE layer serves as both:

A practical performance improvement over standard FFN layers
An educational example of extending Dlib's neural network components

Implementation references:

RoPE: https://arxiv.org/abs/2104.09864
MoE: https://arxiv.org/abs/1701.06538

…des an optimized linear transformation for multi-dimensional inputs.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…tion-free tokenization

pfeatherstone · 2025-06-02T21:42:12Z

Have you looked at https://arxiv.org/pdf/2202.08906 and https://arxiv.org/pdf/2308.00951

Cydral · 2025-06-03T13:22:38Z

Have you looked at https://arxiv.org/pdf/2202.08906 and https://arxiv.org/pdf/2308.00951

Yes, I'm familiar with these mechanisms, although strict implementation would be greatly facilitated by support for dynamic network building—which is not typically the approach or philosophy behind Dlib. The example I provided is actually closer to a Soft MoE mechanism rather than a standard MoE.

That said, thanks to the most recent additions to Dlib, we now have quite a few tools available to build more "modern" networks, at least in line with recent publications. For example, for those interested in integrating causal attention in image processing, I've also published a fully functional and performant ViT-like architecture using Dlib (and a pre-computed model to test). I'm still experimenting with a few specific architectural patterns, and I hope to include a ViT example in the coming weeks.

Back to the MoE topic, I have the intuition that we might get close to dynamic networks via a sort of "network-in-a-network" approach—or more precisely, networks within a layer. I'm currently evaluating this possibility, and if results are promising, I’ll be sure to share an example as well.

Cydral · 2025-06-03T13:25:49Z

Of course, the shared example is just that—an example. For a more robust MoE implementation, we would likely need to add things like a Gaussian noise layer to improve the distribution across experts (ideally deactivated or made transparent during inference), implement a top-n ranking mechanism, and so on. But again, embedding such logic directly within a layer would significantly simplify the whole process.

pfeatherstone · 2025-06-03T15:39:25Z

I did think a couple years ago to create a new api which was dynamic but then gave up and now do everything in PyTorch and onnxruntime. I think the template based network building is no longer fit for purpose. I found it could take up to 20 minutes to compile a yolo model.

davisking · 2025-06-05T01:35:59Z

I did think a couple years ago to create a new api which was dynamic but then gave up and now do everything in PyTorch and onnxruntime. I think the template based network building is no longer fit for purpose. I found it could take up to 20 minutes to compile a yolo model.

Yeah I regret not making it all runtime classes instead of the template thing :| Oh well. I too use pytorch for DNN stuff as well. Still use dlib for many other things though.

Cydral · 2025-06-07T05:38:05Z

Despite these constraints, with the recent modifications, it is now possible to build networks in Dlib following the most "modern" architectures, whether for text sequence processing (LM) or image processing (ViT). I remember several people doubting that we could even implement causal attention in Dlib, yet I managed to produce - even with a templated structure - a functional and relatively performant building block, thus paving the way for bringing in all new, up-to-date models. I will also share a ViT example soon (a model is already available as an example). The key is simply allowing ourselves to not necessarily put all layers directly in Dlib to somewhat simplify future developments.

pfeatherstone · 2025-06-07T12:30:34Z

What would be super awesome is the ability to provide an attention predicate like the new flex_attention() API in pytorch. We could add a class template parameter in the attention layer which would be used when computing the attention score. For example:

struct causal_block_mask
{
    bool operator()(size_t b, size_t h, size_t q_idx, size_t kv_idx)
    {
        return q_idx >= kv_idx;
    }
};

Then use causal_block_mask in the attention layer template parameters.
Another example:

template<size_t window_size>
struct sliding_window_block_mask
{
    bool operator()(size_t b, size_t h, size_t q_idx, size_t kv_idx)
    {
         constexpr size_t hlen  = window_size // 2;
         return std::abs(q_idx - kv_idx) <= hlen;
    }
};

Hopefully this could be use to improve performance as well by not wasting compute.

Cydral and others added 17 commits April 28, 2025 22:10

Implementation of linear_ layer for neural networks. This layer provi…

3e9b9f1

…des an optimized linear transformation for multi-dimensional inputs.

Minor change

93ead3d

Update dlib/dnn/layers.h

bf1b805

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'davisking:master' into master

49bfbc6

Add reshape_to and flatten layers to Dlib's DNN module

f234faa

Missing update to "visitors.h"

26a2960

format fixing for reshape_to

c9a1ee4

Update dlib/test/dnn.cpp

02e62d8

Merge branch 'davisking:master' into master

394dee8

Vocabulary size fixed for learning, and function added for transforma…

778bfc1

…tion-free tokenization

Added a new example for learning a “complex” Transformer model.

03aafc2

Added a new example for learning a “complex” Transformer model.

22c2561

Updated example for training a Transformer model.

01cd0b2

fix for gcc/ffmpeg compilation

6b63e55

Fix a warning message for Ubuntu compilation.

ad1f757

Update for Linux environment.

c91c45a

Fix batch building

6fcc0aa

Slight improvement in model definition.

5a1773e

Cydral added 5 commits June 7, 2025 18:14

linear_ layer implementation improvement

10d7c59

finalizing the example

d4bf94b

Fixing break condition in training method.

a4dac0b

Fixing declaration order of variables.

63454e3

bpe_tokenizer improvements.

87ed70a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add transformer example with RoPE and MoE-like mechanisms #3078

Add transformer example with RoPE and MoE-like mechanisms #3078

Uh oh!

Cydral commented May 29, 2025

Uh oh!

pfeatherstone commented Jun 2, 2025

Uh oh!

Cydral commented Jun 3, 2025

Uh oh!

Cydral commented Jun 3, 2025

Uh oh!

pfeatherstone commented Jun 3, 2025

Uh oh!

davisking commented Jun 5, 2025 •

edited

Loading

Uh oh!

Cydral commented Jun 7, 2025

Uh oh!

pfeatherstone commented Jun 7, 2025

Uh oh!

Uh oh!

Add transformer example with RoPE and MoE-like mechanisms #3078

Are you sure you want to change the base?

Add transformer example with RoPE and MoE-like mechanisms #3078

Uh oh!

Conversation

Cydral commented May 29, 2025

Uh oh!

pfeatherstone commented Jun 2, 2025

Uh oh!

Cydral commented Jun 3, 2025

Uh oh!

Cydral commented Jun 3, 2025

Uh oh!

pfeatherstone commented Jun 3, 2025

Uh oh!

davisking commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cydral commented Jun 7, 2025

Uh oh!

pfeatherstone commented Jun 7, 2025

Uh oh!

Uh oh!

davisking commented Jun 5, 2025 •

edited

Loading