-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add transformer example with RoPE and MoE-like mechanisms #3078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…des an optimized linear transformation for multi-dimensional inputs.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…tion-free tokenization
Have you looked at https://arxiv.org/pdf/2202.08906 and https://arxiv.org/pdf/2308.00951 |
Yes, I'm familiar with these mechanisms, although strict implementation would be greatly facilitated by support for dynamic network building—which is not typically the approach or philosophy behind Dlib. The example I provided is actually closer to a Soft MoE mechanism rather than a standard MoE. That said, thanks to the most recent additions to Dlib, we now have quite a few tools available to build more "modern" networks, at least in line with recent publications. For example, for those interested in integrating causal attention in image processing, I've also published a fully functional and performant ViT-like architecture using Dlib (and a pre-computed model to test). I'm still experimenting with a few specific architectural patterns, and I hope to include a ViT example in the coming weeks. Back to the MoE topic, I have the intuition that we might get close to dynamic networks via a sort of "network-in-a-network" approach—or more precisely, networks within a layer. I'm currently evaluating this possibility, and if results are promising, I’ll be sure to share an example as well. |
Of course, the shared example is just that—an example. For a more robust MoE implementation, we would likely need to add things like a Gaussian noise layer to improve the distribution across experts (ideally deactivated or made transparent during inference), implement a top-n ranking mechanism, and so on. But again, embedding such logic directly within a layer would significantly simplify the whole process. |
I did think a couple years ago to create a new api which was dynamic but then gave up and now do everything in PyTorch and onnxruntime. I think the template based network building is no longer fit for purpose. I found it could take up to 20 minutes to compile a yolo model. |
Yeah I regret not making it all runtime classes instead of the template thing :| Oh well. I too use pytorch for DNN stuff as well. Still use dlib for many other things though. |
Despite these constraints, with the recent modifications, it is now possible to build networks in Dlib following the most "modern" architectures, whether for text sequence processing (LM) or image processing (ViT). I remember several people doubting that we could even implement causal attention in Dlib, yet I managed to produce - even with a templated structure - a functional and relatively performant building block, thus paving the way for bringing in all new, up-to-date models. I will also share a ViT example soon (a model is already available as an example). The key is simply allowing ourselves to not necessarily put all layers directly in Dlib to somewhat simplify future developments. |
What would be super awesome is the ability to provide an attention predicate like the new
Then use
Hopefully this could be use to improve performance as well by not wasting compute. |
This PR introduces a new example demonstrating:
Key features:
The MoE layer serves as both:
Implementation references: