-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add RMS Normalization Layer #2999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a complementary test to comply with the other assessments.
Fix dangling pointer issue in CUDA implementation of rms_normalize_gradient: Key changes include:
|
Looks good I think, except for the cuda pathway. Try building and running the tests locally though with cuda. When I do that I get this build error
I.e. just do
and make sure you see it say it's going to use cuda. Presumably it will since you are using cuda with dlib already though. |
Damn! There's also an update to be made in the test for this new layer. I'll try and have a look tomorrow. |
The implementation now seems to be correct and the compilation tests have been passed for the various platforms. |
The tests that run on github don't compile any GPU code since github doesn't have (or didn't anyway, maybe we can get some now, not sure) GPU machines to test on. So you have to test it locally yourself. Like
And then
Probably just need the other fixes @arrufat has over at #3001. |
Thank you for your feedback, Davis. I had indeed tested the implementation using an external program that allowed me to compare the CPU and CUDA executions against each other and against the theoretical expected results. The CPU version was tested against the expected theoretical output (as given in the dnn.cpp test), and I hadn't initially noticed any issues with the expected results.
These tests passed in my external program. However, I've now been able to reproduce the gradient error you mentioned when running:
Given that I can now reproduce this issue, I will continue to investigate the discrepancy between my initial tests and the test_layer function results. I appreciate your pointing this out, and I'll work on resolving this inconsistency. |
I've reviewed the discussion regarding the layer_norm issue, and I acknowledge that the problem might indeed stem from poorly managed concurrency. However, at this point, I'm still uncertain about the exact source of the error... In light of this, I'm in the process of completely rewriting the layer. While the current implementation correctly handles batches, it does so via channels rather than treating each matrix in the tensor individually. I'll be posting updated versions addressing this aspect soon. Simultaneously, to investigate the CUDA-related issue, I've initially adopted a simpler approach to writing the kernels, essentially mirroring the CPU code. This should help in isolating any problems specifically related to gradient calculation. Additionally, I'll be updating the test program in dnn.cpp to reflect these changes and provide more comprehensive testing. |
It's getting late, I'll continue searching tomorrow. |
Is the plan to slowly merge all the transformer stuff ? |
That's exactly the point... I already have a technically functional implementation, at least on simple examples, and I'm gradually transforming the new layers to try and make them as efficient as possible and in line with Dlib practices. It's a time-consuming job... but I'm making progress. |
Awesome! Are you going to implement flash attention somehow ? |
I'm currently working on implementing the basic mechanism, specifically the multihead attention. |
I'm currently reviewing the automated checks and recompilations under GitHub to ensure everything is working correctly. I've performed a general update of the various implementations related to the rms_norm_ layer. After testing in my own programs, everything now seems to be working correctly, both under CPU and GPU architectures. |
I believe the pull request is now ready for @davis to review and merge, pending his final approval. |
Nice, this is great. Thanks for the PR :) |
This PR introduces a new RMS (Root Mean Square) Normalization layer to Dlib. RMS Normalization is a variant of Layer Normalization that has shown promising results in various deep learning tasks, particularly in Natural Language Processing.
Key changes:
This new layer provides an alternative to the existing layer_norm_, offering potential performance benefits and improved training stability in certain scenarios.
Usage Example:
For a comprehensive example of how to use this new RMS Normalization layer in a Transformer-based architecture, please refer to the ERNIE project: https://github.com/Cydral/ERNIE