Skip to content

Add RMS Normalization Layer #2999

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 7, 2024
Merged

Conversation

Cydral
Copy link
Contributor

@Cydral Cydral commented Aug 25, 2024

This PR introduces a new RMS (Root Mean Square) Normalization layer to Dlib. RMS Normalization is a variant of Layer Normalization that has shown promising results in various deep learning tasks, particularly in Natural Language Processing.

Key changes:

  • Add rms_norm_ class implementing the RMS Normalization layer
  • Implement rms_normalize and rms_normalize_gradient utility functions
  • Add CPU and CUDA implementations for RMS Normalization
  • Include unit tests for the new layer

This new layer provides an alternative to the existing layer_norm_, offering potential performance benefits and improved training stability in certain scenarios.

Usage Example:
For a comprehensive example of how to use this new RMS Normalization layer in a Transformer-based architecture, please refer to the ERNIE project: https://github.com/Cydral/ERNIE

Copy link
Contributor Author

@Cydral Cydral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a complementary test to comply with the other assessments.

@Cydral
Copy link
Contributor Author

Cydral commented Aug 29, 2024

Fix dangling pointer issue in CUDA implementation of rms_normalize_gradient:

Key changes include:

  1. Moving the allocation of the dscale tensor to the rms_norm_ class as a private member.
  2. Allocating memory for dscale in the setup method of rms_norm_.
  3. Using the member variable dscale in the rms_normalize_gradient function instead of a local declaration.
  4. Consistently passing dscale through all relevant functions to ensure an identical interface between the CPU and CUDA implementations.

@davisking
Copy link
Owner

davisking commented Aug 29, 2024

Looks good I think, except for the cuda pathway. Try building and running the tests locally though with cuda. When I do that I get this build error

/home/davis/source/dlib/dlib/test/dnn.cpp: In function ‘void {anonymous}::test_rms_normalize()’:
/home/davis/source/dlib/dlib/test/dnn.cpp:718:107: error: too few arguments to function ‘void dlib::cpu::rms_normalize_gradient(double, const dlib::tensor&, const dlib::tensor&, const dlib::tensor&, const dlib::tensor&, dlib::tensor&, dlib::tensor&, dlib::tensor&)’
         cpu::rms_normalize_gradient(eps, gradient_input, scale_cpu, x, gamma, src_grad_cpu, gamma_grad_cpu);

I.e. just do

cd dlib/test
mkdir build
cmake ..
make -j6 dtest && ./dtest --test_dnn -d

and make sure you see it say it's going to use cuda. Presumably it will since you are using cuda with dlib already though.

@Cydral
Copy link
Contributor Author

Cydral commented Aug 29, 2024

Damn! There's also an update to be made in the test for this new layer. I'll try and have a look tomorrow.

@Cydral
Copy link
Contributor Author

Cydral commented Aug 31, 2024

The implementation now seems to be correct and the compilation tests have been passed for the various platforms.

@davisking
Copy link
Owner

The tests that run on github don't compile any GPU code since github doesn't have (or didn't anyway, maybe we can get some now, not sure) GPU machines to test on. So you have to test it locally yourself. Like

diff --git a/dlib/test/dnn.cpp b/dlib/test/dnn.cpp
index f8c18e57..2f24084a 100644
--- a/dlib/test/dnn.cpp
+++ b/dlib/test/dnn.cpp
@@ -2079,12 +2079,12 @@ namespace
             auto res = test_layer(l);
             DLIB_TEST_MSG(res, res);
         }
-        {
-            print_spinner();
-            layer_norm_ l;
-            auto res = test_layer(l);
-            DLIB_TEST_MSG(res, res);
-        }
+        // {
+        //     print_spinner();
+        //     layer_norm_ l;
+        //     auto res = test_layer(l);
+        //     DLIB_TEST_MSG(res, res);
+        // }
         {  
             print_spinner();  
             rms_norm_ l;  

And then make -j6 dtest && ./dtest --test_dnn -d. For me I still see

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!! TEST FAILED: test_dnn !!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Failure message from test: 

Error occurred at line 2092.
Error occurred in file /home/davis/source/dlib/dlib/test/dnn.cpp.
Failing expression was res.
Gradient error in parameter #0.  Relative error: 0.208587
expected derivative: -2.85984
output derivative:   -2.26331
iteration:           0

Probably just need the other fixes @arrufat has over at #3001.

@Cydral
Copy link
Contributor Author

Cydral commented Sep 1, 2024

Thank you for your feedback, Davis. I had indeed tested the implementation using an external program that allowed me to compare the CPU and CUDA executions against each other and against the theoretical expected results. The CPU version was tested against the expected theoretical output (as given in the dnn.cpp test), and I hadn't initially noticed any issues with the expected results.

resizable_tensor y_cuda(x);
resizable_tensor scale_cuda(x.num_samples());
cuda::rms_normalize(eps, y_cuda, scale_cuda, x, gamma);
DLIB_TEST_MSG(max(abs(mat(y_cpu) - mat(y_cuda))) < 1e-5, "rms_norm layer, max(abs(mat(y_cpu) - mat(y_cuda))) < 1e-5");
DLIB_TEST_MSG(max(abs(mat(scale_cpu) - mat(scale_cuda))) < 1e-5, "rms_norm layer, max(abs(mat(scale_cpu) - mat(scale_cuda))) < 1e-5");

resizable_tensor gradient_input(x);
resizable_tensor src_grad_cpu(x), gamma_grad_cpu(1, x.k(), x.nr(), x.nc());
resizable_tensor src_grad_cuda(x), gamma_grad_cuda(1, x.k(), x.nr(), x.nc());
resizable_tensor dscale_cpu(x.num_samples()), dscale_cuda(x.num_samples());
rnd.fill_gaussian(gradient_input);
src_grad_cpu = 0;
src_grad_cuda = 0;
cpu::rms_normalize_gradient(eps, gradient_input, scale_cpu, x, gamma, src_grad_cpu, gamma_grad_cpu, dscale_cpu);
cuda::rms_normalize_gradient(eps, gradient_input, scale_cuda, x, gamma, src_grad_cuda, gamma_grad_cuda, dscale_cuda);
DLIB_TEST_MSG(max(abs(mat(src_grad_cpu) - mat(src_grad_cuda))) < 1e-5, "rms_norm layer, max(abs(mat(src_grad_cpu) - mat(src_grad_cuda))) < 1e-5");
DLIB_TEST_MSG(max(abs(mat(gamma_grad_cpu) - mat(gamma_grad_cuda))) < 1e-5, "rms_norm layer, max(abs(mat(gamma_grad_cpu) - mat(gamma_grad_cuda))) < 1e-5");

These tests passed in my external program. However, I've now been able to reproduce the gradient error you mentioned when running:

layer_norm_ l;
auto res = test_layer(l);
DLIB_TEST_MSG(res, res);
test failed: Gradient error in parameter #0.  Relative error: 0.203608
expected derivative: -3.19285
output derivative:   -2.54276
iteration:           0

Given that I can now reproduce this issue, I will continue to investigate the discrepancy between my initial tests and the test_layer function results. I appreciate your pointing this out, and I'll work on resolving this inconsistency.

@davisking
Copy link
Owner

Sweet, no worries :D Definitely look at the change @arrufat just made. Since it looks like this PR is based on the layer norm code and probably ends up with the same kind of bug the layer norm code had, which @arrufat just fixed :D

@Cydral
Copy link
Contributor Author

Cydral commented Sep 1, 2024

I've reviewed the discussion regarding the layer_norm issue, and I acknowledge that the problem might indeed stem from poorly managed concurrency. However, at this point, I'm still uncertain about the exact source of the error...

In light of this, I'm in the process of completely rewriting the layer. While the current implementation correctly handles batches, it does so via channels rather than treating each matrix in the tensor individually. I'll be posting updated versions addressing this aspect soon.

Simultaneously, to investigate the CUDA-related issue, I've initially adopted a simpler approach to writing the kernels, essentially mirroring the CPU code. This should help in isolating any problems specifically related to gradient calculation.

Additionally, I'll be updating the test program in dnn.cpp to reflect these changes and provide more comprehensive testing.

@Cydral
Copy link
Contributor Author

Cydral commented Sep 1, 2024

It's getting late, I'll continue searching tomorrow.

@pfeatherstone
Copy link
Contributor

Is the plan to slowly merge all the transformer stuff ?

@Cydral
Copy link
Contributor Author

Cydral commented Sep 2, 2024

That's exactly the point... I already have a technically functional implementation, at least on simple examples, and I'm gradually transforming the new layers to try and make them as efficient as possible and in line with Dlib practices. It's a time-consuming job... but I'm making progress.

@pfeatherstone
Copy link
Contributor

Awesome! Are you going to implement flash attention somehow ?

@Cydral
Copy link
Contributor Author

Cydral commented Sep 2, 2024

I'm currently working on implementing the basic mechanism, specifically the multihead attention.

@Cydral
Copy link
Contributor Author

Cydral commented Sep 7, 2024

I'm currently reviewing the automated checks and recompilations under GitHub to ensure everything is working correctly. I've performed a general update of the various implementations related to the rms_norm_ layer. After testing in my own programs, everything now seems to be working correctly, both under CPU and GPU architectures.
There were some concurrency issues, particularly in the GPU part, related to the gradient update. I took the opportunity to simplify the calculations a bit. Also, the normalization values are no longer saved in the layer, as they are actually intermediate variables that are recalculated in the forward pass (the same behavior applies for the derivative in the backward pass). Only the learned gamma weights remain in the layer's parameter storage. For now, there is one gamma value per channel, but it would be interesting to study the impact of applying a gamma matrix per channel to allow for better and more independent local adaptation.

@Cydral
Copy link
Contributor Author

Cydral commented Sep 7, 2024

I believe the pull request is now ready for @davis to review and merge, pending his final approval.

@davisking
Copy link
Owner

Nice, this is great. Thanks for the PR :)

@davisking davisking merged commit fafdac3 into davisking:master Sep 7, 2024
10 checks passed
@Cydral Cydral deleted the rms-norm-layer branch November 4, 2024 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants