Add RMS Normalization Layer #2999

Cydral · 2024-08-25T15:56:00Z

This PR introduces a new RMS (Root Mean Square) Normalization layer to Dlib. RMS Normalization is a variant of Layer Normalization that has shown promising results in various deep learning tasks, particularly in Natural Language Processing.

Key changes:

Add rms_norm_ class implementing the RMS Normalization layer
Implement rms_normalize and rms_normalize_gradient utility functions
Add CPU and CUDA implementations for RMS Normalization
Include unit tests for the new layer

This new layer provides an alternative to the existing layer_norm_, offering potential performance benefits and improved training stability in certain scenarios.

Usage Example:
For a comprehensive example of how to use this new RMS Normalization layer in a Transformer-based architecture, please refer to the ERNIE project: https://github.com/Cydral/ERNIE

dlib/test/dnn.cpp

Cydral

Adding a complementary test to comply with the other assessments.

dlib/test/dnn.cpp

dlib/cuda/cuda_dlib.cu

…adient

Cydral · 2024-08-29T09:26:24Z

Fix dangling pointer issue in CUDA implementation of rms_normalize_gradient:

Key changes include:

Moving the allocation of the dscale tensor to the rms_norm_ class as a private member.
Allocating memory for dscale in the setup method of rms_norm_.
Using the member variable dscale in the rms_normalize_gradient function instead of a local declaration.
Consistently passing dscale through all relevant functions to ensure an identical interface between the CPU and CUDA implementations.

davisking · 2024-08-29T11:53:17Z

Looks good I think, except for the cuda pathway. Try building and running the tests locally though with cuda. When I do that I get this build error

/home/davis/source/dlib/dlib/test/dnn.cpp: In function ‘void {anonymous}::test_rms_normalize()’:
/home/davis/source/dlib/dlib/test/dnn.cpp:718:107: error: too few arguments to function ‘void dlib::cpu::rms_normalize_gradient(double, const dlib::tensor&, const dlib::tensor&, const dlib::tensor&, const dlib::tensor&, dlib::tensor&, dlib::tensor&, dlib::tensor&)’
         cpu::rms_normalize_gradient(eps, gradient_input, scale_cpu, x, gamma, src_grad_cpu, gamma_grad_cpu);

I.e. just do

cd dlib/test
mkdir build
cmake ..
make -j6 dtest && ./dtest --test_dnn -d

and make sure you see it say it's going to use cuda. Presumably it will since you are using cuda with dlib already though.

Cydral · 2024-08-29T16:11:32Z

Damn! There's also an update to be made in the test for this new layer. I'll try and have a look tomorrow.

Cydral · 2024-08-31T06:46:47Z

The implementation now seems to be correct and the compilation tests have been passed for the various platforms.

davisking · 2024-09-01T12:36:28Z

The tests that run on github don't compile any GPU code since github doesn't have (or didn't anyway, maybe we can get some now, not sure) GPU machines to test on. So you have to test it locally yourself. Like

diff --git a/dlib/test/dnn.cpp b/dlib/test/dnn.cpp
index f8c18e57..2f24084a 100644
--- a/dlib/test/dnn.cpp
+++ b/dlib/test/dnn.cpp
@@ -2079,12 +2079,12 @@ namespace
             auto res = test_layer(l);
             DLIB_TEST_MSG(res, res);
         }
-        {
-            print_spinner();
-            layer_norm_ l;
-            auto res = test_layer(l);
-            DLIB_TEST_MSG(res, res);
-        }
+        // {
+        //     print_spinner();
+        //     layer_norm_ l;
+        //     auto res = test_layer(l);
+        //     DLIB_TEST_MSG(res, res);
+        // }
         {  
             print_spinner();  
             rms_norm_ l;

And then make -j6 dtest && ./dtest --test_dnn -d. For me I still see

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!! TEST FAILED: test_dnn !!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Failure message from test: 

Error occurred at line 2092.
Error occurred in file /home/davis/source/dlib/dlib/test/dnn.cpp.
Failing expression was res.
Gradient error in parameter #0.  Relative error: 0.208587
expected derivative: -2.85984
output derivative:   -2.26331
iteration:           0

Probably just need the other fixes @arrufat has over at #3001.

Cydral · 2024-09-01T16:38:26Z

Thank you for your feedback, Davis. I had indeed tested the implementation using an external program that allowed me to compare the CPU and CUDA executions against each other and against the theoretical expected results. The CPU version was tested against the expected theoretical output (as given in the dnn.cpp test), and I hadn't initially noticed any issues with the expected results.

resizable_tensor y_cuda(x);
resizable_tensor scale_cuda(x.num_samples());
cuda::rms_normalize(eps, y_cuda, scale_cuda, x, gamma);
DLIB_TEST_MSG(max(abs(mat(y_cpu) - mat(y_cuda))) < 1e-5, "rms_norm layer, max(abs(mat(y_cpu) - mat(y_cuda))) < 1e-5");
DLIB_TEST_MSG(max(abs(mat(scale_cpu) - mat(scale_cuda))) < 1e-5, "rms_norm layer, max(abs(mat(scale_cpu) - mat(scale_cuda))) < 1e-5");

resizable_tensor gradient_input(x);
resizable_tensor src_grad_cpu(x), gamma_grad_cpu(1, x.k(), x.nr(), x.nc());
resizable_tensor src_grad_cuda(x), gamma_grad_cuda(1, x.k(), x.nr(), x.nc());
resizable_tensor dscale_cpu(x.num_samples()), dscale_cuda(x.num_samples());
rnd.fill_gaussian(gradient_input);
src_grad_cpu = 0;
src_grad_cuda = 0;
cpu::rms_normalize_gradient(eps, gradient_input, scale_cpu, x, gamma, src_grad_cpu, gamma_grad_cpu, dscale_cpu);
cuda::rms_normalize_gradient(eps, gradient_input, scale_cuda, x, gamma, src_grad_cuda, gamma_grad_cuda, dscale_cuda);
DLIB_TEST_MSG(max(abs(mat(src_grad_cpu) - mat(src_grad_cuda))) < 1e-5, "rms_norm layer, max(abs(mat(src_grad_cpu) - mat(src_grad_cuda))) < 1e-5");
DLIB_TEST_MSG(max(abs(mat(gamma_grad_cpu) - mat(gamma_grad_cuda))) < 1e-5, "rms_norm layer, max(abs(mat(gamma_grad_cpu) - mat(gamma_grad_cuda))) < 1e-5");

These tests passed in my external program. However, I've now been able to reproduce the gradient error you mentioned when running:

layer_norm_ l;
auto res = test_layer(l);
DLIB_TEST_MSG(res, res);

test failed: Gradient error in parameter #0.  Relative error: 0.203608
expected derivative: -3.19285
output derivative:   -2.54276
iteration:           0

Given that I can now reproduce this issue, I will continue to investigate the discrepancy between my initial tests and the test_layer function results. I appreciate your pointing this out, and I'll work on resolving this inconsistency.

davisking · 2024-09-01T18:58:14Z

Sweet, no worries :D Definitely look at the change @arrufat just made. Since it looks like this PR is based on the layer norm code and probably ends up with the same kind of bug the layer norm code had, which @arrufat just fixed :D

dlib/cuda/cuda_dlib.cu

Cydral · 2024-09-01T21:36:32Z

I've reviewed the discussion regarding the layer_norm issue, and I acknowledge that the problem might indeed stem from poorly managed concurrency. However, at this point, I'm still uncertain about the exact source of the error...

In light of this, I'm in the process of completely rewriting the layer. While the current implementation correctly handles batches, it does so via channels rather than treating each matrix in the tensor individually. I'll be posting updated versions addressing this aspect soon.

Simultaneously, to investigate the CUDA-related issue, I've initially adopted a simpler approach to writing the kernels, essentially mirroring the CPU code. This should help in isolating any problems specifically related to gradient calculation.

Additionally, I'll be updating the test program in dnn.cpp to reflect these changes and provide more comprehensive testing.

Cydral · 2024-09-01T21:49:53Z

It's getting late, I'll continue searching tomorrow.

pfeatherstone · 2024-09-02T16:30:06Z

Is the plan to slowly merge all the transformer stuff ?

Cydral · 2024-09-02T19:40:59Z

That's exactly the point... I already have a technically functional implementation, at least on simple examples, and I'm gradually transforming the new layers to try and make them as efficient as possible and in line with Dlib practices. It's a time-consuming job... but I'm making progress.

pfeatherstone · 2024-09-02T20:01:47Z

Awesome! Are you going to implement flash attention somehow ?

Cydral · 2024-09-02T22:06:58Z

I'm currently working on implementing the basic mechanism, specifically the multihead attention.

Cydral · 2024-09-07T07:33:31Z

I'm currently reviewing the automated checks and recompilations under GitHub to ensure everything is working correctly. I've performed a general update of the various implementations related to the rms_norm_ layer. After testing in my own programs, everything now seems to be working correctly, both under CPU and GPU architectures.
There were some concurrency issues, particularly in the GPU part, related to the gradient update. I took the opportunity to simplify the calculations a bit. Also, the normalization values are no longer saved in the layer, as they are actually intermediate variables that are recalculated in the forward pass (the same behavior applies for the derivative in the backward pass). Only the learned gamma weights remain in the layer's parameter storage. For now, there is one gamma value per channel, but it would be interesting to study the impact of applying a gamma matrix per channel to allow for better and more independent local adaptation.

Cydral · 2024-09-07T07:53:17Z

I believe the pull request is now ready for @davis to review and merge, pending his final approval.

davisking · 2024-09-07T17:29:37Z

Nice, this is great. Thanks for the PR :)

Add RMS Normalization Layer

c30e088

davisking reviewed Aug 25, 2024

View reviewed changes

dlib/test/dnn.cpp Show resolved Hide resolved

Update dnn.cpp

d6517b2

Cydral commented Aug 26, 2024

View reviewed changes

dlib/test/dnn.cpp Show resolved Hide resolved

Missing entry in visitors.h to take into account the new rms_norm_ layer

cf462cd

davisking reviewed Aug 29, 2024

View reviewed changes

dlib/cuda/cuda_dlib.cu Outdated Show resolved Hide resolved

Cydral added 2 commits August 29, 2024 08:23

Fix test function name

5a85514

Fix dangling pointer issue in CUDA implementation of rms_normalize_gr…

19a0cfe

…adient

Cydral added 2 commits August 29, 2024 21:06

Merge branch 'davisking:master' into rms-norm-layer

bd3d3f7

Fixing the dnn.cpp test program for the new rms_norm_ layer

6a12e0b

davisking reviewed Sep 1, 2024

View reviewed changes

dlib/cuda/cuda_dlib.cu Outdated Show resolved Hide resolved

Merge branch 'davisking:master' into rms-norm-layer

03cdf93

General update of the rms_norm_ class

1083aab

davisking merged commit fafdac3 into davisking:master Sep 7, 2024
10 checks passed

Cydral deleted the rms-norm-layer branch November 4, 2024 08:41

Add RMS Normalization Layer #2999

Add RMS Normalization Layer #2999

Uh oh!

Conversation

Cydral commented Aug 25, 2024

Uh oh!

Uh oh!

Cydral left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Cydral commented Aug 29, 2024

Uh oh!

davisking commented Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cydral commented Aug 29, 2024

Uh oh!

Cydral commented Aug 31, 2024

Uh oh!

davisking commented Sep 1, 2024

Uh oh!

Cydral commented Sep 1, 2024

Uh oh!

davisking commented Sep 1, 2024

Uh oh!

Uh oh!

Cydral commented Sep 1, 2024

Uh oh!

Cydral commented Sep 1, 2024

Uh oh!

pfeatherstone commented Sep 2, 2024

Uh oh!

Cydral commented Sep 2, 2024

Uh oh!

pfeatherstone commented Sep 2, 2024

Uh oh!

Cydral commented Sep 2, 2024

Uh oh!

Cydral commented Sep 7, 2024

Uh oh!

Cydral commented Sep 7, 2024

Uh oh!

davisking commented Sep 7, 2024

Uh oh!

Uh oh!

Uh oh!

davisking commented Aug 29, 2024 •

edited

Loading