Skip to content

Support for runtime CPU/CUDA selection #3060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

kSkip
Copy link
Contributor

@kSkip kSkip commented Feb 26, 2025

This PR adds support for selecting either the CPU or CUDA implementation of the dnn module at runtime. As discussed in #1852, the #ifdef DLIB_USE_CUDA directives in tensor_tools.cpp just need to be replaced with a runtime branch like

if (use_cuda())
    cuda::function();
else
    cpu::function();

The thread local variable returned by use_cuda() is initialized to be consistent with the current behavior. If DLIB_USE_CUDA=True the function, use_cuda() will return true by default.

I found while making these changes that is was necessary to keep the preprocessor directives because the cuda namespace cannot be referenced when CUDA is not linked. Additionally, the pooling and convolution classes in tensor_tools.h have instances of both implementations as members. I couldn't find a cleaner way (with limited time) to handle those cases.

@davisking
Copy link
Owner

Nice, let me test this a bit and look it over.

@kSkip
Copy link
Contributor Author

kSkip commented Apr 19, 2025

Hey @davisking! Do you have any feedback or get a chance to test this code?

@davisking davisking requested a review from Copilot April 19, 2025 23:03
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces runtime support for selecting between CPU and CUDA implementations in the dnn module by adding new functions (use_cuda() and set_use_cuda(bool)) and wrapping key tensor operation calls with runtime branches.

  • In test code (dlib/test/dnn.cpp), additional calls to set_use_cuda(true) and set_use_cuda(false) have been added to drive the runtime behavior.
  • In dlib/cuda/tensor_tools.h, the preprocessor‐guarded calls have been refactored so that calls to CUDA functions are conditionally executed based on the current runtime flag while still preserving compatibility when CUDA is not linked.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
dlib/test/dnn.cpp Added calls to set_use_cuda(true)/false to force the runtime branch for CUDA tests.
dlib/cuda/tensor_tools.h Refactored tensor tool functions to select between cuda_impl and cpu_impl dynamically at runtime.
Comments suppressed due to low confidence (2)

dlib/test/dnn.cpp:5029

  • Consider adding an inline comment or an explicit assertion to verify that the call to set_use_cuda(true) correctly configures the runtime behavior for CUDA functions, and that a corresponding call to set_use_cuda(false) properly resets the state after the GPU-specific tests.
set_use_cuda(true);

dlib/cuda/tensor_tools.h:24

  • Consider adding unit tests specifically targeting the new runtime selection API (use_cuda() and set_use_cuda(bool)) to ensure they return the expected values in different scenarios, thereby validating that the dynamic dispatch in tensor tool functions works as intended.
bool use_cuda();

tabudz and others added 12 commits April 19, 2025 19:06
Throw an error when image width or height is 0.

Fixes mozilla/mozjpeg#140, closes davisking#7.
`throw()` has been deprecated since C++11 and is removed in C++20. `noexcept` is accepted alternative.
…isking#3063)

If the extra field was larger than the space the user provided with
inflateGetHeader(), and if multiple calls of inflate() delivered
the extra header data, then there could be a buffer overflow of the
provided space. This commit assures that provided space is not
exceeded.
Solve build error when compiling with GCC 15.
…g#3056)

* Add new BPE_Tokenizer class to Dlib

- Implement BPE (Byte Pair Encoding) tokenization
- Add training and encoding methods
- Include unit tests

* Update

* Update

* Last update: optimize BPE tokenizer encoding with parallel paragraph processing

* Use of “in-memory” files to avoid leaving any traces on disk during the test.

* Add one DLIB_TEST_MSG() test per encoded/decoded string

* Add bpe_tokenizer_abstract.h for documentation and integration
@davisking
Copy link
Owner

Yeah sorry. I know I'm super late to review this. Looking at it now a bit. Just pushed a merge commit with master into the PR too.

Comment on lines +24 to +25
bool use_cuda();
void set_use_cuda(bool flag);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add any new public functions to the docs (the _abstract.h files). These should be documented next to set_dnn_prefer_smallest_algorithms(), so over in dlib/dnn/core_abstract.h.

Comment on lines 108 to 114
if (use_cuda())
{
cuda::dot_prods(add_to, out, lhs, rhs);
}
else
out = sum_cols(pointwise_multiply(mat(lhs), mat(rhs)));
{
#endif
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this PR is a good idea. But can you change these so they are simpler? The preprocessor macro logic is now pretty complicated and makes the code hard to read.

But it could be done like this instead:

    void dot_prods (
        resizable_tensor& out, 
        const tensor& lhs, 
        const tensor& rhs
    )    
    {    
        IF_DLIB_USE_CUDA(
            cuda::dot_prods(out, lhs, rhs);
        ) 

        IF_DLIB_NOT_USE_CUDA(
            out = sum_cols(pointwise_multiply(mat(lhs), mat(rhs))); 
        )    
    }   

Which would be a pretty trivial mechanical change from what dlib has now and doesn't really change the readability of the code. And inside IF_DLIB_USE_CUDA you just have an if (use_cuda()) wrapped inside a #ifdef DLIB_USE_CUDA.

That is, define these macros:

#ifdef DLIB_USE_CUDA
#define IF_DLIB_USE_CUDA(expression) if (use_cuda()) { expression }
#else
#define IF_DLIB_USE_CUDA(expression) 
#endif

#ifdef DLIB_USE_CUDA
#define IF_DLIB_NOT_USE_CUDA(expression) if (!use_cuda()) { expression }
#else
#define IF_DLIB_NOT_USE_CUDA(expression) expression
#endif

and then any code that looks like this on master:

#if DLIB_USE_CUDA 
    foo;
    bar;
    whatever;
#else
    foo_cpu;
    whatever2;
#endif   

Just turns into:

IF_DLIB_USE_CUDA(
    foo;
    bar;
    whatever;
)

IF_DLIB_NOT_USE_CUDA(
    foo_cpu;
    whatever2;
)

@davisking
Copy link
Owner

davisking commented Apr 19, 2025

Yeah sorry for the late review. Just been real busy with other things. Every day I see this in my email and think about how late I've been but hasn't made it up my todo list until today. Haven't had a lot of free time lately 😢

Anyway, change it around to use the macros I suggested. Then the PR will be a trivial and obviously correct mechanical change to dlib's existing code and I'll merge it right in :D

@kSkip
Copy link
Contributor Author

kSkip commented May 6, 2025

Thanks! I also had difficulty finding the time to fix this.

I defined variadic macros because IF_DLIB_USE_CUDA(expression) implies a single argument and the preprocessor splits on all the commas in the wrapped code. I didn't test this with MSVC, but it should work as expected.

@davisking davisking requested a review from Copilot May 24, 2025 17:14
Cydral and others added 6 commits May 24, 2025 13:14
…king#3076)

* Implementation of linear_ layer for neural networks. This layer provides an optimized linear transformation for multi-dimensional inputs.

* Minor change

* Update dlib/dnn/layers.h

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Add reshape_to and flatten layers to Dlib's DNN module

* Missing update to "visitors.h"

* format fixing for reshape_to

* Update dlib/test/dnn.cpp

---------

Co-authored-by: Davis E. King <davis685@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds runtime selection between CPU and CUDA implementations in the DNN module, replacing compile-time #ifdef DLIB_USE_CUDA branches with a use_cuda() flag.

  • Introduce use_cuda() and set_use_cuda(bool) APIs in core_abstract.h.
  • Update tests in dlib/test/dnn.cpp to toggle CUDA on/off at runtime.
  • Refactor dlib/cuda/tensor_tools.h to hold both CPU and CUDA instances and dispatch via IF_DLIB_USE_CUDA/IF_DLIB_NOT_USE_CUDA macros.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
dlib/test/dnn.cpp Add set_use_cuda(true/false) calls around CUDA-CPU tests
dlib/dnn/core_abstract.h Declare use_cuda() and set_use_cuda(bool) with docs
dlib/cuda/tensor_tools.h Define dispatch macros, duplicate flag declarations, and split each op into CPU/CUDA branches
Comments suppressed due to low confidence (2)

dlib/test/dnn.cpp:5029

  • The new runtime flag behavior is only exercised in test_multm_prev(). Consider adding tests in other modules to validate that use_cuda() correctly switches implementations for all runtime-dispatched operations.
set_use_cuda(true);

dlib/cuda/tensor_tools.h:36

  • use_cuda() and set_use_cuda() are already declared in dlib/dnn/core_abstract.h. Remove these duplicate declarations and include the core header instead to avoid ODR issues.
bool use_cuda();

Copy link
Owner

@davisking davisking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I just realized this needs more work. I've been going over it for the last hour and a bit. I was updating the unit tests to work with and without a GPU. But I just realized that even when use_cuda()==false it's still using cuda. You can see this by running this for example:

CUDA_VISIBLE_DEVICES= ./dnn_face_recognition_ex ../faces/bald_guys.jpg 

with this change:

--- a/examples/dnn_face_recognition_ex.cpp
+++ b/examples/dnn_face_recognition_ex.cpp
@@ -80,6 +80,7 @@ std::vector<matrix<rgb_pixel>> jitter_image(
 
 int main(int argc, char** argv) try
 {
+    dlib::set_use_cuda(false);
     if (argc != 2)
     {
         cout << "Run this example by invoking it like this: " << endl;

You will get this error. Since the resizable_tensor objects are all still calling cudaMalloc and allocating GPU memory. So the tensor objects would need to be updated too in a similar way to the other code so as to not actually do any allocations when cuda is disabled.

Error while calling cudaGetDevice(&the_device_id) in file /home/davis/source/dlib/dlib/cuda/gpu_data.cpp:204. code: 100, reason: no CUDA-capable device is detected

That CUDA_VISIBLE_DEVICES= tells the cuda runtime to pretend there isn't any GPU attached to the machine. That's really what you want to work here. So that someone can run this stuff on a non-GPU machine even though it's compiled with cuda.

Anyway, I pushed the code I had. If you update the tensors to also use use_cuda() then it should be good to go, unless I'm forgetting some other cuda tooling in there. But I think that's it. The key test is if prefixing programs with CUDA_VISIBLE_DEVICES= results in them running without error even though they are built with cuda support :D

I updated the unit tests in a way that should work with CUDA_VISIBLE_DEVICES=, but like I was saying, I went to test that after doing so and was reminded about the tensor allocation stuff :(

@kSkip
Copy link
Contributor Author

kSkip commented May 25, 2025

Since the resizable_tensor objects are all still calling cudaMalloc and allocating GPU memory.

Oh, good catch. I didn't check the behavior with CUDA_VISIBLE_DEVICES=. It seems like this can be accomplished by changing only the gpu_data class, unless I am missing something.

If we conditionally allocate the CUDA memory, we should also support the following scenario.

resizable_tensor data;

set_use_cuda(false);
data.set_size(batch_size, channels, width, height);
float* host_ptr = data.host();

// do something with host_ptr

set_use_cuda(true);
float* device_ptr = data.device();

// do something with device_ptr

Here we must check if the device buffer needs to be allocated when the data currency is checked.

@kSkip
Copy link
Contributor Author

kSkip commented May 25, 2025

Also, do you still want to add the do { } while(0) to the macro?

@davisking
Copy link
Owner

Since the resizable_tensor objects are all still calling cudaMalloc and allocating GPU memory.

Oh, good catch. I didn't check the behavior with CUDA_VISIBLE_DEVICES=. It seems like this can be accomplished by changing only the gpu_data class, unless I am missing something.

If we conditionally allocate the CUDA memory, we should also support the following scenario.

resizable_tensor data;

set_use_cuda(false);
data.set_size(batch_size, channels, width, height);
float* host_ptr = data.host();

// do something with host_ptr

set_use_cuda(true);
float* device_ptr = data.device();

// do something with device_ptr

Here we must check if the device buffer needs to be allocated when the data currency is checked.

Yep, that would make it all work smoothly unless I'm forgetting another part of it as well. But I think that's the only bit remaining.

@davisking
Copy link
Owner

Also, do you still want to add the do { } while(0) to the macro?

Na, it's good how it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants