Support for runtime CPU/CUDA selection #3060

kSkip · 2025-02-26T02:15:16Z

This PR adds support for selecting either the CPU or CUDA implementation of the dnn module at runtime. As discussed in #1852, the #ifdef DLIB_USE_CUDA directives in tensor_tools.cpp just need to be replaced with a runtime branch like

if (use_cuda())
    cuda::function();
else
    cpu::function();

The thread local variable returned by use_cuda() is initialized to be consistent with the current behavior. If DLIB_USE_CUDA=True the function, use_cuda() will return true by default.

I found while making these changes that is was necessary to keep the preprocessor directives because the cuda namespace cannot be referenced when CUDA is not linked. Additionally, the pooling and convolution classes in tensor_tools.h have instances of both implementations as members. I couldn't find a cleaner way (with limited time) to handle those cases.

davisking · 2025-02-27T03:10:30Z

Nice, let me test this a bit and look it over.

kSkip · 2025-04-19T18:51:54Z

Hey @davisking! Do you have any feedback or get a chance to test this code?

Copilot

Pull Request Overview

This PR introduces runtime support for selecting between CPU and CUDA implementations in the dnn module by adding new functions (use_cuda() and set_use_cuda(bool)) and wrapping key tensor operation calls with runtime branches.

In test code (dlib/test/dnn.cpp), additional calls to set_use_cuda(true) and set_use_cuda(false) have been added to drive the runtime behavior.
In dlib/cuda/tensor_tools.h, the preprocessor‐guarded calls have been refactored so that calls to CUDA functions are conditionally executed based on the current runtime flag while still preserving compatibility when CUDA is not linked.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
dlib/test/dnn.cpp	Added calls to set_use_cuda(true)/false to force the runtime branch for CUDA tests.
dlib/cuda/tensor_tools.h	Refactored tensor tool functions to select between cuda_impl and cpu_impl dynamically at runtime.

Comments suppressed due to low confidence (2)

dlib/test/dnn.cpp:5029

Consider adding an inline comment or an explicit assertion to verify that the call to set_use_cuda(true) correctly configures the runtime behavior for CUDA functions, and that a corresponding call to set_use_cuda(false) properly resets the state after the GPU-specific tests.

set_use_cuda(true);

dlib/cuda/tensor_tools.h:24

Consider adding unit tests specifically targeting the new runtime selection API (use_cuda() and set_use_cuda(bool)) to ensure they return the expected values in different scenarios, thereby validating that the dynamic dispatch in tensor tool functions works as intended.

bool use_cuda();

Throw an error when image width or height is 0. Fixes mozilla/mozjpeg#140, closes davisking#7.

`throw()` has been deprecated since C++11 and is removed in C++20. `noexcept` is accepted alternative.

…isking#3063) If the extra field was larger than the space the user provided with inflateGetHeader(), and if multiple calls of inflate() delivered the extra header data, then there could be a buffer overflow of the provided space. This commit assures that provided space is not exceeded.

Solve build error when compiling with GCC 15.

…g#3056) * Add new BPE_Tokenizer class to Dlib - Implement BPE (Byte Pair Encoding) tokenization - Add training and encoding methods - Include unit tests * Update * Update * Last update: optimize BPE tokenizer encoding with parallel paragraph processing * Use of “in-memory” files to avoid leaving any traces on disk during the test. * Add one DLIB_TEST_MSG() test per encoded/decoded string * Add bpe_tokenizer_abstract.h for documentation and integration

davisking · 2025-04-19T23:10:34Z

Yeah sorry. I know I'm super late to review this. Looking at it now a bit. Just pushed a merge commit with master into the PR too.

davisking · 2025-04-19T23:11:52Z

dlib/cuda/tensor_tools.h

+    bool use_cuda();
+    void set_use_cuda(bool flag);


Add any new public functions to the docs (the _abstract.h files). These should be documented next to set_dnn_prefer_smallest_algorithms(), so over in dlib/dnn/core_abstract.h.

davisking · 2025-04-19T23:30:31Z

dlib/cuda/tensor_tools.cpp

+        if (use_cuda())
+        {
+            cuda::dot_prods(add_to, out, lhs, rhs);
+        }
        else
-            out = sum_cols(pointwise_multiply(mat(lhs), mat(rhs))); 
+        {
+#endif


So this PR is a good idea. But can you change these so they are simpler? The preprocessor macro logic is now pretty complicated and makes the code hard to read.

But it could be done like this instead:

void dot_prods ( resizable_tensor& out, const tensor& lhs, const tensor& rhs ) { IF_DLIB_USE_CUDA( cuda::dot_prods(out, lhs, rhs); ) IF_DLIB_NOT_USE_CUDA( out = sum_cols(pointwise_multiply(mat(lhs), mat(rhs))); ) }

Which would be a pretty trivial mechanical change from what dlib has now and doesn't really change the readability of the code. And inside IF_DLIB_USE_CUDA you just have an if (use_cuda()) wrapped inside a #ifdef DLIB_USE_CUDA.

That is, define these macros:

#ifdef DLIB_USE_CUDA #define IF_DLIB_USE_CUDA(expression) if (use_cuda()) { expression } #else #define IF_DLIB_USE_CUDA(expression) #endif #ifdef DLIB_USE_CUDA #define IF_DLIB_NOT_USE_CUDA(expression) if (!use_cuda()) { expression } #else #define IF_DLIB_NOT_USE_CUDA(expression) expression #endif

and then any code that looks like this on master:

#if DLIB_USE_CUDA foo; bar; whatever; #else foo_cpu; whatever2; #endif

Just turns into:

IF_DLIB_USE_CUDA( foo; bar; whatever; ) IF_DLIB_NOT_USE_CUDA( foo_cpu; whatever2; )

davisking · 2025-04-19T23:31:48Z

Yeah sorry for the late review. Just been real busy with other things. Every day I see this in my email and think about how late I've been but hasn't made it up my todo list until today. Haven't had a lot of free time lately 😢

Anyway, change it around to use the macros I suggested. Then the PR will be a trivial and obviously correct mechanical change to dlib's existing code and I'll merge it right in :D

kSkip · 2025-05-06T13:43:46Z

Thanks! I also had difficulty finding the time to fix this.

I defined variadic macros because IF_DLIB_USE_CUDA(expression) implies a single argument and the preprocessor splits on all the commas in the wrapped code. I didn't test this with MSVC, but it should work as expected.

…king#3076) * Implementation of linear_ layer for neural networks. This layer provides an optimized linear transformation for multi-dimensional inputs. * Minor change * Update dlib/dnn/layers.h Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Add reshape_to and flatten layers to Dlib's DNN module * Missing update to "visitors.h" * format fixing for reshape_to * Update dlib/test/dnn.cpp --------- Co-authored-by: Davis E. King <davis685@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull Request Overview

Adds runtime selection between CPU and CUDA implementations in the DNN module, replacing compile-time #ifdef DLIB_USE_CUDA branches with a use_cuda() flag.

Introduce use_cuda() and set_use_cuda(bool) APIs in core_abstract.h.
Update tests in dlib/test/dnn.cpp to toggle CUDA on/off at runtime.
Refactor dlib/cuda/tensor_tools.h to hold both CPU and CUDA instances and dispatch via IF_DLIB_USE_CUDA/IF_DLIB_NOT_USE_CUDA macros.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
dlib/test/dnn.cpp	Add `set_use_cuda(true/false)` calls around CUDA-CPU tests
dlib/dnn/core_abstract.h	Declare `use_cuda()` and `set_use_cuda(bool)` with docs
dlib/cuda/tensor_tools.h	Define dispatch macros, duplicate flag declarations, and split each op into CPU/CUDA branches

Comments suppressed due to low confidence (2)

dlib/test/dnn.cpp:5029

The new runtime flag behavior is only exercised in test_multm_prev(). Consider adding tests in other modules to validate that use_cuda() correctly switches implementations for all runtime-dispatched operations.

set_use_cuda(true);

dlib/cuda/tensor_tools.h:36

use_cuda() and set_use_cuda() are already declared in dlib/dnn/core_abstract.h. Remove these duplicate declarations and include the core header instead to avoid ODR issues.

bool use_cuda();

dlib/cuda/tensor_tools.h

davisking

Ah I just realized this needs more work. I've been going over it for the last hour and a bit. I was updating the unit tests to work with and without a GPU. But I just realized that even when use_cuda()==false it's still using cuda. You can see this by running this for example:

CUDA_VISIBLE_DEVICES= ./dnn_face_recognition_ex ../faces/bald_guys.jpg

with this change:

--- a/examples/dnn_face_recognition_ex.cpp
+++ b/examples/dnn_face_recognition_ex.cpp
@@ -80,6 +80,7 @@ std::vector<matrix<rgb_pixel>> jitter_image(
 
 int main(int argc, char** argv) try
 {
+    dlib::set_use_cuda(false);
     if (argc != 2)
     {
         cout << "Run this example by invoking it like this: " << endl;

You will get this error. Since the resizable_tensor objects are all still calling cudaMalloc and allocating GPU memory. So the tensor objects would need to be updated too in a similar way to the other code so as to not actually do any allocations when cuda is disabled.

Error while calling cudaGetDevice(&the_device_id) in file /home/davis/source/dlib/dlib/cuda/gpu_data.cpp:204. code: 100, reason: no CUDA-capable device is detected

That CUDA_VISIBLE_DEVICES= tells the cuda runtime to pretend there isn't any GPU attached to the machine. That's really what you want to work here. So that someone can run this stuff on a non-GPU machine even though it's compiled with cuda.

Anyway, I pushed the code I had. If you update the tensors to also use use_cuda() then it should be good to go, unless I'm forgetting some other cuda tooling in there. But I think that's it. The key test is if prefixing programs with CUDA_VISIBLE_DEVICES= results in them running without error even though they are built with cuda support :D

I updated the unit tests in a way that should work with CUDA_VISIBLE_DEVICES=, but like I was saying, I went to test that after doing so and was reminded about the tensor allocation stuff :(

kSkip · 2025-05-25T12:20:21Z

Since the resizable_tensor objects are all still calling cudaMalloc and allocating GPU memory.

Oh, good catch. I didn't check the behavior with CUDA_VISIBLE_DEVICES=. It seems like this can be accomplished by changing only the gpu_data class, unless I am missing something.

If we conditionally allocate the CUDA memory, we should also support the following scenario.

resizable_tensor data;

set_use_cuda(false);
data.set_size(batch_size, channels, width, height);
float* host_ptr = data.host();

// do something with host_ptr

set_use_cuda(true);
float* device_ptr = data.device();

// do something with device_ptr

Here we must check if the device buffer needs to be allocated when the data currency is checked.

kSkip · 2025-05-25T12:21:05Z

Also, do you still want to add the do { } while(0) to the macro?

davisking · 2025-05-26T23:41:23Z

Since the resizable_tensor objects are all still calling cudaMalloc and allocating GPU memory.

Oh, good catch. I didn't check the behavior with CUDA_VISIBLE_DEVICES=. It seems like this can be accomplished by changing only the gpu_data class, unless I am missing something.

If we conditionally allocate the CUDA memory, we should also support the following scenario.
resizable_tensor data;

set_use_cuda(false);
data.set_size(batch_size, channels, width, height);
float* host_ptr = data.host();

// do something with host_ptr

set_use_cuda(true);
float* device_ptr = data.device();

// do something with device_ptr
Here we must check if the device buffer needs to be allocated when the data currency is checked.

Yep, that would make it all work smoothly unless I'm forgetting another part of it as well. But I think that's the only bit remaining.

davisking · 2025-05-26T23:41:34Z

Also, do you still want to add the do { } while(0) to the macro?

Na, it's good how it is.

kSkip added 4 commits February 25, 2025 06:30

added runtime cuda toggle

705bd0c

only set if built with cuda

4b8feab

fixed missing macro condition

3ea1be1

Fixed another missing directive

9ca4809

davisking added the enhancement label Mar 22, 2025

davisking requested a review from Copilot April 19, 2025 23:03

Copilot AI reviewed Apr 19, 2025

View reviewed changes

tabudz and others added 12 commits April 19, 2025 19:06

Check image size when reading targa file (davisking#3058)

49f8b0a

Throw an error when image width or height is 0. Fixes mozilla/mozjpeg#140, closes davisking#7.

change the default u32string char type to char32_t (davisking#3059)

2a82cc3

Change version

8bf1f52

change version back

1a38b7a

throw() -> noexcept

4a402ba

`throw()` has been deprecated since C++11 and is removed in C++20. `noexcept` is accepted alternative.

Drop namespace std (davisking#3067)

3b37d5c

Solve build error when compiling with GCC 15.

cleanup serialization code and add missing fields

207c3d7

Some more cleanup

9ca200f

fix SyntaxWarning: invalid escape sequence '\(' (davisking#3069)

44fbbeb

Merge branch 'refs/heads/master' into kSkip-cpu-cuda-toggle

883dd88

davisking reviewed Apr 19, 2025

View reviewed changes

kSkip added 2 commits May 6, 2025 07:57

Defined macros to wrap use_cuda() branches

47af03f

Added docs for new public functions

dc33fa6

davisking added 2 commits May 24, 2025 13:03

tagging a new release

e3128dc

put back to .99

134475a

davisking requested a review from Copilot May 24, 2025 17:14

Cydral and others added 6 commits May 24, 2025 13:14

Add linear_ layer for neural networks (davisking#3074)

652af01

ci: remove unsupported ubuntu 20.04 (davisking#3075)

f97db8f

tagging a new release

696586c

put back to .99

b9f5fa1

Merge branch 'refs/heads/master' into kSkip-cpu-cuda-toggle

412ef35

Copilot AI reviewed May 24, 2025

View reviewed changes

dlib/cuda/tensor_tools.h Show resolved Hide resolved

Make use_cuda() only set to true if there is also a GPU available

0a77590

davisking reviewed May 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for runtime CPU/CUDA selection #3060

Support for runtime CPU/CUDA selection #3060

kSkip commented Feb 26, 2025

Uh oh!

davisking commented Feb 27, 2025

Uh oh!

kSkip commented Apr 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

davisking commented Apr 19, 2025

Uh oh!

davisking Apr 19, 2025

Uh oh!

davisking Apr 19, 2025

Uh oh!

davisking commented Apr 19, 2025 •

edited

Loading

Uh oh!

kSkip commented May 6, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

davisking left a comment

Uh oh!

kSkip commented May 25, 2025

Uh oh!

kSkip commented May 25, 2025

Uh oh!

davisking commented May 26, 2025

Uh oh!

davisking commented May 26, 2025

Uh oh!

Uh oh!

Support for runtime CPU/CUDA selection #3060

Are you sure you want to change the base?

Support for runtime CPU/CUDA selection #3060

Conversation

kSkip commented Feb 26, 2025

Uh oh!

davisking commented Feb 27, 2025

Uh oh!

kSkip commented Apr 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

davisking commented Apr 19, 2025

Uh oh!

davisking Apr 19, 2025

Choose a reason for hiding this comment

Uh oh!

davisking Apr 19, 2025

Choose a reason for hiding this comment

Uh oh!

davisking commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kSkip commented May 6, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

davisking left a comment

Choose a reason for hiding this comment

Uh oh!

kSkip commented May 25, 2025

Uh oh!

kSkip commented May 25, 2025

Uh oh!

davisking commented May 26, 2025

Uh oh!

davisking commented May 26, 2025

Uh oh!

Uh oh!

davisking commented Apr 19, 2025 •

edited

Loading