Skip to content

GPU support - Design feedback requested #310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: development
Choose a base branch
from

Conversation

cicero-ai
Copy link

Hi Lorenzo,

Ok, submitted PR as requested, but again this code is nowhere close to being ready to merge. However, could use some design guidance if possible. Quite confident I have this right, but not 100% certain. Turns out GPU acceleration is rather intricate with various design considerations and trade offs.

To go through how a typical run will work, first as you can see within /src/gpu/models/logistic_regression/gradient_descent.rs there's that TryFrom<> trait implementation.

Then within fit() function of LogisticRegression I will add something like:

#[cfg(feature = "gpu")]
if let Ok(gpu_module) = (&params, k).try_into() {
    // Add some cool way to check # rows / samples threshold, and if matrix large enough, run it on GPU
    let mut gpu_worker = GpuWorker::new(gpu_module, X.into(), &Y);
    return gpu_worker.run().into();
}

And it will be similar small code blocks for each algorithm supported, and that's really the only modifications to actual smartcore code base. The API is going to remain exactly the same.

As long as the TryFrom<> implementation converts whatever we throw at it into a Box, indicating the algorithm supports GPU acceleration, and if the number of rows / samples is large enough, then simply a GpuWorker struct is instantiated and it handles everything then returns the exact same result the smartcore crate expects.

I think that should work, right?

Then GpuStation struct I think I got right... it's a global static, which caches / lazy loads the various GPU resources which I believe I have grouped together correctly. Only supports one GPU device for now, I figured get one working for now, then enhance to multiple device support later if needed.

One somewhat unfortunate thing is there's no real way to keep the model architectures and solvers seperate like in the smartcore design. It's just not going to work for GPU acceleration.

Then there's just a bunch of other little design trade offs throughout. For example, I don't like that GpuAlgorithm enum in there as it seems like a dirty, lost orphan enum to me, but it's actually kind of required. We need some type of hashable unique identifier for each algorithm, as it's needed within the hash map key of the workgroups property of GpuStation. And when you start playing around trying to put a trait object inside a hash map key wrapped in a mutex, things get really messy really quickly.

Anyway, I think I'm on the right track, but if I could get some extra eyes on this it would be greatly appreciated. Absolutely no rush, as again I'm going to pivot off this for a couple weeks and get the next version of my NLU engine out. I will be back to finish this though.

Cheers,
Matt

@Mec-iS
Copy link
Collaborator

Mec-iS commented Jun 20, 2025

Thank you!

Sure, let's take the right time. Also we need to consider everything from the user experience perspective as you mentioned.

If we can assure that we have zero-overhead from those checks and we document them correctly, it should not be a problem to have the method to fork into a GPU-specific routine.

@cicero-ai
Copy link
Author

Sounds good. Yes, core tenants of my approach are transparent GPU acceleration layer, minimal changes to code base, zero API changes, and zero changes required to developer code. Existing devs will just have to enable new "gpu" feature, run existing code and get performance boost. Optionally set threshold if they desire, nothing else.

Overhead for checks is nearly non-existent. Checking # rows/ samples against threshold is O(1), and tryFrom<> slightly more but not much. Not worried about that part, more worried about architecture of the GPU layer in and of itself.

Anyway, I'm back to my NLU engine. will swing back to this in a couple weeks. If I don't hear from you before then, will drop message here to hopefully restart discussion before I jump back into it to ensure I'm on the right track.

Question, is this feature beneficial and worthwhile to the crate? Essentially what happened is I had three choices, burn out my laptop's CPU, develop Python into my Rust pipeline, or just add GPU acceleration into smartcore keeping everything Rust. I went with the last option.

Cheers,
Matt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants