GPU support - Design feedback requested #310
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi Lorenzo,
Ok, submitted PR as requested, but again this code is nowhere close to being ready to merge. However, could use some design guidance if possible. Quite confident I have this right, but not 100% certain. Turns out GPU acceleration is rather intricate with various design considerations and trade offs.
To go through how a typical run will work, first as you can see within /src/gpu/models/logistic_regression/gradient_descent.rs there's that TryFrom<> trait implementation.
Then within fit() function of LogisticRegression I will add something like:
And it will be similar small code blocks for each algorithm supported, and that's really the only modifications to actual smartcore code base. The API is going to remain exactly the same.
As long as the TryFrom<> implementation converts whatever we throw at it into a Box, indicating the algorithm supports GPU acceleration, and if the number of rows / samples is large enough, then simply a GpuWorker struct is instantiated and it handles everything then returns the exact same result the smartcore crate expects.
I think that should work, right?
Then GpuStation struct I think I got right... it's a global static, which caches / lazy loads the various GPU resources which I believe I have grouped together correctly. Only supports one GPU device for now, I figured get one working for now, then enhance to multiple device support later if needed.
One somewhat unfortunate thing is there's no real way to keep the model architectures and solvers seperate like in the smartcore design. It's just not going to work for GPU acceleration.
Then there's just a bunch of other little design trade offs throughout. For example, I don't like that GpuAlgorithm enum in there as it seems like a dirty, lost orphan enum to me, but it's actually kind of required. We need some type of hashable unique identifier for each algorithm, as it's needed within the hash map key of the workgroups property of GpuStation. And when you start playing around trying to put a trait object inside a hash map key wrapped in a mutex, things get really messy really quickly.
Anyway, I think I'm on the right track, but if I could get some extra eyes on this it would be greatly appreciated. Absolutely no rush, as again I'm going to pivot off this for a couple weeks and get the next version of my NLU engine out. I will be back to finish this though.
Cheers,
Matt