Very rough experiments on running random forest pixel classification on the GPU
This project does pixel classification (not training yet) completely on the GPU: compute pixel features, run random forest on those fatures and classify the pixel.
- In python, I created a scikit Random Forest and trained it on features that are Gaussian Smooths of the original image
- Export the trees in the forest via
export_graphviz
(I saved a forest in10_feats_tree/
) - This is just so we don't have to implement the training ourselves, and so we can compare results with
scikit
- Export the trees in the forest via
- In Rust, we generate the Gaussian Smoothing (
gaussian_blur.rs
) kernels matching the Random Forest training- Then, we upload those kernels to the GPU (
KernelsInBuffSlot::new
) as GPU buffers for now, but maybe we should use textures instead? Would samplling be faster?
- Then, we upload those kernels to the GPU (
- Parse the output of step 1.1 in
src/decisiton_tree.rs
- Generate a compute shader that applies every kernel to every pixel and stores each "feature"
in a compute shader variable called
feature_<feature_index>
- Use the parsed trees from 1.1 to produce a bunch of
if/else
statements at the end of the compute shader viaforest.write_wgsl()
, which classify the pixel - Send an image over to the GPU, run the compute shader on it, then copy the results back
So we load the first pixel, compute all features for the first pixel, then classify the first pixel. Then we load the second pixel, compute all features for the second pixel, then classify the second pixel, and so on. I'd like to believe we'd getter better cache usage computing features like this instead of "all pixels for the first feature, then all pixels for the second feature, and so on", because this requires loading the raw data pixels as many times as there are features (I still don't know what the caching effects of fetching the halos are, though).
With this approach we are parallellizing at the pixel level. Also, a GPU can produce 4K real-time graphics these days; that's 3840 × 2160
pixels every 16ms
, which surely we could use somehow, but:
- There is the monstrous caveat that when gaming, one is not moving all those pixels into and out of the GPU
- But we also wouldn't need to keep moving data if we could render our results directly to the screen.
- Also, you can run real-time applications "in software", say via LLVMPIPE, and that is somehow shipped over to the GPU pretty quickly
It takes the majority of the processing time and feels kinda wrong (see comments about LLVMPIPE above)
Right now I just take a timestamp where I think makes sense
Hessian of Gaussian, Structure Tensor eigenvalues, etc all need an implementation
No separated kernels, for example. Also, maybe it would be faster to compute the kernel value on the fly instead of reading it out of a buffer or texture
How do we even test for that? Maybe just firing multiple processes would give us an idea of how much bandwidth/compute we're wasting
We know that transfer speeds are faster in this case, as we'd expect
The wgpu
library (and maybe the WebGPU spec itself) is far more high level than something like Vulkan, and it lacks some features that could help with performance; In particular, it only has one submission queue (where you put work for the GPU to do), and this might hinder our ability to upload more data while the previous batch is still crunching, but I'm not sure.
There is a limit to how many buffers you're allowed to have in a shader, but separating the kernels in independent buffers or putting them all into a single giant buffer made no difference
I think the wgpu
library is internally locked/synchronized anyway, and this made no difference in throughput
No difference, except that the small tiles would give better latency, since the user could see gradual progress