What's the meaning of "Partial GPU support"? #999
Replies: 1 comment
-
It is a good question, I don't yet understand the bottle necks and performance tuning / trade-offs of this either. When trying tuning CPU processor and thread values for the run I also found that specifying more threads than actually available processor resources seemed to significantly slow down the operation but it's not clear to me whether choosing the "processes & threads" settings to match the actual available CPU processors & threads would be optimum or whether choosing a value only equal to the CPU cores WITHOUT multi-threading beyond the CPU core count or even using less than the physical CPU core count for the processors/threads configuration may be optimum. I think the answer to tuning this Also in my small tests so far with a few hours of audio to process across a few different files it seemed like one could also somehow conceivably parallelize different files' processing onto distinct CPU cores and maybe if there was a way to allocate a given GPU to a given "main whisper.cpp" processes CUBLAS use one could use a couple GPUs in parallel multi-process; or maybe it's even somehow possible (???) to use more than one GPU in parallel for a single "main" process with CUBLAS or possibly OpenCL. The CPU core activity shows every one of several CPU cores I allocated / configured to one whisper.cpp main process to be "saturated" at e.g. 95-100% load for basically the whole run-time so clearly it's pretty CPU intensive almost always even when I assume it is also using CUBLAS to exercise the GPU but my guess is the CPU and maybe the RAM BW are most heavily loaded and only less so the GPU's VRAM BW / compute cores though I don't have RAM / GPU statistics to verify that impression. The reported decoding for me was the largest & slowest with 50-80% or so of the decode time reported being the amount of encode time needed e.g.: whisper_print_timings: load time = 3917.91 ms So with 10T + CUBLAS it was running slightly faster than "real time" with the large model system_info: n_threads = 10 / 10 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | COREML = 0 | |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I set my env to support cuda and had done with compilation with turning the cuda option on.
Everything works fine.
Just curios what is the meaning of the "partial support"?
When I tried to inference the audio file, decode time was much faster than cpu inferencing but it is still limited by the cpu core.
eg. I have 8 cpu core and can allocate the 8 threads maximum for 1 decoder or can allocate n threads for m cores (n*m=8)
But if I set the number of threads over the cpu core numbers, the decoding is super slow. It seems to be the bottleneck in cpu.
I expected if I can use the cuda on GPU, I could get the huge benefit of the concurrent decoding and the fast speed as well.
any idea?
Beta Was this translation helpful? Give feedback.
All reactions