What's the meaning of "Partial GPU support"? #999

mason-acronode · 2023-06-08T23:51:22Z

mason-acronode
Jun 8, 2023

Hello,
I set my env to support cuda and had done with compilation with turning the cuda option on.
Everything works fine.
Just curios what is the meaning of the "partial support"?
When I tried to inference the audio file, decode time was much faster than cpu inferencing but it is still limited by the cpu core.
eg. I have 8 cpu core and can allocate the 8 threads maximum for 1 decoder or can allocate n threads for m cores (n*m=8)
But if I set the number of threads over the cpu core numbers, the decoding is super slow. It seems to be the bottleneck in cpu.
I expected if I can use the cuda on GPU, I could get the huge benefit of the concurrent decoding and the fast speed as well.
any idea?

ghchris2021 · 2023-06-28T00:12:35Z

ghchris2021
Jun 28, 2023

It is a good question, I don't yet understand the bottle necks and performance tuning / trade-offs of this either.
I have only just initially tried the CUBLAS supporting version and I could see that it detected the GPU though I didn't see any metrics listed by the whisper.cpp "main" program log output to indicate how much use it made of the GPU processing / memory etc. capacity so it isn't clear to me yet how fast / slow the CUBLAS whisper.cpp main is vs. say running the GH or HF "whisper" code on the same GPU would be or how much time benefit there is with whisper.cpp CUBLAS version vs. non-GPU CPU / CPU-BLAS versions for real world long duration input files for any given model size variant selection & system.

When trying tuning CPU processor and thread values for the run I also found that specifying more threads than actually available processor resources seemed to significantly slow down the operation but it's not clear to me whether choosing the "processes & threads" settings to match the actual available CPU processors & threads would be optimum or whether choosing a value only equal to the CPU cores WITHOUT multi-threading beyond the CPU core count or even using less than the physical CPU core count for the processors/threads configuration may be optimum. I think the answer to tuning this
could depend on whether there's a cache bottleneck or IPC (cross thread or cross core) bottle neck
or whether the whole thing can get significantly memory bandwidth bound & bottle necked.
It's possible to have code that uses e.g. ONE CPU core / thread only that ~100% exhausts the DRAM bandwidth so having a bunch more processes/threads than the amount which will saturate your RAM or CACHE or whatever is actually counter productive or useless.

Also in my small tests so far with a few hours of audio to process across a few different files it seemed like one could also somehow conceivably parallelize different files' processing onto distinct CPU cores and maybe if there was a way to allocate a given GPU to a given "main whisper.cpp" processes CUBLAS use one could use a couple GPUs in parallel multi-process; or maybe it's even somehow possible (???) to use more than one GPU in parallel for a single "main" process with CUBLAS or possibly OpenCL.

The CPU core activity shows every one of several CPU cores I allocated / configured to one whisper.cpp main process to be "saturated" at e.g. 95-100% load for basically the whole run-time so clearly it's pretty CPU intensive almost always even when I assume it is also using CUBLAS to exercise the GPU but my guess is the CPU and maybe the RAM BW are most heavily loaded and only less so the GPU's VRAM BW / compute cores though I don't have RAM / GPU statistics to verify that impression.

The reported decoding for me was the largest & slowest with 50-80% or so of the decode time reported being the amount of encode time needed e.g.:

whisper_print_timings: load time = 3917.91 ms
whisper_print_timings: fallbacks = 12 p / 1 h
whisper_print_timings: mel time = 29639.96 ms
whisper_print_timings: sample time = 15446.93 ms / 28230 runs ( 0.55 ms per run)
whisper_print_timings: encode time = 2565954.00 ms / 332 runs ( 7728.78 ms per run)
whisper_print_timings: decode time = 4739378.50 ms / 28219 runs ( 167.95 ms per run)
whisper_print_timings: total time = 7355662.00 ms

So with 10T + CUBLAS it was running slightly faster than "real time" with the large model
though I suspect something's not optimum in my resource configuration for a 10-T allocated config:

system_info: n_threads = 10 / 10 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | COREML = 0 |
main: processing 'file.wav' (138765653 samples, 8672.9 sec), 10 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
whisper_print_timings: load time = 3917.91 ms
whisper_print_timings: fallbacks = 12 p / 1 h
whisper_print_timings: mel time = 29639.96 ms
whisper_print_timings: sample time = 15446.93 ms / 28230 runs ( 0.55 ms per run)
whisper_print_timings: encode time = 2565954.00 ms / 332 runs ( 7728.78 ms per run)
whisper_print_timings: decode time = 4739378.50 ms / 28219 runs ( 167.95 ms per run)
whisper_print_timings: total time = 7355662.00 ms

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What's the meaning of "Partial GPU support"? #999

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What's the meaning of "Partial GPU support"? #999

Uh oh!

mason-acronode Jun 8, 2023

Replies: 1 comment

Uh oh!

ghchris2021 Jun 28, 2023

mason-acronode
Jun 8, 2023

ghchris2021
Jun 28, 2023