Possibility of CPU builds for the Python backends #5980

rampa3 · 2025-08-06T13:31:40Z

rampa3
Aug 6, 2025

I have been experimenting with producing CPU-only builds of the Python LocalAI backends using CPU only version of Torch (+cpu) from Torch's repository (--extra-index-url https://download.pytorch.org/whl/cpu to prevent pulling CUDA dependences into CPU builds as Torch from PyPi does, since it is just a copy of the CUDA Torch), and in my opinion, at least some of them might be worth to have built officially. Don't want just outright try to work on a PR for something no one but me wants, so I am opening this discussion for it first.

So far, I have these results at producing pure-CPU builds: (exllama2 is omitted, as it is a CUDA-only backend)

Working (tested for coherent output) backends:
- bark - slow, but yields usable output
- coqui - framework - speed depending on model
- diffusers - SD MODELS ONLY, decent speed with SD models
  - float16 and bfloat16 computation causes deadlock if attempted on CPU - code modifications to prevent float16/bfloat16 from even being attempted on CPU would be needed
  - attempting float32 on Flux causes OOM very fast, Lumina not attempted after what Flux did - Flux and Lumina would need to be blocked on CPU
- chatterbox - very fast when taking in consideration that it runs on CPU
- rerankers - work pretty fast on CPU
- transformers - framework - work, usability depends on chosen model to load
Probably working (wasn't able to test functionality, I just know the server starts and listens):
- kokoro - not sure how to use the backend the right way, kept bumping into issues trying to load a model
- rfdetr - when attempted testing, the backend claimed that the rfdetr-base model from gallery is not valid (has official build, but it is being built against CUDA Torch from PyPi - the image is very big for no reason for CPU usage)
Not working backends:
- faster-whisper - does not output over GRPC - for cycle supposed to iterate over the output segments does not iterate
  - rigging the backend to dump text into a text file for debugging instead produces full and correct transcript of supplied speech
  - manually producing a dummy test segment and appending it to the result array does return it correctly over GRPC
  - faster-whisper ---> GRPC hand off of the segments fails, no idea why
- vllm - attempted to resurrect CPU-only vLLM builds from source, vLLM build process does not install the built library correctly - needs more debugging
  - building vLLM from source even after updating the build code produce a .egg package for some reason, not a wheel - Python 3.10 in uv probably cannot see it - according to uv docs, eggs are deprecated and not supported

For the non-working ones or untested ones, I plan to return to over the time when I have a while, attempting to get them all working.

If anyone else would want to also attempt working on these, my experimental (and probably a bit messy) CPU branch is available here (might not be always up to date): https://github.com/rampa3/LocalAI/tree/python_backends_cpu_build_tweaks

I look forward to any further discussions about the possibility of CPU builds of Python backends.

mudler · 2025-08-07T06:59:52Z

mudler
Aug 7, 2025
Maintainer

hi @rampa3 - I'm totally in favor of this and help on this would be much appreciated. If you open up the PR we can work out the details along the way.

4 replies

rampa3 Aug 7, 2025
Author

Working on the PR. Just figuring out signing off first rn, as all commits in my experimental branch are just commits...

rampa3 Aug 7, 2025
Author

Not sure what is the cleanest way to handle this, since I need to do an extensive rebase if I would want to go back and sign off everything in the history, which would be somewhere around 20 commits I would need to hunt down. Maybe it will be simpler to make yet another branch and merge in all the changes using diff and sign off that.

rampa3 Aug 7, 2025
Author

@mudler Created draft #5990 from a fresh banch just for the PR (kept the original for active experimentation). I feel that doing a non-draft PR with non-working 2 out of 10 backends + another 2 out of the 8 remaining ones untested is not appropriate. I hope that I didn't screwed up anything. This time I kept in mind to sign off everything.

rampa3 Aug 7, 2025
Author

Damn, realized I screwed up the commit names. Should I reset the branch and redo them?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Possibility of CPU builds for the Python backends #5980

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Possibility of CPU builds for the Python backends #5980

Uh oh!

Uh oh!

rampa3 Aug 6, 2025

Replies: 1 comment · 4 replies

Uh oh!

mudler Aug 7, 2025 Maintainer

Uh oh!

Uh oh!

rampa3 Aug 7, 2025 Author

Uh oh!

rampa3 Aug 7, 2025 Author

Uh oh!

rampa3 Aug 7, 2025 Author

Uh oh!

rampa3 Aug 7, 2025 Author

rampa3
Aug 6, 2025

Replies: 1 comment 4 replies

mudler
Aug 7, 2025
Maintainer

rampa3 Aug 7, 2025
Author

rampa3 Aug 7, 2025
Author

rampa3 Aug 7, 2025
Author

rampa3 Aug 7, 2025
Author