feat(Python backends): Add Python backends CPU builds #5990

rampa3 · 2025-08-07T12:28:40Z

Description

This PR provides tweaks and additions for building CPU builds of Python LocalAI backends. Based on discussion #5980.

That entails these changes:

Patching libbackend to use more specific conditions for creating Intel ARC builds (a572ddf) - mainly important for building from source without Docker, since originally used way of detecting Intel ARC is prone to misfire for example while building on Arch Linux and Blender is present, due to Blender's hard dependency on Intel OneAPI, which occupies the checked path of /opt/intel/
Patching all requirements-cpu.txt (except for exllama2, which is CUDA only backend, and for vLLM also patching build process) to use CPU Torch from Torch repository (416f212, 9e65421, 09a32ed, 131a590, b50cdd2, 7986a67, 2644b31, 6938d6d, 3c09e79, 5d4aad5) (PyPi contains only CUDA release)
Fixing bark requirements install order to prevent pulling of CUDA into non-CUDA builds (416f212)
Checking if XPU is really available before using it in diffusers and transformers (704753d, rampa3@076fd9c) - mainly important for running built from source without Docker, since originally used way of detecting Intel ARC is prone to misfire for example while runing on Arch Linux and Blender is present, due to Blender's hard dependency on Intel OneAPI, which occupies the checked path of /opt/intel/
Patching diffusers to ensure harder CPU mode switching, add CPU optimizations, and prevent deadlocks (float16/bfloat16 are not usable on CPU)/OOMs (6ba3b94, e16f605, 64d4b70, a25ff94)
Patching faster-whisper to add CPU switching logic including float type switch (1d94f2d)
Patching transformers to load libraries on CPU (rampa3@076fd9c) and only use float32 while running on CPU (c77851d)
Updating vLLM CPU build process (5d4aad5) and adding logic for it into main Makefile (fd5656b)

Notes for Reviewers

Current state of the CPU builds is (based on testing from a few days ago):

Working (tested for coherent output) backends:
- bark - slow, but yields usable output
- coqui - framework - speed depending on model
- diffusers - SD MODELS ONLY, decent speed with SD models
  - float16 and bfloat16 computation causes deadlock if attempted on CPU - code modifications to prevent float16/bfloat16 from even being attempted on CPU would be needed
  - attempting float32 on Flux causes OOM very fast, Lumina not attempted after what Flux did - Flux and Lumina would need to be blocked on CPU
- chatterbox - very fast when taking in consideration that it runs on CPU
- rerankers - work pretty fast on CPU
- transformers - framework - work, usability depends on chosen model to load
Probably working (wasn't able to test functionality, I just know the server starts and listens):
- kokoro - not sure how to use the backend the right way, kept bumping into issues trying to load a model
- rfdetr - when attempted testing, the backend claimed that the rfdetr-base model from gallery is not valid (has official build, but it is being built against CUDA Torch from PyPi - the image is very big for no reason for CPU usage)
Not working backends:
- faster-whisper - does not output over GRPC - for cycle supposed to iterate over the output segments does not iterate
  - rigging the backend to dump text into a text file for debugging instead produces full and correct transcript of supplied speech
  - manually producing a dummy test segment and appending it to the result array does return it correctly over GRPC
  - faster-whisper ---> GRPC hand off of the segments fails, no idea why
- vllm - attempted to resurrect CPU-only vLLM builds from source, vLLM build process does not install the built library correctly - needs more debugging
  - building vLLM from source even after updating the build code produce a .egg package for some reason, not a wheel - Python 3.10 in uv probably cannot see it - according to uv docs, eggs are deprecated and not supported

Kokoro and rfdetr will need to be still tested. Faster-whisper I am not sure why it refuses to work, and need to know from someone running a GPU build if it is issue with CPU build or not, but the situation is weird, as only change is CPU Torch usage, and trascription itself works if dumped into a file instead of GRPC. vLLM will require some extra work to get the build process to install it in a way compatible with modern Python and uv. For these reasons, the PR will be a draft until the backends are tested, and issues with faster-whisper and vllm are resolved in some way.

Signed commits

Yes, I signed my commits.

…dencies when not needed & use CPU Torch for CPU build Signed-off-by: rampa3 <[email protected]>

Signed-off-by: rampa3 <[email protected]>

…ild type - bare metal build tweak Signed-off-by: rampa3 <[email protected]>

Signed-off-by: rampa3 <[email protected]>

…request contains - deadlock/OOM prevention Signed-off-by: rampa3 <[email protected]>

…k/OOM prevention Signed-off-by: rampa3 <[email protected]>

Signed-off-by: rampa3 <[email protected]>

…ansformers Signed-off-by: rampa3 <[email protected]>

…revention Signed-off-by: rampa3 <[email protected]>

… Dockerfile Signed-off-by: rampa3 <[email protected]>

Signed-off-by: rampa3 <[email protected]>

netlify · 2025-08-07T12:28:45Z

✅ Deploy Preview for localai ready!

Name	Link
🔨 Latest commit	`515ab68`
🔍 Latest deploy log	https://app.netlify.com/projects/localai/deploys/68971449dd710400082dd7d1
😎 Deploy Preview	https://deploy-preview-5990--localai.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

This reverts commit 39f32b0. Made error when applying the CPU requirements - "+cpu" applies only for pinned versions. Signed-off-by: rampa3 <[email protected]>

…se all other configs use Signed-off-by: rampa3 <[email protected]>

mudler · 2025-08-07T14:57:31Z

backend/Dockerfile.vllmcpu

@@ -0,0 +1,64 @@
+ARG BASE_IMAGE=ubuntu:22.04


why having a separate Dockerfile? when build-type is empty we already treat it as a CPU build

why having a separate Dockerfile? when build-type is empty we already treat it as a CPU build

This Dockerfile is supposed to build vLLM from source, as PyPi similar to Torch only has CUDA release. The aim behind the CPU builds is to, where we don't need any other changes, use CPU specific builds of libraries, as for example just installing torch using PyPi repository on a CPU image adds baggage of more than 4 GBs worth of NVIDIA CUDA dependencies the package pulls in (It is well visible on the master CI build of Kitten TTS right now - the TTS itself is not GPU accelerated, but since one of the libs wants Torch, you get more than 5 GB extra dependencies in Torch + CUDA. I have just for fun built it locally with edited requirements to preinstall CPU Torch, and it fell to 1.16 GB image size.). That is why I went and blanket added extra index pointing to CPU releases of Torch everywhere. With vLLM, it is a bit more complicated - to get CPU release, it has to be built from source. We have a part of install.sh for that, but that part never runs with normal Dockerfile, as at some point, the argument FROM_SOURCE was removed. Since the build also has its specific deps, I made extra Dockerfile that installs build deps according to vLLM docs about building the CPU version. It also successfully builds, but for some reason crashes on init when called by LocalAI, and I have no idea how to properly get the whole stacktrace - GRPC returns only part of it. This is one of the reasons why the PR is a draft, not straight PR - I want to try to get this CPU build working.

In the end, that Dockerfile could become .vllm instead of .vllmcpu - every other GPU than NVIDIA needs to be built from source. But for the start, I focused on CPU, as that is the only platform I can reliably test.

Should I rename the file in preparation for potential addition of ROCm and XPU parts into it?

I got the point of building vllm from CPU, but I've just run a diff manually locally here against the two Dockerfiles (Dockerfile.python and Dockerfile.vllmcpu) and I don't see notable differences. My point is more that I think we can still use the same Dockerfile, and handle the installation bits directly in the make/install of the backend, unless am I missing something?

--- backend/Dockerfile.vllmcpu 2025-08-08 16:43:25.145194390 +0200 +++ backend/Dockerfile.python 2025-08-08 16:43:15.812600946 +0200 @@ -1,11 +1,9 @@ ARG BASE_IMAGE=ubuntu:22.04 FROM ${BASE_IMAGE} AS builder -ARG BACKEND=vllm +ARG BACKEND=rerankers ARG BUILD_TYPE ENV BUILD_TYPE=${BUILD_TYPE} -ARG FROM_SOURCE=true -ENV FROM_SOURCE=${FROM_SOURCE} ARG CUDA_MAJOR_VERSION ARG CUDA_MINOR_VERSION ARG SKIP_DRIVERS=false @@ -30,20 +28,81 @@ RUN apt-get update && \ curl python3-pip \ python-is-python3 \ python3-dev llvm \ - python3-venv make \ - wget \ - gcc-12 g++-12 \ - libtcmalloc-minimal4 \ - libnuma-dev \ - ffmpeg \ - libsm6 libxext6 \ - libgl1 \ - jq lsof && \ + python3-venv make && \ apt-get clean && \ - update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 && \ rm -rf /var/lib/apt/lists/* && \ pip install --upgrade pip + +# Cuda +ENV PATH=/usr/local/cuda/bin:${PATH} + +# HipBLAS requirements +ENV PATH=/opt/rocm/bin:${PATH} + +# Vulkan requirements +RUN <<EOT bash + if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then + apt-get update && \ + apt-get install -y --no-install-recommends \ + software-properties-common pciutils wget gpg-agent && \ + wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add - && \ + wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list && \ + apt-get update && \ + apt-get install -y \ + vulkan-sdk && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* + fi +EOT + +# CuBLAS requirements +RUN <<EOT bash + if [ "${BUILD_TYPE}" = "cublas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then + apt-get update && \ + apt-get install -y --no-install-recommends \ + software-properties-common pciutils + if [ "amd64" = "$TARGETARCH" ]; then + curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb + fi + if [ "arm64" = "$TARGETARCH" ]; then + curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/arm64/cuda-keyring_1.1-1_all.deb + fi + dpkg -i cuda-keyring_1.1-1_all.deb && \ + rm -f cuda-keyring_1.1-1_all.deb && \ + apt-get update && \ + apt-get install -y --no-install-recommends \ + cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \ + libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \ + libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \ + libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \ + libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \ + libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* + fi +EOT + +# If we are building with clblas support, we need the libraries for the builds +RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \ + apt-get update && \ + apt-get install -y --no-install-recommends \ + libclblast-dev && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* \ + ; fi + +RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \ + apt-get update && \ + apt-get install -y --no-install-recommends \ + hipblas-dev \ + rocblas-dev && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* && \ + # I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able + # to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency + ldconfig \ + ; fi # Install uv as a system package RUN curl -LsSf https://astral.sh/uv/install.sh | UV_INSTALL_DIR=/usr/bin sh ENV PATH="/root/.cargo/bin:${PATH}" @@ -60,5 +119,5 @@ COPY python/common/ /${BACKEND}/common RUN cd /${BACKEND} && make FROM scratch -ARG BACKEND=vllm -COPY --from=builder /${BACKEND}/ / +ARG BACKEND=rerankers +COPY --from=builder /${BACKEND}/ /

Well. the dependency block I talk about are dependencies from APT, as listed in the vLLM docs. That means that requirements-cpu.txt is not a way. They are GCC, the C++ libraries required to compile vLLM and few tools vLLM uses in its makfiles. Here is the block from vLLM docs that dictates the extra dependencies:

sudo apt-get update -y sudo apt-get install -y --no-install-recommends ccache git curl wget ca-certificates gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

We already have some, but we are missing these:

python3-venv

make

wget

gcc-12

g++-12

libtcmalloc-minimal4

libnuma-dev

ffmpeg

libsm6

libxext6

libgl1

jq

lsof

which all have to be installed as APT packages, since vLLM is compiled from C++ code. Normally, we just pull Python dependencies, as even CPU Torch is available already pre-compiled for the C++ parts. vLLM is compiled for CPU fully from scratch, so unless we decide to not ship CPU vLLM, we have to provide these somehow.

I can see if I can make it work with install.sh, since as it is just a shell script, and the builder runs as root, it should work. The only thing is, if I put them here, those building from custom Dockerfiles won't thank me - people build for example on Arch builders, and putting that there limits build platform to Debian & derivative distros only (without manual intervention). Dockerfile was chosen not only as experimentation shortcut (only the fact that there was a separate one was a testing shortcut), but also to keep the backend source directory platform agnostic.

Fair enough, I think it's OK to put it in the Dockerfile.python builder. Especially because at the end of the day that container is used only for building, so in the worse case we would have to copy the libraries to the final backend during the packaging phase.

my suggestion here probably would be to do this step-by-step for each backend, or at least treat vLLM separately to not make this PR go stale.

my suggestion here probably would be to do this step-by-step for each backend, or at least treat vLLM separately to not make this PR go stale.

I agree. I think splitting it backend by backend will be the best way. I will prepare per backend branches and PRs for the ready ones ASAP. The working ones I will just have to figure out the CI for, the rest will be opened whenever I get a moment to sit down and finish them. Last few weeks were a bit busy, as I am in the middle of autumn terms for bachelor finals. I think with that, I will be closing this one then?

Yes sounds good to me, we can follow-up on the other PRs. Thanks! (and good luck with your finals! )

mudler · 2025-08-07T14:59:40Z

I think changes are in the correct direction, however we need to update the CI workflow in https://github.com/mudler/LocalAI/blob/master/.github/workflows/backend.yml and the backend gallery index accordingly https://github.com/mudler/LocalAI/blob/master/backend/index.yaml to have the cpu variants if missing

rampa3 · 2025-08-07T16:11:38Z

I think changes are in the correct direction, however we need to update the CI workflow in https://github.com/mudler/LocalAI/blob/master/.github/workflows/backend.yml and the backend gallery index accordingly https://github.com/mudler/LocalAI/blob/master/backend/index.yaml to have the cpu variants if missing

Of course, though before doing all the CI and gallery work, I want to have all backends yielding usable output first, unless we decide to drop any. At the moment, I need to finish testing rfdetr (reported rfdetr-base from the gallery as invalid model during testing) and kokoro (attempted to test it before full integration was done, so I was a bit stuck on how it is supposed to be used), and attempt to fix faster-whisper and vllm. The faster-whisper one is very interesting specifically, as both parts work by themselves, but it refuses to pass the transcription through the GRPC part of the backend. vllm is only about finding a way to get the stacktrace from it and implementing fix for the problem.

Signed-off-by: rampa3 <[email protected]>

rampa3 added 20 commits August 7, 2025 12:03

Fix requirements install order for bark to prevent pulling CUDA depen…

416f212

…dencies when not needed & use CPU Torch for CPU build Signed-off-by: rampa3 <[email protected]>

Use CPU Torch for CPU build of Chatterbox

b50cdd2

Signed-off-by: rampa3 <[email protected]>

Patch libbackend to build Intel builds only if one is requested by bu…

a572ddf

…ild type - bare metal build tweak Signed-off-by: rampa3 <[email protected]>

Use CPU Torch for CPU build of Coqui

9e65421

Signed-off-by: rampa3 <[email protected]>

Use CPU Torch for CPU build of diffusers

09a32ed

Signed-off-by: rampa3 <[email protected]>

Only use XPU in diffusers if available when requested

704753d

Signed-off-by: rampa3 <[email protected]>

Ensure CPU mode usage if running diffusers on CPU

6ba3b94

Signed-off-by: rampa3 <[email protected]>

Force diffusers to use float32 only if running on CPU no matter what …

e16f605

…request contains - deadlock/OOM prevention Signed-off-by: rampa3 <[email protected]>

Block bfloat16-only diffusers pipelines from running on CPU - deadloc…

64d4b70

…k/OOM prevention Signed-off-by: rampa3 <[email protected]>

Extra CPU optimizations in diffusers

a25ff94

Signed-off-by: rampa3 <[email protected]>

Use CPU Torch for CPU build of faster-whisper

131a590

Signed-off-by: rampa3 <[email protected]>

Add device type switching logic using Torch into faster-whisper

1d94f2d

Signed-off-by: rampa3 <[email protected]>

Use CPU Torch for CPU build of kokoro

39f32b0

Signed-off-by: rampa3 <[email protected]>

Use CPU Torch for CPU build of rerankers

2644b31

Signed-off-by: rampa3 <[email protected]>

Use CPU Torch for CPU build of rfdetr

6938d6d

Signed-off-by: rampa3 <[email protected]>

Use CPU Torch for CPU build of transformers

3c09e79

Signed-off-by: rampa3 <[email protected]>

Create lib import code for CPU mode & only use XPU if available in tr…

076fd9c

…ansformers Signed-off-by: rampa3 <[email protected]>

Force transformers to use float32 only if running on CPU - deadlock p…

c77851d

…revention Signed-off-by: rampa3 <[email protected]>

Update vLLM files for building CPU version from source & add CPU vLLM…

5d4aad5

… Dockerfile Signed-off-by: rampa3 <[email protected]>

Add CPU vLLM build logic into main Makefile

fd5656b

Signed-off-by: rampa3 <[email protected]>

github-actions bot added the dependencies label Aug 7, 2025

rampa3 changed the title ~~Add Python backends CPU builds~~ draft feat(Python backends): Add Python backends CPU builds Aug 7, 2025

rampa3 added 2 commits August 7, 2025 15:37

Revert "Use CPU Torch for CPU build of kokoro"

e19f8b7

This reverts commit 39f32b0. Made error when applying the CPU requirements - "+cpu" applies only for pinned versions. Signed-off-by: rampa3 <[email protected]>

Pin kokoro Torch in CPU requirements to CPU version of the same relea…

7986a67

…se all other configs use Signed-off-by: rampa3 <[email protected]>

mudler reviewed Aug 7, 2025

View reviewed changes

mudler changed the title ~~draft feat(Python backends): Add Python backends CPU builds~~ feat(Python backends): Add Python backends CPU builds Aug 7, 2025

Resolve conflicts and merge branch 'master' into python_builds_pr

515ab68

Signed-off-by: rampa3 <[email protected]>

rampa3 closed this Aug 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(Python backends): Add Python backends CPU builds #5990

feat(Python backends): Add Python backends CPU builds #5990

Uh oh!

rampa3 commented Aug 7, 2025 •

edited

Loading

Uh oh!

netlify bot commented Aug 7, 2025 •

edited

Loading

Uh oh!

mudler Aug 7, 2025

Uh oh!

rampa3 Aug 7, 2025 •

edited

Loading

Uh oh!

rampa3 Aug 7, 2025 •

edited

Loading

Uh oh!

rampa3 Aug 7, 2025

Uh oh!

mudler Aug 8, 2025

Uh oh!

rampa3 Aug 11, 2025 •

edited

Loading

Uh oh!

mudler Aug 20, 2025

Uh oh!

mudler Aug 20, 2025

Uh oh!

rampa3 Aug 20, 2025

Uh oh!

mudler Aug 20, 2025

Uh oh!

mudler commented Aug 7, 2025

Uh oh!

rampa3 commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

feat(Python backends): Add Python backends CPU builds #5990

feat(Python backends): Add Python backends CPU builds #5990

Uh oh!

Conversation

rampa3 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for localai ready!

Uh oh!

mudler Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

rampa3 Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rampa3 Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rampa3 Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

mudler Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

rampa3 Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mudler Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

mudler Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

rampa3 Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

mudler Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

mudler commented Aug 7, 2025

Uh oh!

rampa3 commented Aug 7, 2025

Uh oh!

Uh oh!

rampa3 commented Aug 7, 2025 •

edited

Loading

netlify bot commented Aug 7, 2025 •

edited

Loading

rampa3 Aug 7, 2025 •

edited

Loading

rampa3 Aug 7, 2025 •

edited

Loading

rampa3 Aug 11, 2025 •

edited

Loading