Skip to content

Conversation

rampa3
Copy link

@rampa3 rampa3 commented Aug 7, 2025

Description

This PR provides tweaks and additions for building CPU builds of Python LocalAI backends. Based on discussion #5980.

That entails these changes:

  • Patching libbackend to use more specific conditions for creating Intel ARC builds (a572ddf) - mainly important for building from source without Docker, since originally used way of detecting Intel ARC is prone to misfire for example while building on Arch Linux and Blender is present, due to Blender's hard dependency on Intel OneAPI, which occupies the checked path of /opt/intel/
  • Patching all requirements-cpu.txt (except for exllama2, which is CUDA only backend, and for vLLM also patching build process) to use CPU Torch from Torch repository (416f212, 9e65421, 09a32ed, 131a590, b50cdd2, 7986a67, 2644b31, 6938d6d, 3c09e79, 5d4aad5) (PyPi contains only CUDA release)
  • Fixing bark requirements install order to prevent pulling of CUDA into non-CUDA builds (416f212)
  • Checking if XPU is really available before using it in diffusers and transformers (704753d, rampa3@076fd9c) - mainly important for running built from source without Docker, since originally used way of detecting Intel ARC is prone to misfire for example while runing on Arch Linux and Blender is present, due to Blender's hard dependency on Intel OneAPI, which occupies the checked path of /opt/intel/
  • Patching diffusers to ensure harder CPU mode switching, add CPU optimizations, and prevent deadlocks (float16/bfloat16 are not usable on CPU)/OOMs (6ba3b94, e16f605, 64d4b70, a25ff94)
  • Patching faster-whisper to add CPU switching logic including float type switch (1d94f2d)
  • Patching transformers to load libraries on CPU (rampa3@076fd9c) and only use float32 while running on CPU (c77851d)
  • Updating vLLM CPU build process (5d4aad5) and adding logic for it into main Makefile (fd5656b)

Notes for Reviewers

Current state of the CPU builds is (based on testing from a few days ago):

  • Working (tested for coherent output) backends:
    • bark - slow, but yields usable output
    • coqui - framework - speed depending on model
    • diffusers - SD MODELS ONLY, decent speed with SD models
      • float16 and bfloat16 computation causes deadlock if attempted on CPU - code modifications to prevent float16/bfloat16 from even being attempted on CPU would be needed
      • attempting float32 on Flux causes OOM very fast, Lumina not attempted after what Flux did - Flux and Lumina would need to be blocked on CPU
    • chatterbox - very fast when taking in consideration that it runs on CPU
    • rerankers - work pretty fast on CPU
    • transformers - framework - work, usability depends on chosen model to load
  • Probably working (wasn't able to test functionality, I just know the server starts and listens):
    • kokoro - not sure how to use the backend the right way, kept bumping into issues trying to load a model
    • rfdetr - when attempted testing, the backend claimed that the rfdetr-base model from gallery is not valid (has official build, but it is being built against CUDA Torch from PyPi - the image is very big for no reason for CPU usage)
  • Not working backends:
    • faster-whisper - does not output over GRPC - for cycle supposed to iterate over the output segments does not iterate
      • rigging the backend to dump text into a text file for debugging instead produces full and correct transcript of supplied speech
      • manually producing a dummy test segment and appending it to the result array does return it correctly over GRPC
      • faster-whisper ---> GRPC hand off of the segments fails, no idea why
    • vllm - attempted to resurrect CPU-only vLLM builds from source, vLLM build process does not install the built library correctly - needs more debugging
      • building vLLM from source even after updating the build code produce a .egg package for some reason, not a wheel - Python 3.10 in uv probably cannot see it - according to uv docs, eggs are deprecated and not supported

Kokoro and rfdetr will need to be still tested. Faster-whisper I am not sure why it refuses to work, and need to know from someone running a GPU build if it is issue with CPU build or not, but the situation is weird, as only change is CPU Torch usage, and trascription itself works if dumped into a file instead of GRPC. vLLM will require some extra work to get the build process to install it in a way compatible with modern Python and uv. For these reasons, the PR will be a draft until the backends are tested, and issues with faster-whisper and vllm are resolved in some way.

Signed commits

  • Yes, I signed my commits.

rampa3 added 20 commits August 7, 2025 12:03
…dencies when not needed & use CPU Torch for CPU build

Signed-off-by: rampa3 <[email protected]>
…ild type - bare metal build tweak

Signed-off-by: rampa3 <[email protected]>
…request contains - deadlock/OOM prevention

Signed-off-by: rampa3 <[email protected]>
Copy link

netlify bot commented Aug 7, 2025

Deploy Preview for localai ready!

Name Link
🔨 Latest commit 515ab68
🔍 Latest deploy log https://app.netlify.com/projects/localai/deploys/68971449dd710400082dd7d1
😎 Deploy Preview https://deploy-preview-5990--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@rampa3 rampa3 changed the title Add Python backends CPU builds draft feat(Python backends): Add Python backends CPU builds Aug 7, 2025
rampa3 added 2 commits August 7, 2025 15:37
This reverts commit 39f32b0.

Made error when applying the CPU requirements - "+cpu" applies only for pinned versions.

Signed-off-by: rampa3 <[email protected]>
@@ -0,0 +1,64 @@
ARG BASE_IMAGE=ubuntu:22.04
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why having a separate Dockerfile? when build-type is empty we already treat it as a CPU build

Copy link
Author

@rampa3 rampa3 Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why having a separate Dockerfile? when build-type is empty we already treat it as a CPU build

This Dockerfile is supposed to build vLLM from source, as PyPi similar to Torch only has CUDA release. The aim behind the CPU builds is to, where we don't need any other changes, use CPU specific builds of libraries, as for example just installing torch using PyPi repository on a CPU image adds baggage of more than 4 GBs worth of NVIDIA CUDA dependencies the package pulls in (It is well visible on the master CI build of Kitten TTS right now - the TTS itself is not GPU accelerated, but since one of the libs wants Torch, you get more than 5 GB extra dependencies in Torch + CUDA. I have just for fun built it locally with edited requirements to preinstall CPU Torch, and it fell to 1.16 GB image size.). That is why I went and blanket added extra index pointing to CPU releases of Torch everywhere. With vLLM, it is a bit more complicated - to get CPU release, it has to be built from source. We have a part of install.sh for that, but that part never runs with normal Dockerfile, as at some point, the argument FROM_SOURCE was removed. Since the build also has its specific deps, I made extra Dockerfile that installs build deps according to vLLM docs about building the CPU version. It also successfully builds, but for some reason crashes on init when called by LocalAI, and I have no idea how to properly get the whole stacktrace - GRPC returns only part of it. This is one of the reasons why the PR is a draft, not straight PR - I want to try to get this CPU build working.

Copy link
Author

@rampa3 rampa3 Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, that Dockerfile could become .vllm instead of .vllmcpu - every other GPU than NVIDIA needs to be built from source. But for the start, I focused on CPU, as that is the only platform I can reliably test.

image image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I rename the file in preparation for potential addition of ROCm and XPU parts into it?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the point of building vllm from CPU, but I've just run a diff manually locally here against the two Dockerfiles (Dockerfile.python and Dockerfile.vllmcpu) and I don't see notable differences. My point is more that I think we can still use the same Dockerfile, and handle the installation bits directly in the make/install of the backend, unless am I missing something?

--- backend/Dockerfile.vllmcpu	2025-08-08 16:43:25.145194390 +0200
+++ backend/Dockerfile.python	2025-08-08 16:43:15.812600946 +0200
@@ -1,11 +1,9 @@
 ARG BASE_IMAGE=ubuntu:22.04
 
 FROM ${BASE_IMAGE} AS builder
-ARG BACKEND=vllm
+ARG BACKEND=rerankers
 ARG BUILD_TYPE
 ENV BUILD_TYPE=${BUILD_TYPE}
-ARG FROM_SOURCE=true
-ENV FROM_SOURCE=${FROM_SOURCE}
 ARG CUDA_MAJOR_VERSION
 ARG CUDA_MINOR_VERSION
 ARG SKIP_DRIVERS=false
@@ -30,20 +28,81 @@ RUN apt-get update && \
         curl python3-pip \
         python-is-python3 \
         python3-dev llvm \
-        python3-venv make \
-        wget \
-        gcc-12 g++-12 \
-        libtcmalloc-minimal4 \
-        libnuma-dev \
-        ffmpeg \
-        libsm6 libxext6 \
-        libgl1 \
-        jq lsof && \
+        python3-venv make && \
     apt-get clean && \
-    update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 && \
     rm -rf /var/lib/apt/lists/* && \
     pip install --upgrade pip
 
+
+# Cuda
+ENV PATH=/usr/local/cuda/bin:${PATH}
+
+# HipBLAS requirements
+ENV PATH=/opt/rocm/bin:${PATH}
+
+# Vulkan requirements
+RUN <<EOT bash
+    if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
+        apt-get update && \
+        apt-get install -y  --no-install-recommends \
+            software-properties-common pciutils wget gpg-agent && \
+        wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add - && \
+        wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list && \
+        apt-get update && \
+        apt-get install -y \
+            vulkan-sdk && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+# CuBLAS requirements
+RUN <<EOT bash
+    if [ "${BUILD_TYPE}" = "cublas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
+        apt-get update && \
+        apt-get install -y  --no-install-recommends \
+            software-properties-common pciutils
+        if [ "amd64" = "$TARGETARCH" ]; then
+            curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
+        fi
+        if [ "arm64" = "$TARGETARCH" ]; then
+            curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/arm64/cuda-keyring_1.1-1_all.deb
+        fi
+        dpkg -i cuda-keyring_1.1-1_all.deb && \
+        rm -f cuda-keyring_1.1-1_all.deb && \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+# If we are building with clblas support, we need the libraries for the builds
+RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            libclblast-dev && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/* \
+    ; fi
+
+RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            hipblas-dev \
+            rocblas-dev && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/* && \
+        # I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
+        # to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
+        ldconfig \
+    ; fi
 # Install uv as a system package
 RUN curl -LsSf https://astral.sh/uv/install.sh | UV_INSTALL_DIR=/usr/bin sh
 ENV PATH="/root/.cargo/bin:${PATH}"
@@ -60,5 +119,5 @@ COPY python/common/ /${BACKEND}/common
 RUN cd /${BACKEND} && make
 
 FROM scratch
-ARG BACKEND=vllm
-COPY --from=builder /${BACKEND}/ /
+ARG BACKEND=rerankers
+COPY --from=builder /${BACKEND}/ /

Copy link
Author

@rampa3 rampa3 Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well. the dependency block I talk about are dependencies from APT, as listed in the vLLM docs. That means that requirements-cpu.txt is not a way. They are GCC, the C++ libraries required to compile vLLM and few tools vLLM uses in its makfiles. Here is the block from vLLM docs that dictates the extra dependencies:

sudo apt-get update  -y
sudo apt-get install -y --no-install-recommends ccache git curl wget ca-certificates gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

We already have some, but we are missing these:

  • python3-venv
  • make
  • wget
  • gcc-12
  • g++-12
  • libtcmalloc-minimal4
  • libnuma-dev
  • ffmpeg
  • libsm6
  • libxext6
  • libgl1
  • jq
  • lsof

which all have to be installed as APT packages, since vLLM is compiled from C++ code. Normally, we just pull Python dependencies, as even CPU Torch is available already pre-compiled for the C++ parts. vLLM is compiled for CPU fully from scratch, so unless we decide to not ship CPU vLLM, we have to provide these somehow.

I can see if I can make it work with install.sh, since as it is just a shell script, and the builder runs as root, it should work. The only thing is, if I put them here, those building from custom Dockerfiles won't thank me - people build for example on Arch builders, and putting that there limits build platform to Debian & derivative distros only (without manual intervention). Dockerfile was chosen not only as experimentation shortcut (only the fact that there was a separate one was a testing shortcut), but also to keep the backend source directory platform agnostic.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I think it's OK to put it in the Dockerfile.python builder. Especially because at the end of the day that container is used only for building, so in the worse case we would have to copy the libraries to the final backend during the packaging phase.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my suggestion here probably would be to do this step-by-step for each backend, or at least treat vLLM separately to not make this PR go stale.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my suggestion here probably would be to do this step-by-step for each backend, or at least treat vLLM separately to not make this PR go stale.

I agree. I think splitting it backend by backend will be the best way. I will prepare per backend branches and PRs for the ready ones ASAP. The working ones I will just have to figure out the CI for, the rest will be opened whenever I get a moment to sit down and finish them. Last few weeks were a bit busy, as I am in the middle of autumn terms for bachelor finals. I think with that, I will be closing this one then?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sounds good to me, we can follow-up on the other PRs. Thanks! (and good luck with your finals! )

@mudler
Copy link
Owner

mudler commented Aug 7, 2025

I think changes are in the correct direction, however we need to update the CI workflow in https://github.com/mudler/LocalAI/blob/master/.github/workflows/backend.yml and the backend gallery index accordingly https://github.com/mudler/LocalAI/blob/master/backend/index.yaml to have the cpu variants if missing

@mudler mudler changed the title draft feat(Python backends): Add Python backends CPU builds feat(Python backends): Add Python backends CPU builds Aug 7, 2025
@rampa3
Copy link
Author

rampa3 commented Aug 7, 2025

I think changes are in the correct direction, however we need to update the CI workflow in https://github.com/mudler/LocalAI/blob/master/.github/workflows/backend.yml and the backend gallery index accordingly https://github.com/mudler/LocalAI/blob/master/backend/index.yaml to have the cpu variants if missing

Of course, though before doing all the CI and gallery work, I want to have all backends yielding usable output first, unless we decide to drop any. At the moment, I need to finish testing rfdetr (reported rfdetr-base from the gallery as invalid model during testing) and kokoro (attempted to test it before full integration was done, so I was a bit stuck on how it is supposed to be used), and attempt to fix faster-whisper and vllm. The faster-whisper one is very interesting specifically, as both parts work by themselves, but it refuses to pass the transcription through the GRPC part of the backend. vllm is only about finding a way to get the stacktrace from it and implementing fix for the problem.

@rampa3 rampa3 closed this Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants