lxd8s allows LXD to be run inside Kubernetes. This is achieved by using
Firecracker to create a minimal, Alpine-based OS to host LXD.
LXD clustering is supported (compatible with a Kubernetes StatefulSet
).
A Helm chart is provided. See my charts repo. Check the default values.yaml
,
there are quite a number of adjustable options. It's recommended that you install
smarter-device-manager in your cluster so that the
container has access to /dev/kvm
.
As well as deploying in Kubernetes, running in plain Docker is supported for testing. The Docker Compose file provided supports this.
Since LXD needs fairly extensive control over cgroups and other kernel-provided isolation tools, it's impractical to run
host LXD directly inside a container, without a significant number of hacks and compromises. Since Firecracker provides
a lightweight KVM-based VM option, LXD can have complete access to the required features. The Kubernetes container also
only needs access to /dev/kvm
(along with CAP_NET_ADMIN
), no privileged: true
required!
In Firecracker 0.25.0 (specifically
this commit) a
check for KVM_CAP_XCRS
was added, even though this call doesn't seem to be used. Since some older hardware doesn't
seem to support this (Intel Nehalem?), lxd8s will stay on v0.24.x.
A custom kernel is built to optimise for Firecracker and LXD. Currently the 5.10.x LTS branch is used.
The kernel will sometimes disable LRO
(Large Receive Offload) on network interfaces. Due to what seems
to be an ambiguous spec, this causes a panic in virtio_net
(used for networking between host and guest). A simple
patch in patches/kernel-virtnet-no-lro-config.patch
makes this operation a no-op to prevent the panic.
Below assumes a checkout of a kernel tree for the desired version.
- Copy
lxd8s.config
toarch/x86/configs
- Run
make defconfig lxd8s.config
- Make changes (e.g. with
make xconfig
) - Copy
.config
to.config.lxd8s
- Run
make defconfig
to overwrite.config
with the default config - Run
scripts/diffconfig -m .config .config.lxd8s > lxd8s.config
to generate the updated config fragment
The main Kubernetes container's purpose is simply to configure some networking, pass configuration to the VM and start
Firecracker. See entrypoint.sh
for all of these details. In summary:
- The TUN device is created (
/dev/net/tun
) - this is required for the host side of networking with the VM - Networking for the container is configured, including forwarding traffic destined for the pod to the VM's internet interface and making use of a separate LXD-private interface utilising kubelan
- LXD configuration is prepared - this includes the cluster cert and preseed files for cluster initialisation
- Config for the VM is added to a tar archive which will overlayed atop the root filesystem on init
- If LXD data and storage volumes have not been provided, temporary 4GiB and 16GiB sparse images will be created
- The VM is started with vmmd, attaching:
- Drives for the read-only rootfs, LXD data, overlay tar and LXD storage
- Internet and LXD private network interfaces
The userspace for the LXD VM is Alpine-based (currently 3.14), although the boot process is custom. In the
Dockerfile
, one of the
builder's is named rootfs
, and the contents of this image is later compressed into a SquashFS image. The Docker
Alpine images are missing many packages needed for a booting Linux system (e.g. an init), so these are all installed.
A number of custom OpenRC scripts are included to perform tasks such as applying the overlay tar, initialising LXD with
a preseed, formatting the LXD data volume etc.
Since the VM is ephemeral (its lifecycle being tied to the host Kubernetes pod), there's no need to have a writable
rootfs. A custom initramfs (embedded into the kernel) uses OverlayFS to add a tmpfs
over /
for scratch files.
Currently, LXD is in Alpine's testing
repo, which is the most bleeding edge. In order to keep the system stable, LXD
is built from source in a separate Alpine 3.14 builder.
A number of custom daemons were created in the go-daemons/
directory.
vmmd is a small program to manage Firecracker. On its own, Firecracker has no real CLI, only a REST API. Using firecracker-go-sdk, a simple daemon configures and starts Firecracker, attaching the serial port to the console. It also supports specifying VM vCPU and memory allocation as a percentage of the host's, which is handy when Kubernetes nodes don't have the same hardware configuration.
livenessd is responsible for reporting health status of each LXD node. It exposes a health check endpoint that Kubernetes can use to check if a node is online. This endpoint attempts to list members of the LXD cluster, which will only succeed if LXD is running and cluster consensus has been reached. It has some special cases, such as allowing health checks to pass if the current node is in the lower half of replicas to be started, since otherwise startup will deadlock (Kubernetes waits for health check to pass but it will not until a majority of cluster nodes are online).
Additionally, livenessd accounts for overprovisioning of memory by shutting down the longest running containers when free memory drops below a certain threshold.