Skip to content
cllm Source

← back to writing

Why a Unikernel for LLM Serving

cllm Team · ·
architectureunikernelsystems

Most LLM serving stacks are an onion. At the centre is the math: matrix multiplies, attention, layer norms, a sampler. Around that sits an inference engine — llama.cpp, vLLM, TensorRT-LLM. Around that sits a Python interpreter or a C++ binary. Around that sits a CUDA userspace. Around that sits a Linux kernel. Around that sits an init system. Around that sits a container runtime. Around that sits an orchestrator. By the time a token comes out of the model and hits the network, it has crossed eight or ten layers of software that exist mostly to maintain each other.

cllm asks a small question with a big consequence: what if we delete almost all of that?

The cost of the layers below

This is not a complaint about modern operating systems. Linux is excellent. Container runtimes solved real problems. Python made ML accessible to a generation of engineers who would not have shown up otherwise. The point is narrower. For inference serving — where the steady-state job is “do as much math as possible, hand the answer back over the network, and do not be interrupted” — most of the layers below the inference engine are paying overhead, not earning it.

Concretely, on a busy Linux box serving LLM inference, the kernel is still doing all the work a kernel does: scheduling, context switches, syscall interposition, virtual memory translation, copy-to-user, signal delivery, page reclaim, network stack traversals. Each one is fast individually. Each one is a hit against the latency budget. The container runtime adds its own slice. The Python interpreter, if you are running vLLM or similar, adds its own. None of these are reckless choices; they are reasonable trade-offs for a general-purpose machine. But the machine we want is not general-purpose. The machine we want runs one program.

A unikernel collapses the stack. There is no kernel/userspace boundary because there is no userspace. There is no scheduler dispatching between unrelated processes because there is only one process. There is no syscall trap because there is no syscall ABI to defend. The Multiboot loader hands a 32-bit ELF an instruction pointer and a stack, and that ELF spends every subsequent cycle on the work it was built for.

That is the trade-off. You give up multi-tenancy, generality, and the comfort of standard tooling. You get a binary that boots in milliseconds, has a memory footprint measured in single digits of MB, and has no software layers between the NIC and the inference path other than the ones you wrote.

What is actually in cllm

The repository today ships eight files of substance. The README is explicit about what each one does:

  • boot.S is the Multiboot entry point. It sets up the stack, initializes the serial port, and jumps to kernel.c.
  • kernel.c is the kernel main. It brings up the VGA terminal, the serial I/O, and enters the main loop.
  • memory.c manages a statically-allocated 4 MB heap arena with malloc and free. There is no virtual memory manager and no paging beyond what x86 needs to run.
  • string.c implements a libc subset — snprintf, memcpy, memset, strncmp — enough to keep the rest of the kernel honest and small.
  • network.c walks the PCI bus, claims an Intel e1000 NIC, and pushes raw frames. It is a NIC driver living in the same address space as the HTTP server above it.
  • http.c and api.c implement an HTTP/1.1 subset and route requests to handlers.
  • api_v1.c exposes a llama.cpp-shaped v1 API: /v1/completions, /v1/chat/completions, /v1/embeddings, /v1/models, plus /tokenize and /detokenize.
  • llm.c is the seam where the inference engine will plug in.

That is the whole stack. There is no filesystem, no process model, no signal handling, no userspace. It is small enough that one engineer can hold the entire control flow in their head, which we think is the right magnitude for a piece of infrastructure that lives between a network card and a model.

What this trades away

We want to be specific about what we lose. The READMEs of unikernel projects historically over-claim. cllm’s does not, and this post should not either.

We lose hardware portability. The current target is x86 i386 (Multiboot). x86_64 is on the wishlist; ARM64 and RISC-V are mentioned in the architecture doc as future considerations. None of these are wired up. Today, “cllm on your laptop” means QEMU emulating an i386.

We lose multi-tenancy. The HTTP server processes one request at a time. There is no thread pool because there is no thread abstraction. Concurrent inference, when it lands, will happen at the model-execution level, not at the OS level.

We lose the standard debugging toolchain. There is no strace, no perf, no gdb attaching to a running process. We get make run-debug, which boots the kernel paused on :1234 waiting for a GDB attach over QEMU’s gdbstub. It is enough — but only because the kernel is small. The day cllm is too big to hold in your head is the day this strategy stops working.

We lose, today, the inference itself. The README is explicit. The kernel ships and serves HTTP; the llama.cpp engine is not yet integrated. The v1 endpoints exist, parse requests, and route to handlers, but the handlers return stubs. Phase 2 of the specification is the engine integration. Phase 3 is GPU support. Phase 4 is vLLM-derived optimizations.

What this buys

In return: a single ELF that boots in milliseconds, has no host OS underneath it to maintain, has no userspace to attack, and has exactly the code you can read in the repository running on the hardware. That is a meaningful property for an inference appliance — a box you ship to a customer or stand up at the edge — where boot time, attack surface, and image size matter more than portability across desktop operating systems.

It is also a meaningful property for a research vehicle. The day you want to test “what if the network packet from the NIC mapped straight into the inference engine’s input buffer with no copy,” there is no kernel boundary in the way. You write it. The day you want to test “what if we pinned the model weights to a specific physical address range and never paged them,” there is no virtual memory manager negotiating with you. You write that, too.

Why C

Because the inference engine we are targeting — llama.cpp — is C. Because the i386 codegen story for C is older than most engineers reading this. Because we want one language across the boot path, the drivers, the network stack, the HTTP server, and the future inference path, with no ABI seams to worry about. The repository has a small amount of Zig in support files, but the kernel proper is C, and that is the surface we expect contributors to read.

The honest framing

cllm is not a faster llama.cpp. It is not a smaller vLLM. It is the substrate we wished existed when we tried to put inference at the edge: a Multiboot ELF with a NIC driver, an HTTP server, and a slot waiting for a model. If you want a production inference server today, run llama.cpp or vLLM. If you want a research vehicle for ring-0 inference, or you are building an appliance and the host OS is liability rather than asset, read the source.

The full specification, including the four-phase roadmap and the GPU backend analysis, lives at docs.cognisoc.com/cllm. The source is at github.com/cognisoc/cllm. make run and you are looking at serial output in under a second.