// about
An LLM server with nothing else on the box.
Inference servers today run on top of a kernel, an init system, a container runtime, a Python interpreter, a CUDA userspace, and a stack of frameworks. Every layer claims a slice of the latency budget. cllm asks what happens when you delete almost all of them.
The thesis
A unikernel collapses the operating system and the application into one binary. There is no kernel/userspace boundary. There is no scheduler dispatching between unrelated processes. There is no syscall trap. The Multiboot loader hands a 32-bit ELF an instruction pointer and a stack, and that ELF spends every subsequent cycle on the work it was built for.
For most workloads that trade-off is wrong. For LLM serving, where the steady-state job is "do as much math as possible without being interrupted," it is at least worth checking. cllm is the experiment.
What the kernel actually contains
The repository today ships eight files of substance. boot.S is the Multiboot entry point. kernel.c brings up serial and VGA and enters the main loop. memory.c manages a statically-allocated 4 MB heap with malloc and free. string.c implements a tiny libc subset — snprintf, memcpy, memset, strncmp — enough to keep the rest of the kernel honest. network.c walks the PCI bus, claims an Intel e1000 NIC, and pushes raw frames. http.c and api.c implement an HTTP/1.1 subset and route requests. api_v1.c exposes a llama.cpp-shaped v1 surface. llm.c is the seam where the inference engine will plug in.
That is the whole stack. There is no virtual memory manager, no filesystem, no process model, no signal handling, no userspace. It is small enough that one engineer can hold the entire control flow in their head.
What is not in the kernel yet
The README and roadmap are explicit, and we will be explicit too. The llama.cpp inference engine is not integrated yet. GPU/CUDA passthrough is not implemented. Streaming token generation is on the roadmap. vLLM-derived transformer optimizations are also on the roadmap. The v1 endpoints exist, parse requests, and route to handlers — but the handlers return stubs until the inference path lands.
What this means in practice: cllm today is a working substrate. It boots, it serves HTTP, it does so without an operating system, and it has a slot waiting for inference code. Phase 2 of the specification is to compile llama.cpp into that slot. Phase 3 is GPU support. Phase 4 is the vLLM optimizations.
Who this is for
We are talking to three audiences. Latency-sensitive LLM operators who care about tail behavior and have already optimized everything above the kernel. Edge inference teams shipping inference appliances that need to boot fast, run small, and be inspectable end to end. Kernel and performance engineers who want a tractable place to apply ring-0 tricks to ML workloads without negotiating with the host OS.
If you write Python and want a managed endpoint, cllm is not the right tool. It will not be the right tool. We are building the other end of the spectrum on purpose.
Why C
Because the inference engine we are integrating is C, the kernel target is i386 with predictable codegen, and we want a single language across the boot path, the drivers, the network stack, and the HTTP server. The repository has a small amount of Zig in support files, but the kernel proper is C. No exceptions, no runtime, no surprises at the ABI boundary.
The honest limitations
- Single-threaded request handling. The HTTP server processes one request at a time.
- 4 MB heap arena. That is enough for the kernel and supporting structures; the model loader will need to grow the arena before real weights land.
- One NIC, one architecture. Intel e1000 over PCI, x86 i386. Other NICs and architectures are future work.
- No persistent storage. Models are baked into the binary or loaded at boot; there is no filesystem to write back to.
- HTTP/1.1 only. No TLS, no HTTP/2, no streaming responses yet.
How to engage
The source lives at github.com/cognisoc/cllm. The build prerequisites are gcc with -m32 support, make, and qemu-system-i386. make run boots the kernel with serial on stdio in under a second. make run-debug pauses on :1234 for a GDB attach. Documentation is at docs.cognisoc.com/cllm.
If you want to follow what we are doing, the blog has the engineering notes and the RSS feed is the canonical way to subscribe.