The kernel is the application
cllm boots via Multiboot and goes straight into serving HTTP. There is no Linux, no init, no syscalls — boot.S hands control to kernel.c and the inference loop runs in ring 0.
x86 · Multiboot · QEMU + bare-metal
// no OS. no syscalls. no scheduler.
// the kernel is the application.
cllm boots via Multiboot, brings up serial, VGA, an Intel e1000 NIC, and an HTTP/1.1 server with a llama.cpp-shaped REST API — all in a single 32-bit ELF. The inference engine is being wired in next.
arena
4 MB
statically allocated kernel heap
api
v1
llama.cpp-shaped HTTP endpoints
target
i386
Multiboot ELF, QEMU + bare metal
$ make run
Building release version...
Booting kernel in QEMU (serial on stdio)...
[boot] multiboot header ok, esp=0x7ffe0
[serial] COM1 9600 8N1 up
[mem] heap arena 0x100000..0x500000 (4096 KiB)
[pci] bus 0 dev 3: Intel e1000 (8086:100e) claimed
[net] e1000 ring up, mac=52:54:00:12:34:56
[http] listening on 0.0.0.0:80
[api] v1 routes mounted: /v1/completions /v1/chat/completions /v1/embeddings /v1/models
[llm] inference engine: stub (llama.cpp integration pending)
// representative serial output. exact lines depend on build flags.
// what is in the binary
cllm boots via Multiboot and goes straight into serving HTTP. There is no Linux, no init, no syscalls — boot.S hands control to kernel.c and the inference loop runs in ring 0.
network.c walks the PCI bus, claims the Intel e1000 device, and pushes raw frames. The HTTP server lives directly above the NIC; no kernel networking subsystem in between.
api_v1.c exposes /v1/completions, /v1/chat/completions, /v1/embeddings, /v1/models, /tokenize and /detokenize. Clients that already speak the llama.cpp HTTP shape work without changes.
memory.c manages a statically-allocated 4 MB arena with malloc/free; string.c implements just enough of libc (snprintf, memcpy, memset, strncmp) to keep the kernel honest and small.
A single 32-bit Multiboot ELF runs identically in QEMU and on bare-metal x86 hardware. Serial-on-stdio for headless boots, optional VGA terminal for interactive ones.
make run-debug builds with symbols and pauses QEMU on :1234 waiting for a GDB attach. Stepping through ring-0 inference code is as cheap as a target remote.
// layout
Eight files, top to bottom. Every layer has a single owner; there is no dynamic linking, no module loader, no plugin surface.
+-----------------------------------------------------------+
| QEMU / Bare Metal (x86, Multiboot) |
+-----------------------------------------------------------+
| boot.S Multiboot entry, stack, serial init |
| kernel.c Kernel main, VGA terminal, serial I/O |
| memory.c Heap allocator (malloc/free) |
| string.c libc subset (snprintf, memcpy, ...) |
| network.c PCI enumeration + e1000 NIC driver |
| http.c / api.c HTTP server, request routing |
| api_v1.c llama.cpp-compatible REST API |
| llm.c Model loading and inference interface |
+-----------------------------------------------------------+ // api
The v1 surface mirrors llama.cpp so existing clients work without modification. Routing is implemented in api.c with handlers in api_v1.c.
| Method | Path | Handler | Status |
|---|---|---|---|
| POST | /v1/completions | handle_v1_completions | wired |
| POST | /v1/chat/completions | handle_v1_chat_completions | wired |
| POST | /v1/embeddings | handle_v1_embeddings | wired |
| GET | /v1/models | handle_v1_models | wired |
| POST | /tokenize | handle_tokenize | wired |
| POST | /detokenize | handle_detokenize | wired |
// "wired" = the route exists and parses; inference returns a stub until the llama.cpp engine is integrated.
// roadmap
Lifted directly from the README. We do not overstate what ships.
// honest comparisons
We compare against the projects whose API surface or design we explicitly borrow from. Pick the right tool for the workload.
llama.cpp runs on a host OS and is the reference inference engine. cllm consumes the same v1 HTTP surface and is designed to embed the engine in the unikernel.
vLLM is a Python serving framework for GPU clusters. cllm targets the opposite end: a single C ELF on i386 with optimizations ported from the vLLM playbook over time.
// faq
A Multiboot ELF that boots in QEMU or on bare-metal x86, brings up serial and VGA, walks the PCI bus, drives an Intel e1000 NIC, and answers HTTP/1.1 requests through a llama.cpp-shaped v1 API. The inference engine itself is on the roadmap; the kernel and the network stack ship now.
The API surface mirrors llama.cpp, so the natural target is anything llama.cpp loads (GGUF-format weights for Llama-family architectures, Mistral, Qwen, Gemma, Phi, and so on). Until the llama.cpp inference path is wired into the kernel, no specific model has been benchmarked end-to-end.
QEMU is the development and primary supported target. The kernel is Multiboot-compliant, so any Multiboot loader will boot it — that includes GRUB on bare metal. Other hypervisors are unverified.
GPU support is roadmap, not shipped. The documentation includes an analysis of how a CUDA ggml backend could be integrated into a unikernel (PCIe access, PTX kernel embedding, host/device memory management), but that work is not yet in the build.
Raw Ethernet frames over the e1000 driver, with a minimal IPv4 + TCP implementation and an HTTP/1.1 subset on top — enough to route the v1 endpoints. There is no socket API in the traditional sense; the server is the packet processing loop.
Install gcc with -m32 support, make, and qemu-system-i386. Clone the repo, run make run, and serial output appears on your terminal. Ctrl-A X exits QEMU. There are debug, VGA, and GDB-attached variants as well.
No. The infrastructure scaffolding (kernel, drivers, HTTP) is in place; the inference engine is not. Treat cllm as a working substrate that the llama.cpp integration will land on top of.
A unikernel removes the kernel/userspace boundary, the scheduler, and every page of the host OS that is not part of the inference path. The target is a single tens-of-kilobytes ELF that boots in milliseconds and spends every cycle on math.
Clone the repo, run make run, and serial output appears on your terminal in under a second.