Skip to content
cllm Source

x86 · Multiboot · QEMU + bare-metal

A bare-metal C unikernel for serving large language models.

// no OS. no syscalls. no scheduler.
// the kernel is the application.

cllm boots via Multiboot, brings up serial, VGA, an Intel e1000 NIC, and an HTTP/1.1 server with a llama.cpp-shaped REST API — all in a single 32-bit ELF. The inference engine is being wired in next.

arena

4 MB

statically allocated kernel heap

api

v1

llama.cpp-shaped HTTP endpoints

target

i386

Multiboot ELF, QEMU + bare metal

qemu :: serial0
$ make run
Building release version...
Booting kernel in QEMU (serial on stdio)...
[boot] multiboot header ok, esp=0x7ffe0
[serial] COM1 9600 8N1 up
[mem]    heap arena 0x100000..0x500000 (4096 KiB)
[pci]    bus 0 dev 3: Intel e1000 (8086:100e) claimed
[net]    e1000 ring up, mac=52:54:00:12:34:56
[http]   listening on 0.0.0.0:80
[api]    v1 routes mounted: /v1/completions /v1/chat/completions /v1/embeddings /v1/models
[llm]    inference engine: stub (llama.cpp integration pending)

// representative serial output. exact lines depend on build flags.

// what is in the binary

Six pieces, one ELF.

Unikernel

The kernel is the application

cllm boots via Multiboot and goes straight into serving HTTP. There is no Linux, no init, no syscalls — boot.S hands control to kernel.c and the inference loop runs in ring 0.

Network

PCI enumeration + e1000 NIC driver

network.c walks the PCI bus, claims the Intel e1000 device, and pushes raw frames. The HTTP server lives directly above the NIC; no kernel networking subsystem in between.

API

llama.cpp-compatible v1 surface

api_v1.c exposes /v1/completions, /v1/chat/completions, /v1/embeddings, /v1/models, /tokenize and /detokenize. Clients that already speak the llama.cpp HTTP shape work without changes.

libc

Custom 4 MB heap and string subset

memory.c manages a statically-allocated 4 MB arena with malloc/free; string.c implements just enough of libc (snprintf, memcpy, memset, strncmp) to keep the kernel honest and small.

Boot

Multiboot ELF — QEMU or bare metal

A single 32-bit Multiboot ELF runs identically in QEMU and on bare-metal x86 hardware. Serial-on-stdio for headless boots, optional VGA terminal for interactive ones.

Debug

GDB-attached boots in one command

make run-debug builds with symbols and pauses QEMU on :1234 waiting for a GDB attach. Stepping through ring-0 inference code is as cheap as a target remote.

// layout

From Multiboot entry to HTTP response.

Eight files, top to bottom. Every layer has a single owner; there is no dynamic linking, no module loader, no plugin surface.

+-----------------------------------------------------------+
|  QEMU / Bare Metal  (x86, Multiboot)                     |
+-----------------------------------------------------------+
|  boot.S             Multiboot entry, stack, serial init   |
|  kernel.c           Kernel main, VGA terminal, serial I/O |
|  memory.c           Heap allocator (malloc/free)          |
|  string.c           libc subset (snprintf, memcpy, ...)   |
|  network.c          PCI enumeration + e1000 NIC driver    |
|  http.c / api.c     HTTP server, request routing          |
|  api_v1.c           llama.cpp-compatible REST API         |
|  llm.c              Model loading and inference interface |
+-----------------------------------------------------------+

// api

HTTP endpoints, in ring 0.

The v1 surface mirrors llama.cpp so existing clients work without modification. Routing is implemented in api.c with handlers in api_v1.c.

Method Path Handler Status
POST /v1/completions handle_v1_completions wired
POST /v1/chat/completions handle_v1_chat_completions wired
POST /v1/embeddings handle_v1_embeddings wired
GET /v1/models handle_v1_models wired
POST /tokenize handle_tokenize wired
POST /detokenize handle_detokenize wired

// "wired" = the route exists and parses; inference returns a stub until the llama.cpp engine is integrated.

// roadmap

What is done, what is next.

Lifted directly from the README. We do not overstate what ships.

  • done Multiboot kernel with VGA + serial output
  • done Custom libc subset (malloc, snprintf, string ops)
  • done PCI enumeration and Intel e1000 NIC driver
  • done HTTP/1.1 server with REST routing
  • done llama.cpp-compatible v1 API skeleton
  • todo Integrate llama.cpp inference engine into the kernel
  • todo GPU passthrough (CUDA backend)
  • todo Streaming token generation
  • todo vLLM-derived optimizations for transformer serving

// honest comparisons

Where cllm sits in the ecosystem.

We compare against the projects whose API surface or design we explicitly borrow from. Pick the right tool for the workload.

// faq

Questions engineers ask first.

+ What is actually running today?

A Multiboot ELF that boots in QEMU or on bare-metal x86, brings up serial and VGA, walks the PCI bus, drives an Intel e1000 NIC, and answers HTTP/1.1 requests through a llama.cpp-shaped v1 API. The inference engine itself is on the roadmap; the kernel and the network stack ship now.

+ Which models will cllm support?

The API surface mirrors llama.cpp, so the natural target is anything llama.cpp loads (GGUF-format weights for Llama-family architectures, Mistral, Qwen, Gemma, Phi, and so on). Until the llama.cpp inference path is wired into the kernel, no specific model has been benchmarked end-to-end.

+ Which hypervisors are supported?

QEMU is the development and primary supported target. The kernel is Multiboot-compliant, so any Multiboot loader will boot it — that includes GRUB on bare metal. Other hypervisors are unverified.

+ What about GPUs?

GPU support is roadmap, not shipped. The documentation includes an analysis of how a CUDA ggml backend could be integrated into a unikernel (PCIe access, PTX kernel embedding, host/device memory management), but that work is not yet in the build.

+ What is the network stack?

Raw Ethernet frames over the e1000 driver, with a minimal IPv4 + TCP implementation and an HTTP/1.1 subset on top — enough to route the v1 endpoints. There is no socket API in the traditional sense; the server is the packet processing loop.

+ How do I build and run it?

Install gcc with -m32 support, make, and qemu-system-i386. Clone the repo, run make run, and serial output appears on your terminal. Ctrl-A X exits QEMU. There are debug, VGA, and GDB-attached variants as well.

+ Is it production-ready?

No. The infrastructure scaffolding (kernel, drivers, HTTP) is in place; the inference engine is not. Treat cllm as a working substrate that the llama.cpp integration will land on top of.

+ Why a unikernel instead of just a small Linux?

A unikernel removes the kernel/userspace boundary, the scheduler, and every page of the host OS that is not part of the inference path. The target is a single tens-of-kilobytes ELF that boots in milliseconds and spends every cycle on math.

Boot it, break it, read the source.

Clone the repo, run make run, and serial output appears on your terminal in under a second.