Skip to content
cllm Source

← back to writing

Multiboot, PCI, e1000, and an HTTP Server in Ring 0

cllm Team · ·
kernelnetworkingengineering

This post traces the boot path of cllm from the first instruction the bootloader executes to the moment an HTTP request hits a handler. The point is not to teach Multiboot or PCI from scratch — there is excellent existing material for that — but to show how the pieces compose in our specific repository and what we chose to leave out.

The code we will reference is boot.S, kernel.c, memory.c, string.c, network.c, http.c, api.c, and api_v1.c. Eight files. That is the entire kernel side of the stack.

Stage 0: the Multiboot handoff

A Multiboot-compliant kernel is an ELF with a small header in the first 8 KiB that the bootloader scans for. The header tells GRUB (or QEMU’s built-in loader) “I am a Multiboot kernel, load me at this address, set up flat 32-bit protected mode, and jump here.” That is what boot.S declares. It also reserves a stack — a few KiB of uninitialized memory aligned to 16 bytes — and points esp at it before transferring control to C.

The boot assembly does one more useful thing before handing off: it initializes the COM1 serial port. The reason is the same reason every kernel does this: if the C code crashes before bringing up VGA, you still want to see the panic. Serial output is cheap, requires no driver state beyond a couple of port I/O writes, and works headlessly under QEMU. make run pipes the serial port directly to your terminal via -serial stdio.

Once boot.S jumps into C, we are in kernel.c. The kernel main brings up the VGA terminal (the 80x25 text mode framebuffer at physical address 0xb8000), prints a banner over serial, and starts wiring up the rest of the stack.

Stage 1: a heap, and just enough libc

Before we can do anything interesting, we need malloc. memory.c defines a statically-allocated 4 MB region — a fixed slab inside the kernel’s .bss section, on a real-mode-friendly address — and implements a simple bump-plus-freelist allocator over it. It is not high-performance. It does not have to be. The HTTP server and the network stack make a small number of large allocations at startup and a small number of small allocations per request; that is the entire workload. The day the inference engine lands, we will revisit the arena size and the allocator strategy together.

string.c provides the libc subset the rest of the kernel needs: snprintf, memcpy, memset, strncmp, strlen, and a handful of related helpers. None of them are anything you have not seen before. The deliberate choice is what is not in string.c: no printf to stdout (we have kprintf to serial), no malloc (that’s memory.c), no thread-local state (we have no threads). Keeping the libc subset to “the smallest set that compiles the rest of the kernel” is a discipline that has paid off every time we have tried to skip it.

Stage 2: PCI enumeration

network.c opens with PCI bus enumeration. The PCI configuration space is accessed via two I/O ports: write a 32-bit address to 0xCF8, then read or write a 32-bit value at 0xCFC. Walking the bus is a triply-nested loop over bus number, device number, and function number, reading the vendor and device IDs at each slot, and recording any device that is not 0xFFFF (the “no device here” sentinel).

What we look for is one specific device: an Intel e1000 NIC, which in QEMU has vendor 0x8086 and device 0x100e. When we find it, we read its BAR (base address register) to learn where its memory-mapped registers live, claim the device by enabling bus mastering in its command register, and hand off to the e1000 driver.

We do not enumerate the full PCI topology and we do not handle PCI-to-PCI bridges. The QEMU default machine has a single PCI bus with the e1000 sitting at a known slot, so there is no reason to do the extra work today. If a contributor needs that, the data structures are sketched for it; we just have not filled them in.

Stage 3: the e1000 driver

The Intel e1000 is one of the most documented NICs on the planet. Intel published the datasheet, and almost every educational kernel uses it as the example NIC because the descriptor ring model is simple, the register layout is stable, and QEMU emulates it well.

Our driver sets up a receive ring and a transmit ring, both with descriptors that point at buffers we allocated from the kernel heap. We program the device’s MAC address, enable receive and transmit, and we are done. The interesting part is what we do not do: we do not implement interrupts. The kernel polls the receive ring in its main loop. This is a perfectly reasonable choice for an inference server, where the latency budget is dominated by model execution, not by NIC interrupt latency, and where we are running on a single core anyway.

The driver pushes raw Ethernet frames up to a minimal IPv4 + TCP implementation, which in turn delivers payload bytes to the HTTP parser.

Stage 4: HTTP/1.1, the boring parts

http.c parses HTTP/1.1 requests. We support GET and POST, the request line, header parsing into a small fixed-size table, and a Content-Length-bounded body. No chunked transfer encoding yet. No HTTP/2. No TLS. The parser is iterative — no allocator activity on the hot path beyond what the underlying TCP buffer already does.

api.c is the router. It matches the request path against a small table of routes and dispatches to a handler. The current routing table maps the llama.cpp-shaped v1 endpoints — /v1/completions, /v1/chat/completions, /v1/embeddings, /v1/models, /tokenize, /detokenize — to handlers in api_v1.c.

api_v1.c is, today, the most aspirational file in the kernel. The handlers parse the request body (using a small JSON parser in json.c), validate parameters, and call into llm.c. llm.c is the seam where the inference engine will live. Until the llama.cpp integration lands, the seam returns stub responses that are JSON-valid but not actually inferred.

What we deliberately did not build

We did not build a process model. There is one address space, one thread of control, one main loop. We did not build a filesystem. Models will be either baked into the binary at link time or loaded over HTTP at startup; we do not need persistent storage in the kernel. We did not build interrupts. The main loop polls. We did not build dynamic linking. There are no .so files because there is no userspace for them to live in. We did not build a syscall ABI. There is no userspace to call out from.

Every one of those omissions is a piece of complexity we did not have to write, debug, or carry forward. Some of them will come back when we need them. Most of them, we suspect, will not.

What is next

The next milestone is plugging an inference engine into llm.c. The README and the specification document are explicit about the order: integrate llama.cpp, then add a CUDA backend via the unikernel-flavoured ggml interface sketched in the GPU backend analysis, then port specific optimizations from vLLM.

If you want to read the code, github.com/cognisoc/cllm is the canonical source. The full architectural documentation is at docs.cognisoc.com/cllm. If you have read this far, make run is already faster than reading another paragraph.