Skip to content
cllm Source

← back to writing

The Four-Phase Roadmap: How cllm Becomes a Useful Inference Server

cllm Team · ·
roadmaparchitectureengineering

The cllm specification document is short. Four sentences, four phases. It is short on purpose — the project does not have credibility yet to make long promises — but it is also short because the phases are genuinely ordered, and each one is well-defined enough to know when it is done. This post walks through each phase, what it actually requires, and where we are.

Phase 1: a working C unikernel in QEMU targeting x86

This is the phase that ships today. It is also the phase that gets the least attention in conversations because, on its own, it is not commercially useful: a kernel that serves HTTP but does not infer anything is not a product.

That said, “kernel that serves HTTP” is much more substantial than it sounds. It is a Multiboot entry that brings up serial and VGA. It is a 4 MB heap arena with malloc and free. It is a libc subset large enough to write the rest of the kernel against. It is PCI enumeration. It is an Intel e1000 NIC driver. It is a minimal IPv4 + TCP stack. It is an HTTP/1.1 parser and a request router. It is a llama.cpp-shaped v1 API skeleton, with handlers that parse JSON request bodies and validate parameters.

What it is not is the inference engine itself. The handlers return stubs. The README is explicit about this in the roadmap. We will be explicit about it everywhere we can.

Phase 1 was the right place to start because the substrate has to exist before the engine has somewhere to live. If you want to integrate llama.cpp into a unikernel, you need a unikernel that boots, has memory, has a network, and has a routable API surface. We now have that.

Phase 2: compile llama.cpp into the unikernel

This is the phase that is in active design. The repository already includes the llama.cpp headers in llama.cpp/, the llm.c seam in the kernel, and the JSON-based configuration scaffolding described in the specification. What is missing is the integration itself.

Integrating llama.cpp into a unikernel is not the same as linking a library. llama.cpp depends on the standard C/C++ library, file I/O, threading, and (on most platforms) a heap allocator that knows how to scale into the multi-GB range. None of those exist in cllm out of the box.

The plan, taken from the llama-integration.md design document in the repository, is:

  1. Extract the core llama.cpp components. Remove or replace system dependencies that are not available in the unikernel environment.
  2. Replace standard malloc/free with the unikernel’s heap allocator. This will mean growing the arena from 4 MB to whatever the smallest interesting model needs.
  3. Replace file I/O with direct memory loading of model data. Models will be either baked into the kernel binary at link time or loaded over HTTP at startup. There is no filesystem.
  4. Replace console output with serial/VGA terminal output.
  5. Either implement threading support in the unikernel or use single-threaded mode for initial integration. We are starting with the single-threaded path and will revisit threads when the model execution path is benchmarked.

The challenges section of the same document is honest about the constraints. System dependencies are extensive. The 4 MB heap is not adequate for real models. The performance work is its own phase, not a side effect of integration. GPU support is a separate phase, intentionally.

The first working benchmark we will publish is “smallest interesting GGUF model running in cllm, in QEMU, on CPU, with serial output.” That is not a competitive number. It is a credibility number. It says: the substrate did its job and the engine is on it.

Phase 3: GPU support, starting with CUDA

GPU support in a unikernel is a meaningfully harder problem than CPU inference, and the design document is correspondingly more detailed than the others.

The summary is: llama.cpp uses ggml as its computational backend, ggml has a CUDA backend, and the CUDA backend depends on the CUDA userspace driver, which depends on a host OS. None of that runs on cllm.

The integration plan, sketched in gpu-backend.md, is to write a “unikernel CUDA driver interface” — a minimal interface that handles device discovery via PCIe, GPU memory allocation, host-to-device and device-to-host memory copies, and PTX kernel loading. Above that interface, we wire a unikernel-specific ggml backend that mirrors the upstream CUDA backend’s contract. Above that, llama.cpp does not need to know it is running on a unikernel — it just sees a ggml backend that says “yes, I am CUDA, here is your tensor.”

The hardware question is the part that is not yet pinned down. The design document targets x86 + NVIDIA, supports the Maxwell-through-Ada-Lovelace generations, and lists the quantization formats the CUDA backend supports (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, F16, F32, plus some IQ formats). What we do not have is a specific GPU we have run cllm against. That will come during Phase 3 implementation, not before.

We want to be careful here. There are interesting and unsolved problems in the unikernel GPU world — PCIe access from a unikernel running on QEMU is meaningfully different from PCIe access from a unikernel running on bare metal, and bare-metal access to a real NVIDIA GPU brings its own driver and firmware questions. Phase 3 is an ambitious phase. We expect to revise the design document several times as we hit the actual hardware.

Phase 4: vLLM-derived optimizations

The fourth phase is the throughput phase. vLLM is the production reference for high-throughput transformer serving, and it earns that reputation through specific implementation choices: PagedAttention for KV cache, continuous batching for request scheduling, tensor parallelism for cross-GPU sharding, and a deep set of CUDA-level optimizations for the math kernels.

Not all of those translate cleanly into the cllm context. PagedAttention assumes a memory allocator with virtual-page semantics; cllm’s heap is flat. Tensor parallelism assumes multiple GPUs and a fabric between them; cllm does not yet have a story for either. Continuous batching, on the other hand, is mostly a scheduler change — and a scheduler in cllm is just a piece of the main loop we have not written yet.

The fourth-phase commitment is therefore narrower than “port vLLM.” It is “identify which vLLM optimizations are tractable in the cllm context, and bring those into the bare-metal inference path.” Continuous batching is the first candidate. Speculative decoding is the second. Kernel fusion for the small set of operators that dominate the inference time is the third.

We will publish phase 4 numbers when there are phase 4 numbers. Until then, the right way to read the roadmap is: phase 1 ships, phase 2 is in active design, phase 3 is sketched, phase 4 is a stated direction.

What is honest about each phase

Every roadmap is a promise. We want ours to be readable as a promise.

Phase 1 is shipped. The kernel boots. The HTTP server answers requests. The v1 endpoints route to handlers. The handlers return stubs.

Phase 2 is in design. The integration plan is documented. No model has been inferred end-to-end in cllm yet.

Phase 3 is sketched. The CUDA backend design exists; no GPU has been touched by cllm code yet.

Phase 4 is a direction. The vLLM playbook is well understood; the porting work is ahead of us.

If you read the README and the specification and arrive at “the kernel works; the inference engine does not yet” — you have read it correctly. The whole project is at github.com/cognisoc/cllm. The documentation is at docs.cognisoc.com/cllm. The roadmap is real and we will report against it.