← back home · compare
cllm vs vLLM
Python inference server for GPU clusters
vLLM is a production GPU serving framework. cllm is a research-grade C unikernel that intends to absorb specific vLLM optimizations (continuous batching, kernel fusion) into the bare-metal path.
| Feature | cllm | vLLM | Advantage |
|---|---|---|---|
| Maturity | Substrate ships; inference is roadmap | Production-ready, widely deployed | vLLM |
| Language | C kernel, small Zig support | Python wrapping CUDA/C++ kernels | Comparable |
| Host operating system | None (unikernel) | Linux (CUDA userspace required) | Comparable |
| GPU support | Roadmap (CUDA design analysis) | CUDA, plus AMD/Intel via vLLM backends | vLLM |
| Continuous batching | Roadmap — port from vLLM playbook | Core feature, mature | vLLM |
| PagedAttention | Not planned in current scope | Native | vLLM |
| Memory footprint | Single-digit MB today | GB-class with model + Python + CUDA | cllm |
| Deployment unit | One Multiboot ELF | Container with Python runtime | cllm |
| Target hardware | x86 i386 (QEMU + bare-metal) | x86_64 Linux + GPU | vLLM |
| OpenAI-compatible API | llama.cpp-shaped v1 surface | Native OpenAI-compatible server | vLLM |
Pick cllm when
- ▸You want every cycle and every page accounted for, end to end
- ▸You are building inference appliances rather than fleet-scale GPU serving
- ▸You care about deterministic boot and a single-ELF deployment story
- ▸You want to experiment with ring-0 inference paths without a host kernel in the way
Pick vLLM when
- ▸You need to serve a transformer in production on a GPU cluster today
- ▸PagedAttention, continuous batching, and tensor parallelism are required, not aspirational
- ▸You run on x86_64 Linux with NVIDIA GPUs and a Python toolchain you already trust
- ▸You need OpenAI-compatible serving, structured outputs, and the broader vLLM feature surface
Still deciding?
cllm and vLLM solve different layers of the same problem. Reading the source for both is the fastest way to know which one belongs in your stack.