Skip to content
cllm Source

← back home · compare

cllm vs vLLM

Python inference server for GPU clusters

vLLM is a production GPU serving framework. cllm is a research-grade C unikernel that intends to absorb specific vLLM optimizations (continuous batching, kernel fusion) into the bare-metal path.

Feature cllm vLLM Advantage
Maturity Substrate ships; inference is roadmap Production-ready, widely deployed vLLM
Language C kernel, small Zig support Python wrapping CUDA/C++ kernels Comparable
Host operating system None (unikernel) Linux (CUDA userspace required) Comparable
GPU support Roadmap (CUDA design analysis) CUDA, plus AMD/Intel via vLLM backends vLLM
Continuous batching Roadmap — port from vLLM playbook Core feature, mature vLLM
PagedAttention Not planned in current scope Native vLLM
Memory footprint Single-digit MB today GB-class with model + Python + CUDA cllm
Deployment unit One Multiboot ELF Container with Python runtime cllm
Target hardware x86 i386 (QEMU + bare-metal) x86_64 Linux + GPU vLLM
OpenAI-compatible API llama.cpp-shaped v1 surface Native OpenAI-compatible server vLLM

Pick cllm when

  • You want every cycle and every page accounted for, end to end
  • You are building inference appliances rather than fleet-scale GPU serving
  • You care about deterministic boot and a single-ELF deployment story
  • You want to experiment with ring-0 inference paths without a host kernel in the way

Pick vLLM when

  • You need to serve a transformer in production on a GPU cluster today
  • PagedAttention, continuous batching, and tensor parallelism are required, not aspirational
  • You run on x86_64 Linux with NVIDIA GPUs and a Python toolchain you already trust
  • You need OpenAI-compatible serving, structured outputs, and the broader vLLM feature surface

Still deciding?

cllm and vLLM solve different layers of the same problem. Reading the source for both is the fastest way to know which one belongs in your stack.