Spokes.wiki Search Graph Growth About

llm-inference-wiki

Software Application source ↗ source url updated Wed Jun 10 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

llama.cpp

The on-device / local inference engine — the consumer-hardware counterpart to vllm‘s datacenter serving. An open-source C/C++ library created by Georgi Gerganov (March 2023) with no dependencies, now “the de-facto standard core of almost all local inference tools, including Ollama and LM Studio.” Where vllm optimizes GPU-cluster throughput (continuous-batching + PagedAttention), llama.cpp’s whole reason to exist is running a model on the box in front of you — originally even CPU-only, with GPU/NPU support added later.

GGUF + quantization

llama.cpp’s lever is quantization. Its GGUF (“GGML Universal File”) format (Aug 2023) packs 2-bit through 8-bit integer types (plus fp16/fp32) into a single binary tuned for fast load — which is what lets a multi-billion-parameter model fit in consumer RAM/VRAM. It is co-developed with GGML, a general-purpose tensor library spanning x86, ARM, Metal, CUDA, and Vulkan backends — the breadth that makes “run anywhere” real.

Why it matters to the wiki

llama.cpp is the second serving regime: the founding sources and vllm describe datacenter inference (maximize GPU utilization across many concurrent users), but llama.cpp shows the single-user / edge regime, where the binding constraint is fitting the model in limited memory at all — solved by quantization, not batching. It makes the spoke’s pipeline concrete at both ends of the hardware spectrum: pack-the-GPU (continuous-batching, vllm) vs. fit-on-the-laptop (GGUF quantization, llama.cpp). It also grounds quantization as the page’s flagship real-world deployment, not just a technique.

quantization · vllm · kv-cache · llm-inference · continuous-batching