Benchmarking Open-Source LLM Inference Servers: vLLM, TGI, and Ollama

A practical comparison framework for benchmarking vLLM, TGI, and Ollama, and what each is actually optimized for.

· By perf-test.com Editorial · AI-assisted
vllmtgiollamabenchmarking

vLLM, Hugging Face’s TGI (Text Generation Inference), and Ollama all serve LLMs, but they’re optimized for meaningfully different use cases — benchmarking them against each other without accounting for that difference produces comparisons that don’t actually answer a useful question.

What each is actually optimized for

  • vLLM — optimized for high-throughput production serving, with PagedAttention (efficient KV cache management, covered in this site’s continuous batching article) and continuous batching as core design features. Targets server/datacenter deployment serving many concurrent requests efficiently.
  • TGI — Hugging Face’s production serving solution, also supporting continuous batching and a broad range of model architectures, with strong integration into the Hugging Face ecosystem (model hub, tooling) as a notable differentiator.
  • Ollama — optimized for ease of local/single-user deployment (a developer’s laptop, a small self-hosted setup), prioritizing simple setup and broad model format support over maximum multi-request concurrent throughput. Built on llama.cpp under the hood, which itself is heavily optimized for running efficiently on consumer/CPU hardware, not necessarily for datacenter-scale concurrent serving.

Why a naive single-request latency benchmark misleads

If you benchmark “time to generate 100 tokens for one request,” all three may perform comparably, since single-request, no-concurrency latency doesn’t exercise the continuous batching and concurrent-request efficiency that meaningfully differentiate vLLM/TGI from Ollama for production serving. The real differentiation shows up specifically under concurrent load — benchmark throughput (tokens/sec) and p95 latency across a sweep of concurrency levels (the same methodology covered in this site’s LLM API load testing article), not a single-request number.

Hardware and deployment context matters enormously

Ollama is commonly run on consumer GPUs or even CPU-only setups; vLLM and TGI are commonly run on datacenter GPUs (A100s, H100s, and similar) with deployment patterns assuming dedicated, often multi-GPU infrastructure. A benchmark comparing all three on identical hardware answers a narrower, more academic question than comparing each in its typical real deployment context — both comparisons can be useful, but know which one you’re actually running and what question it answers.

Quantization support varies

llama.cpp/Ollama has historically had particularly strong support for a wide range of quantization formats optimized for CPU and consumer GPU inference (covered in this site’s quantization article); vLLM and TGI’s quantization support has matured significantly but the specific formats and kernel-level optimization maturity can differ — if quantized serving is a requirement, check current format support directly rather than assuming parity across all three.

A fair benchmarking checklist

  1. Match the deployment context to each tool’s actual target use case (don’t run Ollama’s intended local-single-user scenario against vLLM’s intended high-concurrency datacenter scenario and call it a fair comparison).
  2. Sweep concurrency, not just single-request latency.
  3. Hold the model and quantization format constant across tools where possible, to isolate the serving engine’s contribution specifically.
  4. Report both throughput and latency at each concurrency level, not just one or the other.
  5. Be explicit about hardware used — results don’t transfer across meaningfully different GPU/CPU configurations.

A practical decision guide, not just a benchmark exercise

For production, multi-user serving at scale: vLLM or TGI, chosen based on ecosystem fit and specific model support needs. For local development, prototyping, or single-user self-hosted use: Ollama’s simplicity is usually the more practical win, even if a raw concurrent-throughput benchmark would favor the other two — because that throughput isn’t the actual requirement for that use case.

Takeaway: vLLM, TGI, and Ollama solve different problems — benchmark each in the deployment context it’s actually designed for, and sweep concurrency rather than relying on single-request latency, or the comparison won’t reflect what actually differentiates them.

Discussions coming soon.

Comments are powered by Giscus (GitHub Discussions). Enable them by configuring GISCUS in src/consts.ts — see giscus.app.