GPU Utilization for LLM Model Serving: What to Actually Measure
Why GPU utilization percentage alone is a misleading metric for LLM serving, and what to measure instead to understand real efficiency.
“GPU utilization” as reported by nvidia-smi or similar tools is one of the most commonly misread metrics in model serving — a GPU can show 100% utilization while doing far less useful work than it could be, because the metric measures whether any kernel is running, not how efficiently it’s using the hardware’s actual compute capacity.
What GPU utilization percentage actually measures
The standard utilization metric reports the percentage of time over a sampling window during which at least one GPU kernel was executing — it says nothing about whether that kernel is using a small or large fraction of available compute units (SMs), or whether it’s compute-bound versus memory-bandwidth-bound. A GPU can report 100% utilization while running a memory-bound, poorly batched operation that’s leaving most of its compute throughput unused.
Memory bandwidth vs compute-bound: the more useful distinction
LLM inference’s decode phase (generating one token at a time) is typically memory-bandwidth-bound, not compute-bound — each decode step does relatively little arithmetic per byte of model weight and KV cache read from memory, meaning the bottleneck is often how fast data moves from GPU memory to compute units, not how fast the compute units themselves can do arithmetic. Prefill (processing the full input prompt at once) is generally more compute-bound, since it processes many tokens in parallel per forward pass. Understanding which phase dominates your workload changes what actually helps: more compute (a faster GPU generation) helps prefill-heavy workloads more; more memory bandwidth helps decode-heavy workloads more.
Why batching is the real lever for GPU efficiency
Since decode is memory-bandwidth-bound per-request, but model weights are read from memory once and can be reused across multiple requests’ decode steps within the same batch, increasing batch size improves compute utilization relative to memory bandwidth used — this is the underlying hardware reason continuous batching (covered in this site’s dedicated article) improves throughput so significantly: it amortizes the memory-bandwidth cost of reading model weights across more simultaneous useful work.
Metrics worth tracking instead of (or alongside) raw utilization percentage
- GPU memory bandwidth utilization — a more meaningful efficiency signal for decode-heavy LLM serving than compute utilization percentage alone.
- Tokens generated per GPU-second — a directly business-relevant efficiency metric, comparable across different batch sizes and configurations.
- Batch size achieved in practice — under continuous batching, tracking actual average batch size over time reveals whether you’re achieving the throughput-efficiency the serving engine is theoretically capable of, or leaving capacity on the table due to insufficient request volume or overly conservative
maxVUs-equivalent batch limits.
Multi-GPU and tensor parallelism considerations
For models too large for a single GPU’s memory, tensor or pipeline parallelism splits the model across multiple GPUs — this introduces inter-GPU communication overhead (similar in spirit to the coherency penalty covered in this site’s Universal Scalability Law article) that can itself become a bottleneck if the interconnect (NVLink, InfiniBand) isn’t fast enough relative to the compute being parallelized, meaning “more GPUs” doesn’t always translate to proportionally more throughput.
Quantization’s effect on the memory-bandwidth bottleneck
Since decode is memory-bandwidth-bound, reducing model weight precision (quantization, covered in this site’s dedicated article) directly reduces the bytes that need to move per decode step — often a more effective lever for decode throughput than adding raw compute capacity, precisely because of where the actual bottleneck lies.
Takeaway: raw GPU utilization percentage is close to meaningless for LLM serving efficiency on its own — understanding whether your workload is memory-bandwidth-bound (typically decode) or compute-bound (typically prefill) determines which optimizations (batching, quantization, more compute, faster interconnect) will actually move the needle.
Comments are powered by Giscus (GitHub Discussions). Enable them by
configuring GISCUS in src/consts.ts — see
giscus.app.