The Engineering Behind Serving LLMs at Scale

Training a large language model gets most of the attention. But if you've ever tried to serve one in production (handling real traffic, real latency requirements, and real cost constraints), you know that inference is where most of the hard engineering lives. Over the past couple of years, researchers and systems engineers have made remarkable progress in closing the gap between what these models can do and what it actually costs to run them. Having worked on large-scale distributed systems and followed the ML research space closely, I find this intersection of systems thinking and machine learning to be one of the most exciting areas in software engineering right now.

The Core Problem: Autoregressive Generation is Sequential

To understand why LLM serving is hard, you need to understand how text generation works. Models generate tokens one at a time, where each new token depends on all the tokens that came before it. This autoregressive nature means you can't trivially parallelize generation across a single sequence the way you can parallelize training across samples in a batch.

This creates two distinct phases with very different compute profiles:

Prefill: Processing the entire input prompt in parallel. Compute-bound and GPU-efficient.
Decode: Generating output tokens one by one. Memory-bandwidth-bound and notoriously GPU-inefficient.

  PREFILL (parallel, compute-bound)         DECODE (sequential, memory-bound)
  ┌────────────────────────────────┐
  │  [P1]  [P2]  [P3]  [P4]  [P5] │──▶  [T1] ──▶ [T2] ──▶ [T3] ──▶ ...
  │   all prompt tokens at once    │      one new token generated per step
  └────────────────────────────────┘

Naively serving requests, waiting for one to finish before starting the next, wastes enormous amounts of GPU capacity. The challenge is to keep the GPU busy without sacrificing latency.

Continuous Batching

The first major insight is that you don't need to wait for an entire batch to finish before admitting new requests. Traditional static batching holds a fixed set of requests together until all of them complete. Since requests have wildly different output lengths, this means fast requests sit idle waiting for slow ones.

Continuous batching, also called in-flight batching, solves this by treating generation as a stream of iterations. At each step, finished sequences are immediately evicted from the batch and new requests are slotted in. The GPU is kept continuously busy, and short requests don't pay a penalty for being grouped with long ones. This single change dramatically improves throughput and is now standard in production inference engines.

Static Batching:
  Step:    1    2    3    4    5    6
  Slot A: [████████][  idle  ][  idle  ]   ← done early, slot wasted
  Slot B: [████████████████████████████]
  Slot C: [████████████████][  idle   ]   ← done mid-way, slot wasted

Continuous Batching:
  Step:    1    2    3    4    5    6
  Slot A: [████████][  Req D ████████ ]   ← D admitted immediately
  Slot B: [████████████████████████████]
  Slot C: [████████████████][  Req E  ]   ← E admitted immediately

PagedAttention and KV Cache Management

Every token the model has seen needs to be stored in a key-value (KV) cache so that attention can be computed efficiently on subsequent steps. For long sequences or large batches, this cache becomes the dominant consumer of GPU memory, and the way it's managed determines how many requests you can serve concurrently.

The naive approach allocates a contiguous block of memory for each request upfront, sized for the maximum possible sequence length. This leads to severe fragmentation: memory is reserved but unused, and the system can serve far fewer concurrent requests than the hardware should theoretically allow.

PagedAttention, introduced with vLLM, borrows an idea from OS virtual memory. Instead of contiguous allocation, it partitions the KV cache into fixed-size blocks that can be stored non-contiguously and fetched via a block table during attention computation. Requests share physical memory pages when their prefixes are identical, which is common in applications where many users share the same system prompt. The result is dramatically higher memory utilization and much larger effective batch sizes.

Naive (contiguous pre-allocation):        PagedAttention (non-contiguous blocks):

┌──────────────────────────┐             Physical memory:
│ Req A  [████████░░░░░░░░]│             ┌──────┬──────┬──────┬──────┬──────┐
├──────────────────────────┤             │ A[0] │ B[0] │ A[1] │ C[0] │ B[1] │
│ Req B  [█████░░░░░░░░░░░]│             ├──────┼──────┼──────┼──────┼──────┤
├──────────────────────────┤             │ A[2] │ free │ C[1] │ free │ free │
│ Req C  [██░░░░░░░░░░░░░░]│             └──────┴──────┴──────┴──────┴──────┘
└──────────────────────────┘
  ░ = reserved but unused memory           Block table maps logical → physical

Speculative Decoding

The decode phase is memory-bandwidth-bound: the GPU spends most of its time loading model weights for each token, not doing useful compute. Speculative decoding exploits this headroom.

The idea is to use a small, fast draft model to speculatively generate several candidate tokens ahead of the main model. The large model then verifies all candidates in a single forward pass, taking advantage of the fact that verification is parallelizable even if generation is not. If the large model agrees with the draft, all tokens are accepted; if it disagrees at some point, generation falls back from that position.

When speculative decoding works well, which depends on the draft model's ability to predict what the larger model would generate, it can reduce end-to-end latency significantly without sacrificing output quality. Recent work like Nightjar has pushed this further by dynamically selecting the speculative window length based on real-time batch conditions, achieving up to 20% lower latency compared to static speculative decoding.

Without Speculative Decoding:
  Large Model: ──[T1]──[T2]──[T3]──[T4]──▶   (4 serial forward passes)

With Speculative Decoding:
  Draft Model: ──[d1]──[d2]──[d3]──[d4]──▶   (4 fast serial passes)
                  │     │     │     │
  Large Model: ──[  verify d1, d2, d3, d4  ]──▶   (1 parallel forward pass)
                       ✓     ✓     ✗
                                   └─▶ reject d3, d4 → regenerate from d3

  Net result: 2 accepted tokens from 1 large model pass instead of 1

Why This Matters Beyond LLMs

What I find intellectually satisfying about this space is that none of these ideas are entirely new. Continuous batching is a scheduling problem. PagedAttention is virtual memory. Speculative execution has been in CPUs for decades. What's happened is that the scale and economics of LLM deployment have made it worth rigorously applying classical systems thinking to a new domain.

The engineers who will do the most interesting work at this intersection aren't those who know only one side; they're the ones who understand both the model and the system it runs on. That's a skill set worth building.

Later.