vLLM Architecture Deep Dive

How requests flow through the V1 engine, from API call to token generation

What vLLM Actually Does

vLLM is a high-throughput LLM inference engine. When you send a prompt to a model like Llama or Mistral, vLLM orchestrates the entire process: scheduling requests, managing GPU memory, and running optimized kernels to generate tokens as fast as possible.

Think of vLLM like an air traffic control system. Multiple planes (requests) arrive simultaneously, each needing runway time (GPU compute) and gate space (memory). The controller must land planes efficiently without collisions.

Engine

The control tower. Receives requests and coordinates all other components.

Scheduler

The runway controller. Decides which requests run and when, based on memory constraints.

KV Cache Manager

The gate manager. Allocates and recycles GPU memory blocks for attention state.

Worker

The pilot. Executes actual model computation on the GPU with optimized Triton kernels.

Request Lifecycle

When you send a prompt to vLLM, here is the path your request takes through the system. Each step is a handoff between components, designed for maximum throughput.

API

AsyncLLM

EngineCore

Scheduler

Worker

Click "Next Step" to trace a request through vLLM

The V1 Architecture Directory

The V1 engine lives in vllm/v1/. It was a ground-up rewrite of the original vLLM engine for better performance and simpler scheduling logic.

vllm/v1/ The V1 engine root

engine/ Request handling and coordination

async_llm.py Async API frontend

core.py Main engine loop

detokenizer.py Token-to-text conversion

core/sched/ Scheduling decisions

scheduler.py Main scheduling logic

core/ KV cache and memory management

kv_cache_manager.py Block allocation for KV cache

block_pool.py Prefix caching and block reuse

attention/ops/ Triton attention kernels

triton_prefill_attention.py Prefill phase attention kernel

triton_decode_attention.py Decode phase attention kernel

The Scheduling Algorithm

The V1 scheduler has no concept of separate "prefill" or "decode" phases. Instead, it treats every request uniformly: each has num_computed_tokens and a target num_tokens_with_spec. The scheduler assigns compute budget to help each request catch up.

CODE

# vllm/v1/core/sched/scheduler.py
def schedule(self) -> SchedulerOutput:
    # Each request just has num_computed_tokens and
    # num_tokens_with_spec. At each step, the scheduler
    # tries to assign tokens so that each request's
    # num_computed_tokens can catch up.
    scheduled_new_reqs: list[Request] = []
    scheduled_running_reqs: list[Request] = []
    preempted_reqs: list[Request] = []
    token_budget = self.max_num_scheduled_tokens

PLAIN ENGLISH

This is the main scheduling function, called every step.

The unified approach: no prefill/decode phases. Each request just tracks how many tokens it has computed vs. how many it needs.

The scheduler fills tokens until the budget runs out.

Three buckets: new requests entering the system, requests currently running, and requests that got preempted (paused to free memory).

The token budget caps how many tokens can be processed in one step across all requests.

💡

Why unified scheduling matters

By removing the prefill/decode distinction, vLLM V1 can handle chunked prefills, prefix caching, and speculative decoding with the same scheduling logic. One algorithm covers all cases.

Scheduling Policies

The scheduler supports multiple scheduling policies to control which requests get compute time first.

CODE

# vllm/v1/core/sched/scheduler.py
# Scheduling constraints.
self.max_num_running_reqs = self.scheduler_config.max_num_seqs
self.max_num_scheduled_tokens = (
    self.scheduler_config.max_num_scheduled_tokens
    if self.scheduler_config.max_num_scheduled_tokens
    else self.scheduler_config.max_num_batched_tokens
)
self.policy = SchedulingPolicy(self.scheduler_config.policy)

PLAIN ENGLISH

Two hard limits control scheduling:

1. Maximum concurrent requests (how many planes can be in the air)

2. Maximum tokens per step (total runway capacity per time slice)

The token budget falls back to batched_tokens if not explicitly set.

The policy (FCFS or priority) determines the queue ordering.

Check Your Understanding

What distinguishes the V1 scheduler from traditional LLM schedulers?

A new request arrives but GPU memory is almost full. Which component decides whether to accept or preempt?

Why does the unified scheduling approach enable speculative decoding and chunked prefills?

Memory Management & KV Cache

How vLLM manages GPU memory with block-based allocation and paged attention

The Memory Bottleneck

LLM inference is memory-bound, not compute-bound. The GPU has enormous arithmetic power but limited memory bandwidth. The KV cache - storing key and value tensors from previous tokens - dominates GPU memory usage during inference.

Think of it like a library with limited shelf space. Each book (token) needs a reserved spot on the shelves (memory blocks). Without clever management, you run out of shelves long before you run out of reading capacity.

80G

GPU Memory (A100: 80GB HBM2e)

Total available memory. Model weights take ~14GB for a 7B model in FP16. The rest is for KV cache.

KV Cache per Token

Each token stores a key and value vector per layer per head. For Llama-70B: ~1.25MB per token across all layers.

BLK

Block Size (typically 16 tokens)

Memory is allocated in fixed-size blocks, not per-token. This reduces fragmentation and enables sharing.

Block-Based Memory Allocation

vLLM pioneered paged attention - the key insight that made high-throughput LLM serving possible. Instead of reserving maximum-length contiguous memory for each request, vLLM allocates memory in small blocks that can be scattered across GPU memory.

Request arrives

Scheduler asks KV Cache Manager for blocks based on prompt length

Blocks allocated from free pool

Manager pulls blocks from FreeKVCacheBlockQueue - no contiguous requirement

Tokens fill blocks

As tokens are generated, they fill the current block. When full, a new block is allocated

Request completes

Blocks return to the free pool (or stay cached for prefix reuse)

💡

Why blocks beat contiguous allocation

Without paging, a request expecting 2048 tokens must reserve all 2048 tokens of memory upfront - even if it only generates 50. With blocks, memory grows on-demand. This single change improved memory utilization from ~50% to ~95% in practice.

KV Cache Data Structure

The KVCacheBlocks dataclass is the interface between the Scheduler and the KV Cache Manager. It hides internal memory details from scheduling logic.

CODE

# vllm/v1/core/kv_cache_manager.py
@dataclass
class KVCacheBlocks:
    blocks: tuple[Sequence[KVCacheBlock], ...]
    # blocks[i][j] refers to the i-th kv_cache_group
    # and the j-th block of tokens.

    def __add__(self, other) -> "KVCacheBlocks":
        return KVCacheBlocks(
            tuple(list(itertools.chain(blk1, blk2))
                  for blk1, blk2 in zip(self.blocks, other.blocks))
        )

PLAIN ENGLISH

This is the memory allocation receipt - it tells the scheduler what blocks were allocated.

The blocks field is a tuple of sequences. First dimension is the cache group (for multi-head attention), second is the block index within that group.

The __add__ method lets you combine two allocations (e.g., extending a request that needs more blocks) by chaining their block lists together.

Prefix Caching with Block Hashing

When two requests share the same prefix (e.g., the same system prompt), vLLM can reuse the computed KV cache blocks instead of recomputing them. This is done via content-addressable hashing of block contents.

CODE

# vllm/v1/core/block_pool.py
class BlockHashToBlockMap:
    """Cache of blocks used for prefix caching.
    Maps hash directly to a block or multiple blocks
    (i.e. {block_hash: KVCacheBlocks})
    A cached block is a full block with a block hash
    that can be used for prefix caching.
    The cached block may be used by running requests
    or in the free_block_queue that could
    potentially be evicted."""

PLAIN ENGLISH

This class is the "library card catalog" - it maps block content fingerprints to actual memory blocks.

When a block is full (all token slots used), it gets a hash computed from its token content.

If another request needs the same prefix, we look up the hash and find the existing block - no recomputation needed.

Cached blocks can be in two states: actively used by a request, or sitting in the free queue (evictable but still available for reuse).

Memory Access in Paged Attention

The attention kernel must handle non-contiguous blocks. For each query token, it looks up which physical blocks hold the relevant keys and values using a page table (Req_to_tokens mapping).

Query

Page Table

HBM

GPU Memory

Compute Unit

Click "Next Step" to see paged attention memory access

Check Your Understanding

A request might generate anywhere from 10 to 2000 tokens. Why does block-based allocation help here?

Two users send requests with the same 500-token system prompt. How does prefix caching help?

During decode attention, how does the kernel find where block 5 of a request is stored in GPU memory?

Introduction to Triton for Attention

Why Triton exists, how it models GPU parallelism, and reading vLLM's attention kernels

Why Custom Kernels?

Standard PyTorch operations execute one operation at a time: matrix multiply, then softmax, then another multiply. Each operation reads from and writes to GPU HBM. The data movement between operations is the bottleneck, not the math.

Triton lets you fuse multiple operations into one kernel. Data stays in fast on-chip SRAM between operations instead of making round-trips to slow HBM.

PyTorch Default

Q*K -> HBM -> softmax -> HBM -> *V -> HBM. Three round trips to slow memory.

Triton Fused

Load Q,K,V -> SRAM -> compute attention in tiles -> write output once. One HBM round trip.

Result

2-4x speedup for attention. The math is identical - only the data movement changes.

The Triton Programming Model

Triton operates at the block level. You write code that processes a tile of data, and the GPU launches thousands of these program instances in parallel. Think of it like a factory assembly line where each worker handles one box of parts, not individual screws.

tl.program_id(axis)

Each instance gets a unique ID. This tells it which tile of the output it is responsible for.

tl.arange(0, BLOCK_SIZE)

Generate offset indices within the tile. These map to the elements this instance processes.

tl.load(ptr + offsets, mask=...)

Load a tile of data from HBM into registers. The mask handles boundary conditions.

Compute (tl.dot, tl.sum, etc.)

Process the tile entirely in fast registers/SRAM. This is where the actual math happens.

tl.store(ptr + offsets, result)

Write the computed tile back to HBM. Ideally, this is the only write to slow memory.

GPU Memory Hierarchy

Understanding where data lives is the key to writing fast Triton kernels. The hierarchy has three levels, each with dramatically different speed and capacity.

vLLM Prefill Attention Kernel

The prefill kernel processes all prompt tokens at once. Each program instance handles one tile of query positions for one attention head. It iterates over all key/value positions in tiles, computing attention scores with online softmax.

CODE

# vllm/v1/attention/ops/triton_prefill_attention.py
@triton.jit
def _fwd_kernel(Q, K, V, sm_scale, B_Start_Loc, B_Seqlen, Out,
    stride_qbs, stride_qh, stride_kbs, stride_kh, ...):
    cur_batch = tl.program_id(0)
    cur_head = tl.program_id(1)
    start_m = tl.program_id(2)
    cur_kv_head = cur_head // kv_group_num
    cur_batch_seq_len = tl.load(B_Seqlen + cur_batch)
    # initialize pointer to m and l
    m_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float("inf")
    l_i = tl.zeros([BLOCK_M], dtype=tl.float32)
    acc = tl.zeros([BLOCK_M, BLOCK_DMODEL], dtype=tl.float32)

PLAIN ENGLISH

This Triton kernel runs one instance per (batch, head, query-tile) combination.

@triton.jit compiles this Python function into GPU machine code.

Three program IDs give us our position: which batch, which attention head, which query tile (start_m).

For grouped-query attention: multiple query heads share one KV head.

Load this batch item's sequence length from memory.

Online softmax state: m_i tracks the running maximum (starts at -infinity), l_i tracks the running sum of exponentials.

acc is the output accumulator - the weighted sum of values. All three live in registers throughout.

The Core Attention Loop

This is where the magic happens. The kernel iterates over tiles of keys, computing attention scores and accumulating the weighted values - all without writing intermediate results to HBM.

CODE

# The tiled attention loop (prefill kernel)
for start_n in range(start_n_limit, end_n_limit, BLOCK_N):
    k = tl.load(k_ptrs + start_n * stride_kbs,
                mask=(pos_k < cur_batch_seq_len))
    qk = tl.dot(q, k)
    qk = tl.where(mask, qk * sm_scale, -1.0e8)
    m_ij = tl.maximum(m_i, tl.max(qk, 1))
    qk -= m_ij[:, None]
    p = tl.math.exp2(qk)
    l_ij = tl.sum(p, 1)
    alpha = tl.math.exp2(m_i - m_ij)
    l_i = l_i * alpha + l_ij
    acc = acc * alpha[:, None]
    v = tl.load(v_ptrs + start_n * stride_vbs, ...)
    acc += tl.dot(p, v)
    m_i = m_ij

PLAIN ENGLISH

Iterate over key/value positions in tiles of BLOCK_N.

Load a tile of keys from HBM (this is the main memory read).

Compute QK dot product - attention scores for this tile.

Apply causal mask and scale. Masked positions get -inf (zero after softmax).

Online softmax: find new maximum across this tile.

Subtract maximum for numerical stability, then exponentiate (using exp2 for speed).

Sum the exponentials for this tile.

Correction factor: rescale previous results because the maximum changed.

Update running sum and rescale the output accumulator.

Load the corresponding value tile and add weighted contribution to output.

Update the running maximum for next iteration.

🔍

Why exp2 instead of exp?

GPUs have a dedicated hardware instruction for 2^x (exp2) that is faster than e^x (exp). The kernel uses sm_scale * log2(e) to convert between bases. Same result, fewer cycles.

Check Your Understanding

Why is a fused Triton attention kernel faster than separate PyTorch operations (matmul + softmax + matmul)?

In the prefill kernel, what do program_id(0), program_id(1), and program_id(2) represent?

What problem does online softmax solve in tiled attention?

Optimizing Triton Attention Kernels

Identifying bottlenecks, tuning block sizes, and understanding vLLM's decode attention strategy

The Roofline: Memory vs. Compute Bound

Every kernel operation sits somewhere on the roofline. Prefill attention is often compute-bound (large matrix multiplies). Decode attention is almost always memory-bound (loading KV cache for a single query token). Different optimizations apply to each.

Prefill (Compute-Bound)

Many query tokens attend to many keys simultaneously. Large tl.dot operations saturate tensor cores. Optimize by maximizing tile occupancy.

Decode (Memory-Bound)

One new query token reads the entire KV cache. The bottleneck is loading K,V from HBM. Optimize by reducing memory traffic and using split-KV parallelism.

💡

The decode dilemma

During decode, arithmetic intensity is extremely low: one query vector dot-producted with thousands of key vectors. The GPU finishes the math faster than memory can feed it data. This is why split-KV parallelism exists - it keeps more compute units busy by splitting the KV sequence across multiple program instances.

Split-KV: Parallelizing Decode Attention

vLLM's decode kernel uses a two-stage approach. Stage 1 splits the KV sequence across NUM_KV_SPLITS parallel instances. Stage 2 merges their results. Think of it like parallel reading groups: each group reads a chapter of the book, then they combine their notes.

Query Token

Split Logic

Parallel Splits

Merge (Stage 2)

Click "Next Step" to see split-KV decode attention

Decode Stage 1: The Split-KV Kernel

Each split instance processes a slice of the KV sequence. The key insight: it uses paged memory access to handle non-contiguous blocks while iterating over its assigned KV range.

CODE

# vllm/v1/attention/ops/triton_decode_attention.py
@triton.jit
def _fwd_kernel_stage1(Q, K_Buffer, V_Buffer, sm_scale,
    Req_to_tokens, B_Seqlen, Att_Out, ...):
    cur_batch = tl.program_id(0)
    cur_head = tl.program_id(1)
    split_kv_id = tl.program_id(2)

    kv_len_per_split = tl.cdiv(cur_batch_seq_len, NUM_KV_SPLITS)
    split_kv_start = kv_len_per_split * split_kv_id
    split_kv_end = tl.minimum(split_kv_start + kv_len_per_split,
                              cur_batch_seq_len)

PLAIN ENGLISH

Stage 1 of decode attention - each instance handles one split of the KV sequence.

Three axes of parallelism: across batches, across heads, and across KV splits.

program_id(2) is the split index - this instance handles one chunk of the sequence.

Calculate the range of KV positions this split is responsible for.

Ceiling division ensures all positions are covered even if sequence length is not evenly divisible.

Clamp the end to avoid reading past the actual sequence length.

Paged KV Access in the Kernel

The decode kernel handles paged memory by computing physical addresses from logical positions. This is the critical path where page table lookups happen at GPU speed.

CODE

# Inside the decode attention loop
for start_n in range(split_kv_start, split_kv_end, BLOCK_N):
    offs_n = start_n + tl.arange(0, BLOCK_N)
    kv_page_number = tl.load(
        Req_to_tokens + stride_req_to_tokens_b * cur_batch_req_idx
        + offs_n // PAGE_SIZE,
        mask=offs_n < split_kv_end)
    kv_loc = kv_page_number * PAGE_SIZE + offs_n % PAGE_SIZE
    k = tl.load(K_Buffer + kv_loc[:, None] * stride_buf_kbs
        + cur_kv_head * stride_buf_kh + offs_d[None, :],
        mask=(offs_n[:, None] < split_kv_end) & mask_d)
    qk = tl.sum(q[None, :] * k, 1)
    qk *= sm_scale

PLAIN ENGLISH

Loop over this split's assigned KV range in tiles of BLOCK_N positions.

Generate position indices for this tile.

PAGE TABLE LOOKUP: divide position by page size to get which page (block) it is in. Load the physical page number from the request's page table.

Compute the actual physical memory location: page_number * PAGE_SIZE gives the page start, plus the offset within the page (position % PAGE_SIZE).

Load key vectors from their scattered physical locations using computed addresses.

Compute attention score: element-wise multiply of query with each key, then sum. For decode, query is a single vector (not a matrix like in prefill).

Scale by 1/sqrt(d_head) for stable softmax.

Optimization: Block Size Tuning

The BLOCK_N and BLOCK_DMODEL parameters control tile sizes. Choosing the right values is a trade-off between occupancy and per-instance efficiency.

Small Tiles (BLOCK_N=32)

High occupancy (many instances fit on SM), good for short sequences. But more loop iterations means more overhead per tile.

Medium Tiles (BLOCK_N=64)

Balance between occupancy and per-tile efficiency. Often the sweet spot for decode attention.

Large Tiles (BLOCK_N=128)

Better memory coalescing and amortized overhead. Best for long sequences in prefill. But may reduce occupancy.

🔧

Practical tuning tip

vLLM uses tl.constexpr for BLOCK_N, meaning the compiler generates specialized code for each block size. You can benchmark different values by modifying the kernel launch configuration. The optimal choice depends on your GPU (A100 vs H100), sequence length, and batch size.

FP8 KV Cache: Halving Memory Traffic

vLLM's decode kernel supports FP8 quantized KV cache. Since decode is memory-bound, halving the bytes read from HBM nearly doubles effective performance.

CODE

# FP8 dequantization in decode kernel
ks = tl.load(k_scale)
vs = tl.load(v_scale)
for start_n in range(split_kv_start, split_kv_end, BLOCK_N):
    k = tl.load(K_Buffer + offs_buf_k, mask=...)
    if k.dtype.is_fp8():
        k = (k.to(tl.float32) * ks).to(q.dtype)
    qk = tl.sum(q[None, :] * k, 1)

PLAIN ENGLISH

Load the dequantization scale factors (one for keys, one for values). These are scalars shared across the tensor.

Normal tile loop - same structure as before.

Load key data from the buffer. If stored as FP8, each element is only 1 byte instead of 2 bytes.

Check if the data is FP8 at compile time (constexpr branch - zero runtime cost).

Dequantize: cast to float32, multiply by scale, then cast to query precision. This restores the original value range.

Continue with normal attention computation using the dequantized keys.

Check Your Understanding

Your decode attention kernel achieves only 30% of peak FLOPS. What is the most likely bottleneck?

Why does split-KV help decode performance if the bottleneck is memory bandwidth?

You switch from FP16 to FP8 KV cache. Where does the performance gain come from?

Quantization & Low-Precision Kernels

Compressing model weights and KV cache to fit larger models and serve faster

Why Quantize?

Quantization solves two problems at once: it shrinks model size so larger models fit on fewer GPUs, and it reduces memory traffic so inference runs faster. A 70B parameter model in FP16 needs ~140GB of memory. In INT4, it fits in ~35GB - a single GPU.

Think of it like compressing a high-resolution photo to JPEG. You lose some subtle details, but the image is recognizable and loads much faster. The art is choosing compression levels that preserve quality where it matters.

16b

FP16 / BF16 (Baseline)

Full precision. 2 bytes per parameter. Best accuracy but most memory.

FP8 / INT8

1 byte per parameter. ~0.1% accuracy loss. 2x memory reduction and bandwidth improvement.

INT4 (GPTQ, AWQ)

0.5 bytes per parameter. 4x memory reduction. Requires careful calibration to preserve accuracy.

Quantization Granularity

The scale factor determines accuracy. Finer granularity preserves more information but adds overhead. vLLM supports all three strategies.

Per-Tensor Quantization

One scale for the entire weight matrix. Fastest dequantization but lowest accuracy - outlier values stretch the range for all other values.

Per-Channel Quantization

One scale per output channel. Good balance - each row/column gets its own range. Used by SmoothQuant.

Group-wise Quantization (128 elements)

One scale per group of 128 values. Best accuracy preservation - used by GPTQ and AWQ. Slightly more dequantization work.

💡

The outlier problem

LLMs have activation outliers - a few values 10-100x larger than the rest. Per-tensor quantization wastes most of its range on these outliers, crushing the majority of values into a few quantization levels. Group-wise quantization isolates outliers to their local group.

vLLM Quantization Methods

vLLM supports a comprehensive set of quantization methods, each with different trade-offs between accuracy, speed, and memory savings.

vllm/model_executor/layers/quantization/ All quantization implementations

fp8.py FP8 weight and activation quantization (fastest, minimal accuracy loss)

auto_gptq.py GPTQ: 4-bit weight quantization with calibration data

awq.py AWQ: activation-aware 4-bit quantization

awq_marlin.py AWQ with Marlin kernel backend (optimized W4A16 GEMM)

kv_cache.py KV cache quantization (FP8 cache for memory savings)

fbgemm_fp8.py FBGEMM FP8 backend (Meta-optimized kernels)

FP8 Quantization in Practice

FP8 is the sweet spot for inference: minimal accuracy degradation, hardware-accelerated on H100/MI300X, and transparent to the model architecture. vLLM implements FP8 for both linear layers and KV cache.

CODE

# vllm/model_executor/layers/quantization/fp8.py
from vllm.model_executor.layers.quantization.base_config import (
    QuantizationConfig, QuantizeMethodBase)
from vllm.model_executor.layers.quantization.kv_cache import (
    BaseKVCacheMethod)
from vllm.model_executor.kernels.linear import (
    init_fp8_linear_kernel)
from vllm.model_executor.kernels.linear.scaled_mm import (
    CutlassFP8ScaledMMLinearKernel,
    MarlinFP8ScaledMMLinearKernel)

PLAIN ENGLISH

FP8 quantization integrates with vLLM's base config system and supports both linear layers and KV cache.

QuantizationConfig defines the interface: how to quantize weights, how to create quantized methods for layers.

BaseKVCacheMethod handles FP8 quantization of the KV cache specifically (what we saw in Module 4's decode kernel).

The actual matrix multiplication uses optimized backends:

CutlassFP8 for NVIDIA GPUs (Cutlass library), Marlin for further-optimized 4-bit/8-bit GEMM kernels.

How Quantized Inference Works

The quantized forward pass adds a dequantization step before (or fused into) each linear operation. The key optimization is fusing dequantization with the GEMM so there is no extra memory traffic.

Input (FP16)

Quantized Weights

Fused GEMM

Output (FP16)

Click "Next Step" to see quantized inference flow

Check Your Understanding

A model is quantized from FP16 to INT4. The accuracy drops by only 0.5%. Where does the speed improvement come from?

Your model has a few attention heads where one activation value is 50x larger than the rest. What quantization strategy minimizes accuracy loss?

Why does vLLM fuse dequantization with the GEMM operation instead of dequantizing first, then doing a standard GEMM?

Distributed Inference & Advanced Topics

Scaling across GPUs, continuous batching, speculative decoding, and profiling strategies

Scaling Beyond One GPU

Large models (70B+ parameters) exceed single-GPU memory even with quantization. vLLM supports splitting models across multiple GPUs using tensor parallelism and pipeline parallelism.

Think of it like a relay race vs. parallel swimmers. Tensor parallelism is parallel swimmers: everyone works on the same lap simultaneously and syncs at the wall. Pipeline parallelism is a relay: each swimmer handles one leg sequentially.

Tensor Parallelism

Split weight matrices column/row-wise across GPUs. Each GPU computes partial attention/FFN. Requires all-reduce after each layer.

Pipeline Parallelism

Assign layer ranges to GPUs. Lower communication overhead but harder to keep all GPUs busy. Used for very large models.

Data Parallelism

Each GPU has a full model copy and processes different requests. Simple but requires enough memory per GPU for the full model.

Communication Primitives

Tensor parallelism requires GPUs to exchange partial results after each layer. vLLM uses all-reduce for this synchronization.

CODE

# vllm/distributed/communication_op.py
def tensor_model_parallel_all_reduce(
    input_: torch.Tensor) -> torch.Tensor:
    """All-reduce the input tensor across
       model parallel group."""
    return get_tp_group().all_reduce(input_)

def tensor_model_parallel_all_gather(
    input_: torch.Tensor, dim: int = -1
) -> torch.Tensor:
    """All-gather the input tensor across
       model parallel group."""
    return get_tp_group().all_gather(input_, dim)

PLAIN ENGLISH

This file defines the communication primitives used in distributed inference.

All-reduce: every GPU sends its partial result and receives the sum of all partials. After attention QKV projection, each GPU has a slice of the output - all-reduce combines them.

get_tp_group() returns the tensor-parallel communication group (the set of GPUs sharing one model).

All-gather: each GPU contributes its slice, everyone receives the full concatenated tensor. Used when a layer needs the complete hidden state (e.g., before FFN in some parallelization schemes).

Tensor Parallel Attention

In tensor-parallel attention, each GPU handles a subset of attention heads. After computing attention for its heads, results are combined via all-reduce before passing to the next layer.

Input Hidden

GPU 0 (Heads 0-15)

GPU 1 (Heads 16-31)

All-Reduce

Click "Next Step" to see tensor-parallel attention

Continuous Batching

Continuous batching is vLLM's scheduling superpower. Instead of waiting for all requests in a batch to finish before starting new ones, the scheduler inserts and removes requests at every step.

Speculative Decoding

Speculative decoding is vLLM's approach to reducing latency. A small draft model guesses the next several tokens, then the large model verifies them in parallel. When guesses are correct, you get multiple tokens from one expensive forward pass.

Draft model generates K tokens

A small model (e.g., 1B params) quickly predicts tokens t+1, t+2, ..., t+K. This is cheap because the model is small.

Target model verifies all K tokens in one pass

The large model (70B) runs one forward pass on all K draft tokens simultaneously. It computes what it would have generated at each position.

Accept matching tokens, reject the rest

If the draft matches the target at positions 1-3 but diverges at position 4, we keep tokens 1-3 (3 tokens for one pass) and resample position 4.

Net gain: multiple tokens per target pass

If the draft model is 80% accurate, we get ~3-4 tokens per target forward pass instead of 1. Latency drops proportionally.

🔧

Connection to the V1 scheduler

Speculative decoding fits naturally into V1's unified scheduling: draft tokens increase num_tokens_with_spec, and the scheduler handles them identically to regular tokens. Rejected draft tokens simply don't increment num_computed_tokens.

Profiling & Debugging Strategies

Optimizing kernels requires measurement. Here are the key tools and what they reveal about vLLM performance.

nsys (NVIDIA Nsight Systems)

Timeline view of all GPU activity. Shows kernel launch gaps, communication overhead, and scheduling inefficiency. Start here for system-level issues.

ncu (Nsight Compute)

Deep per-kernel analysis. Shows memory throughput, compute utilization, occupancy, and roofline position. Use for optimizing individual Triton kernels.

Triton Profiler

Built into Triton. Shows auto-tuning results, compiled PTX, and tile scheduling. Use when debugging Triton-specific issues like register pressure.

nsys profile python -m vllm.entrypoints.openai.api_server Profile a full vLLM server run

ncu --set full python benchmark.py Deep-dive into a single kernel

TRITON_PRINT_AUTOTUNING=1 See which kernel configurations Triton selects

What vLLM Actually Does

Engine

Scheduler

KV Cache Manager

Worker

Request Lifecycle

The V1 Architecture Directory

The Scheduling Algorithm

Scheduling Policies

Check Your Understanding

What distinguishes the V1 scheduler from traditional LLM schedulers?

A new request arrives but GPU memory is almost full. Which component decides whether to accept or preempt?

Why does the unified scheduling approach enable speculative decoding and chunked prefills?

The Memory Bottleneck

Block-Based Memory Allocation

KV Cache Data Structure

Prefix Caching with Block Hashing

Memory Access in Paged Attention

Check Your Understanding

A request might generate anywhere from 10 to 2000 tokens. Why does block-based allocation help here?

Two users send requests with the same 500-token system prompt. How does prefix caching help?

During decode attention, how does the kernel find where block 5 of a request is stored in GPU memory?

Why Custom Kernels?

PyTorch Default

Triton Fused

Result

The Triton Programming Model

GPU Memory Hierarchy

vLLM Prefill Attention Kernel

The Core Attention Loop

Check Your Understanding

Why is a fused Triton attention kernel faster than separate PyTorch operations (matmul + softmax + matmul)?

In the prefill kernel, what do program_id(0), program_id(1), and program_id(2) represent?

What problem does online softmax solve in tiled attention?

The Roofline: Memory vs. Compute Bound

Prefill (Compute-Bound)

Decode (Memory-Bound)

Split-KV: Parallelizing Decode Attention

Decode Stage 1: The Split-KV Kernel

Paged KV Access in the Kernel

Optimization: Block Size Tuning

Small Tiles (BLOCK_N=32)

Medium Tiles (BLOCK_N=64)

Large Tiles (BLOCK_N=128)

FP8 KV Cache: Halving Memory Traffic

Check Your Understanding

Your decode attention kernel achieves only 30% of peak FLOPS. What is the most likely bottleneck?

Why does split-KV help decode performance if the bottleneck is memory bandwidth?

You switch from FP16 to FP8 KV cache. Where does the performance gain come from?

Why Quantize?

Quantization Granularity

vLLM Quantization Methods

FP8 Quantization in Practice

How Quantized Inference Works

Check Your Understanding

A model is quantized from FP16 to INT4. The accuracy drops by only 0.5%. Where does the speed improvement come from?

Your model has a few attention heads where one activation value is 50x larger than the rest. What quantization strategy minimizes accuracy loss?

Why does vLLM fuse dequantization with the GEMM operation instead of dequantizing first, then doing a standard GEMM?

Scaling Beyond One GPU

Tensor Parallelism

Pipeline Parallelism

Data Parallelism

Communication Primitives

Tensor Parallel Attention

Continuous Batching

Speculative Decoding

Profiling & Debugging Strategies

nsys (NVIDIA Nsight Systems)

ncu (Nsight Compute)

Triton Profiler

Check Your Understanding

In 2-way tensor parallel attention, GPU 0 computes heads 0-15 and GPU 1 computes heads 16-31. How do they produce the final output?

Requests A, B, C are decoding. B finishes and new request E arrives. What happens in continuous batching?

You suspect your vLLM deployment has communication overhead between GPUs. Which profiling tool do you start with?