vLLM Architecture Deep Dive
How requests flow through the V1 engine, from API call to token generation
What vLLM Actually Does
vLLM is a high-throughput LLM inference engine. When you send a prompt to a model like Llama or Mistral, vLLM orchestrates the entire process: scheduling requests, managing GPU memory, and running optimized kernels to generate tokens as fast as possible.
Think of vLLM like an air traffic control system. Multiple planes (requests) arrive simultaneously, each needing runway time (GPU compute) and gate space (memory). The controller must land planes efficiently without collisions.
Engine
The control tower. Receives requests and coordinates all other components.
Scheduler
The runway controller. Decides which requests run and when, based on memory constraints.
KV Cache Manager
The gate manager. Allocates and recycles GPU memory blocks for attention state.
Worker
The pilot. Executes actual model computation on the GPU with optimized Triton kernels.
Request Lifecycle
When you send a prompt to vLLM, here is the path your request takes through the system. Each step is a handoff between components, designed for maximum throughput.
The V1 Architecture Directory
The V1 engine lives in vllm/v1/. It was a ground-up rewrite of the original vLLM engine for better performance and simpler scheduling logic.
The Scheduling Algorithm
The V1 scheduler has no concept of separate "prefill" or "decode" phases. Instead, it treats every request uniformly: each has num_computed_tokens and a target num_tokens_with_spec. The scheduler assigns compute budget to help each request catch up.
# vllm/v1/core/sched/scheduler.py
def schedule(self) -> SchedulerOutput:
# Each request just has num_computed_tokens and
# num_tokens_with_spec. At each step, the scheduler
# tries to assign tokens so that each request's
# num_computed_tokens can catch up.
scheduled_new_reqs: list[Request] = []
scheduled_running_reqs: list[Request] = []
preempted_reqs: list[Request] = []
token_budget = self.max_num_scheduled_tokens
This is the main scheduling function, called every step.
The unified approach: no prefill/decode phases. Each request just tracks how many tokens it has computed vs. how many it needs.
The scheduler fills tokens until the budget runs out.
Three buckets: new requests entering the system, requests currently running, and requests that got preempted (paused to free memory).
The token budget caps how many tokens can be processed in one step across all requests.
By removing the prefill/decode distinction, vLLM V1 can handle chunked prefills, prefix caching, and speculative decoding with the same scheduling logic. One algorithm covers all cases.
Scheduling Policies
The scheduler supports multiple scheduling policies to control which requests get compute time first.
# vllm/v1/core/sched/scheduler.py
# Scheduling constraints.
self.max_num_running_reqs = self.scheduler_config.max_num_seqs
self.max_num_scheduled_tokens = (
self.scheduler_config.max_num_scheduled_tokens
if self.scheduler_config.max_num_scheduled_tokens
else self.scheduler_config.max_num_batched_tokens
)
self.policy = SchedulingPolicy(self.scheduler_config.policy)
Two hard limits control scheduling:
1. Maximum concurrent requests (how many planes can be in the air)
2. Maximum tokens per step (total runway capacity per time slice)
The token budget falls back to batched_tokens if not explicitly set.
The policy (FCFS or priority) determines the queue ordering.
Check Your Understanding
What distinguishes the V1 scheduler from traditional LLM schedulers?
A new request arrives but GPU memory is almost full. Which component decides whether to accept or preempt?
Why does the unified scheduling approach enable speculative decoding and chunked prefills?
Memory Management & KV Cache
How vLLM manages GPU memory with block-based allocation and paged attention
The Memory Bottleneck
LLM inference is memory-bound, not compute-bound. The GPU has enormous arithmetic power but limited memory bandwidth. The KV cache - storing key and value tensors from previous tokens - dominates GPU memory usage during inference.
Think of it like a library with limited shelf space. Each book (token) needs a reserved spot on the shelves (memory blocks). Without clever management, you run out of shelves long before you run out of reading capacity.
Total available memory. Model weights take ~14GB for a 7B model in FP16. The rest is for KV cache.
Each token stores a key and value vector per layer per head. For Llama-70B: ~1.25MB per token across all layers.
Memory is allocated in fixed-size blocks, not per-token. This reduces fragmentation and enables sharing.
Block-Based Memory Allocation
vLLM pioneered paged attention - the key insight that made high-throughput LLM serving possible. Instead of reserving maximum-length contiguous memory for each request, vLLM allocates memory in small blocks that can be scattered across GPU memory.
Scheduler asks KV Cache Manager for blocks based on prompt length
Manager pulls blocks from FreeKVCacheBlockQueue - no contiguous requirement
As tokens are generated, they fill the current block. When full, a new block is allocated
Blocks return to the free pool (or stay cached for prefix reuse)
Without paging, a request expecting 2048 tokens must reserve all 2048 tokens of memory upfront - even if it only generates 50. With blocks, memory grows on-demand. This single change improved memory utilization from ~50% to ~95% in practice.
KV Cache Data Structure
The KVCacheBlocks dataclass is the interface between the Scheduler and the KV Cache Manager. It hides internal memory details from scheduling logic.
# vllm/v1/core/kv_cache_manager.py
@dataclass
class KVCacheBlocks:
blocks: tuple[Sequence[KVCacheBlock], ...]
# blocks[i][j] refers to the i-th kv_cache_group
# and the j-th block of tokens.
def __add__(self, other) -> "KVCacheBlocks":
return KVCacheBlocks(
tuple(list(itertools.chain(blk1, blk2))
for blk1, blk2 in zip(self.blocks, other.blocks))
)
This is the memory allocation receipt - it tells the scheduler what blocks were allocated.
The blocks field is a tuple of sequences. First dimension is the cache group (for multi-head attention), second is the block index within that group.
The __add__ method lets you combine two allocations (e.g., extending a request that needs more blocks) by chaining their block lists together.
Prefix Caching with Block Hashing
When two requests share the same prefix (e.g., the same system prompt), vLLM can reuse the computed KV cache blocks instead of recomputing them. This is done via content-addressable hashing of block contents.
# vllm/v1/core/block_pool.py
class BlockHashToBlockMap:
"""Cache of blocks used for prefix caching.
Maps hash directly to a block or multiple blocks
(i.e. {block_hash: KVCacheBlocks})
A cached block is a full block with a block hash
that can be used for prefix caching.
The cached block may be used by running requests
or in the free_block_queue that could
potentially be evicted."""
This class is the "library card catalog" - it maps block content fingerprints to actual memory blocks.
When a block is full (all token slots used), it gets a hash computed from its token content.
If another request needs the same prefix, we look up the hash and find the existing block - no recomputation needed.
Cached blocks can be in two states: actively used by a request, or sitting in the free queue (evictable but still available for reuse).
Memory Access in Paged Attention
The attention kernel must handle non-contiguous blocks. For each query token, it looks up which physical blocks hold the relevant keys and values using a page table (Req_to_tokens mapping).
Check Your Understanding
A request might generate anywhere from 10 to 2000 tokens. Why does block-based allocation help here?
Two users send requests with the same 500-token system prompt. How does prefix caching help?
During decode attention, how does the kernel find where block 5 of a request is stored in GPU memory?
Introduction to Triton for Attention
Why Triton exists, how it models GPU parallelism, and reading vLLM's attention kernels
Why Custom Kernels?
Standard PyTorch operations execute one operation at a time: matrix multiply, then softmax, then another multiply. Each operation reads from and writes to GPU HBM. The data movement between operations is the bottleneck, not the math.
Triton lets you fuse multiple operations into one kernel. Data stays in fast on-chip SRAM between operations instead of making round-trips to slow HBM.
PyTorch Default
Q*K -> HBM -> softmax -> HBM -> *V -> HBM. Three round trips to slow memory.
Triton Fused
Load Q,K,V -> SRAM -> compute attention in tiles -> write output once. One HBM round trip.
Result
2-4x speedup for attention. The math is identical - only the data movement changes.
The Triton Programming Model
Triton operates at the block level. You write code that processes a tile of data, and the GPU launches thousands of these program instances in parallel. Think of it like a factory assembly line where each worker handles one box of parts, not individual screws.
Each instance gets a unique ID. This tells it which tile of the output it is responsible for.
Generate offset indices within the tile. These map to the elements this instance processes.
Load a tile of data from HBM into registers. The mask handles boundary conditions.
Process the tile entirely in fast registers/SRAM. This is where the actual math happens.
Write the computed tile back to HBM. Ideally, this is the only write to slow memory.
GPU Memory Hierarchy
Understanding where data lives is the key to writing fast Triton kernels. The hierarchy has three levels, each with dramatically different speed and capacity.
vLLM Prefill Attention Kernel
The prefill kernel processes all prompt tokens at once. Each program instance handles one tile of query positions for one attention head. It iterates over all key/value positions in tiles, computing attention scores with online softmax.
# vllm/v1/attention/ops/triton_prefill_attention.py
@triton.jit
def _fwd_kernel(Q, K, V, sm_scale, B_Start_Loc, B_Seqlen, Out,
stride_qbs, stride_qh, stride_kbs, stride_kh, ...):
cur_batch = tl.program_id(0)
cur_head = tl.program_id(1)
start_m = tl.program_id(2)
cur_kv_head = cur_head // kv_group_num
cur_batch_seq_len = tl.load(B_Seqlen + cur_batch)
# initialize pointer to m and l
m_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float("inf")
l_i = tl.zeros([BLOCK_M], dtype=tl.float32)
acc = tl.zeros([BLOCK_M, BLOCK_DMODEL], dtype=tl.float32)
This Triton kernel runs one instance per (batch, head, query-tile) combination.
@triton.jit compiles this Python function into GPU machine code.
Three program IDs give us our position: which batch, which attention head, which query tile (start_m).
For grouped-query attention: multiple query heads share one KV head.
Load this batch item's sequence length from memory.
Online softmax state: m_i tracks the running maximum (starts at -infinity), l_i tracks the running sum of exponentials.
acc is the output accumulator - the weighted sum of values. All three live in registers throughout.
The Core Attention Loop
This is where the magic happens. The kernel iterates over tiles of keys, computing attention scores and accumulating the weighted values - all without writing intermediate results to HBM.
# The tiled attention loop (prefill kernel)
for start_n in range(start_n_limit, end_n_limit, BLOCK_N):
k = tl.load(k_ptrs + start_n * stride_kbs,
mask=(pos_k < cur_batch_seq_len))
qk = tl.dot(q, k)
qk = tl.where(mask, qk * sm_scale, -1.0e8)
m_ij = tl.maximum(m_i, tl.max(qk, 1))
qk -= m_ij[:, None]
p = tl.math.exp2(qk)
l_ij = tl.sum(p, 1)
alpha = tl.math.exp2(m_i - m_ij)
l_i = l_i * alpha + l_ij
acc = acc * alpha[:, None]
v = tl.load(v_ptrs + start_n * stride_vbs, ...)
acc += tl.dot(p, v)
m_i = m_ij
Iterate over key/value positions in tiles of BLOCK_N.
Load a tile of keys from HBM (this is the main memory read).
Compute QK dot product - attention scores for this tile.
Apply causal mask and scale. Masked positions get -inf (zero after softmax).
Online softmax: find new maximum across this tile.
Subtract maximum for numerical stability, then exponentiate (using exp2 for speed).
Sum the exponentials for this tile.
Correction factor: rescale previous results because the maximum changed.
Update running sum and rescale the output accumulator.
Load the corresponding value tile and add weighted contribution to output.
Update the running maximum for next iteration.
GPUs have a dedicated hardware instruction for 2^x (exp2) that is faster than e^x (exp). The kernel uses sm_scale * log2(e) to convert between bases. Same result, fewer cycles.
Check Your Understanding
Why is a fused Triton attention kernel faster than separate PyTorch operations (matmul + softmax + matmul)?
In the prefill kernel, what do program_id(0), program_id(1), and program_id(2) represent?
What problem does online softmax solve in tiled attention?
Optimizing Triton Attention Kernels
Identifying bottlenecks, tuning block sizes, and understanding vLLM's decode attention strategy
The Roofline: Memory vs. Compute Bound
Every kernel operation sits somewhere on the roofline. Prefill attention is often compute-bound (large matrix multiplies). Decode attention is almost always memory-bound (loading KV cache for a single query token). Different optimizations apply to each.
Prefill (Compute-Bound)
Many query tokens attend to many keys simultaneously. Large tl.dot operations saturate tensor cores. Optimize by maximizing tile occupancy.
Decode (Memory-Bound)
One new query token reads the entire KV cache. The bottleneck is loading K,V from HBM. Optimize by reducing memory traffic and using split-KV parallelism.
During decode, arithmetic intensity is extremely low: one query vector dot-producted with thousands of key vectors. The GPU finishes the math faster than memory can feed it data. This is why split-KV parallelism exists - it keeps more compute units busy by splitting the KV sequence across multiple program instances.
Split-KV: Parallelizing Decode Attention
vLLM's decode kernel uses a two-stage approach. Stage 1 splits the KV sequence across NUM_KV_SPLITS parallel instances. Stage 2 merges their results. Think of it like parallel reading groups: each group reads a chapter of the book, then they combine their notes.
Decode Stage 1: The Split-KV Kernel
Each split instance processes a slice of the KV sequence. The key insight: it uses paged memory access to handle non-contiguous blocks while iterating over its assigned KV range.
# vllm/v1/attention/ops/triton_decode_attention.py
@triton.jit
def _fwd_kernel_stage1(Q, K_Buffer, V_Buffer, sm_scale,
Req_to_tokens, B_Seqlen, Att_Out, ...):
cur_batch = tl.program_id(0)
cur_head = tl.program_id(1)
split_kv_id = tl.program_id(2)
kv_len_per_split = tl.cdiv(cur_batch_seq_len, NUM_KV_SPLITS)
split_kv_start = kv_len_per_split * split_kv_id
split_kv_end = tl.minimum(split_kv_start + kv_len_per_split,
cur_batch_seq_len)
Stage 1 of decode attention - each instance handles one split of the KV sequence.
Three axes of parallelism: across batches, across heads, and across KV splits.
program_id(2) is the split index - this instance handles one chunk of the sequence.
Calculate the range of KV positions this split is responsible for.
Ceiling division ensures all positions are covered even if sequence length is not evenly divisible.
Clamp the end to avoid reading past the actual sequence length.
Paged KV Access in the Kernel
The decode kernel handles paged memory by computing physical addresses from logical positions. This is the critical path where page table lookups happen at GPU speed.
# Inside the decode attention loop
for start_n in range(split_kv_start, split_kv_end, BLOCK_N):
offs_n = start_n + tl.arange(0, BLOCK_N)
kv_page_number = tl.load(
Req_to_tokens + stride_req_to_tokens_b * cur_batch_req_idx
+ offs_n // PAGE_SIZE,
mask=offs_n < split_kv_end)
kv_loc = kv_page_number * PAGE_SIZE + offs_n % PAGE_SIZE
k = tl.load(K_Buffer + kv_loc[:, None] * stride_buf_kbs
+ cur_kv_head * stride_buf_kh + offs_d[None, :],
mask=(offs_n[:, None] < split_kv_end) & mask_d)
qk = tl.sum(q[None, :] * k, 1)
qk *= sm_scale
Loop over this split's assigned KV range in tiles of BLOCK_N positions.
Generate position indices for this tile.
PAGE TABLE LOOKUP: divide position by page size to get which page (block) it is in. Load the physical page number from the request's page table.
Compute the actual physical memory location: page_number * PAGE_SIZE gives the page start, plus the offset within the page (position % PAGE_SIZE).
Load key vectors from their scattered physical locations using computed addresses.
Compute attention score: element-wise multiply of query with each key, then sum. For decode, query is a single vector (not a matrix like in prefill).
Scale by 1/sqrt(d_head) for stable softmax.
Optimization: Block Size Tuning
The BLOCK_N and BLOCK_DMODEL parameters control tile sizes. Choosing the right values is a trade-off between occupancy and per-instance efficiency.
Small Tiles (BLOCK_N=32)
High occupancy (many instances fit on SM), good for short sequences. But more loop iterations means more overhead per tile.
Medium Tiles (BLOCK_N=64)
Balance between occupancy and per-tile efficiency. Often the sweet spot for decode attention.
Large Tiles (BLOCK_N=128)
Better memory coalescing and amortized overhead. Best for long sequences in prefill. But may reduce occupancy.
vLLM uses tl.constexpr for BLOCK_N, meaning the compiler generates specialized code for each block size. You can benchmark different values by modifying the kernel launch configuration. The optimal choice depends on your GPU (A100 vs H100), sequence length, and batch size.
FP8 KV Cache: Halving Memory Traffic
vLLM's decode kernel supports FP8 quantized KV cache. Since decode is memory-bound, halving the bytes read from HBM nearly doubles effective performance.
# FP8 dequantization in decode kernel
ks = tl.load(k_scale)
vs = tl.load(v_scale)
for start_n in range(split_kv_start, split_kv_end, BLOCK_N):
k = tl.load(K_Buffer + offs_buf_k, mask=...)
if k.dtype.is_fp8():
k = (k.to(tl.float32) * ks).to(q.dtype)
qk = tl.sum(q[None, :] * k, 1)
Load the dequantization scale factors (one for keys, one for values). These are scalars shared across the tensor.
Normal tile loop - same structure as before.
Load key data from the buffer. If stored as FP8, each element is only 1 byte instead of 2 bytes.
Check if the data is FP8 at compile time (constexpr branch - zero runtime cost).
Dequantize: cast to float32, multiply by scale, then cast to query precision. This restores the original value range.
Continue with normal attention computation using the dequantized keys.
Check Your Understanding
Your decode attention kernel achieves only 30% of peak FLOPS. What is the most likely bottleneck?
Why does split-KV help decode performance if the bottleneck is memory bandwidth?
You switch from FP16 to FP8 KV cache. Where does the performance gain come from?
Quantization & Low-Precision Kernels
Compressing model weights and KV cache to fit larger models and serve faster
Why Quantize?
Quantization solves two problems at once: it shrinks model size so larger models fit on fewer GPUs, and it reduces memory traffic so inference runs faster. A 70B parameter model in FP16 needs ~140GB of memory. In INT4, it fits in ~35GB - a single GPU.
Think of it like compressing a high-resolution photo to JPEG. You lose some subtle details, but the image is recognizable and loads much faster. The art is choosing compression levels that preserve quality where it matters.
Full precision. 2 bytes per parameter. Best accuracy but most memory.
1 byte per parameter. ~0.1% accuracy loss. 2x memory reduction and bandwidth improvement.
0.5 bytes per parameter. 4x memory reduction. Requires careful calibration to preserve accuracy.
Quantization Granularity
The scale factor determines accuracy. Finer granularity preserves more information but adds overhead. vLLM supports all three strategies.
One scale for the entire weight matrix. Fastest dequantization but lowest accuracy - outlier values stretch the range for all other values.
One scale per output channel. Good balance - each row/column gets its own range. Used by SmoothQuant.
One scale per group of 128 values. Best accuracy preservation - used by GPTQ and AWQ. Slightly more dequantization work.
LLMs have activation outliers - a few values 10-100x larger than the rest. Per-tensor quantization wastes most of its range on these outliers, crushing the majority of values into a few quantization levels. Group-wise quantization isolates outliers to their local group.
vLLM Quantization Methods
vLLM supports a comprehensive set of quantization methods, each with different trade-offs between accuracy, speed, and memory savings.
FP8 Quantization in Practice
FP8 is the sweet spot for inference: minimal accuracy degradation, hardware-accelerated on H100/MI300X, and transparent to the model architecture. vLLM implements FP8 for both linear layers and KV cache.
# vllm/model_executor/layers/quantization/fp8.py
from vllm.model_executor.layers.quantization.base_config import (
QuantizationConfig, QuantizeMethodBase)
from vllm.model_executor.layers.quantization.kv_cache import (
BaseKVCacheMethod)
from vllm.model_executor.kernels.linear import (
init_fp8_linear_kernel)
from vllm.model_executor.kernels.linear.scaled_mm import (
CutlassFP8ScaledMMLinearKernel,
MarlinFP8ScaledMMLinearKernel)
FP8 quantization integrates with vLLM's base config system and supports both linear layers and KV cache.
QuantizationConfig defines the interface: how to quantize weights, how to create quantized methods for layers.
BaseKVCacheMethod handles FP8 quantization of the KV cache specifically (what we saw in Module 4's decode kernel).
The actual matrix multiplication uses optimized backends:
CutlassFP8 for NVIDIA GPUs (Cutlass library), Marlin for further-optimized 4-bit/8-bit GEMM kernels.
How Quantized Inference Works
The quantized forward pass adds a dequantization step before (or fused into) each linear operation. The key optimization is fusing dequantization with the GEMM so there is no extra memory traffic.
Check Your Understanding
A model is quantized from FP16 to INT4. The accuracy drops by only 0.5%. Where does the speed improvement come from?
Your model has a few attention heads where one activation value is 50x larger than the rest. What quantization strategy minimizes accuracy loss?
Why does vLLM fuse dequantization with the GEMM operation instead of dequantizing first, then doing a standard GEMM?
Distributed Inference & Advanced Topics
Scaling across GPUs, continuous batching, speculative decoding, and profiling strategies
Scaling Beyond One GPU
Large models (70B+ parameters) exceed single-GPU memory even with quantization. vLLM supports splitting models across multiple GPUs using tensor parallelism and pipeline parallelism.
Think of it like a relay race vs. parallel swimmers. Tensor parallelism is parallel swimmers: everyone works on the same lap simultaneously and syncs at the wall. Pipeline parallelism is a relay: each swimmer handles one leg sequentially.
Tensor Parallelism
Split weight matrices column/row-wise across GPUs. Each GPU computes partial attention/FFN. Requires all-reduce after each layer.
Pipeline Parallelism
Assign layer ranges to GPUs. Lower communication overhead but harder to keep all GPUs busy. Used for very large models.
Data Parallelism
Each GPU has a full model copy and processes different requests. Simple but requires enough memory per GPU for the full model.
Communication Primitives
Tensor parallelism requires GPUs to exchange partial results after each layer. vLLM uses all-reduce for this synchronization.
# vllm/distributed/communication_op.py
def tensor_model_parallel_all_reduce(
input_: torch.Tensor) -> torch.Tensor:
"""All-reduce the input tensor across
model parallel group."""
return get_tp_group().all_reduce(input_)
def tensor_model_parallel_all_gather(
input_: torch.Tensor, dim: int = -1
) -> torch.Tensor:
"""All-gather the input tensor across
model parallel group."""
return get_tp_group().all_gather(input_, dim)
This file defines the communication primitives used in distributed inference.
All-reduce: every GPU sends its partial result and receives the sum of all partials. After attention QKV projection, each GPU has a slice of the output - all-reduce combines them.
get_tp_group() returns the tensor-parallel communication group (the set of GPUs sharing one model).
All-gather: each GPU contributes its slice, everyone receives the full concatenated tensor. Used when a layer needs the complete hidden state (e.g., before FFN in some parallelization schemes).
Tensor Parallel Attention
In tensor-parallel attention, each GPU handles a subset of attention heads. After computing attention for its heads, results are combined via all-reduce before passing to the next layer.
Continuous Batching
Continuous batching is vLLM's scheduling superpower. Instead of waiting for all requests in a batch to finish before starting new ones, the scheduler inserts and removes requests at every step.
Speculative Decoding
Speculative decoding is vLLM's approach to reducing latency. A small draft model guesses the next several tokens, then the large model verifies them in parallel. When guesses are correct, you get multiple tokens from one expensive forward pass.
A small model (e.g., 1B params) quickly predicts tokens t+1, t+2, ..., t+K. This is cheap because the model is small.
The large model (70B) runs one forward pass on all K draft tokens simultaneously. It computes what it would have generated at each position.
If the draft matches the target at positions 1-3 but diverges at position 4, we keep tokens 1-3 (3 tokens for one pass) and resample position 4.
If the draft model is 80% accurate, we get ~3-4 tokens per target forward pass instead of 1. Latency drops proportionally.
Speculative decoding fits naturally into V1's unified scheduling: draft tokens increase num_tokens_with_spec, and the scheduler handles them identically to regular tokens. Rejected draft tokens simply don't increment num_computed_tokens.
Profiling & Debugging Strategies
Optimizing kernels requires measurement. Here are the key tools and what they reveal about vLLM performance.
nsys (NVIDIA Nsight Systems)
Timeline view of all GPU activity. Shows kernel launch gaps, communication overhead, and scheduling inefficiency. Start here for system-level issues.
ncu (Nsight Compute)
Deep per-kernel analysis. Shows memory throughput, compute utilization, occupancy, and roofline position. Use for optimizing individual Triton kernels.
Triton Profiler
Built into Triton. Shows auto-tuning results, compiled PTX, and tile scheduling. Use when debugging Triton-specific issues like register pressure.
nsys profile python -m vllm.entrypoints.openai.api_server
Profile a full vLLM server run
ncu --set full python benchmark.py
Deep-dive into a single kernel
TRITON_PRINT_AUTOTUNING=1
See which kernel configurations Triton selects