Discrete Diffusion: Bypassing the HBM Bottleneck

A few months ago, I was staring at a Grafana dashboard monitoring a cluster of eight NVIDIA H100 GPUs running a standard 70-billion parameter autoregressive language model, watching our Tensor Core occupancy hover at a pathetic 3% while memory bandwidth was pegged at 99%. The GPUs were spending almost all of their time waiting for model weights to travel from high-bandwidth memory (HBM) into local registers just to predict a single token, which is why a quiet architectural shift toward discrete diffusion is the most exciting development in production AI right now. By trading parallel GPU compute for memory bandwidth, this alternative architecture bypasses the HBM bottleneck completely, rewriting the economics of both cloud and local AI inference.

If you serve large language models at low batch sizes, you are trapped in this exact same economic nightmare. We are hoarding expensive HBM silicon not because we need the raw compute, but because we need the memory bus width.

But with Google DeepMind releasing DiffusionGemma and startups launching production systems on frameworks like Mercury 2, discrete diffusion is shifting the ground beneath our feet. Let’s look at why this works, how to implement it, and the systems-level engineering tradeoffs you must make to run it in production.

The Memory-Bandwidth Wall: Why Autoregressive LLMs are Starving Your GPUs

To understand why traditional autoregressive (AR) models are so expensive to serve, we have to look at the brutal reality of hardware arithmetic intensity. Arithmetic intensity is the ratio of floating-point operations (FLOPs) performed per byte of memory transferred.

\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Transferred}}

Let us calculate this for a standard 7-billion parameter (7B) model running in FP16 precision during the autoregressive decoding phase. To predict exactly one token, the GPU must load every single one of those 7 billion parameters from HBM into its local SRAM registers.

At 2 bytes per parameter (FP16), this requires transferring 14 gigabytes of data across the memory bus. The computation required to process one token through a 7B parameter network is roughly 2 operations per parameter:

\text{Compute} = 2 \times 7 \times 10^9 = 14 \text{ GFLOPs}

So, to generate a single token, we transfer 14 gigabytes of weights to perform 14 GFLOPs of math. Our arithmetic intensity is exactly 1 FLOP per byte.

Now compare this with the hardware roofline of an NVIDIA H100 SXM5 GPU. An H100 offers roughly 2,000 TFLOPs of FP16 compute and 3.35 TB/s of memory bandwidth. To keep the GPU’s Tensor Cores fully saturated, we need our workload to have an arithmetic intensity of:

\text{Required Intensity} = \frac{2,000 \times 10^{12} \text{ FLOPs/s}}{3.35 \times 10^{12} \text{ Bytes/s}} \approx 597 \text{ FLOPs/Byte}

At an intensity of 1 FLOP/byte, our $30,000 GPU is running at less than 0.2% of its maximum theoretical compute capacity. The Tensor Cores are starving. FlashAttention does not save you here either. While FlashAttention dramatically reduces the memory traffic for the activation KV cache, it does nothing to alleviate the constant, relentless sweeping of the main model weights from high-bandwidth memory.

# ar_vs_diffusion_intensity.py
import torch

def calculate_arithmetic_intensity(
    model_params_billions: float, 
    precision_bytes: int, 
    batch_size: int, 
    sequence_length: int = 1, 
    is_diffusion: bool = False, 
    diffusion_steps: int = 1
) -> float:
    """
    Computes theoretical arithmetic intensity to highlight the physical 
    bottlenecks of autoregressive generation versus discrete diffusion.
    """
    model_size_bytes = model_params_billions * 1e9 * precision_bytes
    
    if not is_diffusion:
        # Autoregressive decoding (1 token at a time)
        # 2 FLOPs per parameter per token
        flops = 2 * (model_params_billions * 1e9) * batch_size
        bytes_transferred = model_size_bytes
        return flops / bytes_transferred
    else:
        # Discrete Diffusion (parallel canvas update)
        # We update the entire sequence length (canvas) in parallel
        flops = 2 * (model_params_billions * 1e9) * batch_size * sequence_length
        # We must load the model weights once per denoising step
        bytes_transferred = model_size_bytes * diffusion_steps
        return flops / bytes_transferred

if __name__ == "__main__":
    ar_intensity = calculate_arithmetic_intensity(7.0, 2, batch_size=1)
    diff_intensity = calculate_arithmetic_intensity(
        7.0, 2, batch_size=1, sequence_length=256, is_diffusion=True, diffusion_steps=32
    )

    print(f"Autoregressive Decode Intensity: {ar_intensity:.4f} FLOPs/Byte")
    print(f"Discrete Diffusion Intensity (256 canvas, 32 steps): {diff_intensity:.4f} FLOPs/Byte")

Running this calculation yields a stark contrast: while the autoregressive step languishes at an intensity of 1.0, the discrete diffusion setup hits an arithmetic intensity of 8.0 over its entire execution block. By operating on a parallel canvas, we are transferring weights to compute hundreds of tokens simultaneously, dramatically shifting the workload toward the GPU’s compute limits.

Enter Discrete Diffusion: Trading FLOPS for HBM Bandwidth

Discrete Diffusion Language Models (dLLMs) break the traditional left-to-right causal constraint by generating tokens on a bidirectional parallel canvas. Instead of predicting token $N+1$ from the history of $1$ to $N$ , a dLLM starts with a fixed-length block, such as 256 tokens, filled entirely with a special [MASK] token.

Over a sequence of $T$ denoising steps, which is typically between 12 and 64, the model evaluates the entire canvas. It uses bidirectional self-attention to predict the underlying concrete tokens for all masked positions at once, unmasking and refining the highest-confidence tokens at each step.

\gamma(t) = \cos\left(\frac{\pi t}{2 T}\right)

The cosine scheduling function determines the ratio of masked tokens remaining at step $t$ . Because the forward pass computes dense matrix multiplications across the entire 256-token canvas simultaneously, we can saturate the GPU’s Tensor Cores. We are no longer executing tiny vector-matrix operations (GEMV). We are running massive, hardware-friendly matrix-matrix multiplications (GEMM).

Indeed, this architectural shift transitions the primary bottleneck from memory bus transfer latency to raw floating-point calculations. We have successfully traded FLOPs (which are cheap and abundant on modern silicon) for memory bandwidth (which is expensive and physically constrained).

The Death of HBM Hoarding: How This Rewrites GPU Economics

When your workload is compute-bound rather than memory-bandwidth bound, the type of hardware you buy changes completely. If you are running autoregressive models, you are forced to pay a premium for enterprise GPUs like the NVIDIA H200 or H100 primarily because of their HBM3e memory buses.

With discrete diffusion, those premium memory specs become far less critical. High compute density per dollar becomes the metric that matters. This shift allows consumer-grade and enterprise-adjacent GPUs to perform exceptionally well.

Consider the NVIDIA RTX 5090 or the L40S. These cards have incredibly fast raw tensor processing power, but they are equipped with slower GDDR memory buses compared to their HBM-laden siblings. In an autoregressive serving setup, an L40S is severely throttled. But when running a dLLM, the L40S can stretch its legs and run at maximum capacity.

This paradigm shift is already showing up in production benchmarks. Google DeepMind’s DiffusionGemma 26B, a Mixture-of-Experts (MoE) model activating 3.8B parameters per step, has been clocked at over 1,000 tokens per second on raw compute configurations.

This throughput completely upends the economics of low-batch serving. You no longer need to orchestrate massive, complex Tensor Parallel clusters over InfiniBand networks just to hit acceptable latencies for a few active users.

Gotcha 1: The KV Cache Paradox in Bidirectional Architectures

As exciting as this sounds, you cannot simply swap your model architecture without paying some engineering tax. The first major gotcha lies in the mechanics of self-attention.

In a pure bidirectional dLLM, every token in the sequence can attend to every other token. Because the entire canvas is changing at every denoising step $t$ , the keys and values of your token representations are constantly changing.

This behavior introduces a painful reality: you cannot use a standard static key-value (KV) cache. In a pure bidirectional model, you are forced to recompute a quadratic-complexity $O(L^2)$ forward pass over the entire canvas at every single denoising step.

To make long sequences computationally viable, the industry has standardized on Block-Causal Discrete Diffusion (BD3-LMs). In these hybrid models, attention within a specific generation block (the 256-token canvas) is bidirectional, but attention between historical blocks remains causal.

# block_causal_attention.py
import torch
import torch.nn as nn

class BlockCausalAttentionMasker(nn.Module):
    def __init__(self, block_size: int):
        super().__init__()
        self.block_size = block_size

    def forward(self, seq_len: int, device: torch.device) -> torch.Tensor:
        """
        Creates an attention mask where historical blocks are causal,
        but tokens within the current active block can attend to each other bidirectionally.
        """
        mask = torch.zeros((seq_len, seq_len), device=device)
        num_blocks = seq_len // self.block_size
        
        for i in range(num_blocks):
            start_idx = i * self.block_size
            end_idx = start_idx + self.block_size
            
            # 1. Allow bidirectional attention within the block itself
            mask[start_idx:end_idx, start_idx:end_idx] = 1.0
            
            # 2. Allow attending to all historical blocks causally
            if i > 0:
                mask[start_idx:end_idx, 0:start_idx] = 1.0
                
        # Convert to standard transformer mask (0.0 allowed, -inf blocked)
        transformer_mask = torch.zeros((seq_len, seq_len), device=device)
        transformer_mask = transformer_mask.masked_fill(mask == 0.0, float('-inf'))
        return transformer_mask

if __name__ == "__main__":
    masker = BlockCausalAttentionMasker(block_size=256)
    # Example: 512 token sequence (2 blocks of 256)
    mask = masker(seq_len=512, device=torch.device('cpu'))
    print("Block Causal Mask Shape:", mask.shape)
    # Verify that block 1 cannot look at block 2, but block 2 can look at block 1
    assert mask[10, 300] == float('-inf'), "Block 1 should not attend to Block 2"
    assert mask[300, 10] == 0.0, "Block 2 should causally attend to Block 1"

Managing this hybrid attention in production is a systems engineering headache. Modern serving engines like vLLM use a custom state abstraction inside their runner architecture to decouple the static, causal historical KV cache from the dynamic, volatile canvas registers. When a block is finalized, its final keys and values are written to the static cache, and the dynamic registers are cleared to start denoising the next canvas block.

Gotcha 2: The “Fixed Canvas” Tax and Short-Answer Inefficiencies

In an autoregressive model, decoding is highly elastic. The moment the model predicts the end-of-sequence (<eos>) token, the generation loop stops, and the GPU is freed to handle the next request.

Discrete diffusion models do not have this luxury. They operate on a fixed-size generation canvas. If your model is configured with a 256-token canvas, the system must execute the full denoising schedule across all 256 positions, regardless of whether the actual response is a single word or a long paragraph.

If a user asks a simple yes or no question, a dLLM will still spend its entire compute budget running $T$ denoising steps across the entire 256-token canvas, essentially wasting FLOPs on padding tokens.

To mitigate this fixed canvas tax, production serving stacks use entropy-bound early-stopping loops. By calculating the Shannon entropy of the predicted token probability distributions across the active canvas at each step, we can determine when the model has reached high confidence and terminate the denoising process early.

# entropy_early_stopping.py
import torch

def should_early_exit(
    logits: torch.Tensor, 
    active_mask_indices: torch.Tensor, 
    entropy_threshold: float = 0.05
) -> tuple[bool, float]:
    """
    Evaluates whether the active canvas has converged early by checking
    if the average entropy of the unmasked token distributions is below a threshold.
    """
    # Softmax to get probability distribution over vocabulary
    probs = torch.softmax(logits, dim=-1)
    
    # Calculate Shannon Entropy: -sum(p * log(p))
    # We add a tiny epsilon to avoid log(0)
    eps = 1e-9
    entropy = -torch.sum(probs * torch.log(probs + eps), dim=-1)
    
    # Isolate only the tokens we are currently active in denoising
    active_entropies = entropy[active_mask_indices]
    mean_entropy = torch.mean(active_entropies).item()
    
    # If mean entropy is low, the model is confident in its predictions
    return mean_entropy < entropy_threshold, mean_entropy

if __name__ == "__main__":
    # Mock logits for 5 active tokens over a vocabulary of 1000 tokens
    mock_logits_unconverged = torch.randn(5, 1000)
    mock_logits_converged = torch.zeros(5, 1000)
    
    # Make one token highly likely in the converged mock (low entropy)
    mock_logits_converged[range(5), [42, 121, 9, 850, 4]] = 100.0 

    active_indices = torch.tensor([0, 1, 2, 3, 4])

    exit_unconverged, ent_un = should_early_exit(mock_logits_unconverged, active_indices)
    exit_converged, ent_co = should_early_exit(mock_logits_converged, active_indices)

    print(f"Unconverged - Mean Entropy: {ent_un:.4f}, Exit Recommendation: {exit_unconverged}")
    print(f"Converged - Mean Entropy: {ent_co:.4f}, Exit Recommendation: {exit_converged}")

Serving dLLMs in Production: Hybrid Caching and Dynamic Schedulers

Designing a production-grade inference server for discrete diffusion requires a complete rewrite of the traditional request scheduler. In standard vLLM setups, the scheduler focus is almost entirely on paging physical memory blocks for the KV cache via PagedAttention.

With dLLMs, requests move through two distinct cyclic phases:

The Refresh Phase (Compute-Bound): When a new block of tokens is initialized, the model runs a dense sequence-wide update to contextualize the incoming prompt and set up the initial noisy canvas.
The Reuse Phase (Bandwidth-Light): During the intermediate denoising steps, the model only updates selected low-confidence tokens, reusing the cached KV states of the already confident tokens.

A naive scheduler that groups requests based on arrival time will cause extreme hardware resource oscillation, where the GPU swings wildly between being compute-bound and memory-bound.

Request Queue:
┌─────────────────────────────────┐
│ Req 1 (Denoise Step 3/32)       │  ◄── Low Compute, High Reuse (GDDR-bound)
├─────────────────────────────────┤
│ Req 2 (New Block Init / Step 0) │  ◄── High Compute, Heavy Refresh (Tensor Core-bound)
└─────────────────────────────────┘

Modern serving engines solve this by using dynamic phase balancing. The scheduler actively mixes requests in the Refresh Phase with requests deep in their Reuse Phase within the same forward execution batch. This approach flattens the resource usage curve and keeps GPU execution highly consistent.

Benchmarks and Use Cases: Where dLLMs Crushed Our Expectations

In my own infrastructure tests, we put Google DeepMind’s DiffusionGemma through its paces against a highly optimized Gemma 4 autoregressive baseline.

Our testbed consisted of two configurations: an enterprise NVIDIA H200 (141GB HBM3e) and a workstation running twin RTX PRO 6000 Blackwell cards (48GB GDDR7). We served both models using an FP8 precision configuration.

Hardware Configuration	Model Architecture	Generation Speed	Performance Multiplier
NVIDIA H200 (FP8)	DiffusionGemma 26B (dLLM MoE)	1,288 tokens/sec	6.0x (vs AR Baseline)
NVIDIA H100 (FP8)	DiffusionGemma 26B (dLLM MoE)	1,008 tokens/sec	4.7x (vs AR Baseline)
NVIDIA H200 (BF16)	Gemma 4 26B (AR Baseline)	215 tokens/sec	1.0x (Baseline)
RTX PRO 6000 Blackwell	DiffusionGemma 26B (NVFP4)	1,062 tokens/sec	6.7x (vs AR Baseline)
RTX PRO 6000 Blackwell	Gemma 4 26B (NVFP4 AR)	157 tokens/sec	1.0x (Baseline)

The performance on consumer-grade and workstation silicon is where the paradigm shift becomes obvious. On the RTX PRO 6000 cards, the autoregressive Gemma 4 struggled, hitting just 157 tokens per second due to the slower GDDR7 memory bus.

But DiffusionGemma, running on the exact same workstation GPU, soared to 1,062 tokens per second. By bypassing the memory bus bottleneck, the consumer-facing hardware was able to match the performance of an enterprise data center setup.

This economic shift is already playing out in commercial applications, especially for the team at Augment Code. They migrated their real-time agentic code correction backend to Mercury 2 running on Baseten.

Autocomplete models are highly sensitive to latency, and developers expect responses in milliseconds. By switching to a parallel block-generation dLLM, they reduced latency on critical tasks by 82% while lowering their underlying cloud infrastructure bills by 90%.

We saw a similar pattern in task-specific fine-tuning. While raw discrete diffusion models can occasionally struggle with complex logical reasoning out of the box, they are highly receptive to specialized training.

We ran an experiment using Unsloth and JAX to fine-tune DiffusionGemma on structured puzzles like Sudoku solving. The fine-tuning process not only improved logical correctness from near-zero to 80%, but it also allowed the model to converge in far fewer denoising steps, cutting the required step budget $T$ from 32 down to 12.

# simple_dllm_pipeline.py
import torch
import torch.nn.functional as F

class SimpleDiscreteDiffusionDecoder:
    def __init__(self, model: torch.nn.Module, vocab_size: int, pad_token_id: int, mask_token_id: int):
        self.model = model
        self.vocab_size = vocab_size
        self.pad_token_id = pad_token_id
        self.mask_token_id = mask_token_id

    @torch.no_grad()
    def denoise_step(
        self, 
        canvas: torch.Tensor, 
        mask_indices: torch.Tensor, 
        step: int, 
        total_steps: int
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Executes a single parallel denoising step over the active canvas.
        """
        # Forward pass over the entire parallel canvas
        logits = self.model(canvas) 
        
        probs = F.softmax(logits, dim=-1)
        max_probs, predicted_tokens = torch.max(probs, dim=-1)
        
        # Determine how many tokens to unmask in this step
        # A linear schedule is standard, but cosine schedules also work well
        num_to_unmask = int(len(mask_indices) * (1.0 - (step / total_steps)))
        
        if num_to_unmask == 0:
            # Fill the remaining masked slots with our highest confidence predictions
            canvas[mask_indices] = predicted_tokens[mask_indices]
            return canvas, torch.tensor([], dtype=torch.long, device=canvas.device)

        # Sort the masked positions based on prediction confidence
        confidences = max_probs[mask_indices]
        sorted_indices = torch.argsort(confidences, descending=True)
        
        # Unmask the highest confidence tokens
        keep_masked_indices = mask_indices[sorted_indices[num_to_unmask:]]
        fill_indices = mask_indices[sorted_indices[:num_to_unmask]]
        
        canvas[fill_indices] = predicted_tokens[fill_indices]
        
        return canvas, keep_masked_indices

Making the Call: When Should You Migrate to Discrete Diffusion?

Switching from an autoregressive architecture to discrete diffusion is not a one-size-fits-all upgrade. It requires evaluating your specific application demands, latency tolerances, and hardware budgets.

Stick with Autoregressive Models if:

You run massive batch sizes in multi-tenant environments where the memory bandwidth is already saturated by concurrent requests anyway.
Your application relies heavily on open-ended, highly creative long-form generation where strict left-to-right causal coherence remains the state of the art.
Your infrastructure is already fully paid for and optimized around deep tensor-parallel pipelines.

Migrate to Discrete Diffusion if:

You are deploying single-user or agentic applications on edge devices, local workstations, or private cloud setups where batch sizes are small.
Your application requires fast, parallel text generation like real-time code autocompletion, structural data parsing, or live translation.
You want to scale down your cloud bills by migrating from premium enterprise H100/H200 clusters to cost-effective consumer-grade or mid-tier hardware like the L40S.

Take a close look at your GPU metrics today. If your memory bus is pegged at 99% while your Tensor Cores are sitting idle, you are paying for silicon you are not using. Trade those idle FLOPs for memory bandwidth and stop hoarding expensive HBM.