Week 04 · March 2026

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

March 28, 2026 · by Satish K C 15 min read
Deep Learning Efficiency LLMs Optimization

The Paper

"Mamba: Linear-Time Sequence Modeling with Selective State Spaces" was submitted to arXiv on December 1, 2023 (arXiv:2312.00752) by Albert Gu (Carnegie Mellon University) and Tri Dao (Princeton University / Together AI), and was accepted at ICLR 2024. Within two years of publication, the paper accumulated over 5,000 citations - an extraordinary reception for a paper that directly challenged the architectural consensus built around Transformers.

The central claim is precise: the computational bottleneck of Transformers on long sequences is not a tuning problem or an engineering problem. It is a structural one. Transformer attention scales as O(L2) in both time and memory with sequence length L. Mamba proposes an alternative architecture based on Selective State Space Models (SSMs) that achieves O(L) complexity while matching or exceeding Transformer quality across language, DNA, and audio benchmarks.

Read the Paper on arXiv →

The Problem Before This Paper

State Space Models are not new. The mathematical framework - systems that map sequences through a latent continuous-time state - has been studied in signal processing for decades. What changed in 2021-2022 was the discovery that SSMs could be made competitive with Transformers on sequence modeling tasks through careful structural design. Work including S4 (Gu et al., 2021), H3 (Fu et al., 2023), and Hyena (Poli et al., 2023) established that structured SSMs could achieve strong results on long-range dependency benchmarks while maintaining linear complexity.

But all of these models shared a fundamental constraint: they were Linear Time-Invariant (LTI) systems. The state transition matrices A, B, and C were fixed across the entire input sequence - the same transformation applied to every token regardless of content. This design choice is what enables efficient computation via convolutions, but it comes with a hard architectural ceiling.

Specifically, LTI SSMs cannot perform content-based reasoning. They cannot selectively ignore irrelevant context, hold certain information in state indefinitely while forgetting other inputs, or dynamically adjust how information flows based on what the tokens actually contain. On tasks that require this - selective copying, induction heads, long-context recall - LTI SSMs failed almost completely, falling close to random-chance performance. Transformers, by contrast, handle these tasks via attention, which explicitly scores token-to-token compatibility at every step.

Gu and Dao's diagnosis: the core limitation of prior SSMs was not the linear recurrence framework. It was the time-invariance constraint. The solution was to make the parameters input-dependent.

What They Built

The foundation is the continuous-time SSM. Given an input sequence x(t), the model maintains a hidden state h(t) and produces output y(t) through the following system:

h'(t) = A h(t) + B x(t)
y(t) = C h(t)

In prior SSMs, A, B, and C are fixed matrices learned during training. For discrete-time computation on tokens, this continuous system is discretized using zero-order hold (ZOH) with a learnable timescale parameter Delta:

A_bar = exp(Delta * A)
B_bar = (Delta * A)-1 (exp(Delta * A) - I) * Delta * B

h_t = A_bar * h_{t-1} + B_bar * x_t
y_t = C * h_t

The selective mechanism in Mamba is a direct extension: Delta, B, and C are made functions of the input x_t. These parameters are no longer learned constants - they are outputs of linear projections applied to the current token. This single change transforms the architecture from a fixed filter into a content-aware sequential model:

B_t = Linear_B(x_t)
C_t = Linear_C(x_t)
Delta_t = softplus( Linear_Delta(x_t) + parameter )

h_t = A_bar_t * h_{t-1} + B_bar_t * x_t
y_t = C_t * h_t

Delta functions as a gating mechanism. When Delta_t is large, the discretized A_bar approaches the identity - the model forgets the previous state and focuses on the current input. When Delta_t is small, the state carries forward with minimal update. B and C control which information gets written into and read from the state at each step, conditioned on the current token's content.

The challenge this creates is computational. Standard SSMs exploit their LTI structure to compute the entire output via a global convolution in O(L log L) time using the Fast Fourier Transform. Making parameters input-dependent breaks this - a content-aware recurrence cannot be expressed as a fixed convolution. Without a workaround, selective SSMs would be forced back into sequential computation: O(L) in complexity but with a constant factor too large to be practical on modern hardware.

The solution is a hardware-aware parallel scan algorithm. Gu and Dao observe that the recurrence h_t = A_bar_t * h_{t-1} + B_bar_t * x_t is a first-order linear recurrence, and these can be computed in parallel via prefix-sum scan operations. Critically, the bottleneck on GPUs is not computation but memory bandwidth - moving data between HBM (high bandwidth memory) and SRAM (on-chip cache) is the slow step. Their implementation uses kernel fusion to keep all intermediate states and parameters in SRAM during computation, performing discretization and the recurrence entirely without reading from or writing to HBM. The result is a CUDA kernel that achieves the theoretical linear complexity with practical throughput that competes with FlashAttention.

The complete Mamba block is minimal: a selective SSM layer with input and output projections, paired with a gated MLP-style branch, and wrapped with layer normalization and a residual connection. There is no attention mechanism, no key-value cache, and no MLP sublayer separate from the block itself. Stacking these blocks end-to-end produces the full model.

Key Findings

Results

The paper benchmarks Mamba against Transformer baselines at matched parameter counts. All language modeling experiments use The Pile dataset. The baseline series is Pythia (EleutherAI), which uses the same data, tokenizer, and training setup, making comparisons direct.

Hardware specs for language model experiments: 8x A100-80GB GPUs. Mamba models were trained using bf16 precision with no gradient checkpointing required at the sequence lengths tested.

Why This Matters for AI and Automation

The practical impact of Mamba is not just benchmark numbers. It is the argument that Transformer attention is not the only viable path to general-purpose sequence modeling at scale. For practitioners, this opens up concrete options:

My Take

The selection mechanism is the paper's genuinely novel contribution. The insight that Delta acts as a continuous gate - controlling whether the model focuses on new input or carries forward existing state - is clean and well-motivated. Prior SSMs were powerful but fundamentally passive: they filtered sequences through fixed learned dynamics. Mamba's dynamics respond to content, which is what makes it competitive on the tasks that previously separated SSMs from Transformers.

The hardware-aware implementation is underrated in discussions of this paper. The theoretical O(L) complexity of selective SSMs had been understood before Mamba - the open question was whether a practical GPU implementation could match Transformer throughput, which had years of optimization work behind it. Gu and Dao's parallel scan kernel in SRAM is what closed that gap. Without it, the paper's benchmark numbers would look very different. This is a recurring pattern in deep learning: algorithmic and systems contributions are inseparable, but the systems work rarely gets proportional credit.

The limitations are real but manageable. On short sequences (under 2K tokens), Transformer attention is still faster and the quality gap is narrow. Mamba's clear advantage emerges at longer context lengths. The paper's language benchmarks also use greedy evaluation and character-level tasks where Mamba's linear scan is particularly well-suited - it is an open question how selective SSMs perform at truly large scale (70B+ parameters) and on multi-step reasoning benchmarks where attention's direct token-to-token routing may still hold structural advantages.

That said, the follow-on literature has moved faster on this than almost any architecture paper in recent memory. RWKV-v6, Griffin, Jamba, and Mamba-2 all appeared within 12 months. That pace is a reliable signal that the core idea here is genuine and extensible. Whether SSMs ultimately displace Transformers or converge with them in hybrid forms, Mamba defined the architecture conversation for 2024.

Discussion question: Mamba-2 (Dao and Gu, 2024) established a formal connection between selective SSMs and a restricted variant of linear attention, suggesting the two frameworks are more similar than the original paper implied. Hybrid architectures like Griffin and Jamba interleave Mamba and attention layers rather than replacing one with the other. Does that convergence suggest that full-attention replacement is the wrong goal - and that the real contribution of Mamba is establishing SSMs as a first-class component in hybrid sequence models rather than a stand-alone Transformer alternative?

Share this discussion

← Back to all papers