Week 02 · March 2026

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

March 15, 2026 · by Satish K C 13 min read
LLMs Quantization Efficiency

The Paper

"The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" was published on February 27, 2024 by a ten-person team at Microsoft Research: Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The paper sits at the intersection of model quantization, hardware efficiency, and large-scale language modeling - and its central claim is striking: a large language model where every single weight is restricted to one of three values, {-1, 0, +1}, can match a full-precision (FP16) model of the same size on both perplexity and downstream task performance, while delivering dramatic improvements in latency, memory, throughput, and energy consumption.

The name comes from information theory. Three possible values per weight requires log2(3) bits to represent - which is approximately 1.58 bits. This is the theoretical minimum for a ternary representation. The paper argues this is not just a compression trick but a fundamentally new and more efficient way to design and deploy LLMs at scale.

Read the Paper on arXiv →

The Problem Before This Paper

Modern large language models are extraordinarily expensive to serve. A 70-billion parameter model stored in FP16 (16-bit floating point) requires roughly 140 GB of GPU memory just to hold the weights - before accounting for the KV cache, activations, or optimizer states during inference. Deploying such a model commercially requires multiple high-end GPUs running continuously, with energy costs that scale directly with the number of floating-point multiply-accumulate (MAC) operations performed per token.

Post-training quantization (PTQ) had emerged as a partial solution. Methods like GPTQ, AWQ, and LLM.int8() compress already-trained models from FP16 down to 8-bit or 4-bit integers, reducing memory footprint significantly. But PTQ carries trade-offs:

The key insight from Ma et al. was to approach this differently: instead of training a full-precision model and then quantizing it, train the model natively in low precision from scratch. This is called Quantization-Aware Training (QAT). The model learns to represent knowledge within the constraints of ternary weights, rather than having those constraints imposed after the fact.

What They Built

BitNet b1.58 is a Transformer architecture where every weight matrix in every linear layer is quantized to ternary values {-1, 0, +1} during training. Activations are quantized to 8-bit integers. The architecture is designed to be a drop-in replacement for standard LLaMA-style models, adopting RMSNorm, SwiGLU activations, Rotary Position Embeddings (RoPE), and no bias terms - the same design choices used in most modern open-weight LLMs.

The core quantization mechanism is the absmean quantization function. For a weight matrix W of shape n x m, the quantized version is computed as:

W̃ = RoundClip( W / (γ + ε), -1, 1 )
where γ = (1/nm) Σ |Wij|

Each weight matrix is scaled by its mean absolute value (γ), then each element is rounded to the nearest integer and clipped to the range [-1, 1]. The result is that every weight becomes exactly -1, 0, or +1. The ε term prevents division by zero. Crucially, this quantization happens at training time using straight-through estimators for the gradient, allowing the model to learn weight values that are useful in their quantized form.

Activations are handled separately. Input activations are quantized to 8-bit integers per token, scaled to the range [-Qb, Qb]:

x̃ = Clip( x * Qb / γx, -Qb, Qb )
where γx = max(|x|)

This per-token activation scaling, combined with ternary weights, means that the dominant computation in each linear layer - the matrix multiply - reduces to additions and subtractions only. There are no floating-point multiplications in the weight computation. On custom hardware designed for integer arithmetic, this is where the energy and latency gains materialize most dramatically.

The paper also highlights a structural advantage unique to ternary (vs. binary) quantization: the presence of zero as a valid weight value. A zero weight means the corresponding input feature is completely ignored by that connection. This gives the model an explicit feature filtering mechanism - learned sparsity built directly into the weight representation. Binary {-1, +1} models lack this and must encode "ignore this feature" indirectly through cancellation, which is less expressive.

Key Findings

Results

The authors evaluated BitNet b1.58 against LLaMA, the standard full-precision baseline, across model sizes from 700M to 3.9B parameters, trained on 100B tokens.

Perplexity (WikiText2, 100B tokens):

Zero-shot accuracy (ARC-Easy, ARC-Challenge, HellaSwag, and others at 3B): BitNet b1.58 achieves an average of 50.2% vs. LLaMA's 49.7% - matching and marginally exceeding full precision.

Latency:

Memory:

Energy (matrix multiplication on 7nm chips): BitNet b1.58 consumes 71.4x less energy than FP16 for matrix multiplication operations, the dominant cost in transformer inference.

Throughput (70B model on two A100 GPUs):

At longer training horizons, the results hold. Trained to 2 trillion tokens following StableLM-3B's data recipe, BitNet b1.58 achieves 74.34% average accuracy across Winogrande, PIQA, SciQ, LAMBADA, and ARC-easy, compared to StableLM-3B's 73.22% - matching or exceeding full-precision models even at scale.

Why This Matters for AI and Automation

The implications of BitNet b1.58 are not incremental. If ternary-weight LLMs reach production at scale, they reshape the economics of AI deployment across the entire stack.

My Take

What strikes me most about this paper is how cleanly it separates two things that the field had been conflating: representational capacity and numerical precision. The conventional assumption was that more bits per weight meant a more capable model - that FP16 was better than INT8, and INT8 was better than INT4, and so on down the line. BitNet b1.58 challenges this directly. It does not compress a high-precision model; it trains a model that never needs high precision in the first place.

The introduction of zero as a weight value is more significant than it might initially appear. In a binary {-1, +1} network, every weight contributes to every computation - there is no way to say "ignore this input feature." In BitNet b1.58, the model can learn to set weights to zero and effectively prune connections during training. This is not post-hoc sparsification; it is learned sparsity built into the quantization scheme itself.

The main open question is whether these results hold at frontier scale. The paper's largest experiments top out at 3.9B parameters with 100B training tokens - orders of magnitude smaller than the models currently at the frontier (GPT-4, Claude 3, Gemini Ultra). Whether ternary quantization maintains parity at 100B+ parameters with trillion-token training runs is the critical unknown. The authors suggest it should based on the scaling trends they observe, but it has not been demonstrated empirically at that scale yet.

The second constraint is the hardware gap. The energy and throughput numbers that make this paper exciting are largely theoretical at this point - they assume hardware designed for ternary arithmetic that does not yet exist in production. The 71.4x energy figure is a projection based on operation counts on 7nm chips, not a benchmark on deployed silicon. Realizing the full gains requires either the major chip manufacturers or specialized AI hardware startups to build for this paradigm. That investment requires confidence that the model quality holds at scale - a chicken-and-egg problem.

Even with those caveats, this paper is one of the more important efficiency results of the past few years. It proves the concept at a meaningful scale, with reproducible benchmarks, and open-sources the comparison against a well-understood baseline (LLaMA). Whether it becomes the dominant paradigm depends on what happens at 70B+ parameter scales and what the hardware ecosystem builds toward.

Discussion question: BitNet b1.58 demonstrates compelling efficiency gains at the 3B scale, but the frontier sits at 100B+ parameters with multimodal capabilities. Do you think ternary quantization will hold up at that scale, or is there a fundamental trade-off between extreme weight compression and the kind of nuanced reasoning that frontier models need? And which comes first - the model results at scale, or the custom hardware investment to make it worthwhile?

Share this discussion

← Back to all papers