Week 05 · DeepSeek-OCR 2: Visual Causal Flow

The Paper

"DeepSeek-OCR 2: Visual Causal Flow" was submitted to arXiv on January 28, 2026 (arXiv:2601.20552v1) by Haoran Wei, Yaofeng Sun, and Yukun Li from DeepSeek AI. The paper is licensed under CC BY 4.0 and the model weights are publicly available at github.com/deepseek-ai/DeepSeek-OCR-2. As of publication it is not yet accepted to a conference venue, but sits in the Computer Vision and Pattern Recognition (cs.CV) category.

The core claim: raster-scan token ordering - the default processing order in virtually all vision encoders - is architecturally mismatched with how semantic information is distributed in documents. The paper introduces DeepEncoder V2, which replaces CLIP ViT with a language model used as a vision encoder, equipped with learned causal flow queries that dynamically reorder attention over visual tokens based on image semantics. On OmniDocBench v1.5, the system reaches 91.09% overall performance, a 3.73% improvement over DeepSeek-OCR v1, and outperforms Gemini-3 Pro at an equal visual token budget of 1,120.

Read the Paper on arXiv →

The Problem Before This Paper

Document OCR is not just a character recognition problem. A production-grade document understanding system needs to handle multi-column layouts, tables with merged cells, embedded formulas, reading order across figure captions and footnotes, and handwritten annotations alongside typeset text. These requirements make the vision encoder - not the language decoder - the critical component.

The dominant architecture for multimodal document understanding encodes images with CLIP ViT (approximately 300M parameters), projects the visual tokens via a learned adapter, and feeds them as a prefix into an LLM decoder. This approach has two structural problems for document OCR specifically.

First, all visual tokens are generated in fixed raster-scan order: left-to-right, top-to-bottom, row by row. This ordering carries no semantic information. A dense table header, a page number in the margin, and a mathematical formula in the body are interleaved in exactly the same way regardless of their semantic relationships or their relevance to the decoding task. CLIP ViT processes these tokens with bidirectional attention, so it can in principle relate any two patches, but the sequence passed to the decoder carries no learned notion of which regions should be attended to first.

Second, prior work - notably BLIP-2 (Li et al., 2023) and its Q-Former - addressed visual token compression by introducing 32 learnable query tokens that cross-attend to the full set of ViT output tokens. This creates an information bottleneck: compressing hundreds of patch tokens into 32 fixed queries inevitably loses fine-grained spatial structure, which is fatal for formula recognition, table cell alignment, and reading order recovery. The Q-Former approach reduces the number of tokens passed to the LLM but does so with no mechanism for preserving layout-critical information.

The paper's diagnosis: the bottleneck is not the number of visual tokens. It is the absence of a mechanism that learns which tokens to surface first, and in what order, relative to the semantic content of the image.

What They Built

The architecture is an encoder-decoder model. The encoder - DeepEncoder V2 - processes the image and produces a compact sequence of causal tokens. The decoder is a 3B-parameter Mixture-of-Experts LLM with approximately 500M active parameters per forward pass, based on DeepSeek-MoE.

Vision Tokenizer. The first stage of the encoder is a convolutional tokenizer built on SAM-base with two additional convolutional layers added on top. Total parameter count is approximately 80M. The tokenizer performs 16x spatial compression and reduces the feature dimensionality from 1024 to 896. The role of this component is analogous to patch projection in ViT: it converts the raw image pixels into a dense grid of patch representations, referred to as visual tokens.

Language Model as Vision Encoder. The visual tokens produced by the tokenizer are then processed by Qwen2-0.5B (500M parameters), a language model used as the encoder rather than CLIP ViT. This is the key architectural shift. Unlike CLIP ViT - which was trained on image-text contrastive objectives and generates a fixed-size output - Qwen2-0.5B operates over the full variable-length sequence of visual tokens and learns to produce compact causal flow query outputs from them.

The attention mask within this LLM encoder uses a dual-stream design:

Attention mask structure (m visual tokens, n causal query tokens):

[ ones(m x m) | zeros(m x n) ]
[ ones(n x m) | tril(n x n) ]

Visual tokens (the first m positions) use fully bidirectional attention among themselves - every visual token can attend to every other visual token. Causal flow queries (the last n positions) use causal (lower triangular) attention among themselves, and can attend to all visual tokens but not to each other in future positions. Crucially, visual tokens cannot attend to causal query positions: this prevents information from flowing backward from the queries into the visual token representations, enforcing a clean separation between the two streams.

Only the last n token outputs - the causal flow queries - are passed to the MoE decoder. The visual tokens are discarded after the encoder pass.

Causal Flow Query Count. The number of queries is determined by the image resolution:

num_queries = (W x H) / (16^2 x 16)

At the global view resolution of 1024x1024, this yields 256 queries. For local crops at 768x768, the formula yields 144 queries per crop. The system processes one global view and up to six local crops, giving a total token range of [256, 1120] visual tokens passed to the decoder. The upper bound of 1,120 is intentionally set to match Gemini-3 Pro's maximum visual token budget, enabling a direct controlled comparison.

Global view (1024x1024): 256 queries
Local crop (768x768): 144 queries / crop
Max crops k: k in [0, 6]
Total token range: [256, 256 + 144*6] = [256, 1120]

Training Pipeline. The model is trained in three stages on approximately 100M image-text pairs across 160 A100 GPUs (20 nodes x 8 GPUs each). The training data mixture is 80% OCR-specific content, sampled at a 3:1:1 ratio across text, formulas, and tables.

Stage 1 pretains the full encoder (vision tokenizer + Qwen2-0.5B) at resolutions 768x768 and 1024x1024, using AdamW with cosine decay from 1e-4 to 1e-6, for 40,000 iterations with 8K sequence packing and a batch size of 640.

Stage 2 jointly optimizes the LLM encoder and the MoE decoder while keeping the vision tokenizer frozen. A 4-stage pipeline parallel setup distributes the vision tokenizer, the LLM encoder, and six decoder layers across 40 data-parallel replicas. Learning rate decays from 5e-5 to 1e-6 over 15,000 iterations with a global batch size of 1,280.

Stage 3 freezes DeepEncoder V2 entirely and continues training only the decoder for 20,000 iterations from 1e-6 to 5e-8. Freezing the encoder at this stage more than doubles training throughput by eliminating the encoder's backward pass.

Key Findings

Semantic token ordering improves formula and reading order recovery. FormulaCDM improves by 6.17 percentage points (84.14% to 90.31%) and R-orderEdit drops from 0.085 to 0.057 - a 33% reduction in reading order error. These are the two metrics most sensitive to the ordering of visual attention.
LLM encoders outperform CLIP at matched parameter budgets. Replacing CLIP ViT (300M params) with Qwen2-0.5B (500M params) delivers consistent gains across all nine document categories on OmniDocBench v1.5, with the most pronounced gains in formula-dense and layout-complex categories.
Causal query compression avoids the Q-Former bottleneck. Rather than reducing visual information to a fixed number of learned query slots (as in BLIP-2), causal flow queries are resolution-adaptive. At higher resolutions, more queries are allocated, preserving spatial granularity proportional to document complexity.
Cascaded 1D causal reasoners approximate 2D spatial understanding. The paper frames DeepEncoder V2 as two cascaded causal reasoners - the LLM encoder and the LLM decoder - operating on a linearized 2D image. This is presented as a tractable path toward 2D layout reasoning without requiring native 2D positional encodings or autoregressive 2D decoders.
Production repetition rate reduces meaningfully. Online user logs show the repetition rate drops from 6.25% to 4.17% - a failure mode that is difficult to isolate in offline benchmarks but directly attributable to poor layout encoding in the predecessor.

Results

All results are evaluated on OmniDocBench v1.5, which covers nine document categories: PPT, Academic Paper, Book, Colorful Textbook, Exam Paper, Magazine, Newspaper, Note, and Research Report. Metrics include TextEdit distance, FormulaEdit distance, FormulaCDM (formula structure recognition), TableTEDs strict and loose, and R-orderEdit (reading order).

Overall OmniDocBench v1.5: 91.09% (vs 87.36% for DeepSeek-OCR v1, +3.73%)
TextEdit: 0.048 (vs 0.073, -34% error)
FormulaEdit: 0.198 (vs 0.236, -16% error)
FormulaCDM: 90.31% (vs 84.14%, +6.17 pp)
TableTEDs strict: 87.75% (vs 85.25%, +2.50 pp)
TableTEDs loose: 92.06% (vs 89.01%, +3.05 pp)
R-orderEdit: 0.057 (vs 0.085, -33% error)

At the equal visual token budget of 1,120 tokens, DeepSeek-OCR 2 achieves an overall edit distance of 0.100 versus Gemini-3 Pro's 0.115 - a 13% relative reduction in transcription error.

The weakest category across both models is Newspaper, with a text edit distance of 0.139 for DeepSeek-OCR 2. The paper attributes this to the training data distribution: only approximately 250k newspaper-format samples are present in the OCR 2.0 dataset, compared to substantially higher counts for academic and exam paper categories.

On production traffic, the repetition rate - measured on online user logs and on a separate PDF pretraining data evaluation set - drops from 6.25% to 4.17% (online) and from 3.69% to 2.88% (PDF pretraining data). These figures are significant because repetition failures are a direct user-facing quality issue that offline benchmarks underweight.

Training hardware: 160 A100-40GB GPUs across all three stages. No gradient checkpointing is reported. The 3-stage pipeline reaches full convergence at 75,000 total iterations across stages.

Why This Matters for AI and Automation

Document understanding is a prerequisite for almost every enterprise automation pipeline that touches unstructured data. Contracts, invoices, research papers, regulatory filings, and technical manuals all require OCR as a first step before any downstream reasoning can occur. The quality ceiling on that OCR step bounds the quality of everything downstream.

Formula-heavy document automation. FormulaCDM improving from 84.14% to 90.31% is directly relevant to any pipeline processing academic papers, patent filings, or technical specifications. At sub-90% formula recognition, downstream reasoning over quantitative claims is unreliable. DeepSeek-OCR 2 pushes past that threshold.
Reading order recovery for multi-column layouts. A 33% reduction in R-orderEdit means fewer pipeline failures when processing magazine-style layouts, legal documents with side annotations, or academic papers with footnotes and margin figures. This is the step that most OCR systems handle worst, and the one that most directly breaks RAG pipelines operating on PDFs.
Competitive performance at controlled token cost. The 1,120-token budget comparison with Gemini-3 Pro establishes that a specialized, open-weights document model outperforms a frontier multimodal model on this task class - at a token count that is directly controllable in production serving configurations.
LLM-as-encoder pattern is generalizeable. The core design - using a pretrained language model as the vision encoder rather than a contrastively trained ViT - is not OCR-specific. The paper's discussion section explicitly identifies this as a direction toward a unified omni-modal encoder, where modality-specific query embeddings differentiate input streams through shared weight matrices. This is a meaningful architectural signal for teams building multimodal systems beyond document understanding.
Open weights reduce vendor dependency. With model weights at github.com/deepseek-ai/DeepSeek-OCR-2, teams can self-host inference and fine-tune on domain-specific document types - a significant operational advantage over API-only alternatives for document-intensive applications.
Repetition rate as a production metric. The paper's inclusion of production repetition rates alongside benchmark scores is methodologically noteworthy. It reflects a commitment to measuring failure modes that matter in deployment, not just those that are easy to quantify offline. More research papers should track and report this class of metric.

My Take

The most interesting decision in this paper is the choice to use a language model as the vision encoder rather than CLIP. On the surface this sounds expensive - Qwen2-0.5B is larger than CLIP ViT-L/14 and was not designed for visual feature extraction. But the paper's argument is structural: CLIP was trained to align global image representations with text captions, which produces good semantic embeddings but poor spatial encodings. A language model trained to predict next tokens from sequences is, by construction, trained to encode positional and relational structure in sequences. Applying that to visual token sequences - even ones that were not part of language pretraining - turns out to transfer better to document layout understanding than CLIP's contrastive representations do.

The causal flow query mechanism is elegant but the paper does not provide strong ablations isolating its contribution from the LLM encoder substitution itself. Both changes happen together in DeepEncoder V2, and the benchmark numbers do not decompose how much of the 3.73% overall improvement comes from the encoder replacement versus the causal ordering of the query outputs. This is a gap in the analysis that would be worth closing in follow-on work.

The Newspaper category gap - text edit distance of 0.139 even after all improvements - is an honest concession. Newspaper layouts involve non-standard column widths, narrow gutters, mixed fonts, and irregular article boundaries. 250k training samples is not enough to learn these patterns well, and the paper's attribution of this to data distribution rather than architectural limitation seems correct. It also highlights a recurring issue in document AI: the categories where the architecture works worst are often the ones that appear in the highest volume in real enterprise deployments.

The discussion of cascading two 1D causal reasoners as a path to 2D reasoning is speculative but worth taking seriously. The argument is that documents have been designed by humans to be read linearly - reading order is not arbitrary but reflects the author's intended information flow. A model that learns to sequence visual attention in a causally consistent manner is therefore recovering something real about the structure of the document, not just performing a compression trick. Whether this generalizes to non-document visual understanding tasks is the genuinely open question.

Discussion question: DeepSeek-OCR 2 replaces CLIP ViT with Qwen2-0.5B as the vision encoder and gets better document OCR. But CLIP was trained on image-text alignment across general visual domains, while Qwen2-0.5B was trained entirely on text. Does the performance gain reflect a genuine advantage of causal LLM encoders for structured document understanding, or does it primarily reflect the larger parameter count and the fact that document layouts share more structural properties with text sequences than with natural image distributions - and if so, what happens to this advantage when the input is a natural scene rather than a typeset document?

DeepSeek-OCR 2: Visual Causal Flow