The Big Idea
IBM Research released Granite 4.0 3B Vision on March 27, 2026 - a 4B-parameter vision-language model built for one specific job: pulling structured data out of enterprise documents. Not general visual QA. Not image captioning. Extraction: charts to CSV, tables to JSON, invoices to key-value pairs.
The model sits on Granite 4.0 Micro (a 3B LLM), uses rank-256 LoRA adapters across all attention and MLP layers, and introduces a deepstack architecture that injects visual features into eight different LLM layer depths. It is Apache 2.0 licensed, vLLM-deployable, and integrates directly with IBM's open-source Docling document pipeline.
What makes it immediately relevant for automation practitioners is the interface: task routing
via prompt tags. Send <chart2csv> with a chart image and get a structured
CSV back. Send <tables_json> with a document scan and get a JSON object.
The extraction pipeline collapses from a multi-step OCR-plus-parser chain into a single API call.
Before vs After
The problem it solves is one that anyone who has built document processing automation will recognize immediately. The old approach was fragile from the start.
Traditional OCR Pipeline
- OCR engine extracts raw text (Tesseract, Textract, Form Recognizer)
- Custom regex and heuristics parse field positions
- Separate parser per document layout and version
- Charts are completely unreadable - treated as images
- Complex nested tables break silently with wrong cell alignment
- Manual QA pass required on every batch
- Parser maintenance cost grows as document formats change
Granite 4.0 3B Vision
- Single model handles charts, tables, and key-value pairs
- Tag-based task routing - no prompt engineering required
- Structured output (CSV, JSON, HTML) returned directly
- Charts converted to executable Python code or data tables
- Table structure preserved via TEDS-optimized training
- 85.5% zero-shot KVP accuracy on VAREX benchmark
- Apache 2.0 - no licensing restrictions for commercial use
How It Works
The architecture has four distinct components that work in sequence. Understanding each one explains why IBM made specific trade-off decisions.
Vision Encoder - SigLIP2. Input images are tiled into 384x384 patches, each encoded independently by the SigLIP2 vision encoder (google/siglip2-so400m-patch16-384). Tiling preserves resolution across large documents without forcing a single global resize that would destroy fine-grained text detail in tables or small chart labels.
Window Q-Former Projectors. After encoding, each 4x4 patch window is compressed to 2x2 tokens via cross-attention. This 4x reduction cuts the visual token budget fed to the LLM from an otherwise prohibitive count to something tractable without losing structural information. The window-based design preserves local spatial relationships - critical for table cell alignment.
LayerDeepstack and SpatialDeepstack. Most VLMs inject visual features once, at the input boundary of the LLM. Granite 4.0 3B Vision injects them at 8 distinct points:
- LayerDeepstack (4 points): features from 4 encoder depths projected to 4 different LLM layers. Deepest (most semantically abstract) features feed the earliest LLM layers, giving the model global semantic context before text generation begins.
- SpatialDeepstack (4 points): deepest encoder features split into 4 spatial groups, each injected at a separate later LLM layer. This preserves spatial detail - column positions, row boundaries, axis labels - where single-injection approaches lose it.
LoRA on Granite 4.0 Micro. Rank-256 LoRA adapters are applied across all self-attention projections and MLP layers of the 3B base model. This enables a useful deployment trick: in the native LoRA runtime mode (vLLM), the base model handles text-only requests without any vision overhead. The adapter loads only when an image is present. One deployment serves both workloads.
Tag-based task routing. The model was fine-tuned to associate specific prompt
tags with specific output formats. Send <chart2csv> and it outputs a CSV.
Send <tables_json> and it outputs a JSON object with table dimensions and
cell contents. For KVP extraction, you pass a JSON schema in the prompt and the model returns
a structured JSON matching those fields - returning null for fields it cannot locate.
Key Findings
- KVP extraction (VAREX benchmark): 85.5% exact-match accuracy zero-shot, ranking 3rd among 2-4B parameter models as of March 2026. Competitive with models twice the size for structured field extraction.
- Chart extraction (ChartNet test set, GPT-4o as judge): outperforms competitive small VLMs on both chart2csv and chart2summary tasks. The chart2code output produces executable Python that recreates the original visualization.
- Table extraction (TEDS metric): strong performance on cropped table (TableVQA-Extract, OmniDocBench-tables, PubTablesV2) and full-page document settings. TEDS measures tree-edit distance between predicted and ground-truth table structure - a strict metric that penalizes cell merges and column misalignment.
- Single model, 7 tasks: chart2csv, chart2code, chart2summary, tables_json, tables_html, tables_otsl, KVP - all routed by prompt tag, no task-specific model per pipeline.
- LoRA merge option: calling
model.merge_lora_adapters()permanently fuses weights for faster inference in dedicated vision workloads. - Training infrastructure: IBM Blue Vela supercomputing cluster, 32 NVIDIA H100 GPUs, approximately 200 hours. Released March 27, 2026 under Apache 2.0.
Why This Matters for AI and Automation Practitioners
Document extraction sits at the front of a surprisingly large percentage of enterprise automation pipelines. Invoices, contracts, financial reports, regulatory filings, research PDFs - anything that arrives as a scanned image or a visually complex PDF has historically required either expensive cloud OCR services or brittle custom parsers.
For automation practitioners specifically, here is what is actionable:
- Docling integration: IBM's open-source Docling pipeline (github.com/DS4SD/docling) has native support for Granite 4.0 3B Vision. If you are already using Docling for PDF processing, this is a near-drop-in upgrade for structured extraction tasks.
- vLLM-compatible: the model runs on vLLM with OpenAI-compatible API - meaning it fits into any existing LLM infrastructure that already routes to vLLM endpoints. No separate serving stack required.
- Apache 2.0 licensing: unlike Azure Form Recognizer or AWS Textract which charge per-page, you own the deployment. High-volume document pipelines see the cost difference at scale.
- English-only limitation: the model is trained on English documents. Multilingual enterprise pipelines (contracts in Spanish, German, Japanese) are not covered and will need a different approach.
- Output validation still required: IBM's own documentation recommends pairing with Granite Guardian 3.2-5B for risk detection. At 85.5% accuracy, 1 in 7 KVP extractions may be wrong. For high-stakes documents (legal, financial), a validation layer is not optional.
My Take
The narrow specialization is both the model's strength and its honest positioning. IBM is not trying to compete with GPT-4o on visual reasoning or Gemini on multimodal QA. They are building a reliable extraction component that fits into document processing workflows the way a specialized tool should: predictably, with structured outputs, at low inference cost.
The tag-based interface is smart design. It makes the model's capabilities explicit at call time, forces the caller to declare intent, and avoids the ambiguous open-ended prompting that makes general VLMs unreliable for tasks that need deterministic structure. You cannot accidentally get a narrative response when you asked for CSV.
What I find most technically interesting is the deepstack injection architecture. The standard approach - inject visual features once at the LLM input boundary - treats all visual information as equally relevant at all processing depths. Injecting at 8 different depths, mixing semantic context early with spatial detail late, is a meaningful departure. The intuition is sound: global understanding should establish context first, and fine-grained layout detail should resolve ambiguity later. Whether this advantage generalizes beyond document tasks is worth watching.
The English-only constraint is real and will limit adoption in global enterprise settings. It is also understandable: the extraction-focused training dataset simply does not cover multilingual documents at this stage. That is a gap IBM will need to close in future releases to compete with commercial alternatives that already support 50-plus languages.
For practitioners deciding whether to adopt this now: if your pipeline processes English documents at volume and you are currently paying per-page for cloud OCR or maintaining custom parsers, this is worth a serious evaluation. If your documents are multilingual or require high-stakes accuracy beyond 90%, you need a validation layer on top or a different model.
Discussion question: Granite 4.0 3B Vision uses tag-based task routing rather than free-form instruction prompting - a design that trades flexibility for predictability. Does that constraint make it more or less useful than a general-purpose VLM like Qwen2.5-VL-3B in a production document automation pipeline, and at what document volume does a specialized extraction model start paying off versus maintaining a general one?