Article 05 · IBM Granite 4.0 3B Vision · AI & Automation Chronicle

The Big Idea

IBM Research released Granite 4.0 3B Vision on March 27, 2026 - a 4B-parameter vision-language model built for one specific job: pulling structured data out of enterprise documents. Not general visual QA. Not image captioning. Extraction: charts to CSV, tables to JSON, invoices to key-value pairs.

The model sits on Granite 4.0 Micro (a 3B LLM), uses rank-256 LoRA adapters across all attention and MLP layers, and introduces a deepstack architecture that injects visual features into eight different LLM layer depths. It is Apache 2.0 licensed, vLLM-deployable, and integrates directly with IBM's open-source Docling document pipeline.

What makes it immediately relevant for automation practitioners is the interface: task routing via prompt tags. Send <chart2csv> with a chart image and get a structured CSV back. Send <tables_json> with a document scan and get a JSON object. The extraction pipeline collapses from a multi-step OCR-plus-parser chain into a single API call.

Key distinction: Granite 4.0 3B Vision is not a general-purpose VLM. It is a narrow, specialized extraction component - designed to replace the OCR layer in document automation pipelines, not to compete with GPT-4o on open-ended vision tasks. That narrowness is intentional and is what makes it reliable enough for production use.

Before vs After

The problem it solves is one that anyone who has built document processing automation will recognize immediately. The old approach was fragile from the start.

Traditional OCR Pipeline

OCR engine extracts raw text (Tesseract, Textract, Form Recognizer)
Custom regex and heuristics parse field positions
Separate parser per document layout and version
Charts are completely unreadable - treated as images
Complex nested tables break silently with wrong cell alignment
Manual QA pass required on every batch
Parser maintenance cost grows as document formats change

Granite 4.0 3B Vision

Single model handles charts, tables, and key-value pairs
Tag-based task routing - no prompt engineering required
Structured output (CSV, JSON, HTML) returned directly
Charts converted to executable Python code or data tables
Table structure preserved via TEDS-optimized training
85.5% zero-shot KVP accuracy on VAREX benchmark
Apache 2.0 - no licensing restrictions for commercial use

Document Extraction Pipeline - Old vs New

How It Works

The architecture has four distinct components that work in sequence. Understanding each one explains why IBM made specific trade-off decisions.

Granite 4.0 3B Vision - Architecture Pipeline

Vision Encoder - SigLIP2. Input images are tiled into 384x384 patches, each encoded independently by the SigLIP2 vision encoder (google/siglip2-so400m-patch16-384). Tiling preserves resolution across large documents without forcing a single global resize that would destroy fine-grained text detail in tables or small chart labels.

Window Q-Former Projectors. After encoding, each 4x4 patch window is compressed to 2x2 tokens via cross-attention. This 4x reduction cuts the visual token budget fed to the LLM from an otherwise prohibitive count to something tractable without losing structural information. The window-based design preserves local spatial relationships - critical for table cell alignment.

LayerDeepstack and SpatialDeepstack. Most VLMs inject visual features once, at the input boundary of the LLM. Granite 4.0 3B Vision injects them at 8 distinct points:

LayerDeepstack (4 points): features from 4 encoder depths projected to 4 different LLM layers. Deepest (most semantically abstract) features feed the earliest LLM layers, giving the model global semantic context before text generation begins.
SpatialDeepstack (4 points): deepest encoder features split into 4 spatial groups, each injected at a separate later LLM layer. This preserves spatial detail - column positions, row boundaries, axis labels - where single-injection approaches lose it.

LoRA on Granite 4.0 Micro. Rank-256 LoRA adapters are applied across all self-attention projections and MLP layers of the 3B base model. This enables a useful deployment trick: in the native LoRA runtime mode (vLLM), the base model handles text-only requests without any vision overhead. The adapter loads only when an image is present. One deployment serves both workloads.

Tag-based task routing. The model was fine-tuned to associate specific prompt tags with specific output formats. Send <chart2csv> and it outputs a CSV. Send <tables_json> and it outputs a JSON object with table dimensions and cell contents. For KVP extraction, you pass a JSON schema in the prompt and the model returns a structured JSON matching those fields - returning null for fields it cannot locate.

# Chart extraction - 3 output formats from one image
chart_prompts = [
    "<chart2csv>",      # structured CSV data
    "<chart2summary>",   # natural language description
    "<chart2code>",     # executable Python to recreate chart
]

# KVP extraction - schema-guided
schema = {
    "type": "object",
    "properties": {
        "invoice_date": {"type": "string"},
        "order_number":  {"type": "string"},
        "total_amount":  {"type": "number"},
    }
}
prompt = f"Extract structured data. Schema:\n{json.dumps(schema)}\nReturn ONLY valid JSON."
    

Key Findings

85.5% KVP Accuracy 7 Extraction Tasks Apache 2.0 8 Injection Points vLLM Native Docling Integration

KVP extraction (VAREX benchmark): 85.5% exact-match accuracy zero-shot, ranking 3rd among 2-4B parameter models as of March 2026. Competitive with models twice the size for structured field extraction.
Chart extraction (ChartNet test set, GPT-4o as judge): outperforms competitive small VLMs on both chart2csv and chart2summary tasks. The chart2code output produces executable Python that recreates the original visualization.
Table extraction (TEDS metric): strong performance on cropped table (TableVQA-Extract, OmniDocBench-tables, PubTablesV2) and full-page document settings. TEDS measures tree-edit distance between predicted and ground-truth table structure - a strict metric that penalizes cell merges and column misalignment.
Single model, 7 tasks: chart2csv, chart2code, chart2summary, tables_json, tables_html, tables_otsl, KVP - all routed by prompt tag, no task-specific model per pipeline.
LoRA merge option: calling model.merge_lora_adapters() permanently fuses weights for faster inference in dedicated vision workloads.
Training infrastructure: IBM Blue Vela supercomputing cluster, 32 NVIDIA H100 GPUs, approximately 200 hours. Released March 27, 2026 under Apache 2.0.

85.5%

KVP exact-match on VAREX (zero-shot)

visual token compression via Window Q-Former

vision-to-LLM injection points via Deepstack

Why This Matters for AI and Automation Practitioners

Document extraction sits at the front of a surprisingly large percentage of enterprise automation pipelines. Invoices, contracts, financial reports, regulatory filings, research PDFs - anything that arrives as a scanned image or a visually complex PDF has historically required either expensive cloud OCR services or brittle custom parsers.

What changes: A 4B-parameter model on a single A100 now handles charts, tables, and key-value extraction at 85.5% accuracy without any fine-tuning. For many document types, that is production-ready out of the box. The operational cost drops from maintaining per-layout parsers to maintaining one model deployment.

For automation practitioners specifically, here is what is actionable:

Docling integration: IBM's open-source Docling pipeline (github.com/DS4SD/docling) has native support for Granite 4.0 3B Vision. If you are already using Docling for PDF processing, this is a near-drop-in upgrade for structured extraction tasks.
vLLM-compatible: the model runs on vLLM with OpenAI-compatible API - meaning it fits into any existing LLM infrastructure that already routes to vLLM endpoints. No separate serving stack required.
Apache 2.0 licensing: unlike Azure Form Recognizer or AWS Textract which charge per-page, you own the deployment. High-volume document pipelines see the cost difference at scale.
English-only limitation: the model is trained on English documents. Multilingual enterprise pipelines (contracts in Spanish, German, Japanese) are not covered and will need a different approach.
Output validation still required: IBM's own documentation recommends pairing with Granite Guardian 3.2-5B for risk detection. At 85.5% accuracy, 1 in 7 KVP extractions may be wrong. For high-stakes documents (legal, financial), a validation layer is not optional.

The risk: The 85.5% KVP accuracy figure is zero-shot on VAREX. Real-world document diversity - handwritten annotations, faded scans, non-standard layouts - will push that number lower. Treat the benchmark as a ceiling for clean documents, not a floor for production pipelines.

My Take

The narrow specialization is both the model's strength and its honest positioning. IBM is not trying to compete with GPT-4o on visual reasoning or Gemini on multimodal QA. They are building a reliable extraction component that fits into document processing workflows the way a specialized tool should: predictably, with structured outputs, at low inference cost.

The tag-based interface is smart design. It makes the model's capabilities explicit at call time, forces the caller to declare intent, and avoids the ambiguous open-ended prompting that makes general VLMs unreliable for tasks that need deterministic structure. You cannot accidentally get a narrative response when you asked for CSV.

What I find most technically interesting is the deepstack injection architecture. The standard approach - inject visual features once at the LLM input boundary - treats all visual information as equally relevant at all processing depths. Injecting at 8 different depths, mixing semantic context early with spatial detail late, is a meaningful departure. The intuition is sound: global understanding should establish context first, and fine-grained layout detail should resolve ambiguity later. Whether this advantage generalizes beyond document tasks is worth watching.

The English-only constraint is real and will limit adoption in global enterprise settings. It is also understandable: the extraction-focused training dataset simply does not cover multilingual documents at this stage. That is a gap IBM will need to close in future releases to compete with commercial alternatives that already support 50-plus languages.

For practitioners deciding whether to adopt this now: if your pipeline processes English documents at volume and you are currently paying per-page for cloud OCR or maintaining custom parsers, this is worth a serious evaluation. If your documents are multilingual or require high-stakes accuracy beyond 90%, you need a validation layer on top or a different model.

Discussion question: Granite 4.0 3B Vision uses tag-based task routing rather than free-form instruction prompting - a design that trades flexibility for predictability. Does that constraint make it more or less useful than a general-purpose VLM like Qwen2.5-VL-3B in a production document automation pipeline, and at what document volume does a specialized extraction model start paying off versus maintaining a general one?

IBM Granite 4.0 3B Vision: The Compact VLM Built for Document Extraction

The Big Idea

Before vs After

Traditional OCR Pipeline

Granite 4.0 3B Vision

How It Works

Key Findings

Why This Matters for AI and Automation Practitioners

My Take

Share this discussion