Article 05 · April 2026

IBM Granite 4.0 3B Vision: The Compact VLM Built for Document Extraction

April 1, 2026 · by Satish K C 11 min read
Agents Automation LLMs Vision AI

The Big Idea

IBM Research released Granite 4.0 3B Vision on March 27, 2026 - a 4B-parameter vision-language model built for one specific job: pulling structured data out of enterprise documents. Not general visual QA. Not image captioning. Extraction: charts to CSV, tables to JSON, invoices to key-value pairs.

The model sits on Granite 4.0 Micro (a 3B LLM), uses rank-256 LoRA adapters across all attention and MLP layers, and introduces a deepstack architecture that injects visual features into eight different LLM layer depths. It is Apache 2.0 licensed, vLLM-deployable, and integrates directly with IBM's open-source Docling document pipeline.

What makes it immediately relevant for automation practitioners is the interface: task routing via prompt tags. Send <chart2csv> with a chart image and get a structured CSV back. Send <tables_json> with a document scan and get a JSON object. The extraction pipeline collapses from a multi-step OCR-plus-parser chain into a single API call.

Key distinction: Granite 4.0 3B Vision is not a general-purpose VLM. It is a narrow, specialized extraction component - designed to replace the OCR layer in document automation pipelines, not to compete with GPT-4o on open-ended vision tasks. That narrowness is intentional and is what makes it reliable enough for production use.

Before vs After

The problem it solves is one that anyone who has built document processing automation will recognize immediately. The old approach was fragile from the start.

Traditional OCR Pipeline

  • OCR engine extracts raw text (Tesseract, Textract, Form Recognizer)
  • Custom regex and heuristics parse field positions
  • Separate parser per document layout and version
  • Charts are completely unreadable - treated as images
  • Complex nested tables break silently with wrong cell alignment
  • Manual QA pass required on every batch
  • Parser maintenance cost grows as document formats change

Granite 4.0 3B Vision

  • Single model handles charts, tables, and key-value pairs
  • Tag-based task routing - no prompt engineering required
  • Structured output (CSV, JSON, HTML) returned directly
  • Charts converted to executable Python code or data tables
  • Table structure preserved via TEDS-optimized training
  • 85.5% zero-shot KVP accuracy on VAREX benchmark
  • Apache 2.0 - no licensing restrictions for commercial use
Document Extraction Pipeline - Old vs New
OLD WAY - 5 STEPS, BRITTLE Document Image / Scan OCR Engine raw text dump Custom Parser per layout Manual QA catch errors Structured Output charts invisible to OCR, table cells misaligned, parser breaks on new layouts NEW WAY - GRANITE 4.0 3B VISION Document Image / Scan Prompt Tag <chart2csv> <tables_json> Granite 4.0 3B Vision SigLIP2 + Q-Former + Deepstack Granite 4.0 Micro + LoRA r256 Structured Output CSV / JSON / HTML Python code single inference call - no parser maintenance - charts, tables, and KVP all handled 5+ steps, fragile 1 API call

How It Works

The architecture has four distinct components that work in sequence. Understanding each one explains why IBM made specific trade-off decisions.

Granite 4.0 3B Vision - Architecture Pipeline
VISION ENCODER SigLIP2 so400m-patch16-384 tiled 384x384 patches Q-FORMER Window Projector 4x4 patch window compressed to 2x2 tokens 4x less tokens DEEPSTACK INJECTION - 8 POINTS LayerDeepstack (4 points) semantic features - early LLM layers SpatialDeepstack (4 points) spatial detail - later LLM layers LANGUAGE MODEL Granite 4.0 Micro 3B LoRA rank 256 all attn + MLP layers TASK ROUTING VIA PROMPT TAG <chart2csv> CSV table <chart2code> Python code <chart2summary> NL description <tables_json> JSON + dims <tables_html> HTML table <tables_otsl> OTSL markup KVP - schema-guided JSON extraction pass JSON schema in prompt, model returns matched fields as nested JSON DEPLOYMENT MODES Merged LoRA (fastest inference) Native LoRA (text + vision shared)

Vision Encoder - SigLIP2. Input images are tiled into 384x384 patches, each encoded independently by the SigLIP2 vision encoder (google/siglip2-so400m-patch16-384). Tiling preserves resolution across large documents without forcing a single global resize that would destroy fine-grained text detail in tables or small chart labels.

Window Q-Former Projectors. After encoding, each 4x4 patch window is compressed to 2x2 tokens via cross-attention. This 4x reduction cuts the visual token budget fed to the LLM from an otherwise prohibitive count to something tractable without losing structural information. The window-based design preserves local spatial relationships - critical for table cell alignment.

LayerDeepstack and SpatialDeepstack. Most VLMs inject visual features once, at the input boundary of the LLM. Granite 4.0 3B Vision injects them at 8 distinct points:

LoRA on Granite 4.0 Micro. Rank-256 LoRA adapters are applied across all self-attention projections and MLP layers of the 3B base model. This enables a useful deployment trick: in the native LoRA runtime mode (vLLM), the base model handles text-only requests without any vision overhead. The adapter loads only when an image is present. One deployment serves both workloads.

Tag-based task routing. The model was fine-tuned to associate specific prompt tags with specific output formats. Send <chart2csv> and it outputs a CSV. Send <tables_json> and it outputs a JSON object with table dimensions and cell contents. For KVP extraction, you pass a JSON schema in the prompt and the model returns a structured JSON matching those fields - returning null for fields it cannot locate.

# Chart extraction - 3 output formats from one image chart_prompts = [ "<chart2csv>", # structured CSV data "<chart2summary>", # natural language description "<chart2code>", # executable Python to recreate chart ] # KVP extraction - schema-guided schema = { "type": "object", "properties": { "invoice_date": {"type": "string"}, "order_number": {"type": "string"}, "total_amount": {"type": "number"}, } } prompt = f"Extract structured data. Schema:\n{json.dumps(schema)}\nReturn ONLY valid JSON."

Key Findings

85.5% KVP Accuracy 7 Extraction Tasks Apache 2.0 8 Injection Points vLLM Native Docling Integration
85.5%
KVP exact-match on VAREX (zero-shot)
4x
visual token compression via Window Q-Former
8
vision-to-LLM injection points via Deepstack

Why This Matters for AI and Automation Practitioners

Document extraction sits at the front of a surprisingly large percentage of enterprise automation pipelines. Invoices, contracts, financial reports, regulatory filings, research PDFs - anything that arrives as a scanned image or a visually complex PDF has historically required either expensive cloud OCR services or brittle custom parsers.

What changes: A 4B-parameter model on a single A100 now handles charts, tables, and key-value extraction at 85.5% accuracy without any fine-tuning. For many document types, that is production-ready out of the box. The operational cost drops from maintaining per-layout parsers to maintaining one model deployment.

For automation practitioners specifically, here is what is actionable:

The risk: The 85.5% KVP accuracy figure is zero-shot on VAREX. Real-world document diversity - handwritten annotations, faded scans, non-standard layouts - will push that number lower. Treat the benchmark as a ceiling for clean documents, not a floor for production pipelines.

My Take

The narrow specialization is both the model's strength and its honest positioning. IBM is not trying to compete with GPT-4o on visual reasoning or Gemini on multimodal QA. They are building a reliable extraction component that fits into document processing workflows the way a specialized tool should: predictably, with structured outputs, at low inference cost.

The tag-based interface is smart design. It makes the model's capabilities explicit at call time, forces the caller to declare intent, and avoids the ambiguous open-ended prompting that makes general VLMs unreliable for tasks that need deterministic structure. You cannot accidentally get a narrative response when you asked for CSV.

What I find most technically interesting is the deepstack injection architecture. The standard approach - inject visual features once at the LLM input boundary - treats all visual information as equally relevant at all processing depths. Injecting at 8 different depths, mixing semantic context early with spatial detail late, is a meaningful departure. The intuition is sound: global understanding should establish context first, and fine-grained layout detail should resolve ambiguity later. Whether this advantage generalizes beyond document tasks is worth watching.

The English-only constraint is real and will limit adoption in global enterprise settings. It is also understandable: the extraction-focused training dataset simply does not cover multilingual documents at this stage. That is a gap IBM will need to close in future releases to compete with commercial alternatives that already support 50-plus languages.

For practitioners deciding whether to adopt this now: if your pipeline processes English documents at volume and you are currently paying per-page for cloud OCR or maintaining custom parsers, this is worth a serious evaluation. If your documents are multilingual or require high-stakes accuracy beyond 90%, you need a validation layer on top or a different model.

Discussion question: Granite 4.0 3B Vision uses tag-based task routing rather than free-form instruction prompting - a design that trades flexibility for predictability. Does that constraint make it more or less useful than a general-purpose VLM like Qwen2.5-VL-3B in a production document automation pipeline, and at what document volume does a specialized extraction model start paying off versus maintaining a general one?

Share this discussion

← Back to all papers