Document Intelligence — Archive

A structured index of every tool, model, and method in this vault that touches document intelligence. Three buckets: Ingestion (parse the document into structure), Retrieval (query the structure), Combined (frameworks that bundle both). Wiki links go to deeper notes.

1. Ingestion

The job decomposes into three layers (see [[pdf-parsing-landscape]]):

Recognition - pixels → characters (OCR; required for scanned docs).
Structure - page → hierarchy (titles, tables, reading order, bounding boxes).
Understanding - structure → downstream artifact (Q&A pairs, fields, tagged PDFs).

1.1 OCR + structure (specialized models)

Tool	Size	Score	Notes
GLM-OCR (Zhipu AI)	0.9B	94.62 OmniDocBench v1.5	Two-stage layout + parallel recognition; Multi-Token Prediction · [[pdf-parsing-landscape]]
PaddleOCR-VL-1.5	0.9B	94.50	The OCR layer under MinerU / RAGFlow / OmniParser / Umi-OCR · [[paddleocr-document-parsing]]
PaddleOCR-VL	0.9B	92.86	Predecessor · [[paddleocr-document-parsing]]
MinerU 2.5	1.2B	90.67	Sits on PaddleOCR · [[pdf-parsing-landscape]]
MonkeyOCR-pro-3B	3.7B	88.85	[[pdf-parsing-landscape]]
dots.ocr	3B	88.41	[[pdf-parsing-landscape]]
MonkeyOCR-3B	3.7B	87.13	[[pdf-parsing-landscape]]
Deepseek-OCR	3B	87.01	[[pdf-parsing-landscape]]
MonkeyOCR-pro-1.2B	1.9B	86.96	[[pdf-parsing-landscape]]
Nanonets-OCR-s	3B	85.59	[[pdf-parsing-landscape]]
MinerU2-VLM	0.9B	85.56	[[pdf-parsing-landscape]]
Dolphin-1.5	0.3B	83.21	[[pdf-parsing-landscape]]
olmOCR-7B	7B	81.79	[[pdf-parsing-landscape]]
POINTS-Reader	3B	80.98	[[pdf-parsing-landscape]]
Mistral OCR	hosted	78.83	[[pdf-parsing-landscape]]
OCRFlux-3B	3B	74.82	[[pdf-parsing-landscape]]
Dolphin	0.3B	74.67	[[pdf-parsing-landscape]]
Tesseract	open	baseline	Pre-LLM OCR; still inside many pipelines · [[pdf-parsing-landscape]]

1.2 Structure-only (digital PDFs, OCR-free)

Tool	One-line	Notes
OpenDataLoader	Deterministic local + optional hybrid AI; accessibility auto-tagging (EAA / ADA / Section 508)	[[opendataloader-pdf-parser]]
PP-StructureV3	Layout DAG pipeline	[[paddleocr-document-parsing]]
Marker	Pipeline parser	[[pdf-parsing-landscape]]

1.3 Generalist frameworks (many formats)

Tool	One-line	Notes
Kreuzberg	Rust core, 91+ formats (PDF / Office / eBooks / LaTeX / Hangul / archives), 12 polyglot bindings, MCP server	[[kreuzberg-document-intelligence]]

1.4 General VLMs used as OCR

Tool	Score (OmniDocBench v1.5)	Notes
Gemini-3 Pro	90.33	[[pdf-parsing-landscape]]
Qwen3-VL (235B)	89.15	[[pdf-parsing-landscape]]
Gemini-2.5 Pro	88.03	[[pdf-parsing-landscape]]
Qwen2.5-VL (72B)	87.02	[[pdf-parsing-landscape]]
GPT-5.2	85.50	[[pdf-parsing-landscape]]
InternVL3.5 (241B)	82.67	[[pdf-parsing-landscape]]
InternVL3 (76B)	80.33	[[pdf-parsing-landscape]]
GPT-4o	75.02	[[pdf-parsing-landscape]]

1.5 In-place augmentation (PDF stays a PDF)

Tool	One-line	Notes
OCRmyPDF	Adds a hidden Tesseract OCR text layer to scanned PDFs while preserving the original visual; defaults to PDF/A; powers Paperless-ngx and Nextcloud OCR	[[ocrmypdf-searchable-pdf]]

1.6 Visual diff (not a parser)

Tool	One-line	Notes
diff-pdf	Pixel-diffs two rendered PDFs for CI regression	[[diff-pdf-visual-compare]]

1.7 Benchmarks

Benchmark	What it measures	Notes
OmniDocBench v1.5	Text similarity to reference; favors specialist OCR models	[[pdf-parsing-landscape]]
ParseBench (LlamaIndex)	Whether parsed output is reliable for autonomous agents; favors VLMs	[[pdf-parsing-landscape]]
FinanceBench	Q&A over financial filings; PageIndex / Mafin 2.5 = 98.7% SOTA	[[pageindex-vectorless-rag]]

2. Retrieval

The catalog ([[rag-techniques]]) organizes ~30 techniques into families.

Family	Techniques	Notes
Foundational	Simple RAG, Reliable RAG, Choose Chunk Size, Proposition Chunking, Simple RAG (CSV / JSON)	[[rag-techniques]]
Query enhancement	Query Transformations · HyDE (avoid in practice)	[[rag-techniques]]
Context enrichment	HyPE, Contextual Chunk Headers, Relevant Segment Extraction, Semantic Chunking, Contextual Compression, Document Augmentation	[[rag-techniques]]
Advanced retrieval	Fusion Retrieval, Intelligent Reranking, Hierarchical Indices, Multi-faceted Filtering, Dartboard, Multi-modal Retrieval	[[rag-techniques]]
Iterative + adaptive	Feedback Loops, Adaptive Retrieval, Self-RAG, Corrective RAG	[[rag-techniques]]
Memory-augmented	MemoRAG	[[rag-techniques]]
Graph-based	Graph RAG, Microsoft GraphRAG, Knowledge Graph Integration, RAPTOR	[[graph-rag]]
Vector + graph hybrid	LightRAG	[[lightrag-algorithm]]
Vectorless / tree-search	PageIndex - TOC tree + LLM tree reasoning, no embeddings	[[pageindex-vectorless-rag]]
Explainability	Explainable Retrieval	[[rag-techniques]]
Capstone	Sophisticated Controllable Agent for Complex RAG Tasks	[[rag-techniques]]
Evaluation	DeepEval, GroUSE, Open-RAG-Eval, End-to-End RAG Evaluation	[[rag-techniques]]

3. Combined (parsing + retrieval bundled)

Tool	What it bundles	Notes
RAG-Anything (HKUDS)	MinerU parsing + image captioning + Graph RAG + multimodal retrieval	[[multimodal-rag]]
PageIndex (self-hosted)	Tree-of-contents index + LLM tree-search retrieval; vision mode skips OCR	[[pageindex-vectorless-rag]]
Mafin 2.5	PageIndex + reasoning retrieval; SOTA on FinanceBench (98.7%)	[[pageindex-vectorless-rag]]
Kreuzberg	All-formats parsing + OCR (146 LLM providers) + tree-sitter code intelligence + MCP server	[[kreuzberg-document-intelligence]]
OpenKB (VectifyAI)	Karpathy's "LLM Knowledge Bases" workflow as a CLI; markitdown for short docs + PageIndex for long PDFs; compiles a wiki the LLM maintains; Obsidian-compatible	[[openkb-knowledge-base]]

4. Code intelligence (sibling field, same primitives)

When the "document" is a codebase, tree-sitter is the same underlying technology:

Tool	One-line	Notes
Graphify	Tree-sitter (25 langs) + NetworkX + Leiden community detection; agent-native via PreToolUse hooks	[[code-to-knowledge-graph]]
Kreuzberg	Tree-sitter (248 langs); extracts functions / classes / imports / docstrings + syntax-aware chunking	[[kreuzberg-document-intelligence]]
Joern (CPG)	Code Property Graph (AST + CFG + PDG merged) for security and deep static analysis	not yet captured
GitNexus	MCP-native knowledge graph engine for agents	not yet captured

5. Wiki hub notes

[[pdf-parsing-landscape]] - the layered map of parsers
[[rag-techniques]] - the catalog of retrieval techniques
[[rag-learning-path]] - how to practice them
[[multimodal-rag]] - non-text-only RAG
[[graph-rag]] - graph-based retrieval
[[code-to-knowledge-graph]] - the codebase analog
[[kreuzberg-document-intelligence]] - the cross-cutting generalist tool
[[openkb-knowledge-base]] - the corpus-level wiki compiler that mirrors this vault's own pattern

Counterargument worth knowing

VectifyAI's Do We Still Need OCR? makes an information-theoretic case that, for visually-complex documents, OCR is non-invertible and caps the upper bound of any pipeline that uses it. The proposed alternative: VLM-direct on page images, with [[pageindex-vectorless-rag|PageIndex]] tree-indexing as the vectorless retrieval layer. Recorded in the [[pdf-parsing-landscape#When not to OCR at all|landscape note]].