Document Intelligence — Archive
Tools and methods for turning documents into structured, queryable knowledge
A structured index of every tool, model, and method in this vault that touches document intelligence. Three buckets: Ingestion (parse the document into structure), Retrieval (query the structure), Combined (frameworks that bundle both). Wiki links go to deeper notes.
1. Ingestion
The job decomposes into three layers (see [[pdf-parsing-landscape]]):
- Recognition - pixels → characters (OCR; required for scanned docs).
- Structure - page → hierarchy (titles, tables, reading order, bounding boxes).
- Understanding - structure → downstream artifact (Q&A pairs, fields, tagged PDFs).
1.1 OCR + structure (specialized models)
| Tool | Size | Score | Notes |
|---|---|---|---|
| GLM-OCR (Zhipu AI) | 0.9B | 94.62 OmniDocBench v1.5 | Two-stage layout + parallel recognition; Multi-Token Prediction · [[pdf-parsing-landscape]] |
| PaddleOCR-VL-1.5 | 0.9B | 94.50 | The OCR layer under MinerU / RAGFlow / OmniParser / Umi-OCR · [[paddleocr-document-parsing]] |
| PaddleOCR-VL | 0.9B | 92.86 | Predecessor · [[paddleocr-document-parsing]] |
| MinerU 2.5 | 1.2B | 90.67 | Sits on PaddleOCR · [[pdf-parsing-landscape]] |
| MonkeyOCR-pro-3B | 3.7B | 88.85 | [[pdf-parsing-landscape]] |
| dots.ocr | 3B | 88.41 | [[pdf-parsing-landscape]] |
| MonkeyOCR-3B | 3.7B | 87.13 | [[pdf-parsing-landscape]] |
| Deepseek-OCR | 3B | 87.01 | [[pdf-parsing-landscape]] |
| MonkeyOCR-pro-1.2B | 1.9B | 86.96 | [[pdf-parsing-landscape]] |
| Nanonets-OCR-s | 3B | 85.59 | [[pdf-parsing-landscape]] |
| MinerU2-VLM | 0.9B | 85.56 | [[pdf-parsing-landscape]] |
| Dolphin-1.5 | 0.3B | 83.21 | [[pdf-parsing-landscape]] |
| olmOCR-7B | 7B | 81.79 | [[pdf-parsing-landscape]] |
| POINTS-Reader | 3B | 80.98 | [[pdf-parsing-landscape]] |
| Mistral OCR | hosted | 78.83 | [[pdf-parsing-landscape]] |
| OCRFlux-3B | 3B | 74.82 | [[pdf-parsing-landscape]] |
| Dolphin | 0.3B | 74.67 | [[pdf-parsing-landscape]] |
| Tesseract | open | baseline | Pre-LLM OCR; still inside many pipelines · [[pdf-parsing-landscape]] |
1.2 Structure-only (digital PDFs, OCR-free)
| Tool | One-line | Notes |
|---|---|---|
| OpenDataLoader | Deterministic local + optional hybrid AI; accessibility auto-tagging (EAA / ADA / Section 508) | [[opendataloader-pdf-parser]] |
| PP-StructureV3 | Layout DAG pipeline | [[paddleocr-document-parsing]] |
| Marker | Pipeline parser | [[pdf-parsing-landscape]] |
1.3 Generalist frameworks (many formats)
| Tool | One-line | Notes |
|---|---|---|
| Kreuzberg | Rust core, 91+ formats (PDF / Office / eBooks / LaTeX / Hangul / archives), 12 polyglot bindings, MCP server | [[kreuzberg-document-intelligence]] |
1.4 General VLMs used as OCR
| Tool | Score (OmniDocBench v1.5) | Notes |
|---|---|---|
| Gemini-3 Pro | 90.33 | [[pdf-parsing-landscape]] |
| Qwen3-VL (235B) | 89.15 | [[pdf-parsing-landscape]] |
| Gemini-2.5 Pro | 88.03 | [[pdf-parsing-landscape]] |
| Qwen2.5-VL (72B) | 87.02 | [[pdf-parsing-landscape]] |
| GPT-5.2 | 85.50 | [[pdf-parsing-landscape]] |
| InternVL3.5 (241B) | 82.67 | [[pdf-parsing-landscape]] |
| InternVL3 (76B) | 80.33 | [[pdf-parsing-landscape]] |
| GPT-4o | 75.02 | [[pdf-parsing-landscape]] |
1.5 In-place augmentation (PDF stays a PDF)
| Tool | One-line | Notes |
|---|---|---|
| OCRmyPDF | Adds a hidden Tesseract OCR text layer to scanned PDFs while preserving the original visual; defaults to PDF/A; powers Paperless-ngx and Nextcloud OCR | [[ocrmypdf-searchable-pdf]] |
1.6 Visual diff (not a parser)
| Tool | One-line | Notes |
|---|---|---|
| diff-pdf | Pixel-diffs two rendered PDFs for CI regression | [[diff-pdf-visual-compare]] |
1.7 Benchmarks
| Benchmark | What it measures | Notes |
|---|---|---|
| OmniDocBench v1.5 | Text similarity to reference; favors specialist OCR models | [[pdf-parsing-landscape]] |
| ParseBench (LlamaIndex) | Whether parsed output is reliable for autonomous agents; favors VLMs | [[pdf-parsing-landscape]] |
| FinanceBench | Q&A over financial filings; PageIndex / Mafin 2.5 = 98.7% SOTA | [[pageindex-vectorless-rag]] |
2. Retrieval
The catalog ([[rag-techniques]]) organizes ~30 techniques into families.
| Family | Techniques | Notes |
|---|---|---|
| Foundational | Simple RAG, Reliable RAG, Choose Chunk Size, Proposition Chunking, Simple RAG (CSV / JSON) | [[rag-techniques]] |
| Query enhancement | Query Transformations · HyDE (avoid in practice) | [[rag-techniques]] |
| Context enrichment | HyPE, Contextual Chunk Headers, Relevant Segment Extraction, Semantic Chunking, Contextual Compression, Document Augmentation | [[rag-techniques]] |
| Advanced retrieval | Fusion Retrieval, Intelligent Reranking, Hierarchical Indices, Multi-faceted Filtering, Dartboard, Multi-modal Retrieval | [[rag-techniques]] |
| Iterative + adaptive | Feedback Loops, Adaptive Retrieval, Self-RAG, Corrective RAG | [[rag-techniques]] |
| Memory-augmented | MemoRAG | [[rag-techniques]] |
| Graph-based | Graph RAG, Microsoft GraphRAG, Knowledge Graph Integration, RAPTOR | [[graph-rag]] |
| Vector + graph hybrid | LightRAG | [[lightrag-algorithm]] |
| Vectorless / tree-search | PageIndex - TOC tree + LLM tree reasoning, no embeddings | [[pageindex-vectorless-rag]] |
| Explainability | Explainable Retrieval | [[rag-techniques]] |
| Capstone | Sophisticated Controllable Agent for Complex RAG Tasks | [[rag-techniques]] |
| Evaluation | DeepEval, GroUSE, Open-RAG-Eval, End-to-End RAG Evaluation | [[rag-techniques]] |
3. Combined (parsing + retrieval bundled)
| Tool | What it bundles | Notes |
|---|---|---|
| RAG-Anything (HKUDS) | MinerU parsing + image captioning + Graph RAG + multimodal retrieval | [[multimodal-rag]] |
| PageIndex (self-hosted) | Tree-of-contents index + LLM tree-search retrieval; vision mode skips OCR | [[pageindex-vectorless-rag]] |
| Mafin 2.5 | PageIndex + reasoning retrieval; SOTA on FinanceBench (98.7%) | [[pageindex-vectorless-rag]] |
| Kreuzberg | All-formats parsing + OCR (146 LLM providers) + tree-sitter code intelligence + MCP server | [[kreuzberg-document-intelligence]] |
| OpenKB (VectifyAI) | Karpathy's "LLM Knowledge Bases" workflow as a CLI; markitdown for short docs + PageIndex for long PDFs; compiles a wiki the LLM maintains; Obsidian-compatible | [[openkb-knowledge-base]] |
4. Code intelligence (sibling field, same primitives)
When the "document" is a codebase, tree-sitter is the same underlying technology:
| Tool | One-line | Notes |
|---|---|---|
| Graphify | Tree-sitter (25 langs) + NetworkX + Leiden community detection; agent-native via PreToolUse hooks | [[code-to-knowledge-graph]] |
| Kreuzberg | Tree-sitter (248 langs); extracts functions / classes / imports / docstrings + syntax-aware chunking | [[kreuzberg-document-intelligence]] |
| Joern (CPG) | Code Property Graph (AST + CFG + PDG merged) for security and deep static analysis | not yet captured |
| GitNexus | MCP-native knowledge graph engine for agents | not yet captured |
5. Wiki hub notes
- [[pdf-parsing-landscape]] - the layered map of parsers
- [[rag-techniques]] - the catalog of retrieval techniques
- [[rag-learning-path]] - how to practice them
- [[multimodal-rag]] - non-text-only RAG
- [[graph-rag]] - graph-based retrieval
- [[code-to-knowledge-graph]] - the codebase analog
- [[kreuzberg-document-intelligence]] - the cross-cutting generalist tool
- [[openkb-knowledge-base]] - the corpus-level wiki compiler that mirrors this vault's own pattern
Counterargument worth knowing
VectifyAI's Do We Still Need OCR? makes an information-theoretic case that, for visually-complex documents, OCR is non-invertible and caps the upper bound of any pipeline that uses it. The proposed alternative: VLM-direct on page images, with [[pageindex-vectorless-rag|PageIndex]] tree-indexing as the vectorless retrieval layer. Recorded in the [[pdf-parsing-landscape#When not to OCR at all|landscape note]].
A structured index of every tool, model, and method in this vault that touches document intelligence. Three buckets: Ingestion (parse the document into structure), Retrieval (query the structure), Combined (frameworks that bundle both). Wiki links go to deeper notes.
1. Ingestion
The job decomposes into three layers (see [[pdf-parsing-landscape]]):
- Recognition - pixels → characters (OCR; required for scanned docs).
- Structure - page → hierarchy (titles, tables, reading order, bounding boxes).
- Understanding - structure → downstream artifact (Q&A pairs, fields, tagged PDFs).
1.1 OCR + structure (specialized models)
| Tool | Size | Score | Notes |
|---|---|---|---|
| GLM-OCR (Zhipu AI) | 0.9B | 94.62 OmniDocBench v1.5 | Two-stage layout + parallel recognition; Multi-Token Prediction · [[pdf-parsing-landscape]] |
| PaddleOCR-VL-1.5 | 0.9B | 94.50 | The OCR layer under MinerU / RAGFlow / OmniParser / Umi-OCR · [[paddleocr-document-parsing]] |
| PaddleOCR-VL | 0.9B | 92.86 | Predecessor · [[paddleocr-document-parsing]] |
| MinerU 2.5 | 1.2B | 90.67 | Sits on PaddleOCR · [[pdf-parsing-landscape]] |
| MonkeyOCR-pro-3B | 3.7B | 88.85 | [[pdf-parsing-landscape]] |
| dots.ocr | 3B | 88.41 | [[pdf-parsing-landscape]] |
| MonkeyOCR-3B | 3.7B | 87.13 | [[pdf-parsing-landscape]] |
| Deepseek-OCR | 3B | 87.01 | [[pdf-parsing-landscape]] |
| MonkeyOCR-pro-1.2B | 1.9B | 86.96 | [[pdf-parsing-landscape]] |
| Nanonets-OCR-s | 3B | 85.59 | [[pdf-parsing-landscape]] |
| MinerU2-VLM | 0.9B | 85.56 | [[pdf-parsing-landscape]] |
| Dolphin-1.5 | 0.3B | 83.21 | [[pdf-parsing-landscape]] |
| olmOCR-7B | 7B | 81.79 | [[pdf-parsing-landscape]] |
| POINTS-Reader | 3B | 80.98 | [[pdf-parsing-landscape]] |
| Mistral OCR | hosted | 78.83 | [[pdf-parsing-landscape]] |
| OCRFlux-3B | 3B | 74.82 | [[pdf-parsing-landscape]] |
| Dolphin | 0.3B | 74.67 | [[pdf-parsing-landscape]] |
| Tesseract | open | baseline | Pre-LLM OCR; still inside many pipelines · [[pdf-parsing-landscape]] |
1.2 Structure-only (digital PDFs, OCR-free)
| Tool | One-line | Notes |
|---|---|---|
| OpenDataLoader | Deterministic local + optional hybrid AI; accessibility auto-tagging (EAA / ADA / Section 508) | [[opendataloader-pdf-parser]] |
| PP-StructureV3 | Layout DAG pipeline | [[paddleocr-document-parsing]] |
| Marker | Pipeline parser | [[pdf-parsing-landscape]] |
1.3 Generalist frameworks (many formats)
| Tool | One-line | Notes |
|---|---|---|
| Kreuzberg | Rust core, 91+ formats (PDF / Office / eBooks / LaTeX / Hangul / archives), 12 polyglot bindings, MCP server | [[kreuzberg-document-intelligence]] |
1.4 General VLMs used as OCR
| Tool | Score (OmniDocBench v1.5) | Notes |
|---|---|---|
| Gemini-3 Pro | 90.33 | [[pdf-parsing-landscape]] |
| Qwen3-VL (235B) | 89.15 | [[pdf-parsing-landscape]] |
| Gemini-2.5 Pro | 88.03 | [[pdf-parsing-landscape]] |
| Qwen2.5-VL (72B) | 87.02 | [[pdf-parsing-landscape]] |
| GPT-5.2 | 85.50 | [[pdf-parsing-landscape]] |
| InternVL3.5 (241B) | 82.67 | [[pdf-parsing-landscape]] |
| InternVL3 (76B) | 80.33 | [[pdf-parsing-landscape]] |
| GPT-4o | 75.02 | [[pdf-parsing-landscape]] |
1.5 In-place augmentation (PDF stays a PDF)
| Tool | One-line | Notes |
|---|---|---|
| OCRmyPDF | Adds a hidden Tesseract OCR text layer to scanned PDFs while preserving the original visual; defaults to PDF/A; powers Paperless-ngx and Nextcloud OCR | [[ocrmypdf-searchable-pdf]] |
1.6 Visual diff (not a parser)
| Tool | One-line | Notes |
|---|---|---|
| diff-pdf | Pixel-diffs two rendered PDFs for CI regression | [[diff-pdf-visual-compare]] |
1.7 Benchmarks
| Benchmark | What it measures | Notes |
|---|---|---|
| OmniDocBench v1.5 | Text similarity to reference; favors specialist OCR models | [[pdf-parsing-landscape]] |
| ParseBench (LlamaIndex) | Whether parsed output is reliable for autonomous agents; favors VLMs | [[pdf-parsing-landscape]] |
| FinanceBench | Q&A over financial filings; PageIndex / Mafin 2.5 = 98.7% SOTA | [[pageindex-vectorless-rag]] |
2. Retrieval
The catalog ([[rag-techniques]]) organizes ~30 techniques into families.
| Family | Techniques | Notes |
|---|---|---|
| Foundational | Simple RAG, Reliable RAG, Choose Chunk Size, Proposition Chunking, Simple RAG (CSV / JSON) | [[rag-techniques]] |
| Query enhancement | Query Transformations · HyDE (avoid in practice) | [[rag-techniques]] |
| Context enrichment | HyPE, Contextual Chunk Headers, Relevant Segment Extraction, Semantic Chunking, Contextual Compression, Document Augmentation | [[rag-techniques]] |
| Advanced retrieval | Fusion Retrieval, Intelligent Reranking, Hierarchical Indices, Multi-faceted Filtering, Dartboard, Multi-modal Retrieval | [[rag-techniques]] |
| Iterative + adaptive | Feedback Loops, Adaptive Retrieval, Self-RAG, Corrective RAG | [[rag-techniques]] |
| Memory-augmented | MemoRAG | [[rag-techniques]] |
| Graph-based | Graph RAG, Microsoft GraphRAG, Knowledge Graph Integration, RAPTOR | [[graph-rag]] |
| Vector + graph hybrid | LightRAG | [[lightrag-algorithm]] |
| Vectorless / tree-search | PageIndex - TOC tree + LLM tree reasoning, no embeddings | [[pageindex-vectorless-rag]] |
| Explainability | Explainable Retrieval | [[rag-techniques]] |
| Capstone | Sophisticated Controllable Agent for Complex RAG Tasks | [[rag-techniques]] |
| Evaluation | DeepEval, GroUSE, Open-RAG-Eval, End-to-End RAG Evaluation | [[rag-techniques]] |
3. Combined (parsing + retrieval bundled)
| Tool | What it bundles | Notes |
|---|---|---|
| RAG-Anything (HKUDS) | MinerU parsing + image captioning + Graph RAG + multimodal retrieval | [[multimodal-rag]] |
| PageIndex (self-hosted) | Tree-of-contents index + LLM tree-search retrieval; vision mode skips OCR | [[pageindex-vectorless-rag]] |
| Mafin 2.5 | PageIndex + reasoning retrieval; SOTA on FinanceBench (98.7%) | [[pageindex-vectorless-rag]] |
| Kreuzberg | All-formats parsing + OCR (146 LLM providers) + tree-sitter code intelligence + MCP server | [[kreuzberg-document-intelligence]] |
| OpenKB (VectifyAI) | Karpathy's "LLM Knowledge Bases" workflow as a CLI; markitdown for short docs + PageIndex for long PDFs; compiles a wiki the LLM maintains; Obsidian-compatible | [[openkb-knowledge-base]] |
4. Code intelligence (sibling field, same primitives)
When the "document" is a codebase, tree-sitter is the same underlying technology:
| Tool | One-line | Notes |
|---|---|---|
| Graphify | Tree-sitter (25 langs) + NetworkX + Leiden community detection; agent-native via PreToolUse hooks | [[code-to-knowledge-graph]] |
| Kreuzberg | Tree-sitter (248 langs); extracts functions / classes / imports / docstrings + syntax-aware chunking | [[kreuzberg-document-intelligence]] |
| Joern (CPG) | Code Property Graph (AST + CFG + PDG merged) for security and deep static analysis | not yet captured |
| GitNexus | MCP-native knowledge graph engine for agents | not yet captured |
5. Wiki hub notes
- [[pdf-parsing-landscape]] - the layered map of parsers
- [[rag-techniques]] - the catalog of retrieval techniques
- [[rag-learning-path]] - how to practice them
- [[multimodal-rag]] - non-text-only RAG
- [[graph-rag]] - graph-based retrieval
- [[code-to-knowledge-graph]] - the codebase analog
- [[kreuzberg-document-intelligence]] - the cross-cutting generalist tool
- [[openkb-knowledge-base]] - the corpus-level wiki compiler that mirrors this vault's own pattern
Counterargument worth knowing
VectifyAI's Do We Still Need OCR? makes an information-theoretic case that, for visually-complex documents, OCR is non-invertible and caps the upper bound of any pipeline that uses it. The proposed alternative: VLM-direct on page images, with [[pageindex-vectorless-rag|PageIndex]] tree-indexing as the vectorless retrieval layer. Recorded in the [[pdf-parsing-landscape#When not to OCR at all|landscape note]].