Back to Blog

Document Intelligence — Archive

Tools and methods for turning documents into structured, queryable knowledge

referenceMay 4, 2026

A structured index of every tool, model, and method in this vault that touches document intelligence. Three buckets: Ingestion (parse the document into structure), Retrieval (query the structure), Combined (frameworks that bundle both). Wiki links go to deeper notes.


1. Ingestion

The job decomposes into three layers (see [[pdf-parsing-landscape]]):

  • Recognition - pixels → characters (OCR; required for scanned docs).
  • Structure - page → hierarchy (titles, tables, reading order, bounding boxes).
  • Understanding - structure → downstream artifact (Q&A pairs, fields, tagged PDFs).

1.1 OCR + structure (specialized models)

ToolSizeScoreNotes
GLM-OCR (Zhipu AI)0.9B94.62 OmniDocBench v1.5Two-stage layout + parallel recognition; Multi-Token Prediction · [[pdf-parsing-landscape]]
PaddleOCR-VL-1.50.9B94.50The OCR layer under MinerU / RAGFlow / OmniParser / Umi-OCR · [[paddleocr-document-parsing]]
PaddleOCR-VL0.9B92.86Predecessor · [[paddleocr-document-parsing]]
MinerU 2.51.2B90.67Sits on PaddleOCR · [[pdf-parsing-landscape]]
MonkeyOCR-pro-3B3.7B88.85[[pdf-parsing-landscape]]
dots.ocr3B88.41[[pdf-parsing-landscape]]
MonkeyOCR-3B3.7B87.13[[pdf-parsing-landscape]]
Deepseek-OCR3B87.01[[pdf-parsing-landscape]]
MonkeyOCR-pro-1.2B1.9B86.96[[pdf-parsing-landscape]]
Nanonets-OCR-s3B85.59[[pdf-parsing-landscape]]
MinerU2-VLM0.9B85.56[[pdf-parsing-landscape]]
Dolphin-1.50.3B83.21[[pdf-parsing-landscape]]
olmOCR-7B7B81.79[[pdf-parsing-landscape]]
POINTS-Reader3B80.98[[pdf-parsing-landscape]]
Mistral OCRhosted78.83[[pdf-parsing-landscape]]
OCRFlux-3B3B74.82[[pdf-parsing-landscape]]
Dolphin0.3B74.67[[pdf-parsing-landscape]]
TesseractopenbaselinePre-LLM OCR; still inside many pipelines · [[pdf-parsing-landscape]]

1.2 Structure-only (digital PDFs, OCR-free)

ToolOne-lineNotes
OpenDataLoaderDeterministic local + optional hybrid AI; accessibility auto-tagging (EAA / ADA / Section 508)[[opendataloader-pdf-parser]]
PP-StructureV3Layout DAG pipeline[[paddleocr-document-parsing]]
MarkerPipeline parser[[pdf-parsing-landscape]]

1.3 Generalist frameworks (many formats)

ToolOne-lineNotes
KreuzbergRust core, 91+ formats (PDF / Office / eBooks / LaTeX / Hangul / archives), 12 polyglot bindings, MCP server[[kreuzberg-document-intelligence]]

1.4 General VLMs used as OCR

ToolScore (OmniDocBench v1.5)Notes
Gemini-3 Pro90.33[[pdf-parsing-landscape]]
Qwen3-VL (235B)89.15[[pdf-parsing-landscape]]
Gemini-2.5 Pro88.03[[pdf-parsing-landscape]]
Qwen2.5-VL (72B)87.02[[pdf-parsing-landscape]]
GPT-5.285.50[[pdf-parsing-landscape]]
InternVL3.5 (241B)82.67[[pdf-parsing-landscape]]
InternVL3 (76B)80.33[[pdf-parsing-landscape]]
GPT-4o75.02[[pdf-parsing-landscape]]

1.5 In-place augmentation (PDF stays a PDF)

ToolOne-lineNotes
OCRmyPDFAdds a hidden Tesseract OCR text layer to scanned PDFs while preserving the original visual; defaults to PDF/A; powers Paperless-ngx and Nextcloud OCR[[ocrmypdf-searchable-pdf]]

1.6 Visual diff (not a parser)

ToolOne-lineNotes
diff-pdfPixel-diffs two rendered PDFs for CI regression[[diff-pdf-visual-compare]]

1.7 Benchmarks

BenchmarkWhat it measuresNotes
OmniDocBench v1.5Text similarity to reference; favors specialist OCR models[[pdf-parsing-landscape]]
ParseBench (LlamaIndex)Whether parsed output is reliable for autonomous agents; favors VLMs[[pdf-parsing-landscape]]
FinanceBenchQ&A over financial filings; PageIndex / Mafin 2.5 = 98.7% SOTA[[pageindex-vectorless-rag]]

2. Retrieval

The catalog ([[rag-techniques]]) organizes ~30 techniques into families.

FamilyTechniquesNotes
FoundationalSimple RAG, Reliable RAG, Choose Chunk Size, Proposition Chunking, Simple RAG (CSV / JSON)[[rag-techniques]]
Query enhancementQuery Transformations · HyDE (avoid in practice)[[rag-techniques]]
Context enrichmentHyPE, Contextual Chunk Headers, Relevant Segment Extraction, Semantic Chunking, Contextual Compression, Document Augmentation[[rag-techniques]]
Advanced retrievalFusion Retrieval, Intelligent Reranking, Hierarchical Indices, Multi-faceted Filtering, Dartboard, Multi-modal Retrieval[[rag-techniques]]
Iterative + adaptiveFeedback Loops, Adaptive Retrieval, Self-RAG, Corrective RAG[[rag-techniques]]
Memory-augmentedMemoRAG[[rag-techniques]]
Graph-basedGraph RAG, Microsoft GraphRAG, Knowledge Graph Integration, RAPTOR[[graph-rag]]
Vector + graph hybridLightRAG[[lightrag-algorithm]]
Vectorless / tree-searchPageIndex - TOC tree + LLM tree reasoning, no embeddings[[pageindex-vectorless-rag]]
ExplainabilityExplainable Retrieval[[rag-techniques]]
CapstoneSophisticated Controllable Agent for Complex RAG Tasks[[rag-techniques]]
EvaluationDeepEval, GroUSE, Open-RAG-Eval, End-to-End RAG Evaluation[[rag-techniques]]

3. Combined (parsing + retrieval bundled)

ToolWhat it bundlesNotes
RAG-Anything (HKUDS)MinerU parsing + image captioning + Graph RAG + multimodal retrieval[[multimodal-rag]]
PageIndex (self-hosted)Tree-of-contents index + LLM tree-search retrieval; vision mode skips OCR[[pageindex-vectorless-rag]]
Mafin 2.5PageIndex + reasoning retrieval; SOTA on FinanceBench (98.7%)[[pageindex-vectorless-rag]]
KreuzbergAll-formats parsing + OCR (146 LLM providers) + tree-sitter code intelligence + MCP server[[kreuzberg-document-intelligence]]
OpenKB (VectifyAI)Karpathy's "LLM Knowledge Bases" workflow as a CLI; markitdown for short docs + PageIndex for long PDFs; compiles a wiki the LLM maintains; Obsidian-compatible[[openkb-knowledge-base]]

4. Code intelligence (sibling field, same primitives)

When the "document" is a codebase, tree-sitter is the same underlying technology:

ToolOne-lineNotes
GraphifyTree-sitter (25 langs) + NetworkX + Leiden community detection; agent-native via PreToolUse hooks[[code-to-knowledge-graph]]
KreuzbergTree-sitter (248 langs); extracts functions / classes / imports / docstrings + syntax-aware chunking[[kreuzberg-document-intelligence]]
Joern (CPG)Code Property Graph (AST + CFG + PDG merged) for security and deep static analysisnot yet captured
GitNexusMCP-native knowledge graph engine for agentsnot yet captured

5. Wiki hub notes

  • [[pdf-parsing-landscape]] - the layered map of parsers
  • [[rag-techniques]] - the catalog of retrieval techniques
  • [[rag-learning-path]] - how to practice them
  • [[multimodal-rag]] - non-text-only RAG
  • [[graph-rag]] - graph-based retrieval
  • [[code-to-knowledge-graph]] - the codebase analog
  • [[kreuzberg-document-intelligence]] - the cross-cutting generalist tool
  • [[openkb-knowledge-base]] - the corpus-level wiki compiler that mirrors this vault's own pattern

Counterargument worth knowing

VectifyAI's Do We Still Need OCR? makes an information-theoretic case that, for visually-complex documents, OCR is non-invertible and caps the upper bound of any pipeline that uses it. The proposed alternative: VLM-direct on page images, with [[pageindex-vectorless-rag|PageIndex]] tree-indexing as the vectorless retrieval layer. Recorded in the [[pdf-parsing-landscape#When not to OCR at all|landscape note]].