Your Data Science
Resource Deck

The field moves fast, but fast isn't always the right direction. We hand-pick the books, tools, and paths that actually compound — slower than chasing every trend, but it ends up taking you further.

Featured Resources

View all
LangExtract — Source-Grounded LLM Information Extraction (Google)
Github

LangExtract — Source-Grounded LLM Information Extraction (Google)

A Google-maintained Python library that uses LLMs to pull structured fields out of unstructured text. Rather than letting the model hallucinate where each value came from, every extraction is mapped back to its exact character span in the source — so you can highlight, verify, and audit it instead of trusting it.

llminformation-extractionstructured-outputsource-grounding
Intermediate

LangExtract — Source-Grounded LLM Information Extraction (Google)

Github

LangExtract is a package that extracts structured information from documents and maps it back to the exact location in the source text — then renders the result as an interactive HTML viewer. What made it succeed is how cleanly it taps into a real anxiety: when GPT gives you an answer without telling you where it came from, the first instinct is to doubt it. So you predefine the fields you want, the LLM pulls them out, and every extraction shows up alongside the evidence — the exact span of source it came from.

When to use

Reach for this when your extraction pipeline keeps shipping fields the LLM 'almost' got right, and you need every value to arrive with a verifiable receipt — exact character offsets in the source — instead of trust.

How to use

`pip install langextract`, define extraction instructions plus a few-shot example, then call `lx.extract(text=..., prompt_description=..., examples=...)` with a Gemini / OpenAI / Ollama model id. Render the result with `lx.visualize(...)` to get an interactive HTML viewer with source highlighting.

Time investment

1 hour to wire up

EdgeQuake — High-Performance GraphRAG in Rust
Github

EdgeQuake — High-Performance GraphRAG in Rust

LightRAG-style GraphRAG, rewritten in Rust. Documents become a knowledge graph of entities and relationships; queries traverse both vector space and graph structure.

graph-ragknowledge-graphrustlightrag
Intermediate

EdgeQuake — High-Performance GraphRAG in Rust

Github

EdgeQuake is a package that implements the LightRAG algorithm in async Rust. Six query modes cover the spectrum — naive (keyword-style vector search), local (entity-centered local graph), global (community-based, thematic), the default hybrid (local + global combined), mix (weighted combination), and bypass (skip RAG, query the LLM directly) — each picked for the speed/cost tradeoff you need. On top of that, results are easy to inspect through the built-in web visualization, and an MCP server is included so agents like Claude and Cursor can call EdgeQuake directly.

When to use

Reach for this when vector RAG plateaus on multi-hop reasoning ('how does X relate to Y through Z?') or thematic questions, and you'd rather adopt a production-shaped GraphRAG stack than glue one together yourself.

How to use

Easiest path: `curl -fsSL https://raw.githubusercontent.com/raphaelmansuy/edgequake/edgequake-main/quickstart.sh | sh`. Pick OpenAI or Ollama in the wizard, open localhost:3000, drop in a PDF, and inspect the resulting graph in the Sigma.js view.

Time investment

1 hour setup

Kreuzberg — Rust Core + 12 Language Bindings + MCP Server
Github

Kreuzberg — Rust Core + 12 Language Bindings + MCP Server

A Rust-core document intelligence framework that extracts text, OCR output, and code intelligence from 91+ file formats under one unified call. Rather than locking you into Python, it ships native bindings for languages such as Ruby, Go, Java, and Elixir, plus a CLI and an MCP server any agent host can call.

document-intelligenceocrpdfrust
Intermediate

Kreuzberg — Rust Core + 12 Language Bindings + MCP Server

Github

Most document-intelligence stacks force you into Python. Kreuzberg refuses: the engine is Rust, but it ships as native bindings for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, and TypeScript (Node/Bun/WASM/Deno) — and if none of those fit, it runs as a CLI, REST server, or MCP server that any agent host can call. The output shape is unified: every input — PDF, .docx, .epub, .hwp, .ipynb, an entire codebase — comes back as one `ExtractionResult` with `code_intelligence` and `semantic_chunking` fields, so a 'PDF for RAG' pipeline and a 'codebase for an agent' pipeline use the same call. OCR is pluggable across Tesseract (incl. WASM), PaddleOCR, EasyOCR, and VLM-OCR through 146 providers (GPT-4o, Claude, Gemini, Ollama, vLLM, llama.cpp) — stay local or route through a frontier model with one config flip. Two design quirks worth borrowing: a TOON wire format that's 30–50% smaller than JSON for the same payload, and HTML→Markdown via direct conversion instead of an AST round-trip.

When to use

Reach for this when you need to parse many document formats (not just PDF), you're working in a non-Python language and tired of being a second-class citizen, or you want one MCP server doing both document parsing and code intelligence for an agent.

How to use

Python: `pip install kreuzberg`, then `extract(file_path)`. Or skip bindings entirely — `npx skills add kreuzberg-dev/kreuzberg` installs the agent skill so Claude Code, Cursor, or Codex can call it directly.

Time investment

30 min to wire up

PageIndex — Vectorless, Reasoning-Based RAG
Github

PageIndex — Vectorless, Reasoning-Based RAG

Drop the vector DB. Build a table-of-contents tree over your PDF and let the LLM reason its way to the right section — same idea as AlphaGo's tree search.

ragvectorlesstree-searchlong-document
Advanced

PageIndex — Vectorless, Reasoning-Based RAG

Github

On professional documents (10-K filings, contracts, technical manuals), similarity ≠ relevance — what you actually need isn't embedding-based distance, it's logical reasoning. PageIndex strips out the vector pipeline entirely: it extracts a hierarchical table-of-contents tree from the PDF, then an agent reads node summaries and walks the tree to the right section, leaving a justification at every step. The retrieval trail itself becomes evidence for *why* a particular section was chosen, so results carry a credibility log instead of just a similarity score. The cost is honest — every query takes at least one LLM call, often several, so latency and per-query cost go up. Even so, the performance is real: PageIndex hit 98.7% on FinanceBench, beating vector-based RAG.

When to use

Reach for this when your corpus is long, structured, and audit-sensitive — financial filings, contracts, regulatory PDFs — and your vector RAG keeps surfacing the *similar* chunk instead of the *relevant* section.

How to use

Clone, set an LLM key, run `python3 run_pageindex.py --pdf_path <doc>` to build the tree. Or skip the install and hit the hosted API/MCP at pageindex.ai/developer.

Time investment

2 hours to wire up

OpenDataLoader — PDF Parser for RAG and Accessibility
Github

OpenDataLoader — PDF Parser for RAG and Accessibility

An Apache-2.0 PDF parser that extracts Markdown, HTML, or JSON-with-bboxes locally without an LLM in the loop. However, when a page contains scans, complex tables, or formulas, it routes through a hybrid AI backend that adds OCR, formula extraction, and chart description.

pdfocrragparsing
Intermediate

OpenDataLoader — PDF Parser for RAG and Accessibility

Github

OpenDataLoader sits in the same slot as MinerU, Docling, or LlamaParse, with two unusual bets that change how it feels in production. First, the default mode is fully local and deterministic: `pip install`, call `convert(...)`, and get reproducible Markdown / HTML / JSON-with-bboxes from an XY-Cut++ reading-order pass — no LLM, no network. Second, the same layout engine doubles as a PDF accessibility auto-tagger (the team claims the first OSS end-to-end Tagged PDF generator), so the structure you extract for RAG is the same structure that satisfies WCAG-style remediation. The hybrid mode adds OCR (80+ languages), formula extraction, and chart description by routing pages through a local AI backend, and it's the configuration that wins the team's 200-PDF benchmark at 0.907 overall. JSON output ships per-element bounding boxes, so downstream chunks can carry source-citation rectangles without any post-hoc layout inference.

When to use

Reach for this when you want a parser that stays local by default, gives you bbox-anchored citations for free, and you don't want to choose between RAG-quality structure and accessibility tagging.

How to use

`pip install opendataloader-pdf`, then `convert(input='doc.pdf', output_dir='out')` — that's the deterministic mode. For scans, formulas, or charts, run `opendataloader-pdf-hybrid --port 5002` first and pass `--hybrid` to opt in.

Time investment

30 min to wire up

GLM-OCR — 0.9B Compact OCR VLM, #1 on OmniDocBench
Github

GLM-OCR — 0.9B Compact OCR VLM, #1 on OmniDocBench

A 0.9B-parameter OCR VLM from Zhipu AI that parses documents into structured Markdown and JSON. Rather than scaling up like frontier OCR models, it pairs a layout detector with multi-token prediction to top OmniDocBench v1.5 (94.6) while staying small enough to run at the edge.

ocrvlmdocument-parsingedge-deployment
Intermediate

GLM-OCR — 0.9B Compact OCR VLM, #1 on OmniDocBench

Github

GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts. Only 0.9B parameters.

When to use

Reach for this when you want a small OCR model you can actually deploy at the edge or in a high-concurrency service, and you're tired of paying frontier-VLM prices for what's structurally a deterministic task.

How to use

Pull from HuggingFace (`zai-org/GLM-OCR`) and serve with vLLM, SGLang, or Ollama. Fastest sanity check: try the demo at ocr.z.ai before wiring anything up.

Time investment

1-2 hours to deploy

Learning Paths

View all