Back to Blog

Build a Traversable Index

Two paths past the MECE wall - document-level and entity-level - share the same move: replace flat similarity with structure you can actually walk.

8 minMay 4, 2026

Most discussions of RAG start with retrieval quality and end with the same observation: a well-written question against a thoughtfully embedded corpus still returns mediocre results. The instinct is to fix this with better embeddings, better chunkers, better rerankers. Each helps a little. None of them touch the actual problem.

The actual problem is that embedding vectors capture surface similarity, but the relationships that matter in real documents are structural.

What gets lost when you embed a document

When you chunk a document and embed each chunk, you preserve one signal: how textually close one chunk is to another in some learned semantic space. You discard everything else. The hierarchy the author wrote into the document - title, section, subsection, paragraph - flattens. The fact that chapter 3 follows from chapter 2 disappears. The fact that section 4.2 contradicts section 1.3 is invisible. All you keep is "these two passages use similar words."

For some queries this is enough. For most professional documents - regulatory filings, contracts, technical manuals, research papers - it isn't.

The MECE wall

Here is the cleanest way to see why. A well-written article is structured by some logical framework, and the most common framework is MECE: Mutually Exclusive, Collectively Exhaustive. The sub-topics under a main topic are deliberately designed not to overlap, because overlap is a sign of muddled thinking. The whole point of decomposing a problem is that each branch covers different ground.

Now ask: what does this do to vector similarity?

If the sub-categories are mutually exclusive by design, their direct embedding similarity will be low. They share a parent, but no surface vocabulary should connect them - the author worked to keep them separate. The only path between sibling sub-categories runs upward through the parent topic. An embedding cosine measure has no way to see that path. It will rank the two siblings as unrelated, even though understanding either one requires understanding both.

Cosine similarity sees siblings as strangers. The document's structure says they're family.

Vectors don't walk

This is the move both serious responses to the MECE wall make, and the one observation that ties them together: embeddings give you a flat similarity space; what you actually need is something you can traverse.

A vector store lets you ask "what's near this point?" That's it. You can compare two locations, but you can't walk anywhere. There's no notion of parent, child, sibling, neighbor-by-citation, member-of-the-same-cluster. Just distance.

What documents and corpora have, that vector spaces don't, is structure that can be walked. A table of contents lets you descend from chapter to section to subsection. An entity graph lets you hop from one concept along a labeled edge to another. A citation network lets you follow a reference. None of these are similarity measurements. They're traversable indexes.

So the move is uniform across both serious responses to RAG's structural failure: build an index or map - something traversable - that recovers the relationships embeddings flatten. The two paths differ only in what they build the index over.

Two indexes you can walk

Document-level index (PageIndex)

[[pageindex-vectorless-rag|PageIndex]] preserves the index the document already has: the author's outline. It builds a hierarchical "table of contents" tree, with summaries at each node, and lets an LLM reason its way down the tree to find the relevant section. No chunking, no embeddings, no vector database. The retrieval engine is the LLM's logical reasoning, applied to a structure that was already in the document.

The bet: most professional documents already encode their MECE structure in their outline. You just have to stop throwing it away. When the LLM reads "section 1: revenue, section 2: costs, section 3: balance sheet," it knows to descend into section 1 for a revenue question without needing the word "revenue" to pattern-match against anything. The TOC is the index. The traversal is the LLM walking the tree.

PageIndex's own argument goes further: even OCR is information-lossy because converting a 2D visual document to 1D text discards spatial structure. The vision-mode variant skips OCR entirely - the LLM walks the tree, picks pages, and a VLM reads each page as an image.

What gets built: a tree. What walks it: an LLM, by reasoning.

Entity-level map (Graph RAG)

[[graph-rag|Graph RAG]] doesn't trust the document's outline to encode the relationships you need - either because the outline is weak, or because the relationships span multiple documents that don't share an outline at all. So instead of preserving structure, it constructs one: an explicit graph of entities and the relationships between them.

[[lightrag-algorithm|LightRAG]] is the canonical recipe. An LLM reads each chunk during indexing and extracts (entity, type, description) tuples and (source, target, keywords, description) relationship tuples. The output is a real knowledge graph stored alongside the vectors. At query time, retrieval blends modes: vector search to find relevant entities, then graph traversal from those entities to pull in connected ones - even ones whose embeddings would never have surfaced them. Multi-hop queries ("how does X relate to Y through Z?") work because the path actually exists in the graph. Thematic queries ("what are the major themes?") work because community detection on the graph clusters entities that don't share words but share relationships.

What gets built: a graph. What walks it: traversal algorithms (with vectors as a coarse first lookup).

Same move, different index

The two paths look opposed, but they're instances of the same move - build something traversable - applied at different scales:

Document-level (PageIndex)Entity-level (Graph RAG / LightRAG)
What gets builtA tree (the document's TOC)A graph (extracted entities + relationships)
Where the index comes fromThe author wrote itAn LLM extracted it during indexing
What walks itLLM logical reasoning, top-downGraph traversal (often vector-seeded)
Smallest retrieval unitSection / pageEntity / relationship
Best onLong, well-structured single docsCross-document corpora, sparse facts
Worst onDocuments with no real hierarchyDocuments with strong hierarchy that's wasted
Vectors needed?NoneYes, for entity disambiguation and seeding

PageIndex trusts the author's index. Graph RAG distrusts it and builds its own map. Both reject the premise that flat embeddings on flat chunks can recover what the document originally encoded.

What this implies for picking a stack

If the corpus is a single long document with real structure - a 10-K filing, a textbook, a regulatory manual - PageIndex is the right call. The TOC tree already exists; preserve it. Mafin 2.5 hits 98.7% on FinanceBench using exactly this pattern.

If the corpus is many shorter documents with implicit cross-references - a research literature, a wiki, a knowledge base - Graph RAG / LightRAG is the right call. There's no single outline to lean on, but there are entities and relationships across documents that the LLM can extract once and traverse forever after.

If the corpus is everything you've collected and you want to keep adding to it - the [[openkb-knowledge-base|Karpathy "LLM Knowledge Bases" pattern]] - both apply. OpenKB uses PageIndex for long PDFs and an LLM-compiled wiki of cross-document concept pages for the corpus level. Document structure where it exists; entity structure where it doesn't.

If the corpus is a codebase, the same logic applies. [[code-to-knowledge-graph|Graphify]] is Graph RAG for code - tree-sitter extracts the AST, Leiden community detection finds the modular structure, the agent reads the graph instead of grepping the files.

What embeddings are still good for

None of this means embeddings are dead. They remain the right tool for the surface-similarity job they were designed for: finding semantically near passages, computing chunk-level relevance, providing fast first-pass retrieval that a structural layer can refine. The mistake was treating them as the entire pipeline. They're a coarse layer underneath a structural one, not a substitute for structure.

The architectural shift in 2026 RAG is not "find a better embedding model." It's build an index you can walk - whether the index is a tree the author already wrote (PageIndex) or a graph the LLM extracts (Graph RAG / LightRAG) - and stop asking cosine similarity to do work that traversal was always supposed to do.

Related

  • [[pageindex-vectorless-rag]] - the document-level path, with the OCR-is-information-lossy argument
  • [[graph-rag]] - the entity-level path, broader pattern
  • [[lightrag-algorithm]] - the canonical entity+relationship graph RAG recipe
  • [[rag-techniques]] - the catalog of ~30 retrieval techniques, organized by goal
  • [[multimodal-rag]] - RAG-Anything: graph + multimodal applied to documents with figures and tables
  • [[openkb-knowledge-base]] - the corpus-level wiki compiler that uses PageIndex for long docs
  • [[code-to-knowledge-graph]] - the same pattern, applied to code instead of documents
  • [[document-intelligence-archive|Document Intelligence Archive]] - the full toolset index this essay's argument runs through