Build a Traversable Index

Most discussions of RAG start with retrieval quality and end with the same observation: a well-written question against a thoughtfully embedded corpus still returns mediocre results. The instinct is to fix this with better embeddings, better chunkers, better rerankers. Each helps a little. None of them touch the actual problem.

The actual problem is that embedding vectors capture surface similarity, but the relationships that matter in real documents are structural.

What gets lost when you embed a document

When you chunk a document and embed each chunk, you preserve one signal: how textually close one chunk is to another in some learned semantic space. You discard everything else. The hierarchy the author wrote into the document - title, section, subsection, paragraph - flattens. The fact that chapter 3 follows from chapter 2 disappears. The fact that section 4.2 contradicts section 1.3 is invisible. All you keep is "these two passages use similar words."

For some queries this is enough. For most professional documents - regulatory filings, contracts, technical manuals, research papers - it isn't.

The MECE wall

Here is the cleanest way to see why. A well-written article is structured by some logical framework, and the most common framework is MECE: Mutually Exclusive, Collectively Exhaustive. The sub-topics under a main topic are deliberately designed not to overlap, because overlap is a sign of muddled thinking. The whole point of decomposing a problem is that each branch covers different ground.

Now ask: what does this do to vector similarity?

If the sub-categories are mutually exclusive by design, their direct embedding similarity will be low. They share a parent, but no surface vocabulary should connect them - the author worked to keep them separate. The only path between sibling sub-categories runs upward through the parent topic. An embedding cosine measure has no way to see that path. It will rank the two siblings as unrelated, even though understanding either one requires understanding both.

Cosine similarity sees siblings as strangers. The document's structure says they're family.

Vectors don't walk

This is the move both serious responses to the MECE wall make, and the one observation that ties them together: embeddings give you a flat similarity space; what you actually need is something you can traverse.

A vector store lets you ask "what's near this point?" That's it. You can compare two locations, but you can't walk anywhere. There's no notion of parent, child, sibling, neighbor-by-citation, member-of-the-same-cluster. Just distance.

What documents and corpora have, that vector spaces don't, is structure that can be walked. A table of contents lets you descend from chapter to section to subsection. An entity graph lets you hop from one concept along a labeled edge to another. A citation network lets you follow a reference. None of these are similarity measurements. They're traversable indexes.

So the move is uniform across both serious responses to RAG's structural failure: build an index or map - something traversable - that recovers the relationships embeddings flatten. The two paths differ only in what they build the index over.

Two indexes you can walk

Document-level index (PageIndex)

[[pageindex-vectorless-rag|PageIndex]] preserves the index the document already has: the author's outline. It builds a hierarchical "table of contents" tree, with summaries at each node, and lets an LLM reason its way down the tree to find the relevant section. No chunking, no embeddings, no vector database. The retrieval engine is the LLM's logical reasoning, applied to a structure that was already in the document.

The bet: most professional documents already encode their MECE structure in their outline. You just have to stop throwing it away. When the LLM reads "section 1: revenue, section 2: costs, section 3: balance sheet," it knows to descend into section 1 for a revenue question without needing the word "revenue" to pattern-match against anything. The TOC is the index. The traversal is the LLM walking the tree.

PageIndex's own argument goes further: even OCR is information-lossy because converting a 2D visual document to 1D text discards spatial structure. The vision-mode variant skips OCR entirely - the LLM walks the tree, picks pages, and a VLM reads each page as an image.

What gets built: a tree. What walks it: an LLM, by reasoning.

Entity-level map (Graph RAG)

[[graph-rag|Graph RAG]] doesn't trust the document's outline to encode the relationships you need - either because the outline is weak, or because the relationships span multiple documents that don't share an outline at all. So instead of preserving structure, it constructs one: an explicit graph of entities and the relationships between them.

[[lightrag-algorithm|LightRAG]] is the canonical recipe. An LLM reads each chunk during indexing and extracts (entity, type, description) tuples and (source, target, keywords, description) relationship tuples. The output is a real knowledge graph stored alongside the vectors. At query time, retrieval blends modes: vector search to find relevant entities, then graph traversal from those entities to pull in connected ones - even ones whose embeddings would never have surfaced them. Multi-hop queries ("how does X relate to Y through Z?") work because the path actually exists in the graph. Thematic queries ("what are the major themes?") work because community detection on the graph clusters entities that don't share words but share relationships.

What gets built: a graph. What walks it: traversal algorithms (with vectors as a coarse first lookup).

Same move, different index

The two paths look opposed, but they're instances of the same move - build something traversable - applied at different scales:

	Document-level (PageIndex)	Entity-level (Graph RAG / LightRAG)
What gets built	A tree (the document's TOC)	A graph (extracted entities + relationships)
Where the index comes from	The author wrote it	An LLM extracted it during indexing
What walks it	LLM logical reasoning, top-down	Graph traversal (often vector-seeded)
Smallest retrieval unit	Section / page	Entity / relationship
Best on	Long, well-structured single docs	Cross-document corpora, sparse facts
Worst on	Documents with no real hierarchy	Documents with strong hierarchy that's wasted
Vectors needed?	None	Yes, for entity disambiguation and seeding

PageIndex trusts the author's index. Graph RAG distrusts it and builds its own map. Both reject the premise that flat embeddings on flat chunks can recover what the document originally encoded.

What this implies for picking a stack

If the corpus is a single long document with real structure - a 10-K filing, a textbook, a regulatory manual - PageIndex is the right call. The TOC tree already exists; preserve it. Mafin 2.5 hits 98.7% on FinanceBench using exactly this pattern.

If the corpus is many shorter documents with implicit cross-references - a research literature, a wiki, a knowledge base - Graph RAG / LightRAG is the right call. There's no single outline to lean on, but there are entities and relationships across documents that the LLM can extract once and traverse forever after.

If the corpus is everything you've collected and you want to keep adding to it - the [[openkb-knowledge-base|Karpathy "LLM Knowledge Bases" pattern]] - both apply. OpenKB uses PageIndex for long PDFs and an LLM-compiled wiki of cross-document concept pages for the corpus level. Document structure where it exists; entity structure where it doesn't.

If the corpus is a codebase, the same logic applies. [[code-to-knowledge-graph|Graphify]] is Graph RAG for code - tree-sitter extracts the AST, Leiden community detection finds the modular structure, the agent reads the graph instead of grepping the files.

What embeddings are still good for

None of this means embeddings are dead. They remain the right tool for the surface-similarity job they were designed for: finding semantically near passages, computing chunk-level relevance, providing fast first-pass retrieval that a structural layer can refine. The mistake was treating them as the entire pipeline. They're a coarse layer underneath a structural one, not a substitute for structure.

The architectural shift in 2026 RAG is not "find a better embedding model." It's build an index you can walk - whether the index is a tree the author already wrote (PageIndex) or a graph the LLM extracts (Graph RAG / LightRAG) - and stop asking cosine similarity to do work that traversal was always supposed to do.

[[pageindex-vectorless-rag]] - the document-level path, with the OCR-is-information-lossy argument
[[graph-rag]] - the entity-level path, broader pattern
[[lightrag-algorithm]] - the canonical entity+relationship graph RAG recipe
[[rag-techniques]] - the catalog of ~30 retrieval techniques, organized by goal
[[multimodal-rag]] - RAG-Anything: graph + multimodal applied to documents with figures and tables
[[openkb-knowledge-base]] - the corpus-level wiki compiler that uses PageIndex for long docs
[[code-to-knowledge-graph]] - the same pattern, applied to code instead of documents
[[document-intelligence-archive|Document Intelligence Archive]] - the full toolset index this essay's argument runs through

	문서 단위 (PageIndex)	엔티티 단위 (Graph RAG / LightRAG)
무엇이 만들어지나	트리 (문서의 목차)	그래프 (추출한 엔티티 + 관계)
인덱스의 출처	저자가 직접 써놓은 것	LLM이 인덱싱 시점에 추출한 것
무엇이 그것을 걷나	LLM의 논리적 추론, 위에서 아래로	그래프 탐색 (보통 벡터로 시드)
최소 retrieval 단위	섹션 / 페이지	엔티티 / 관계
잘 맞는 곳	길고 잘 구조화된 단일 문서	여러 문서에 흩어진 코퍼스, 희소한 사실들
안 맞는 곳	위계가 사실상 없는 문서	위계가 강한데도 그걸 버리는 코퍼스
벡터 필요?	없음	있음 (엔티티 disambiguation + 시드용)

What gets lost when you embed a document

The MECE wall

Vectors don't walk

Two indexes you can walk

Document-level index (PageIndex)

Entity-level map (Graph RAG)

Same move, different index

What this implies for picking a stack

What embeddings are still good for

Related

문서를 임베딩하면 무엇이 사라지나

MECE 벽

벡터는 관계를 잇지 못한다

걸어 다닐 수 있는 두 가지 인덱스

문서 단위 인덱스 — PageIndex

엔티티 단위 맵 — Graph RAG

같은 동작, 다른 인덱스

그래서 어떤 스택을 골라야 하나

그렇다고 임베딩이 죽은 건 아니에요

관련 자료