Back to Blog

The RAG Curriculum - Build Your Own AI Information Engine

For everyone from weekend hobbyists to people who want to understand how a real production RAG service actually works.

8 weeksApr 26, 2026

Automation keeps absorbing more of the work, but building the knowledge system you actually want on top of that automation takes more than a "just do it for me." Off-the-shelf RAG packages exist - but knowing which knob to turn for your documents, your questions, your domain is something you have to learn yourself before you can change it the way you want. The people who get the most out of Claude Code are the ones who already understand development deeply; the same will be true of every knowledge tool that comes next. The people who know what's underneath will use them best.

This curriculum is built so you can enter at the stage you want. Haven't written Python yet? Start at Stage 0. Already comfortable? Start at Stage 1. Already shipping with LangChain? Skip to Stage 2.

Difficulty: Beginner to intermediate Total time: 4-14 weeks depending on where you enter

RAG techniques overview


Stage 0 - If you don't know Python yet

Optional · 2–4 weeks

Every RAG snippet you'll see in this curriculum is Python. If variables, loops, lists, and dicts haven't sunk in yet, LangChain code will just look like noise. This is the standard intro course - five sections, free to audit, Korean subtitles available.

Once you finish lectures 3 (Web Data) and 4 (Databases), you're ready for Stage 1 - JSON, REST APIs, and reading SQL schemas are all roughly covered. pandas and numpy aren't taught here; pick those up reactively when Stage 3 code stops making sense.

Stage 1 - Get LangChain & LangGraph into your hands

Sequential · 1 week

Most of the RAG code you'll read in Stage 2 is LangChain, and the moment branching or loops enter the picture you need LangGraph. Finish LangChain Academy's official intro course inside a week, then port one small chain you've written into LangGraph as a forcing function.

Stage 2 - 22 RAG Techniques: the map

Sequential · 2 weeks

Skim every technique end-to-end. Goal isn't to run all the code - it's to understand what each method actually does so you can tell which ones fit your problem and which don't.

Stage 3 - Apply RAG techniques to a real PDF

Sequential · 4 weeks

Pick one PDF you actually want to query - a textbook, a contract, a manual, your company wiki, doesn't matter. Take that document and walk through the nine categories below in order. Implement at least one technique per category, measure retrieval quality every time, and keep a written log of which techniques pull the right information out of your PDF.

Week-by-week (4 weeks recommended)

  • Week 1 / Foundational - Anchor on Simple RAG, then compare chunk size, Proposition, and Reliable RAG. The goal is a baseline number that every later technique gets evaluated against.
  • Week 1 / Query Enhancement - Rewrite the user's question itself with Query Transformations, and find out firsthand why HyDE often underperforms. First lever to pull on top of the baseline.
  • Week 2 / Context & Content Enrichment - HyPE, Contextual Chunk Headers, Relevant Segment Extraction, Semantic Chunking, Contextual Compression. See how far you can move retrieval quality by reshaping information before it's embedded.
  • Week 2 / Advanced Retrieval - Fusion Retrieval, Intelligent Reranking, Hierarchical Indices, Multi-faceted Filtering. Usually the highest-leverage stretch - spend most of the week here.
  • Week 3 / Iterative & Adaptive - Feedback Loops, Adaptive Retrieval. Build a feel for retrieval that refines itself across multiple passes instead of trying to nail it in one shot.
  • Week 3 / Evaluation - DeepEval, GroUSE, Open-RAG-Eval. Take the baseline number from Week 1 and re-measure it under a real evaluation framework. Skip this and everything downstream runs on vibes.
  • Week 4 / Memory-Augmented + Explainability - MemoRAG and Explainable Retrieval. Skim, but build at least one structure that can show the user why a given chunk was retrieved.
  • Week 4 / Advanced Architectures - Agentic RAG, Graph RAG (Milvus / Microsoft), RAPTOR, Self RAG, Corrective RAG. Goal here is to judge which architecture fits your PDF, not to implement all seven.
  • End of Week 4 / Controllable Agent - Close on the most complex case. By the time you finish this you'll naturally see where to tune any off-the-shelf RAG package.

Log answer quality on the same PDF after every category. After the four weeks you should be able to point at which technique stack actually solved your corpus - that's the only way you can keep improving from here.

Stage 4 - Multimodal Graph RAG, package-form

Sequential · 1 week

Now that you've spent four weeks hand-rolling, run these two packages and the internals will start surfacing. The same RAG pipeline can be built differently - RAG-Anything is a complete multimodal RAG service: drop in your PDFs and Word docs, get a queryable system back. Graphify is a Graph RAG layer aimed at coding agents - point it at a folder of code, PDFs, HTML, and screenshots, and the agent answers from the pre-built knowledge graph instead of touching the original files. Looking at packages built with different approaches teaches you a lot - you'll see how the strengths and limits of every Stage-3 technique show up inside each package, and what makes these heavily-starred RAG packages actually different from a stitched-together pipeline.

✓ What you'll walk away with - The minimum knowledge to ship your own

By here you understand how a production-grade RAG package actually works and you can rewire the whole pipeline however you want. Real services have a lot of plumbing on either side of this, of course. But knowing the core of how a knowledge system processes information is what separates you from anyone else who can only call an API.


Side note - When you can't get clean information out of a PDF

The hardest part of RAG isn't the model or the retrieval - it's PDF preprocessing. Tables, equations, images, and multi-column layouts mixed together will mangle context and formatting if you do plain text extraction, and the best pipeline downstream still fails on garbage input. The two videos below - both Korean - walk through preprocessing tricks that actually work. Worth it even with subtitles.