Multimodal & Reasoning LLMs

A 10-week path for engineers who already get transformers conceptually but want to build deeper. You'll see attention visualized, build Llama from scratch in parallel code/video tracks, then extend into the two directions that actually matter today — multimodal models (build a VLM) and reasoning models (a visual guide before the DeepSeek-R1 paper).

Difficulty: Advanced · Total time: 10 weeks

Stage 1 — See Transformers Before You Read Them

Sequential · 1 week

Transformer & Attention, Visualized (3Blue1Brown)

Stage 2 — Llama from Scratch (parallel tracks)

Parallel · 2 weeks · pick one or do both

Track A: Code

Llama from Scratch (Code Walkthrough)

Track B: Video

Llama from Scratch (Video Companion)

✓ Checkpoint — Rebuild Llama's attention block from memory

If you can write multi-head attention from a blank file, you're ready to go beyond text-only.

Stage 3 — Understand Multimodal LLMs Conceptually

Sequential · 1 week

Understanding Multimodal LLMs (Raschka)

Read it once end-to-end. Sketch the architecture from memory. Then read it again.

Stage 4 — Visual Guide to Reasoning LLMs

Sequential · 1 week

Visual Guide to Reasoning LLMs

A primer on the other extension of base LLMs. Read this before the DeepSeek-R1 paper, not after.

Stage 5 — Build a VLM from Scratch

Sequential · 3 weeks

Build a VLM from Scratch (nanoVLM)

Walk through the codebase line by line. Train it on Colab. Don't optimize — understand.

✓ Checkpoint — Train a small VLM and explain its design choices

Train your nanoVLM on a small image-text dataset. Then explain — out loud, to a friend, or in a write-up — why the vision encoder fuses where it fuses, why the projection layer has the dimensions it has, and why the loss is computed only on text tokens. If you can't explain it, you didn't really build it.

Multimodal & Reasoning LLMs

Stage 1 — See Transformers Before You Read Them

Stage 2 — Llama from Scratch (parallel tracks)

✓ Checkpoint — Rebuild Llama's attention block from memory

Stage 3 — Understand Multimodal LLMs Conceptually

Stage 4 — Visual Guide to Reasoning LLMs

Stage 5 — Build a VLM from Scratch

✓ Checkpoint — Train a small VLM and explain its design choices

1단계 — Transformer를 읽기 전에 보세요

2단계 — Llama from Scratch (병렬 트랙)

✓ 체크포인트 — Llama의 attention block을 기억으로 다시 짜보기

3단계 — 멀티모달 LLM을 개념으로 먼저 이해하기

4단계 — 추론 LLM 비주얼 가이드

5단계 — VLM을 from scratch로 직접 구현

체크포인트 — 작은 VLM을 학습시키고 설계 설명하기