Back to Blog

Multimodal & Reasoning LLMs

From transformer intuition to training a small VLM and reading reasoning model papers

10 weeksApr 25, 2026

A 10-week path for engineers who already get transformers conceptually but want to build deeper. You'll see attention visualized, build Llama from scratch in parallel code/video tracks, then extend into the two directions that actually matter today — multimodal models (build a VLM) and reasoning models (a visual guide before the DeepSeek-R1 paper).

Difficulty: Advanced · Total time: 10 weeks


Stage 1 — See Transformers Before You Read Them

Sequential · 1 week

Stage 2 — Llama from Scratch (parallel tracks)

Parallel · 2 weeks · pick one or do both

Track A: Code

Track B: Video

✓ Checkpoint — Rebuild Llama's attention block from memory

If you can write multi-head attention from a blank file, you're ready to go beyond text-only.

Stage 3 — Understand Multimodal LLMs Conceptually

Sequential · 1 week

Read it once end-to-end. Sketch the architecture from memory. Then read it again.

Stage 4 — Visual Guide to Reasoning LLMs

Sequential · 1 week

A primer on the other extension of base LLMs. Read this before the DeepSeek-R1 paper, not after.

Stage 5 — Build a VLM from Scratch

Sequential · 3 weeks

Walk through the codebase line by line. Train it on Colab. Don't optimize — understand.

✓ Checkpoint — Train a small VLM and explain its design choices

Train your nanoVLM on a small image-text dataset. Then explain — out loud, to a friend, or in a write-up — why the vision encoder fuses where it fuses, why the projection layer has the dimensions it has, and why the loss is computed only on text tokens. If you can't explain it, you didn't really build it.