Attention From Scratch

I’m mapping the landscape of large language models to see what’s possible for an individual, not just for massive tech teams. The exciting news is that fully open-source LLMs exist: you can inspect the weights, study the architecture, run the inference code, and even examine the training stack. As someone who loves marrying hardware, software, and math–classic HW/SW co-design–I’m inspired by the breakthroughs that pushed LLM performance forward: from FlashAttention and its fused kernels, to asynchronous attention in FlashAttention 2/3, and the paged-attention serving tricks in engines like vLLM. With today’s cloud options, renting enterprise-class GPUs for inference is also within reach. So I’ve set a personal roadmap that takes me from beginner to building production-grade inference infrastructure on top of open models like Olmo 2. I’m not aiming to train new models; I’m aiming to run them exceptionally well. Along the way, I hope to understand the algorithms behind one of the most impactful technologies of our time–and maybe even find new ways to push them further.

Repository: Attention-From-Scratch

Attention From Scratch

12-week plan to build a from-scratch Transformer inference stack (prefill, decode, serving)

📆 Today Oct 22, 2025
🚀 Start 2025-10-14
🔄 Updated Oct 22, 2025
Progress: 1/12 (8%)
Week 1
Vast.AI Setup + Baseline Environment
Target: 2025-10-21
Done: 2025-10-19
Baseline environment ready; captured reference runs on Olmo 2 (RTX 5090).
Week 2
Baseline Benchmark Harness
Target: 2025-10-28
Research performance metrics and implement a repeatable GPU benchmarking suite for reference models.
Week 3
Inference Stack Skeleton
Target: 2025-11-04
Start building the custom inference stack with modular orchestration and a prefill pipeline.
Week 4
Prefill + Decode Loop + Sampling
Target: 2025-11-11
Implement KV cache, decode loop, and sampling strategies; validate generations and measure performance.
Week 5
FlashAttention-Style Optimization
Target: 2025-11-18
Replace naive attention with IO-aware tiled kernels; benchmark and profile.
Week 6
Advanced KV Cache + CUDA Graphs
Target: 2025-11-25
Introduce paged/slab KV cache and capture the decode path with CUDA Graphs.
Week 7
Quantization (AWQ / GPTQ / NF4)
Target: 2025-12-02
Implement weight-only int4 paths and evaluate quality vs fp16 baseline.
Week 8
Batching, Streams, and Scheduler
Target: 2025-12-09
Improve throughput via concurrent streams and request scheduling.
Week 9
Parallelism & Memory Optimization
Target: 2025-12-16
Extend to multi-GPU tensor parallel and reduce memory pressure for long contexts.
Week 10
Speculative & Lookahead Decoding
Target: 2025-12-23
Implement Lookahead and (optionally) Medusa heads; measure speedups.
Week 11
Serving Layer
Target: 2025-12-30
Expose the engine via a lightweight API with batching and metrics.
Week 12
Validation, Docs, and Release
Target: 2026-01-06
Validate parity and publish a complete report and repo.