Attention From Scratch

I’m mapping the landscape of large language models to see what’s possible for an individual, not just for massive tech teams. The exciting news is that fully open-source LLMs exist: you can inspect the weights, study the architecture, run the inference code, and even examine the training stack. As someone who loves marrying hardware, software, and math–classic HW/SW co-design–I’m inspired by the breakthroughs that pushed LLM performance forward: from FlashAttention and its fused kernels, to asynchronous attention in FlashAttention 2/3, and the paged-attention serving tricks in engines like vLLM. With today’s cloud options, renting enterprise-class GPUs for inference is also within reach. So I’ve set a personal roadmap that takes me from beginner to building production-grade inference infrastructure on top of open models like Olmo 2. I’m not aiming to train new models; I’m aiming to run them exceptionally well. Along the way, I hope to understand the algorithms behind one of the most impactful technologies of our time–and maybe even find new ways to push them further.

Repository: Attention-From-Scratch

Attention From Scratch

12-week plan to build a from-scratch Transformer inference stack (prefill, decode, serving)

📆 Today Oct 22, 2025

🚀 Start 2025-10-14

🔄 Updated Oct 22, 2025

Progress: 1/12 (8%)

Baseline environment ready; captured reference runs on Olmo 2 (RTX 5090).

Research performance metrics and implement a repeatable GPU benchmarking suite for reference models.

Start building the custom inference stack with modular orchestration and a prefill pipeline.

Implement KV cache, decode loop, and sampling strategies; validate generations and measure performance.

Replace naive attention with IO-aware tiled kernels; benchmark and profile.

Introduce paged/slab KV cache and capture the decode path with CUDA Graphs.

Implement weight-only int4 paths and evaluate quality vs fp16 baseline.

Improve throughput via concurrent streams and request scheduling.

Extend to multi-GPU tensor parallel and reduce memory pressure for long contexts.

Implement Lookahead and (optionally) Medusa heads; measure speedups.

Expose the engine via a lightweight API with batching and metrics.

Validate parity and publish a complete report and repo.