Smear and backout parameters were living uninitialized lives until now.
Somewhere between model construction and first forward pass, the smear and backout lambdas were just... vibes. Undefined. @karpathy's #686 moves their initialization into init_weights() where it belongs. The kind of fix that makes you wonder how it ever worked — and slightly nervous about what else might be floating in limbo.
Setuptools was missing, and so was everyone without a GPU.
If you tried running nanochat on CPU, you hit an import error before you hit a single token. #706 adds setuptools as a dependency — the kind of one-liner that unblocks an entire hardware class. Now the CUDA-less among us can at least watch the model crawl.
Multi-node training kept dying before it started.
Distributed training with torch.compile has a timing problem: nodes compile at different speeds, the watchdog gets impatient, NCCL throws a timeout, everyone goes home. #722 proposes a coordinated warmup phase so all nodes finish compiling before the real work begins. Still open, but if you've ever stared at a watchdog timeout at 3am, you know this matters.
Not everyone has FA3 silicon, not everyone wants SDPA slowness.
FA3 is fast but hardware-gated. SDPA is portable but not exactly blazing. #721 opens the door to Flash Attention 2 as a middle tier — faster than the fallback, less demanding than the bleeding edge. A pragmatic addition for the GPU middle class.
DPI claims faster convergence without the usual learning rate dance.
Warmup schedules exist because random initialization plus high learning rates equals NaN city. #707 proposes DPI — a geometric initialization scheme that allegedly lets you skip warmup entirely. Research-tagged and unmerged, but if it holds up, that's one less hyperparameter to babysit.
MPS and CPU paths for the CUDA-less researcher.
autoresearch assumed CUDA or nothing. #516 opens a PR adding macOS support via MPS and CPU fallbacks. Still pending, but it means your M-series laptop might actually run automated experiments instead of just warming your desk.
| autoresearch | ★★★★★★★★★★ | 76,079 |
| nanoGPT | ★★★★★★★★☆☆ | 57,095 |
| nanochat | ★★★★★★★☆☆☆ | 52,417 |
| minGPT | ★★★☆☆☆☆☆☆☆ | 24,223 |
| llama2.c | ★★★☆☆☆☆☆☆☆ | 19,436 |
| karpathy.github.io | ★☆☆☆☆☆☆☆☆☆ | 1,332 |
Not everyone has FA3 silicon, not everyone wants SDPA slowness.
FA3 is fast but hardware-gated. SDPA is portable but not exactly blazing. #721 opens the door to Flash Attention 2 as a middle tier — faster than the fallback, less demanding than the bleeding edge. A pragmatic addition for the GPU middle class.
DPI claims faster convergence without the usual learning rate dance.
Warmup schedules exist because random initialization plus high learning rates equals NaN city. #707 proposes DPI — a geometric initialization scheme that allegedly lets you skip warmup entirely. Research-tagged and unmerged, but if it holds up, that's one less hyperparameter to babysit.
MPS and CPU paths for the CUDA-less researcher.
autoresearch assumed CUDA or nothing. #516 opens a PR adding macOS support via MPS and CPU fallbacks. Still pending, but it means your M-series laptop might actually run automated experiments instead of just warming your desk.
| autoresearch | ★★★★★★★★★★ | 76,079 |
| nanoGPT | ★★★★★★★★☆☆ | 57,095 |
| nanochat | ★★★★★★★☆☆☆ | 52,417 |
| minGPT | ★★★☆☆☆☆☆☆☆ | 24,223 |
| llama2.c | ★★★☆☆☆☆☆☆☆ | 19,436 |
| karpathy.github.io | ★☆☆☆☆☆☆☆☆☆ | 1,332 |
Your GitHub week, turned into something worth reading.
Generate your dispatch →