karpathy — gitzette 2026-W16

6 commits 2 PRs merged 0 releases 6 repos Two merged PRs, five open experiments, and the quiet hum of a codebase that mostly worked anyway.

FEATURE

weight initialization finally remembers the lambdas it forgot

Smear and backout parameters were living uninitialized lives until now.

Somewhere between model construction and first forward pass, the smear and backout lambdas were just... vibes. Undefined. @karpathy's #686 moves their initialization into init_weights() where it belongs. The kind of fix that makes you wonder how it ever worked — and slightly nervous about what else might be floating in limbo.

merged: #686#706

FEATURE

CPU users can now actually run the thing

Setuptools was missing, and so was everyone without a GPU.

If you tried running nanochat on CPU, you hit an import error before you hit a single token. #706 adds setuptools as a dependency — the kind of one-liner that unblocks an entire hardware class. Now the CUDA-less among us can at least watch the model crawl.

merged: #686#706

PENDING

NCCL watchdog timeouts meet their match in a compile warmup fix

Multi-node training kept dying before it started.

Distributed training with torch.compile has a timing problem: nodes compile at different speeds, the watchdog gets impatient, NCCL throws a timeout, everyone goes home. #722 proposes a coordinated warmup phase so all nodes finish compiling before the real work begins. Still open, but if you've ever stared at a watchdog timeout at 3am, you know this matters.

merged: #686#706

PENDING

Flash Attention 2 slots in between the fast and the fallback

Not everyone has FA3 silicon, not everyone wants SDPA slowness.

FA3 is fast but hardware-gated. SDPA is portable but not exactly blazing. #721 opens the door to Flash Attention 2 as a middle tier — faster than the fallback, less demanding than the bleeding edge. A pragmatic addition for the GPU middle class.

merged: #686#706

PENDING

geometric initialization wants to kill your warmup schedule

DPI claims faster convergence without the usual learning rate dance.

Warmup schedules exist because random initialization plus high learning rates equals NaN city. #707 proposes DPI — a geometric initialization scheme that allegedly lets you skip warmup entirely. Research-tagged and unmerged, but if it holds up, that's one less hyperparameter to babysit.

merged: #686#706

PENDING

macOS joins the single-GPU research party

MPS and CPU paths for the CUDA-less researcher.

autoresearch assumed CUDA or nothing. #516 opens a PR adding macOS support via MPS and CPU fallbacks. Still pending, but it means your M-series laptop might actually run automated experiments instead of just warming your desk.

6
commits
12
pull requests
0
releases
commits by repo

      REPO
      COMMITS
      
      nanochat
    
    6
    
github stars

      autoresearch
      ★★★★★★★★★★
      76,079
    

      nanoGPT
      ★★★★★★★★☆☆
      57,095
    

      nanochat
      ★★★★★★★☆☆☆
      52,417
    

      minGPT
      ★★★☆☆☆☆☆☆☆
      24,223
    

      llama2.c
      ★★★☆☆☆☆☆☆☆
      19,436
    

      karpathy.github.io
      ★☆☆☆☆☆☆☆☆☆
      1,332
    

PENDING