autoresearch
Andrej Karpathy‘s proof-of-concept for the “agentic scientist”: AI agents that autonomously run ML research overnight on a single GPU, iterating on nanochat (Karpathy’s minimal single-GPU LLM trainer). It is the origin of the autoresearch/iteration paradigm that autonovel credits — so the two now sit in the wiki as sibling instances of the same idea, one doing science, one doing fiction. MIT-licensed; ~87k GitHub stars.
The loop
- The agent reads
program.md(human-written guidance) and proposes edits totrain.pyonly — architecture, hyperparameters, optimizer, batch size. - Runs training for a fixed 5-minute wall-clock budget.
- Measures validation bits-per-byte (
val_bpb) — lower is better. - Keeps improvements, discards failures, repeats — ~12 experiments/hour, ~100 per overnight run.
Humans write program.md and review the logs on waking; the experimental loop itself runs with no
human in it.
Two design choices that make it work
- Constrained action surface — agents edit only
train.py;prepare.py(data/tokenization) and core infra are off-limits. Single-file diffs stay reviewable — the containment posture as an enabler of autonomy, not just a safety tax. - Fixed time budget — every experiment gets the same 5 minutes, so architecturally different runs are directly comparable. The budget is the stop condition.
Why it matters here — the feedback-signal contrast
autoresearch is the cleanest instance yet of loop-engineering / agent-loops-verification (“the bottleneck is the feedback signal, not generation”), and it sharpens that thesis by contrast with autonovel:
- autoresearch has an objective, ground-truth metric (
val_bpb) — the easy case; the loop can trust its own signal for free. - autonovel has no ground-truth metric for prose quality, so it had to engineer a feedback signal (a mechanical scan + an LLM-judge). The harder case.
So the same paradigm spans a spectrum: where an objective metric exists, autonomy is nearly free; where it doesn’t, the research problem becomes building a trustworthy evaluator. That is the loop thesis’s load-bearing claim made concrete across two repos.
Tier
T1 — first-party project repo (Karpathy’s own code), the spoke’s convention for project source.
Self-described proof-of-concept, not a benchmarked system; recorded. freshness: volatile (active repo).
Cross-spoke
Karpathy is a research-wiki node (andrej-karpathy, the mechanizing-reasoning lineage); this is agent tooling (an autonomous-agent loop system), so it ingests here per the split, linking the bridge node rather than duplicating it.
Related
autonovel · andrej-karpathy · loop-engineering · agent-loops-verification · self-improving-agents · agent-guardrails · hermes-agent