Karpathy's Autoresearch: AI That Does AI Research While You Sleep

Andrej Karpathy released autoresearch — an open-source project where AI agents run experiments autonomously overnight. 53K stars in two weeks. Here's how it works.

5 min read

Imagine this: you go to bed at midnight. An AI agent starts running experiments on your GPU. It tries changing the model architecture. Trains for 5 minutes. Checks if the result improved. Keeps the change or throws it away. Tries something else. Repeats.

You wake up at 8am. The agent has run 96 experiments. Your model is measurably better than when you went to sleep. You didn't write a single line of code.

That's autoresearch.

Who made it and why?

Andrej Karpathy — the person who built Tesla's self-driving AI, co-founded OpenAI, and taught Stanford's most popular deep learning course — released autoresearch in March 2026. It hit 53,000+ GitHub stars in its first two weeks.

The idea came from a simple observation: most of what AI researchers do is repetitive. Change a number. Train. Check the result. Repeat. What if an AI agent did all of that on its own?

Karpathy's opening line in the README sets the tone:

"Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies."

He's joking. Mostly.

How does it actually work?

The entire project is shockingly simple. Three files:

prepare.py — Downloads training data and sets up a tokenizer. You run this once and never touch it again.

train.py — Contains the full AI model, optimizer, and training loop. This is the file the AI agent edits. It can change anything: the architecture, the learning rate, the batch size, the optimizer — everything.

program.md — Instructions for the AI agent. This is a Markdown file that tells the agent what to do, like a job description for your AI researcher.

Here's the key insight: you, the human, only edit program.md. You're not writing code anymore. You're writing instructions for an AI that writes code.

The 5-minute rule

Every experiment runs for exactly 5 minutes of training time. Not 5 minutes of your time — 5 minutes of GPU time.

Why 5 minutes? Because it makes everything comparable. Whether the agent tries a tiny model or a huge one, a weird architecture or a standard one, they all get the same time budget. The metric is validation bits per byte (lower = better). The agent's job is simple: make that number go down.

At ~12 experiments per hour, that's roughly 100 experiments overnight. Each one is a full "try something → measure → keep or discard" cycle.

What makes this different from just... training a model?

Normally, a researcher would:

Have an idea ("what if I increase the learning rate?")
Edit the code
Run training for hours
Look at the results
Think about what to try next
Repeat

Autoresearch compresses this entire loop. The AI agent does steps 1-6 autonomously. It reads the code, forms a hypothesis, makes an edit, trains, evaluates, and decides whether to keep the change — all without human intervention.

The agent doesn't just try random things either. It reads program.md for context about what has worked before, what hasn't, and what to try next. It's a researcher with a lab notebook.

Why is this a big deal?

It's the first practical "AI doing AI research" project. People have theorized about this for years. Karpathy made it real, simple, and open source.

It runs on a single GPU. You don't need a data center. One NVIDIA GPU and a coding agent (Claude, Codex, whatever you prefer) and you're running autonomous research.

The code is intentionally tiny. The whole project is about 500 lines across three files. There's no complex infrastructure, no distributed training, no configs. This makes it hackable — anyone can fork it and experiment.

It reframes what "programming" means. Instead of writing Python, you're writing Markdown instructions for an AI researcher. The skill shifts from "can you code this?" to "can you direct an AI to research this?"

What are the limitations?

It's narrow. Right now, autoresearch only optimizes one specific thing: a small language model trained on text data. It's not discovering new architectures from scratch or writing research papers.

The 5-minute budget means small models. You're not training GPT-4 here. The models are small enough to train in 5 minutes, which limits what the agent can explore.

It still needs human judgment. The agent optimizes a number (validation loss), but deciding whether the changes are meaningful or interesting requires a human looking at the experiment log.

Hardware dependent. Results on an H100 GPU won't match results on a laptop GPU. Your experiments aren't comparable to someone else's — they're only comparable to your own previous experiments.

The bigger picture

Autoresearch is a proof of concept for something much larger: AI agents that do scientific research autonomously.

Today, the agent tweaks hyperparameters and model architectures. Tomorrow, it might design entirely new training algorithms. The gap between "AI assistant that helps researchers" and "AI that does research" is closing fast.

Karpathy's contribution isn't the code — it's the demonstration that this loop works right now, with tools that already exist, on hardware you might already own.

The era of AI researching AI has a start date. It's March 2026.

Want to try it?

You'll need:

An NVIDIA GPU (tested on H100, forks available for Mac and Windows)
Python 3.10+
A coding agent (Claude Code, Codex, etc.)

The repo is at github.com/karpathy/autoresearch. Clone it, run prepare.py once, point your agent at program.md, and go to sleep.

Wake up smarter.

Get new explanations in your inbox

Every Tuesday and Friday. No spam, just AI clarity.