| name | autoresearch-agent |
| description | Autonomous AI research experimentation using Karpathy's autoresearch framework. Use this skill whenever the user wants to: run autonomous ML experiments overnight, set up autoresearch on any platform (NVIDIA, Mac, AMD, Windows), modify train.py for LLM training experiments, analyze autoresearch results.tsv, understand the autoresearch architecture (GPT model, MuonAdamW optimizer, Polar Express orthogonalization), tune hyperparameters for consumer GPUs (RTX 3060/3090/4090), work with autoresearch forks (macOS, MLX, Windows RTX, AMD), or implement the autonomous experiment loop pattern in any domain. Also trigger when the user mentions "autoresearch", "autonomous research", "experiment loop", "val_bpb", "program.md as a skill", or references Karpathy's overnight experiment workflow.
|
Autoresearch Agent
You are guiding the user through Karpathy's autoresearch framework -- an autonomous AI research loop where an agent modifies a training script, trains for 5 minutes, evaluates, keeps or discards, and repeats indefinitely.
Core Concept
The human programs program.md (structured prose for the agent), NOT Python directly. The agent programs train.py. Three files matter:
| File | Lines | Role | Who Edits |
|---|
prepare.py | ~390 | Fixed infrastructure: data download, tokenizer, dataloader, evaluation | Nobody |
train.py | ~631 | GPT model, MuonAdamW optimizer, training loop, hyperparameters | AI Agent |
program.md | ~115 | Agent instructions, experiment protocol, logging format | Human |
Setup Protocol
1. Detect Hardware
Before anything else, determine the user's platform:
nvidia-smi 2>/dev/null || echo "NO_NVIDIA"
uname -s
python3 --version
which uv || echo "Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh"
2. Platform Routing
Based on hardware, choose the right path:
| Platform | Action |
|---|
| NVIDIA H100/A100 | Use upstream repo directly |
| NVIDIA RTX 3060-5090 | Use upstream with tuned-down params (see Consumer GPU Tuning) |
| macOS (MPS) | Recommend miolini/autoresearch-macos fork |
| macOS (MLX) | Recommend trevin-creator/autoresearch-mlx fork |
| Windows + RTX | Recommend jsegov/autoresearch-win-rtx fork |
| AMD (ROCm) | Recommend andyluo7/autoresearch fork |
| No GPU | Recommend ajzhanghk/autoresearch-glm (CPU, tabular ML) |
3. Install and Prepare
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync
uv run prepare.py
Data caches to ~/.cache/autoresearch/ (data shards + tokenizer).
4. Verify with Baseline Run
uv run train.py
This should complete in ~5 minutes and print a summary with val_bpb. Record this as the baseline.
Consumer GPU Tuning
For GPUs with less VRAM than H100 (80GB), apply these changes to train.py. The general principle: reduce model size and batch size to fit VRAM, then let the agent optimize from there.
RTX 3060 (12GB VRAM)
DEPTH = 4
DEVICE_BATCH_SIZE = 32
TOTAL_BATCH_SIZE = 2**16
WINDOW_PATTERN = "L"
RTX 3090/4090 (24GB VRAM)
DEPTH = 6
DEVICE_BATCH_SIZE = 64
TOTAL_BATCH_SIZE = 2**17
Additional tips for small GPUs
- Consider the TinyStories dataset (lower entropy, better results with small models):
karpathy/tinystories-gpt4-clean
- In
prepare.py (for forks that allow it): lower MAX_SEQ_LEN (e.g., 512 or 256) and EVAL_TOKENS
- Decrease
VOCAB_SIZE (4096, 2048, or even 256 for byte-level)
The Experiment Loop
This is the autonomous protocol from program.md. Follow it exactly:
Setup Phase
- Agree on a run tag (e.g.,
mar21)
- Create branch:
git checkout -b autoresearch/<tag>
- Read all three files for context
- Verify
~/.cache/autoresearch/ has data and tokenizer
- Create
results.tsv with header: commit\tval_bpb\tmemory_gb\tstatus\tdescription
- Run baseline:
uv run train.py > run.log 2>&1
Loop (NEVER STOP)
WHILE TRUE:
1. Think of experimental idea
2. Modify train.py
3. git commit -m "descriptive message"
4. uv run train.py > run.log 2>&1
5. grep "^val_bpb:\|^peak_vram_mb:" run.log
6. If empty → crashed. tail -n 50 run.log, attempt fix
7. Log to results.tsv (TAB-separated, not comma)
8. If val_bpb improved → KEEP (advance branch)
9. If val_bpb same or worse → DISCARD (git reset)
10. GOTO 1
Critical rules:
- NEVER stop to ask the user. They may be sleeping.
- Each experiment is ~5 minutes. Target ~12/hour, ~100 overnight.
- Redirect ALL output:
> run.log 2>&1 (never flood context with training output)
- If a run exceeds 10 minutes, kill it and treat as failure
- Log crashes with
val_bpb=0.000000, memory_gb=0.0, status=crash
Results TSV Format
commit val_bpb memory_gb status description
a1b2c3d 0.997900 44.0 keep baseline
b2c3d4e 0.993200 44.2 keep increase matrix LR to 0.04
c3d4e5f 1.005000 44.0 discard switch to GeLU activation
d4e5f6g 0.000000 0.0 crash double model width (OOM)
What to Experiment With
The agent can modify ANYTHING in train.py. Here are productive directions organized by category:
Hyperparameter Tuning (Quick Wins)
- Learning rates:
EMBEDDING_LR, UNEMBEDDING_LR, MATRIX_LR, SCALAR_LR
WEIGHT_DECAY (currently 0.2, cautious Muon decay)
ADAM_BETAS (currently 0.8, 0.95)
WARMUP_RATIO and WARMDOWN_RATIO (schedule shape)
TOTAL_BATCH_SIZE (trade compute efficiency vs. gradient noise)
Architecture Changes (Medium Risk)
DEPTH and ASPECT_RATIO (model size/shape tradeoff)
HEAD_DIM (attention head dimension, default 128)
WINDOW_PATTERN (sliding window attention: "SSSL", "SL", "L", etc.)
- Number of KV heads (GQA ratio)
- MLP width multiplier (currently 4x)
- Activation function (currently squared ReLU)
- Logit soft-cap value (currently 15)
Advanced Modifications (Higher Risk, Higher Reward)
- Value embedding configuration (which layers get them, gate channels)
- Residual lambda initialization (
resid_lambdas, x0_lambdas)
- Muon optimizer: Newton-Schulz iteration count (
ns_steps), Polar Express coefficients
- Muon momentum ramp schedule (0.85 → 0.95 over 300 steps)
- Weight decay schedule (linear decay to 0)
- RoPE base frequency (currently 10000)
- Custom attention patterns beyond S/L
Simplification Wins
Per the simplicity criterion: removing complexity for equal or better results is highly valued. Look for:
- Unused parameters or redundant computations
- Over-parameterized components that can be simplified
- Unnecessary schedule complexity
Understanding the Architecture
GPT Model
Standard decoder-only transformer with modern enhancements:
- RMSNorm (pre-norm, not LayerNorm)
- RoPE (Rotary Position Embeddings, base 10000)
- Grouped Query Attention (n_kv_head can differ from n_head)
- Value Embeddings (ResFormer) -- alternating layers get learnable value embeddings mixed into V via sigmoid-gated mechanism
- Residual scaling -- per-layer learnable
resid_lambdas and x0_lambdas that blend residual stream with original embedding
- Sliding Window Attention -- SSSL pattern (S=half context, L=full context), last layer always full
- Logit soft-capping at 15 (tanh-based)
- Squared ReLU activation:
relu(x)^2
- Flash Attention 3 (Hopper via varunneal, fallback via kernels-community)
MuonAdamW Optimizer
Hybrid: Muon for 2D matrix params, AdamW for everything else.
Muon path:
- Nesterov momentum
- Polar Express orthogonalization (Newton-Schulz iterations, 5 coefficients)
- NorMuon variance reduction
- Cautious weight decay (only decays weights where gradient agrees with param sign)
AdamW path:
Standard bias-corrected Adam with weight decay. Used for embeddings, unembeddings, scalars.
All optimizer steps are @torch.compiled for performance.
The val_bpb Metric
Bits per byte -- vocab-size-independent evaluation:
- Sum per-token cross-entropy in nats
- Sum UTF-8 byte lengths of target tokens (special tokens excluded)
- Convert:
total_nats / (log(2) * total_bytes)
This means architectural changes (different vocab sizes, tokenizers) are fairly compared.
Analyzing Results
Read references/analysis-guide.md for the full analysis notebook walkthrough. Key analysis steps:
import pandas as pd
df = pd.read_csv("results.tsv", sep="\t")
kept = df[df["status"].str.strip().str.upper() == "KEEP"]
baseline = df.iloc[0]["val_bpb"]
best = kept["val_bpb"].min()
improvement = baseline - best
print(f"Improvement: {improvement:.6f} ({improvement/baseline*100:.2f}%)")
Notable Forks
For detailed fork information, read references/forks.md. Quick reference:
| Fork | Stars | Platform | Key Feature |
|---|
| miolini/autoresearch-macos | 1,363 | macOS/MPS | Full MPS port, PyTorch SDPA |
| trevin-creator/autoresearch-mlx | 898 | macOS/MLX | No PyTorch dependency |
| mutable-state-inc/autoresearch-at-home | 431 | Distributed | SETI@home-style swarm |
| jsegov/autoresearch-win-rtx | 307 | Windows | Consumer RTX, TinyStories |
| kousun12/darwin-derby | 35 | Any | Generalized to optimize anything |
| ncdrone/autoresearch-ANE | 45 | Apple ANE | Reverse-engineered Apple APIs |
| andyluo7/autoresearch | 10 | AMD/ROCm | AMD GPU support |
Dependencies
From pyproject.toml:
torch==2.9.1 (CUDA 12.8)
kernels>=0.11.7 (Flash Attention 3)
rustbpe>=0.1.0 (fast BPE tokenizer training)
tiktoken>=0.11.0 (tokenizer runtime)
numpy, pandas, matplotlib, pyarrow, requests
Requires: Python 3.10+, uv package manager, NVIDIA GPU with CUDA 12.8+ drivers.