一键在 Manus 中运行任何 Skill

autoresearch-agent

Autonomous AI research experimentation using Karpathy's autoresearch framework. Use this skill whenever the user wants to: run autonomous ML experiments overnight, set up autoresearch on any platform (NVIDIA, Mac, AMD, Windows), modify train.py for LLM training experiments, analyze autoresearch results.tsv, understand the autoresearch architecture (GPT model, MuonAdamW optimizer, Polar Express orthogonalization), tune hyperparameters for consumer GPUs (RTX 3060/3090/4090), work with autoresearch forks (macOS, MLX, Windows RTX, AMD), or implement the autonomous experiment loop pattern in any domain. Also trigger when the user mentions "autoresearch", "autonomous research", "experiment loop", "val_bpb", "program.md as a skill", or references Karpathy's overnight experiment workflow.

在 Manus 中运行

概览

安装命令

npx skills add https://github.com/adaptationio/autoresearch --skill autoresearch-agent

复制此命令并粘贴到 Claude Code 中以安装该技能

来源

adaptationio/autoresearch

星标0

分支0

更新时间2026年3月21日 03:04

文件资源管理器

3 个文件

SKILL.md

readonly

同仓库更多 Skills

同仓库

autoloop

adaptationio/autoresearch

Universal autonomous optimization loop. Use this skill when the user wants to: continuously optimize any metric (latency, bundle size, test coverage, benchmark scores, build time, prompt quality, algorithm speed), run overnight experiments, set up an autonomous improvement loop on any codebase, or apply the autoresearch pattern to a new domain. Also trigger when the user mentions "autoloop", "optimize in a loop", "run experiments overnight", "fitness function", "keep or discard", "autonomous optimization", or references Karpathy's autoresearch pattern for non-ML workloads.

2026-03-210

来源

adaptationio

adaptationio/autoresearch

打开 GitHub 仓库查看创作者相关仓库

安装命令

下载

在 Manus 中运行

适用职业SOC

软件开发工程师计算机与数学类职业15-1252L4

name

autoresearch-agent

description

Autoresearch Agent

You are guiding the user through Karpathy's autoresearch framework -- an autonomous AI research loop where an agent modifies a training script, trains for 5 minutes, evaluates, keeps or discards, and repeats indefinitely.

Core Concept

The human programs program.md (structured prose for the agent), NOT Python directly. The agent programs train.py. Three files matter:

File	Lines	Role	Who Edits
`prepare.py`	~390	Fixed infrastructure: data download, tokenizer, dataloader, evaluation	Nobody
`train.py`	~631	GPT model, MuonAdamW optimizer, training loop, hyperparameters	AI Agent
`program.md`	~115	Agent instructions, experiment protocol, logging format	Human

Setup Protocol

1. Detect Hardware

Before anything else, determine the user's platform:

# Check GPU
nvidia-smi 2>/dev/null || echo "NO_NVIDIA"
# Check for Mac
uname -s  # Darwin = Mac
# Check Python and uv
python3 --version
which uv || echo "Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh"

2. Platform Routing

Based on hardware, choose the right path:

Platform	Action
NVIDIA H100/A100	Use upstream repo directly
NVIDIA RTX 3060-5090	Use upstream with tuned-down params (see Consumer GPU Tuning)
macOS (MPS)	Recommend `miolini/autoresearch-macos` fork
macOS (MLX)	Recommend `trevin-creator/autoresearch-mlx` fork
Windows + RTX	Recommend `jsegov/autoresearch-win-rtx` fork
AMD (ROCm)	Recommend `andyluo7/autoresearch` fork
No GPU	Recommend `ajzhanghk/autoresearch-glm` (CPU, tabular ML)

3. Install and Prepare

# Clone
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch

# Install deps (creates .venv via uv)
uv sync

# One-time data prep (~2 min): downloads 10 parquet shards + trains BPE tokenizer
uv run prepare.py
# Options: --num-shards N (default 10), --download-workers N (default 8)

Data caches to ~/.cache/autoresearch/ (data shards + tokenizer).

4. Verify with Baseline Run

uv run train.py

This should complete in ~5 minutes and print a summary with val_bpb. Record this as the baseline.

Consumer GPU Tuning

For GPUs with less VRAM than H100 (80GB), apply these changes to train.py. The general principle: reduce model size and batch size to fit VRAM, then let the agent optimize from there.

RTX 3060 (12GB VRAM)

DEPTH = 4                    # down from 8
DEVICE_BATCH_SIZE = 32       # down from 128
TOTAL_BATCH_SIZE = 2**16     # down from 2**19
WINDOW_PATTERN = "L"         # simpler than "SSSL"

RTX 3090/4090 (24GB VRAM)

DEPTH = 6                    # down from 8
DEVICE_BATCH_SIZE = 64       # down from 128
TOTAL_BATCH_SIZE = 2**17     # down from 2**19

Additional tips for small GPUs

Consider the TinyStories dataset (lower entropy, better results with small models): karpathy/tinystories-gpt4-clean
In prepare.py (for forks that allow it): lower MAX_SEQ_LEN (e.g., 512 or 256) and EVAL_TOKENS
Decrease VOCAB_SIZE (4096, 2048, or even 256 for byte-level)

The Experiment Loop

This is the autonomous protocol from program.md. Follow it exactly:

Setup Phase

Agree on a run tag (e.g., mar21)
Create branch: git checkout -b autoresearch/<tag>
Read all three files for context
Verify ~/.cache/autoresearch/ has data and tokenizer
Create results.tsv with header: commit\tval_bpb\tmemory_gb\tstatus\tdescription
Run baseline: uv run train.py > run.log 2>&1

Loop (NEVER STOP)

WHILE TRUE:
  1. Think of experimental idea
  2. Modify train.py
  3. git commit -m "descriptive message"
  4. uv run train.py > run.log 2>&1
  5. grep "^val_bpb:\|^peak_vram_mb:" run.log
  6. If empty → crashed. tail -n 50 run.log, attempt fix
  7. Log to results.tsv (TAB-separated, not comma)
  8. If val_bpb improved → KEEP (advance branch)
  9. If val_bpb same or worse → DISCARD (git reset)
  10. GOTO 1

Critical rules:

NEVER stop to ask the user. They may be sleeping.
Each experiment is ~5 minutes. Target ~12/hour, ~100 overnight.
Redirect ALL output: > run.log 2>&1 (never flood context with training output)
If a run exceeds 10 minutes, kill it and treat as failure
Log crashes with val_bpb=0.000000, memory_gb=0.0, status=crash

Results TSV Format

commit	val_bpb	memory_gb	status	description
a1b2c3d	0.997900	44.0	keep	baseline
b2c3d4e	0.993200	44.2	keep	increase matrix LR to 0.04
c3d4e5f	1.005000	44.0	discard	switch to GeLU activation
d4e5f6g	0.000000	0.0	crash	double model width (OOM)

What to Experiment With

The agent can modify ANYTHING in train.py. Here are productive directions organized by category:

Hyperparameter Tuning (Quick Wins)

Learning rates: EMBEDDING_LR, UNEMBEDDING_LR, MATRIX_LR, SCALAR_LR
WEIGHT_DECAY (currently 0.2, cautious Muon decay)
ADAM_BETAS (currently 0.8, 0.95)
WARMUP_RATIO and WARMDOWN_RATIO (schedule shape)
TOTAL_BATCH_SIZE (trade compute efficiency vs. gradient noise)

Architecture Changes (Medium Risk)

DEPTH and ASPECT_RATIO (model size/shape tradeoff)
HEAD_DIM (attention head dimension, default 128)
WINDOW_PATTERN (sliding window attention: "SSSL", "SL", "L", etc.)
Number of KV heads (GQA ratio)
MLP width multiplier (currently 4x)
Activation function (currently squared ReLU)
Logit soft-cap value (currently 15)

Advanced Modifications (Higher Risk, Higher Reward)

Value embedding configuration (which layers get them, gate channels)
Residual lambda initialization (resid_lambdas, x0_lambdas)
Muon optimizer: Newton-Schulz iteration count (ns_steps), Polar Express coefficients
Muon momentum ramp schedule (0.85 → 0.95 over 300 steps)
Weight decay schedule (linear decay to 0)
RoPE base frequency (currently 10000)
Custom attention patterns beyond S/L

Simplification Wins

Per the simplicity criterion: removing complexity for equal or better results is highly valued. Look for:

Unused parameters or redundant computations
Over-parameterized components that can be simplified
Unnecessary schedule complexity

Understanding the Architecture

GPT Model

Standard decoder-only transformer with modern enhancements:

RMSNorm (pre-norm, not LayerNorm)
RoPE (Rotary Position Embeddings, base 10000)
Grouped Query Attention (n_kv_head can differ from n_head)
Value Embeddings (ResFormer) -- alternating layers get learnable value embeddings mixed into V via sigmoid-gated mechanism
Residual scaling -- per-layer learnable resid_lambdas and x0_lambdas that blend residual stream with original embedding
Sliding Window Attention -- SSSL pattern (S=half context, L=full context), last layer always full
Logit soft-capping at 15 (tanh-based)
Squared ReLU activation: relu(x)^2
Flash Attention 3 (Hopper via varunneal, fallback via kernels-community)

MuonAdamW Optimizer

Hybrid: Muon for 2D matrix params, AdamW for everything else.

Muon path:

Nesterov momentum
Polar Express orthogonalization (Newton-Schulz iterations, 5 coefficients)
NorMuon variance reduction
Cautious weight decay (only decays weights where gradient agrees with param sign)

AdamW path: Standard bias-corrected Adam with weight decay. Used for embeddings, unembeddings, scalars.

All optimizer steps are @torch.compiled for performance.

The val_bpb Metric

Bits per byte -- vocab-size-independent evaluation:

Sum per-token cross-entropy in nats
Sum UTF-8 byte lengths of target tokens (special tokens excluded)
Convert: total_nats / (log(2) * total_bytes)

This means architectural changes (different vocab sizes, tokenizers) are fairly compared.

Analyzing Results

Read references/analysis-guide.md for the full analysis notebook walkthrough. Key analysis steps:

import pandas as pd
df = pd.read_csv("results.tsv", sep="\t")
kept = df[df["status"].str.strip().str.upper() == "KEEP"]
baseline = df.iloc[0]["val_bpb"]
best = kept["val_bpb"].min()
improvement = baseline - best
print(f"Improvement: {improvement:.6f} ({improvement/baseline*100:.2f}%)")

Notable Forks

For detailed fork information, read references/forks.md. Quick reference:

Fork	Stars	Platform	Key Feature
miolini/autoresearch-macos	1,363	macOS/MPS	Full MPS port, PyTorch SDPA
trevin-creator/autoresearch-mlx	898	macOS/MLX	No PyTorch dependency
mutable-state-inc/autoresearch-at-home	431	Distributed	SETI@home-style swarm
jsegov/autoresearch-win-rtx	307	Windows	Consumer RTX, TinyStories
kousun12/darwin-derby	35	Any	Generalized to optimize anything
ncdrone/autoresearch-ANE	45	Apple ANE	Reverse-engineered Apple APIs
andyluo7/autoresearch	10	AMD/ROCm	AMD GPU support

Dependencies

From pyproject.toml:

torch==2.9.1 (CUDA 12.8)
kernels>=0.11.7 (Flash Attention 3)
rustbpe>=0.1.0 (fast BPE tokenizer training)
tiktoken>=0.11.0 (tokenizer runtime)
numpy, pandas, matplotlib, pyarrow, requests

Requires: Python 3.10+, uv package manager, NVIDIA GPU with CUDA 12.8+ drivers.