Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

autoresearch

Name: Autoresearch
Author: wjgoarxiv

// Core autonomous research loop. Reads research.md, proposes hypotheses, runs experiments, evaluates results mechanically, keeps improvements, discards failures, and iterates until the target metric is achieved or the iteration budget is exhausted. TRIGGER when: user invokes "autoresearch" (no subcommand); research.md exists; user wants the 5-stage loop; user wants iterative optimization overnight.

In Manus ausführen

$ git log --oneline --stat

stars:16

forks:2

updated:5. April 2026 um 06:56

Datei-Explorer

3 Dateien

SKILL.md

readonly

related-skills.json

gleiches Repository

autoresearch-skill.md

from "wjgoarxiv/autoresearch-skill"

Autonomous research and experimentation toolkit with 10 commands. Core loop inspired by Karpathy's autoresearch — generalizes to any domain with mechanical evaluation, overnight persistence, and zero dependencies. TRIGGER when: user wants autonomous experiments; user mentions "autoresearch" or "auto-research"; user wants iterative optimization; user wants a research loop; user mentions "research.md"; user wants to iterate until some condition; user wants to optimize code, prompts, configs, or parameters iteratively; user invokes any /autoresearch:* subcommand. DO NOT TRIGGER when: user wants a one-shot answer; user wants manual step-by-step guidance; user just wants to read a single paper; user wants a simple web search.

2026-04-0516

pdf.md

from "wjgoarxiv/autoresearch-skill"

Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When the LLM (Claude, ChatGPT, Gemini, or others) needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale.

2026-04-0516

autoresearch-debug.md

from "wjgoarxiv/autoresearch-skill"

Scientific bug hunting using falsifiable hypotheses. Forms hypotheses, designs falsifying tests, eliminates candidates systematically, and logs the full investigation trail in a structured debug/ folder. TRIGGER when: user has a bug to investigate scientifically; user wants systematic root-cause analysis; user says "debug", "investigate", "root cause", "why is this failing"; user invokes /autoresearch:debug. DO NOT TRIGGER when: user wants to optimize a metric (use /autoresearch); user wants to fix a known error automatically (use /autoresearch:fix); user just wants a quick one-line answer about what a function does.

2026-04-0516

autoresearch-fix.md

from "wjgoarxiv/autoresearch-skill"

Iterative error-crusher loop that auto-stops at 0 errors. Cascade-aware: fixes dependency errors before their dependents. Refuses anti-patterns that hide errors instead of fixing them. TRIGGER when: user has errors or failures to fix iteratively; user asks to "fix all errors"; user has a failing test suite; user has compilation errors; user has linter errors; user wants systematic error elimination; user invokes /autoresearch:fix. DO NOT TRIGGER when: user wants a one-shot fix for a single obvious bug; user wants debugging guidance only; user wants code review without fixing.

2026-04-0516

autoresearch-plan.md

from "wjgoarxiv/autoresearch-skill"

7-step setup wizard that produces a complete, ready-to-run research.md without executing the research loop. Walks the user through goal, metric, search space, constraints, evaluator design, and baseline measurement, then writes the file. TRIGGER when: user wants to set up a research project; user wants to plan before running the loop; user says "plan my research"; user has a goal but no research.md; user invokes /autoresearch:plan. DO NOT TRIGGER when: research.md already exists and the user wants to run the loop; user wants a one-shot answer; user wants to debug, not optimize.

2026-04-0516

autoresearch-predict.md

from "wjgoarxiv/autoresearch-skill"

Multi-perspective deliberation engine. Gathers independent positions from diverse personas, runs cross-examination and rebuttal rounds, detects herd behavior, and synthesizes a neutral judge verdict with confidence levels. TRIGGER when: user wants multi-perspective prediction, forecasting, scenario analysis, decision analysis, "what will happen if", "should we", "predict the outcome of", structured devil's advocacy, or any question benefiting from adversarial deliberation.

2026-04-0516

package.json

"author": "wjgoarxiv"

"repository": "wjgoarxiv/autoresearch-skill"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

DatenwissenschaftlerInformatik- und Mathematikberufe15-2051L4

name	autoresearch
description	Core autonomous research loop. Reads research.md, proposes hypotheses, runs experiments, evaluates results mechanically, keeps improvements, discards failures, and iterates until the target metric is achieved or the iteration budget is exhausted. TRIGGER when: user invokes "autoresearch" (no subcommand); research.md exists; user wants the 5-stage loop; user wants iterative optimization overnight.
allowed-tools	["Read","Write","Edit","Bash","WebFetch","WebSearch"]

Autonomous Research Loop

Autonomous research loop inspired by Karpathy's autoresearch. Where autoresearch optimizes ML training on a single GPU, this skill generalizes the loop to any domain: prompt engineering, literature review, code optimization, configuration tuning, and more. You write a research.md — the agent does the rest.

Autonomy Directive

You are an autonomous research agent. Once the loop begins:

NEVER STOP to ask for permission. The user may be asleep.
NEVER ASK "should I continue?" or "is this a good stopping point?"
NEVER SUMMARIZE AND WAIT. After logging an iteration, begin the next one immediately.
The loop runs until one of three conditions is met:
- The target metric is achieved (success)
- max_iterations is exhausted (budget spent — this is normal, not failure)
- The user manually interrupts
If none of these conditions are true, you MUST begin the next iteration immediately.

Think of max_iterations as a budget to spend, not a limit to fear. Using all 20 iterations means you gave the problem your full effort. Stopping at iteration 4 means you gave up.

Pre-Flight Setup (Mandatory)

Before starting the research loop, the agent MUST ask the user these questions if not already answered in research.md. Do NOT assume — ask.

Question 1: Overnight Execution

Ask: "Do you want this research loop to run unattended (overnight)?"

If yes:

Set pause_every: never
Recommend: bash scripts/autoresearch-loop.sh ./research-dir/
Explain the 3 execution modes (foreground / nohup / tmux)
Remind: "Check progress anytime with bash scripts/check_progress.sh"

If no:

Ask: "How often should I pause for your review?" (every N iterations, or never)
Set pause_every accordingly

Question 2: Evaluator Setup

Ask: "Do you have a script that can automatically measure the success metric? (e.g., python evaluate.py that outputs JSON)"

If yes:

Record the evaluator command in research.md Constraints
Ask: "Keep policy — score_improvement (keep only if better) or pass_only (keep if passes)?"

If no:

Agent will evaluate manually using available tools
Note this in research.md: Evaluator: _(none — agent judges manually)_

IMPORTANT: Do NOT start Stage 1 of the first iteration until pre-flight questions are answered. If research.md already has all answers (evaluator, pause_every defined), skip the questions and proceed.

Precondition Checks

Before the first iteration, verify the environment:

Git repository: If in a git repo, check git status — warn if there are uncommitted changes that might be lost during rollback. Suggest committing or stashing first.
Stale lock file: If .autoresearch.lock exists and is >10 minutes old, warn the user that a previous session may have crashed. Offer to delete the lock and continue.
research.md completeness: Goal, Success Metric, and Search Space sections must be filled in. Refuse to start with placeholders (e.g., TBD, TODO).

Environment Detection

Before starting, detect your runtime capabilities and select the appropriate tier:

Check 1: Can I run Bash + Python?
  YES -> Tier 1 (Full experimentation — run code, measure results)
  NO  -> Check 2: Can I use WebFetch or WebSearch?
    YES -> Tier 2 (Research-only — literature review, web research)
    NO  -> Tier 3 (Analysis-only — work with user-provided data)

Tier	Environment	Capabilities	Experimentation Method
Tier 1	Claude Code, Codex CLI, any terminal	Bash + Python + full tool access	Run code, measure metrics, modify files, benchmark
Tier 2	Claude App (Web) with web access	WebFetch + WebSearch	Web research, literature review, synthesis
Tier 3	Fully restricted (no network, no shell)	Text generation only	Analyze user-provided data, propose hypotheses without executing

How It Works

Five-stage loop, repeating until the success metric is met or constraints are exhausted:

[research.md] --> [Understand] --> [Hypothesize] --> [Experiment] --> [Evaluate] --> [Log]
                       ^                                                              |
                       |______________________________________________________________|
                                              (iterate until done)

Stage 1 — Understand: Read research.md. Load the goal, success metric, constraints, search space, and iteration history. Assess current state: What has been tried? What worked? What failed? Where is the metric now relative to the target?

Stage 2 — Hypothesize: Based on prior results and remaining search space, propose a single specific, testable change. State the hypothesis clearly: "Changing X to Y should improve the metric because Z." Avoid repeating failed approaches unless the context has changed.

Stage 3 — Experiment: Execute the change. Tier 1: run code, modify files, execute benchmarks — wrap all Bash commands with timeout 5m <command>. If timeout kills the process (exit code 124), treat as a failed experiment — revert and log "TIMEOUT: experiment exceeded 5-minute budget", then proceed to the next iteration. Tier 2: search the web, fetch papers, gather data. Tier 3: apply analytical reasoning to user-provided data. Always preserve the ability to revert.

Stage 4 — Evaluate: Measure the result against the defined success metric. Compare to baseline and to the best result so far. Determine: improved, regressed, or no change? See evaluator-contract.md for mechanical evaluator details.

Stage 5 — Log & Iterate: If improved (or evaluator returns pass+score_improvement) — keep the change, update the best-known result. If not — revert the change, log the failure reason. In both cases: append a row to the History table in research.md, append detailed notes to research_log.md, append a row to autoresearch-results.tsv. After logging, update progress.png — a live convergence plot refreshed every iteration. Use rcparams() from scripts/style_presets.py before plotting. Single-panel: iteration number (x) vs metric value (y), kept iterations as filled markers, reverted as hollow, best-so-far envelope line, target threshold as horizontal dashed line. Overwrite progress.png each iteration. Then check termination conditions: (1) Target metric achieved? (2) Max iterations exhausted? If NEITHER condition is true, return to Stage 1 immediately — do not pause, do not summarize, do not ask the user. Begin the next iteration NOW.

Noise Handling

For metrics that are noisy (e.g., benchmarks, ML training), configure these optional fields in research.md Constraints:

noise_runs (default: 1): Number of runs to take the median of. Set to 3–5 for noisy benchmarks.
min_delta (default: 0): Minimum improvement required to count as "better". Prevents keeping noise-driven false positives. Example: min_delta: 0.01 means the metric must improve by at least 1% to be kept.

Confirmation run: If a result looks unexpectedly large (>2× the previous best improvement), run one additional confirmation measurement before committing. Log: "CONFIRMATION RUN: verifying unexpected improvement."

Guard Parameter

In addition to the success metric (what to optimize), you can define a guard — a hard safety constraint:

Guard: A condition that must remain true at all times. If the guard fails, revert immediately regardless of metric improvement.
Example guards: "all unit tests must pass", "response latency must stay < 200ms", "no new compiler warnings"
Guard failures are logged as status: guard_violation in the TSV.
Unlike the metric (which allows trade-offs), the guard is absolute. A 50% metric improvement that fails the guard is reverted.

Optional: Mechanical Evaluator

See evaluator-contract.md for the full evaluator specification, JSON contract, and keep policies.

Quick reference:

Add to research.md Constraints: Evaluator: python evaluate.py
Evaluator must output: {"pass": true, "score": 0.94}
Keep policies: score_improvement (default) or pass_only

The research.md Format

The research.md file is both input and state. The user writes the top sections; the agent maintains the History table. See assets/research_template.md for the full template.

Sections: Goal, Success Metric, Constraints (evaluator, pause_every, max_iterations, guard, noise_runs, min_delta), Current Approach, Search Space, Context & References, History.

Output Structure

File	Updated	Purpose
`research.md`	Every iteration	Living research document with History table
`research_log.md`	Every iteration (append-only)	Detailed audit trail of every experiment
`progress.png`	Every iteration	Live convergence plot
`autoresearch-results.tsv`	Every iteration	Machine-readable TSV (8 columns: see `references/results-logging.md`)
`final_report.md`	End only	Structured summary with best result + recommendations

Safety & Guardrails

max_iterations (default: 20) — Iteration budget. Aim to USE all iterations.
pause_every — Optional human review checkpoint. Default: never. Only set for safety-critical domains.
Automatic rollback — Every experiment preserves the prior state. Failed experiments are reverted before the next iteration.
forbidden_changes — Hard boundaries defined in research.md. Never modify anything in this list.
Time budget per experiment — Default: 5 minutes. Enforced via timeout 5m <command>. Exit code 124 = timeout — treat as failed experiment, revert, and continue.

Stuck Detection & Pivot Protocol

See stuck-detection.md for the full Pivot Protocol.

Quick reference:

Level 1 (3 consecutive non-improving): Switch to a different strategy. Continue iterating.
Level 2 (5 consecutive non-improving): Radical paradigm shift. Continue iterating.
Level 3 (max_iterations reached): Normal termination — produce final_report.md.

Endgame Strategy

Normal mode (remaining iterations >= 2): Balance EXPLORE (new approaches) and EXPLOIT (refine best). Give new strategies at least 2 iterations before judging.

Last iteration only: Refine best approach with micro-optimizations, ensure all output files are complete, produce final_report.md.

Edge Cases

Situation	Handling
No metric defined	Refuse to start. Ask user to define a measurable metric.
Experiment crashes	Log error, revert, try different approach next iteration.
Guard violation	Revert. Log as `guard_violation`. Metric improvement does not count.
Same metric for 3+ iterations	Shift strategy (Level 1 Pivot).
Max iterations reached	Produce `final_report.md`. Normal outcome, not failure.
Evaluator crashes / invalid JSON / timeout	Treat as failed experiment — revert and continue.
No search space left	Try combinations of kept changes. If truly exhausted, produce `final_report.md`.

Overnight Runs

# Foreground
bash scripts/autoresearch-loop.sh ./my-research/

# Background (no tmux needed)
nohup bash scripts/autoresearch-loop.sh ./my-research/ > autoresearch.log 2>&1 &

# tmux
tmux new-session -d -s research 'bash scripts/autoresearch-loop.sh ./my-research/'

# Monitor
bash scripts/check_progress.sh ./my-research/

autoresearch

Mehr aus diesem Repository

Mehr aus diesem Repository

Autonomous Research Loop

Autonomy Directive

Pre-Flight Setup (Mandatory)

Question 1: Overnight Execution

Question 2: Evaluator Setup

Precondition Checks

Environment Detection

How It Works

Noise Handling

Guard Parameter

Optional: Mechanical Evaluator

The research.md Format

Output Structure

Safety & Guardrails

Stuck Detection & Pivot Protocol

Endgame Strategy

Edge Cases

Overnight Runs

Autonomous Research Loop

Autonomy Directive

Pre-Flight Setup (Mandatory)

Question 1: Overnight Execution

Question 2: Evaluator Setup

Precondition Checks

Environment Detection

How It Works

Noise Handling

Guard Parameter

Optional: Mechanical Evaluator

The research.md Format

Output Structure

Safety & Guardrails

Stuck Detection & Pivot Protocol

Endgame Strategy

Edge Cases

Overnight Runs