en un clic
investigate-autophysicist-run
// Investigates an Autophysicist workspace run. Use to understand what went wrong and could be improved in the single-agent iterative research process.
// Investigates an Autophysicist workspace run. Use to understand what went wrong and could be improved in the single-agent iterative research process.
| name | investigate-autophysicist-run |
| description | Investigates an Autophysicist workspace run. Use to understand what went wrong and could be improved in the single-agent iterative research process. |
| allowed-tools | Read, Grep |
| model | opus |
Read README.md to understand how the Autophysicist research mode works (single-agent iterative loop with ephemeral sub-agents).
Given a workspace directory (under workspaces/ in the PhysicsIntern project — autophysicist runs end in _autophysicist; legacy runs may live under workspaces/autophysicist/), perform a systematic post-mortem analysis of the run. The user may provide a folder name or path; if ambiguous, list available workspaces and ask.
Check references/ in the project root for a reference document matching the problem. These files describe what a correct answer looks like and what a typical successful run looks like for known problems. The reference is written for the vanilla multi-agent pipeline, but the correct answer and key pitfalls still apply.
Key deliverables:
For any Python code you need to write to analyze the files, use the /tmp folder to write and run temporary files. Do not run directly in the command line. Instead, write a Python script that reads the relevant files, performs the analysis, and prints the results. Then run that script and read its output.
An Autophysicist workspace contains these key files:
| File | Purpose |
|---|---|
problem.yaml | Problem definition (problem text, answer_template, and possibly the true answer — not visible to the Manager) |
PROBLEM.md | Problem statement in readable form |
ANSWER.md | Final answer (present only if submit_final_answer was called) |
PERMANENT_MEMORY.md | Append-only verified results — the Manager's canonical output |
SCRATCHPAD.md | Rolling working notes (full history on disk; only last N entries were visible to Manager each iteration) |
METRICS.md | Per-iteration token usage with YAML frontmatter summary |
VERIFICATION.md | Formal answer evaluation only (correct/incorrect/inconclusive/skipped) — no LLM diagnosis |
EVENT_LOG.jsonl | LLM call metadata + scaffold events |
config.json | Run configuration (model, budgets, caps) |
.iteration | Final iteration counter |
logs/ | Per-call logs: iter{N:03d}_{M:02d}_{agent_name}.md |
computations/ | Code execution scripts: subagent_iter{N}_{idx}_attempt{M}.py |
console.log | Raw console output |
Important: Unlike the vanilla pipeline, there is no RESEARCH_GRAPH.json, no RESEARCH_STATE.md, no EVIDENCE_LOG.md, no CRITIQUE_LOG.md, and no automated diagnosis in VERIFICATION.md. The entire research narrative must be reconstructed from PERMANENT_MEMORY.md, SCRATCHPAD.md, EVENT_LOG.jsonl, and the log files.
Manager logs (iter{N:03d}_01_manager.md): Contains the full agentic conversation for one iteration:
<SYSTEM_PROMPT> — the Research Manager system prompt (same every iteration)<TOOLS> — available tool definitions<USER_MESSAGE> — contains iteration number, problem statement, permanent memory contents, and visible scratchpad entries<ROUND n="N"> blocks, each containing:
<LLM_RESPONSE> — the Manager's reasoning and decisions (with token counts, duration, stop reason)<TOOL_CALL name="..."> — the tool invocation with JSON arguments<TOOL_RESULT name="..." duration="..." status="..."> — the tool's responseThe dispatch_subagent tool calls are especially important: the JSON arguments contain system_prompt and user_message (revealing how the Manager designed the sub-agent) and the TOOL_RESULT contains the sub-agent's response wrapped in <subagent_reasoning>, <code>, and <execution_output> tags.
Sub-agent logs (iter{N:03d}_{M:02d}_subagent_iter{N}_{idx}.md): Contains:
<SYSTEM_PROMPT> — whatever the Manager wrote<USER_MESSAGE> — the task the Manager assigned<LLM_RESPONSE> — the sub-agent's full responseFor code-execution sub-agents with retries, there may be additional log files with _retry{K} suffixed agent names.
After reading the problem statement in problem.yaml:
Read VERIFICATION.md to get the formal answer evaluation result. Unlike the vanilla pipeline, there is no automated diagnosis section — just the verdict (correct/incorrect/inconclusive/skipped) and, if applicable, the candidate-vs-truth comparison.
If a reference document exists in references/ for this problem, read it. The correct answer and key pitfalls apply regardless of which mode produced the run.
Read ANSWER.md if it exists. If the formal evaluation was incorrect, compare the submitted answer against the reference to identify specifically what is wrong (wrong formula, wrong coefficients, wrong functional form, missing terms, etc.).
If no ANSWER.md exists (the Manager never called submit_final_answer), note this as a primary failure — the run did not produce a final answer.
This is the most important file. Read it in full. It is the Manager's accumulated knowledge — the entire output of the research process. Analyze:
Result progression:
Verification discipline:
Self-correction chains:
Memory clarity:
Read the full scratchpad (all entries, not just the windowed view the Manager saw). This reveals:
Strategic evolution:
Stagnation detection:
Context loss:
config.json). Did important context scroll off?System notes:
SYSTEM NOTE: Iteration N failed with error: entries — these are injected by the scaffold when an iteration crashes. They indicate API failures, premature response endings, or other infrastructure problems.Using EVENT_LOG.jsonl and the log files, build a per-iteration summary:
For each iteration, determine:
execute_code: true)_retry entries in EVENT_LOG)write_to_permanent_memory and write_to_scratchpad)end_turn() or submit_final_answer() was calledPresent this as a timeline table, then flag anomalies:
end_turn()This is the core of the Autophysicist analysis. For each sub-agent dispatch (visible in Manager log <TOOL_CALL name="dispatch_subagent">):
Task design quality:
system_prompt specific enough? (e.g., "You are a quantum error correction expert" with a precise task vs. a vague "investigate this")user_message well-scoped? (A single, concrete question/task vs. multiple interleaved questions)execute_code set appropriately? (Computational tasks should use it; pure reasoning tasks should not)Result utilization:
Verification strategy:
Read EVENT_LOG.jsonl. Events have kind: "llm_call" or kind: "scaffold".
LLM call entries (kind: "llm_call"):
agent: "manager" — Manager rounds within an iteration. Multiple per iteration. Track round number, input_tokens, output_tokens, reasoning_tokens, duration_s, stop_reason.agent: "subagent_iter{N}_{M}" — Sub-agent calls. round: 0 (one-shot). Note system_prompt_chars and user_content_chars to gauge context size.agent: "subagent_iter{N}_{M}_retry{K}" — Code execution retry. Indicates the sub-agent's code failed at least once.Scaffold events (kind: "scaffold"):
event: "iteration_failed" — Iteration crashed. Read detail for the error message.event: "api_retry" — API call needed retry (transient provider error). Frequent retries suggest unreliable infrastructure.event: "tool_output_truncation" — A sub-agent's response was truncated before being returned to the Manager. The Manager may have received incomplete information.event: "tool_call_failure_fallback" — Tool calling broke; LLM fell back to text-only.event: "empty_end_turn_recovery" — Manager produced empty response; recovery attempted.event: "text_end_turn_recovery" — Manager ended turn via text instead of tool call; recovery attempted.event: "ready_conclude_recovery" — Manager signaled readiness to conclude without calling exit tool.event: "context_too_long_fallback" — Context exceeded provider limit.event: "progress_check" — Manager was reminded to wrap up after many consecutive tool calls.event: "forced_final_call" — Manager exhausted max rounds; forced text-only final response.event: "forced_exit_tool_retry" — Forced final call didn't produce an exit tool; retrying.event: "tool_timeout" — A tool call (likely code execution) timed out.Token patterns from METRICS.md:
reasoning_tokens to answer_tokens — a very high ratio may indicate the model spending excessive time in hidden reasoning.If the run involved execute_code=True sub-agents:
computations/ to assess quality._attempt1.py, _attempt2.py, _attempt3.py) — what went wrong in earlier attempts?tool_timeout scaffold events)If the Manager called submit_final_answer:
If the Manager never called submit_final_answer:
For every failure or significant inefficiency, trace it to its root cause using this framework specific to the Manager + sub-agent architecture:
system_prompt and user_message from the <TOOL_CALL> to assess this.submit_final_answer before adequate verification.tool_output_truncation events) or missing key parts.api_retry events, iteration_failed errors. Check if these caused lost progress.tool_output_truncation events causing the Manager to receive incomplete information.Based on the above analysis, provide specific, actionable insights in these categories:
src/physics_intern/autophysicist/prompt.md) that would have prevented the observed failures. E.g., stronger guidance on verification protocol, better instructions for memory management, explicit anti-patterns to avoid.