con un clic
investigate-run
// Investigates PhysicsIntern workspace run. Use to understand what went wrong and could be improved in the multi-agent research process.
// Investigates PhysicsIntern workspace run. Use to understand what went wrong and could be improved in the multi-agent research process.
| name | investigate-run |
| description | Investigates PhysicsIntern workspace run. Use to understand what went wrong and could be improved in the multi-agent research process. |
| allowed-tools | Read, Grep |
| model | opus |
Read README.md to understand how the multi-agent research process works.
Given a workspace directory (under workspaces/ in the PhysicsIntern project), perform a systematic post-mortem analysis of the run and its failure modes and inefficiencies. The user may provide a folder name or path; if ambiguous, list available workspaces and ask.
Check references/ in the project root for a reference document matching the problem. These files describe what a correct answer looks like and what a typical successful run looks like for known problems.
Key deliverables:
For any python code you need to write to analyze the file, use the /tmp folder to write and run temporary files. Do not run directly in the command line. Instead, write a python script that reads the relevant files, performs the analysis, and prints the results. Then run that script and read its output.
A workspace contains these key files:
| File | Purpose |
|---|---|
problem.yaml | The scientific research problem to be solved, the answer template and possibly the true answer (not visible to the agents) |
ANSWER.md | Final formatted answer (produced by formatter agent on successful termination) |
VERIFICATION.md | Formal answer evaluation + diagnosis (error/correction chain analysis) |
RESEARCH_GRAPH.json | Authoritative structured state: hypotheses (with evidence + review), research_questions (with evidence), critiques, failed_approaches with explicit cross-links |
EVENT_LOG.jsonl | Structured scaffold events (4 categories) and LLM call metadata |
RESEARCH_STATE.md | Rendered snapshot of the research state (from ResearchState, write-only for git/audit) |
EVIDENCE_LOG.md | Rendered snapshot of all evidence and review results (from ResearchState, write-only for git/audit) |
CRITIQUE_LOG.md | Rendered snapshot of all critiques (from ResearchState, write-only for git/audit) |
logs/ | Per-iteration LLM call logs (XML-tagged: SYSTEM_PROMPT, USER_MESSAGE, ROUND, LLM_RESPONSE, TOOL_CALL, TOOL_RESULT) |
METRICS.md | Per-iteration token counts and alerts |
Important: RESEARCH_GRAPH.json is the authoritative source of truth. The .md files (RESEARCH_STATE, EVIDENCE_LOG, CRITIQUE_LOG) are rendered snapshots — useful for human reading but derived from the JSON.
VERIFICATION.md contains two sections produced at different stages:
Formal Answer Evaluation — deterministic symbolic/numerical check of ANSWER.md against ground truth (run by the engine at end of run). Frontmatter has formal_answer: correct/incorrect/inconclusive/skipped.
Diagnosis — a single LLM analysis that traces error/correction chains through the run. If the answer was correct, it focuses on errors that were made and caught (correction chains). If incorrect, it focuses on root-cause failure analysis (failure chains). Each event is classified as CAUGHT, UNCAUGHT, or PARTIAL, with agents involved, root cause, and evidence IDs.
The diagnosis is a useful starting point — read it first. Your job is to go deeper: verify the diagnosis claims against the raw data, investigate events it may have missed, and read the actual agent logs for critical moments.
After reading the problem statement in problem.yaml:
Read VERIFICATION.md to get the automated diagnosis. Note the formal answer evaluation result and the error/correction chains identified. This gives you the high-level narrative and the key events to investigate further.
Read RESEARCH_GRAPH.json (the authoritative state, not the markdown files):
Strategy:
Hypothesis integrity:
abandoned? Are they recorded in failed_approaches?depends_on fields — are dependency chains satisfied for established results?promotion_justification filled in?Evidence quality:
evidence field?type (research vs compute) — is the right agent type used for each claim?approach document the methodology? Are scripts listed?reasoning substantive?confidence values (exact/approximate/partial) — are they realistic?Review integrity:
review field with verdict: "VERIFIED"?review.verdict: "REFUTED" that weren't abandoned?Research questions:
status: resolved) with resolved_to pointing to WH/ER IDs?evidence attached (from researcher/computer agents)?Critique tracking:
iteration_resolved set (not null)?target_id: "STRATEGY") — were they justified?Failed approaches:
failed_approaches? Do they correspond to abandoned hypotheses?Reconstruct the full lifecycle of every entity (RQ, WH, ER) from RESEARCH_GRAPH.json and EVENT_LOG.jsonl. Present this as a structured per-entity timeline so the user can visualize how the research unfolded.
Data sources:
RESEARCH_GRAPH.json — the final snapshot of all entities with their fields (iteration_created, iteration_modified, iteration_resolved, status, evidence, review, resolved_to, depends_on, promotion_justification, etc.)EVENT_LOG.jsonl — timestamped events that record when mutations happened: add_research_question, add_hypothesis, promote_hypothesis, abandon_hypothesis, abandon_research_question, resolve_critique, file_critique, er_demotion_safetyEntity numbering convention: RQ, WH, and ER share a single counter. When an RQ is explored and the result formulated as a hypothesis, RQ-001 → WH-001 → ER-001. The from_rq field on add_hypothesis events and the resolved_to field on RQs confirm these links.
For entities without the full RQ→WH→ER chain (e.g., WH created directly without an RQ, or RQ that was abandoned), show only the relevant stages.
For each Research Question (RQ-NNN):
iteration_created) and the question posedopen, resolved, or abandonedevidence field) and by which agent type (research/compute)resolved_to list of WH/ER IDs), when (iteration_resolved), and why (resolution_reason)For each Working Hypothesis (WH-NNN):
iteration_created) and the claim statementresolved_to on RQ-NNN, or from_rq in the add_hypothesis event)from_rq on add_hypothesis)?depends_on) — are they satisfied (all dependencies established)?iteration_modified)?For each Established Result (ER-NNN):
iteration_modified) and the promotion_justificationer_demotion_safety events in EVENT_LOG.jsonl)depends_on) — verify the full chain is establishedFor critiques (CRIT-NNN):
iteration_filed), severity, target entity or STRATEGYiteration_resolved), resolution text, was it substantive?For failed approaches:
failed_approaches entry to the hypothesis that triggered itAfter presenting the per-entity timeline, explicitly flag any of these anomalies:
review.verdict is not VERIFIEDer_demotion_safety events)resolved_to is emptystrategy field in RESEARCH_GRAPH.json against entity statuses)Read EVENT_LOG.jsonl. Events fall into 4 categories: call_reliability, state_invariants, loop_control, output_normalization.
State mutations (state_invariants category) — the research narrative:
add_hypothesis — new WH created; check if from_rq and depends_on are notedpromote_hypothesis — WH→ER promotion; check timing relative to VERIFIED reviewabandon_hypothesis — check if dependents are noted and handledresolve_critique — critique resolution; check if resolution text is meaningfulfile_critique — new critique filed; check severity and targetadd_research_question / abandon_research_question — RQ lifecycle trackingappend_note — research notes added by orchestratorValidation checks (state_invariants category):
er_demotion_safety — ER was demoted back to WH due to REFUTED review (1-2 is healthy; 5+ suggests a loop)phantom_labels — references to non-existent hypothesesstale_unverified_labels — labels promoted/demoted based on review statuscritique_resolution_consistency — resolved critiques that shouldn't beLoop control events — process health:
forced_critic — critic was forced because it hadn't run recentlytermination_blocked — orchestrator tried to terminate but was blocked (read the blocker text)dispatch_failure — agent dispatch failed (transient error)compute_enrichment — prior failure context injected into compute taskexplore_result_suppressed — evidence result was dropped (no evidence or missing target)agent_failure_max_tokens — agent hit token limitagent_failure_max_rounds — agent exhausted tool-use roundsmax_tokens_no_retry — one-shot agent hit max_tokensno_critiques_filed — critic found nothing to critique (healthy if late in run)status_field_exit — run ended via status field checkCall reliability events — LLM interaction health:
api_retry — API call needed retry (transient errors)tool_call_failure_fallback — tool-calling broke, fell back to text-onlyempty_end_turn_recovery — agent produced empty response, recovery attemptedprogress_check — agent was reminded to wrap up after many consecutive execute_python callsforced_final_call — agent exhausted rounds, forced text-only final responseLLM call entries (kind: "llm_call"):
agent, model, input_tokens, output_tokens, duration, round (for agentic calls)The automated diagnosis (Step 1) identifies the key error/correction chains. Now verify and deepen those claims by reading the actual agent conversations.
For each event flagged in the diagnosis (CAUGHT, UNCAUGHT, or PARTIAL):
logs/ (e.g., iter003_01_orchestrator.md for iteration 3) — logs use ALL_CAPS XML tags (<SYSTEM_PROMPT>, <ROUND>, <LLM_RESPONSE>, <TOOL_CALL>, <TOOL_RESULT>, <USER_MESSAGE>) to separate log structure from prompt contentAlso check for issues the diagnosis may have missed:
METRICS.md for token usage anomalies (context bloat, max_tokens hits)For every failure or significant inefficiency, trace it to its root cause: which agent made the mistake, and why? Use the diagnosis chains as your starting point — the automated analysis identifies the "what", your job is to determine the "why" by reading the actual agent reasoning.
Focus on the questions most relevant to the specific failures found. Skip those that are clearly irrelevant.
Based on the above analysis, list specific insights for improving the multi-agent research process. These can be categorized into: