Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

recursive-improve

Name: Recursive Improve
Author: kayba-ai

// End-to-end agent improvement pipeline. Analyzes raw execution traces, extracts insights, manages a skillbook, gathers domain context, defines metrics, builds a rubric, creates a prioritized action plan, presents it for review, and implements approved fixes. Trigger when the user says "improve my agent", "run the improvement pipeline", "apply insights", "/recursive-improve", or when eval/traces/ contains trace files.

Ejecutar en Manus

$ git log --oneline --stat

stars:196

forks:19

updated:30 de marzo de 2026, 08:11

Explorador de archivos

5 archivos

SKILL.md

readonly

package.json

"author": "kayba-ai"

"repository": "kayba-ai/recursive-improve"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Desarrolladores de softwareOcupaciones informáticas y matemáticas15-1252L4

Ejecuta cualquier Skill con un clic

name

recursive-improve

description

End-to-end agent improvement pipeline. Analyzes raw execution traces, extracts insights, manages a skillbook, gathers domain context, defines metrics, builds a rubric, creates a prioritized action plan, presents it for review, and implements approved fixes. Trigger when the user says "improve my agent", "run the improvement pipeline", "apply insights", "/recursive-improve", or when eval/traces/ contains trace files.

recursive-improve: Agent Improvement Pipeline

End-to-end pipeline: trace analysis → skill extraction → domain context → metrics → rubric → action plan → review → fixes.

Prerequisites

Traces must exist in eval/traces/. If they don't:

Ask the user for their traces directory
Copy .json, .md, and .toon files into eval/traces/

Skip condition: If eval/stage1_insights_summary.md already exists (from a prior run or from recursive-improve analyze), skip Stages 0 and 1 — go directly to Stage 2.

Stage 0: Trace Analysis

Analyze raw execution traces to extract learnings. This stage adapts ACE's recursive reflector methodology — a structured 6-phase strategy that moves from data discovery through verified deep-dives to synthesized, evidence-backed insights.

Inputs

eval/traces/ — raw trace files (.json, .md, .toon)

Phase 1: Discover

Map the data shape and inventory. Do NOT judge outcomes yet — just catalog what you have.

Read 2-3 trace files. Identify:
- Top-level keys and message schema (3 levels deep)
- Message format: role, content, tool_calls, turn_idx, etc.
- Total trace count and per-trace message counts
Search for agent operating rules, policy, or instructions embedded in the traces — these are often in large strings (>500 chars). Check:
- role: "system" messages
- info.environment_info.policy or similar fields
- Large embedded strings in any field

Build an inventory table:

File                  Messages  Has system prompt?  Has tool calls?
trace_001.json        42        yes                 yes
trace_002.json        18        yes                 no
...

Record discovered rules/policy verbatim — understanding what the agent was supposed to do is essential for evaluating what it actually did.

Phase 2: Derive Evaluation Criteria

Based on your discovery (schema, rules, patterns), define specific evaluation criteria to apply to every trace during the survey phase.

For each criterion, state:

What to look for
What a violation looks like

Example criteria (adapt to what you discovered):

"Agent must verify customer identity before account changes" → violation: account change without prior verification tool call
"Agent must not hallucinate policy details" → violation: agent states a policy that contradicts the embedded rules

Phase 3: Survey

Read ALL traces (if ≤ 20) or a stratified sample (if > 20, target ~15 or 30%, whichever is larger — sample by outcome, length, and complexity).

For each trace, record:

What was requested
What the agent did (key decisions, tool calls, reasoning)
How it ended (success / failure / partial)
Evaluation criteria results (pass / fail / not applicable per criterion)

Process in batches of ~3 traces at a time to manage context.

Phase 4: Categorize

Review all survey summaries. Group by task type and outcome.

Select 2-3 deep-dive targets, prioritizing:

Divergent outcomes — same task type, one succeeded, one failed. What made the difference?
Longest/most complex traces with mistakes — most decision points, most learning potential
Most common failure pattern — highest impact to fix
Confident-but-wrong — traces where the agent's stated reasoning seems worth cross-checking against the data it received
Rule/criteria violations that appeared across traces (even successful ones)

Group targets by root cause — max 2 deep-dives per root cause. Prioritize breadth over depth.

Skip short, simple, routine traces — they rarely yield learnings.

Phase 5: Deep-dive

For each deep-dive target, re-read the FULL raw trace — not your survey summary. Deep-dives that analyze summaries of summaries produce shallow, unverified conclusions.

Two passes per target:

Pass 1 — Verification: Separate what the agent claimed from what data it received. For each key claim or conclusion:

What did the agent claim?
What does the actual data/tool response show?
Does it comply with the discovered rules?
List any incorrect claims: what was claimed, what data shows, impact

This catches "confident but wrong" errors — where the agent proceeds without hesitation based on incorrect reasoning — that behavioral analysis alone misses.

Pass 2 — Root cause analysis: Given the verification findings and the full trace:

What should the agent do differently?
What is the root cause (not just the symptom)?
Is this a missing instruction, a wrong instruction, a code limitation, or a reasoning failure?

For divergent outcomes: compare success and failure traces side by side. What specifically made the difference?

Phase 6: Synthesize

Combine ALL survey summaries with ALL deep-dive results. Do not omit deep-dive findings — they contain your best evidence.

Produce a list of atomic learnings. For each:

Learning: one specific, actionable insight (one concept only)
Atomicity score (0.0–1.0): base 1.0, deduct 0.15 per "and/also/plus", 0.20 per vague term, 0.05 per word over 15
Evidence: cite specific trace details (file name, message index, exact data)
Severity: high (directly causes wrong outcomes), medium (degrades quality), low (minor inefficiency)
Category: code_fix | prompt_fix | process_fix

Verification findings are high-severity — when the agent's reasoning contradicted the data it received, this directly causes wrong outcomes regardless of correct procedure.

Output

Write to eval/stage0_trace_analysis.md:

# Trace Analysis

## Discovery
### Trace Format
### Schema
### Agent Rules
### Inventory

## Evaluation Criteria
1. [criterion]: [violation description]
...

## Survey
### [trace_file.json]
- Requested: ...
- Agent did: ...
- Outcome: success/failure/partial
- Criteria: ...

## Categories
### Success patterns
### Failure patterns
### Partial completions

## Deep-dive Targets
### [Target 1: description]
#### Verification findings
#### Root cause analysis
### [Target 2: description]
...

## Extracted Learnings
| # | Learning | Atomicity | Evidence | Severity | Category |
|---|----------|-----------|----------|----------|----------|
| 1 | ...      | 0.92      | ...      | high     | prompt_fix |
| 2 | ...      | 0.87      | ...      | medium   | code_fix   |

Stage 1: Skill Management

Transform raw learnings from Stage 0 into a structured skillbook with quality gates.

Inputs

eval/stage0_trace_analysis.md — extracted learnings from Stage 0
eval/skillbook.json — existing skillbook from a prior improvement cycle (if it exists, load and update it; if not, start fresh)

Step 1: Quality gate — Atomicity

For each learning from Stage 0, verify the atomicity score:

Score	Level	Action
0.95–1.00	Excellent	Accept as-is
0.85–0.94	Good	Accept, minor tightening optional
0.70–0.84	Fair	Split into multiple atomic learnings
0.40–0.69	Poor	Must split before proceeding
< 0.40	Rejected	Discard — too vague or compound

Splitting example:

Compound: "Tool X worked in 4 steps with 95% accuracy" (0.55)
Split into: "Use Tool X for task type Y" (0.95) + "Tool X completes in ~4 steps" (0.92) + "Expect 95% accuracy from Tool X" (0.90)

Step 2: Format as imperative commands

Every skill must be an imperative command, not an observation.

BAD: "The agent accurately answers factual questions" (observation)
GOOD: "Answer factual questions directly and concisely" (imperative)
BAD: "Missing verification step caused errors" (observation)
GOOD: "Verify customer identity before making account changes" (imperative)

Step 3: Deduplication

If eval/skillbook.json exists, load it. For each new learning, check whether any existing skill has >70% semantic overlap.

Semantic duplicates (use UPDATE, not ADD):

Existing skill	Duplicate (don't add)
"Answer directly"	"Use direct answers"
"Break into steps"	"Decompose into parts"
"Verify calculations"	"Double-check results"

Step 4: Determine operations

For each learning, select the operation:

Situation	Operation
New error pattern or missing capability	ADD new skill
Existing skill needs refinement	UPDATE with improved content
Existing skill contributed to success in traces	TAG as helpful
Existing skill caused or contributed to error	TAG as harmful
Strategies contradict each other	REMOVE one or UPDATE to resolve
Skill tagged harmful 3+ times	REMOVE
No actionable insight	SKIP

Default to UPDATE over ADD when a similar skill exists.

Step 5: Rejection filter

Reject any skill that contains:

Meta-commentary (not actionable): "be careful", "consider", "think about", "remember", "make sure"
Observations (not commands): "the agent", "the model" — write commands to follow, not descriptions of behavior
Vague terms: "appropriate", "proper", "various" — too vague to act on
Overgeneralizations: "always", "never" without specific context

Step 6: Skillbook size management

If the skillbook exceeds 50 skills:

Prioritize UPDATE over ADD
Merge skills with >70% overlap
Remove lowest-performing skills (most harmful tags, least helpful tags)

Outputs

eval/skillbook.json:

{
  "skills": {
    "section-00001": {
      "id": "section-00001",
      "section": "error_handling",
      "content": "Verify customer identity before making account changes",
      "evidence": "In trace_003.json, agent changed account without verification (msg 12)",
      "justification": "Prevents unauthorized account modifications",
      "helpful": 0,
      "harmful": 0,
      "status": "active"
    }
  },
  "sections": {
    "error_handling": ["section-00001"]
  },
  "next_id": 2
}

eval/stage1_insights_summary.md:

# Insights Summary

Generated by: recursive-improve (Stage 1)
Total insights: N

---

## Insight: {skill_id} — {section}

**Status:** active
**Helpful/Harmful:** 0/0

**Content:**
{imperative skill text}

**Evidence:**
{specific trace evidence}

**Justification:**
{why this improves the agent}

---

Write both files, then proceed to Stage 2.

Stage 2: Domain Context Gathering

Understand the agent's world — what it does, what tools it has, and what "success" looks like.

0. Detect trace format

Read 1 trace file from eval/traces/ and identify the framework:

Signal	Framework
`info.agent_info.implementation`, `simulation.messages[]` with `role`/`tool_calls`/`turn_idx`	tau2-bench
`runs[].steps[]` with `type: "tool"`, `lc_kwargs`	LangChain / LangSmith
`events[]` with `event_type`, `span_id`, `parent_id`	LlamaIndex
`choices[].message.tool_calls[]` at top level	Raw OpenAI API logs
`trace.spans[]` with `attributes`, `trace_id`	OpenTelemetry / Arize / Langfuse

Record the detected format. If unrecognized, note top-level keys and proceed best-effort.

1. Detect architecture

Read 2-3 traces. Determine single-agent vs multi-agent:

Single agent: one conversation thread, tool calls from one identity
Multi-agent: multiple agent_info entries, routing tool calls (transfer_to_*, delegate_to_*), distinct system prompts per agent

If multi-agent: document each agent separately and note routing logic.

2. Find the system prompt

Fallback chain — stop at first hit:

Config files — grep for: system_prompt, system_message, instructions, AGENT_INSTRUCTION, SYSTEM_PROMPT
Source code — search for prompt template strings, f-strings building system messages
Trace extraction — check info.environment_info.policy, first role: "system" message, raw_data fields
Not found — record SYSTEM_PROMPT_STATUS: NOT_FOUND

Record both content and source location.

3. Extract tool definitions

Pass 1 — Source code: Search for @tool, @is_tool, function schema arrays, tools=[]. For each: name, params, return type, side effects (READ/WRITE/GENERIC), unvalidated rules.

Pass 2 — Traces: Read ALL traces (if ≤ 20) or stratified sample. Extract every unique tool_calls[].name and role: "tool" response. Record one example input/output per tool.

Reconcile: Tools in source but not traces = "available but unused". Tools in traces but not source = investigate.

4. Find domain documentation

READMEs, policy files, inline comments, test files describing expected behavior.

5. Catalogue behavior patterns

Trace selection — stratified sampling (if > 20 traces):

2+ per unique termination_reason
Shortest, longest, 2 median by message count
Lowest and highest by tool call count
3+ of each pass/fail outcome
Target: ~15 traces or 30%, whichever is larger

For each trace, document: function call frequency, tool call sequences, success patterns, failure patterns, error patterns, policy violations, user feedback signals.

6. Write findings

Write to eval/stage2_domain_context.md:

# Domain Context

## Trace Format
## Architecture
## Agent Purpose
## System Prompt
## Tools
## Domain Rules
## Behavior Patterns
### Success patterns
### Failure patterns
### Policy violation patterns
### Error patterns
### User feedback signals

Stage 3: Metrics and Programmatic Analysis

Define metrics from insights, implement as code, run, review, iterate.

Step 0: Run built-in eval first

Before writing custom detectors, run the built-in eval:

recursive-improve eval eval/traces --branch main

Review the generic metrics (loops, give-ups, errors, recovery, clean success). Only write custom detectors in eval/compute_baselines.py for domain-specific metrics the built-in detectors cannot capture.

Inputs

eval/stage1_insights_summary.md
eval/stage2_domain_context.md

Read both before starting.

Process

This stage is iterative with a 3 iteration cap. A metric set is "clean" when:

No small-sample metrics in priority set (denominator ≥ 5 for priority ranking)
No unexplained 0%/100% extremes
No redundant pairs (>70% denominator overlap)
Script runs without errors

Step 1: Define metrics

For each insight, identify observable trace signals. Classify by detector pattern:

Recovery detectors — consecutive calls: first error, next success
Loop detectors — N+ consecutive calls to same function (stuck agent)
Give-up detectors — regex for abandonment phrases ("I'm unable to", "cannot complete")
Error classifiers — match outputs against domain-specific error patterns
Over-exploration detectors — ratio of explore vs action calls exceeding threshold
Ground-truth comparison — agent claims value vs preceding tool response (regex extraction: dollar amounts, IDs, flight numbers, compare against JSON fields)
Ordering/sequencing detectors — tool call A before B when B should come first
Clean success — threads with no errors and no other tags

Validate each detector against 2-3 traces where you know ground truth before coding at scale.

Step 2: Implement and run

Write eval/compute_baselines.py with:

CLI args: --traces-dir (required), --output (default: eval/baseline_metrics.json)
load_traces(traces_dir) — loads all JSON trace files
tag_thread(thread) — combines all detectors
One measurement function per metric (numerator / denominator)
compute_all_baselines(traces_dir) — runs all, returns dict

Run it:

python eval/compute_baselines.py --traces-dir eval/traces --output eval/baseline_metrics.json

Then store the baseline as a benchmark run (for the dashboard):

recursive-improve store-baseline

Step 3: Review and iterate

Check A — Script health. Errors or null values? Fix, re-run (doesn't count toward cap).

Check B — Small-sample guard. Denominator ≥ 5 → full confidence. 1-4 → "directional-only". 0 → broken or genuinely absent.

Check C — Extreme-value triage. 0% or 100%: plausible non-extreme case exists → detector broken. Otherwise → write justification. Add "at_ceiling": true or "at_floor": true for already-optimal metrics.

Check E — Coverage. Every Stage 1 insight needs a metric. Try hard before classifying as unmeasurable. Only valid unmeasurable reasons: qualitative-only, insufficient-data, needs-ground-truth.

Design principles

One metric per insight. Fewer metrics than insights = too conservative.
Express as ratio or percentage. No absolute counts.
Prefer per-event denominators over per-thread.
Build a metric for EVERY insight. "Unmeasurable" is a last resort.

Outputs

eval/compute_baselines.py
eval/baseline_metrics.json
eval/benchmark_results.json (stored via recursive-improve store-baseline)

Stage 4: Rubric Definition

Organize metrics into a tiered evaluation rubric.

Inputs

eval/baseline_metrics.json
eval/compute_baselines.py
eval/stage1_insights_summary.md
eval/stage2_domain_context.md

Process

1. Quantitative redundancy check

For every metric pair, check:

Denominator overlap > 70%
Same skill set
Logical subsumption

For each candidate pair: explicit decision (keep both / merge / drop) with reasoning.

Target: 5-7 metrics after redundancy resolution.

2. Tier each metric

Q1: Can a SINGLE skill/instruction change directly move this? → LEADING
Q2: Does it require MULTIPLE skills adopted together? → LAGGING
Q3: Does it require domain reasoning beyond instructions? → QUALITY

If ambiguous, pick the lower tier. Record which question determined the tier.

3. Flag low-confidence baselines

Denominator < 5 → marked with **Confidence: low** (n=X), excluded from priority sorting.

4. Set direction

Up or down. Ceiling guard: 100% baseline → "↑ maintain", never bare "↑". Same for 0% floor.

5. Map insights to metrics

Every insight → Mapped / Indirectly mapped / Qualitative-only. Report coverage counts.

6. Invalidation notes

Per metric: "What would make this tier assignment wrong?"

7. Write the rubric

Write to eval/baseline_metrics.md with summary table, tier definitions, metric details, redundancy analysis, insight coverage.

Stage 5: Action Plan

Triage each insight into discard/code-fix/prompt-fix and produce a prioritized plan.

Inputs

eval/stage1_insights_summary.md
eval/stage2_domain_context.md
eval/baseline_metrics.md
eval/baseline_metrics.json
eval/compute_baselines.py

Process

1. Triage each insight

1a. Validity check — real recurring problem or one-off noise?

1b. "Already handled" verification — grep codebase for key terms, read existing system prompt. Partially covered → keep as strengthening fix. Fully covered AND baseline ≥ 95% → discard.

1c. Code-vs-prompt decision:

For each insight, judge the best fix type on its own merits:

Signal	Fix type
Agent has the information but reasons incorrectly about it	PROMPT FIX — clarify reasoning guidance
Agent lacks instructions for a scenario it hasn't seen	PROMPT FIX — add instructions for the new case
Agent has the instruction but ignores or violates it	CODE FIX — enforce via validation, guardrails, or forced sequencing
Agent lacks a tool, schema, validation, or infrastructure capability	CODE FIX — add or modify code
Agent has partial info and a prompt workaround exists, but a code fix would be more robust	CODE FIX — prefer the durable solution
Fix involves both new instructions and supporting code	BOTH — implement both, note the dependency

Choose the approach that most directly and durably solves the root cause. There is no default — evaluate each case independently.

2. Consolidate related insights

Merge when ALL hold: same target behavior, overlapping fix text (>50%), fixing one fixes >80% of the other.

3. Write specific recommendations

Discards: one sentence why.
Code fixes: file, function, specific change.
Prompt fixes: exact instruction text, where it goes, why this wording.

4. Assess risk per fix

Risk	Definition
None	Additive, no existing behavior affected
Low	Targets currently-failing behavior
Medium	Modifies behavior where some cases already work
High	Rewrites behavior that mostly works

Medium/High: add one-sentence mitigation.

5. Handle qualitative-only insights

Still produce fixes. Use confidence = 0.5 and estimate impact from severity. Note that improvement should be verified via manual trace review.

6. Link to metrics

Each fix → which metric(s) would move.

7. Prioritize

Priority Score = Impact × Confidence × Tier Bonus ÷ Risk Factor

Impact = estimated gap closure
Confidence: n≥20 → 1.0, 10-19 → 0.8, 5-9 → 0.6, <5 → 0.3
Tier Bonus: leading → 1.5x, lagging/quality → 1.0x
Risk Factor: None/Low → 1.0, Medium → 1.5, High → 2.0

After scoring, promote prerequisites even if their standalone score is lower.

Output

Write to eval/action_plan.md with summary, implementation priority table, per-fix entries, consolidated prompt skills, monitor items.

Stage 6: Human-In-The-Loop Gate

Present the action plan for informed approval.

Inputs

eval/action_plan.md
eval/baseline_metrics.md
eval/baseline_metrics.json
eval/stage1_insights_summary.md

Process

1. Executive summary

Counts: total insights, distinct after dedup, prompt/code/discards, discard reasons.

2. Top 3 highest-impact changes

For each: before/after behavior from actual traces, target metric delta, risk rating.

3. Full prioritized fix list

Table: priority, name, type, target metrics, risk, effort (Low/Medium/High).

4. "What we are NOT fixing and why"

Every discard with: ID, name, reason, what would change your mind.

5. Flag small-sample items

Call out metrics with denominator < 5.

6. Traceability chain

Per fix: insight → metric → fix → expected improvement.

7. Collect decision

Explain the branch workflow: approved fixes will be applied on a dedicated branch (ri/improve-<YYYYMMDD-HHMMSS>), not directly on the current branch. This means:

The user's current branch stays untouched
All changes can be reviewed with git diff main...ri/improve-<timestamp>
The user can open a PR, get CI feedback, and merge when ready
Easy to discard with git branch -D if the fixes aren't right

Three options:

[A] Approve all — create branch, implement all fixes
[B] Approve with modifications — walk through each fix individually (approve/skip/modify), then create branch and implement
[C] Reject — collect feedback, re-run Stage 5

If [B]: update eval/action_plan.md with modifications. Write eval/stage6_decision.md.

Rules

Do NOT auto-approve
Do NOT proceed to fixes until clear approval is recorded
Always present small-sample warnings
Always present the "not fixing" section
Always explain the branch workflow before collecting the decision

Output

eval/stage6_decision.md
eval/action_plan.md (updated if modifications)

Stage 7: Fix Implementation

Implement every approved fix from the action plan.

Inputs

eval/action_plan.md
eval/stage6_decision.md (if exists)
eval/baseline_metrics.json

Pre-flight: Create Improvement Branch

All fixes are applied on a dedicated branch, keeping the user's current branch clean.

git status — verify clean working tree. If there are uncommitted changes, ask the user to commit or stash them first before proceeding.
Record the current branch name (e.g. main) as the base branch.

Create and switch to the improvement branch:

git checkout -b ri/improve-$(date +%Y%m%d-%H%M%S)

Record the branch name in eval/changes_log.md.

Pre-flight: HITL Modification Check

If eval/stage6_decision.md exists, identify user-modified items. Tag them [HITL-MODIFIED] in the changes log.

Pre-flight: Conflict Scan

Build file_path → [fix IDs] map. Flag co-located and overlapping fixes. Plan sequential application in priority order.

Process

For each non-discarded fix in priority order:

Understand — read the recommendation and referenced files
Implement — minimal, targeted change. Code fixes: find file, make the change. Prompt fixes: find system prompt, add instruction. No refactoring beyond what's needed.
Log — append to eval/changes_log.md: type, verdict, files modified, before/after snippets, linked metrics, conflict notes
Handle uncertainty — if unsure, log as NEEDS REVIEW with what's unclear and continue

Post-Fix: Next Steps

After all fixes are applied on the improvement branch:

Ensure eval/benchmark_results.json is present (it was created by store-baseline in Stage 3). If missing, run: recursive-improve store-baseline
Commit all changes on the improvement branch with a descriptive message. Include eval/benchmark_results.json in the commit.
Stay on the improvement branch (do NOT switch back to the base branch).

Rules

Do NOT modify trace files
Do NOT make changes beyond what was recommended
Do NOT run compute_baselines.py (baselines reflect old traces)
Do NOT apply fixes directly on the user's current branch — always use the improvement branch
Make minimal, targeted changes

Output

eval/changes_log.md
The improvement branch with all code/prompt changes

name

recursive-improve

description

recursive-improve: Agent Improvement Pipeline

End-to-end pipeline: trace analysis → skill extraction → domain context → metrics → rubric → action plan → review → fixes.

Prerequisites

Traces must exist in eval/traces/. If they don't:

Ask the user for their traces directory
Copy .json, .md, and .toon files into eval/traces/

Skip condition: If eval/stage1_insights_summary.md already exists (from a prior run or from recursive-improve analyze), skip Stages 0 and 1 — go directly to Stage 2.

Stage 0: Trace Analysis

Inputs

eval/traces/ — raw trace files (.json, .md, .toon)

Phase 1: Discover

Map the data shape and inventory. Do NOT judge outcomes yet — just catalog what you have.

Read 2-3 trace files. Identify:
- Top-level keys and message schema (3 levels deep)
- Message format: role, content, tool_calls, turn_idx, etc.
- Total trace count and per-trace message counts
Search for agent operating rules, policy, or instructions embedded in the traces — these are often in large strings (>500 chars). Check:
- role: "system" messages
- info.environment_info.policy or similar fields
- Large embedded strings in any field

Build an inventory table:

File                  Messages  Has system prompt?  Has tool calls?
trace_001.json        42        yes                 yes
trace_002.json        18        yes                 no
...

Record discovered rules/policy verbatim — understanding what the agent was supposed to do is essential for evaluating what it actually did.

Phase 2: Derive Evaluation Criteria

Based on your discovery (schema, rules, patterns), define specific evaluation criteria to apply to every trace during the survey phase.

For each criterion, state:

What to look for
What a violation looks like

Example criteria (adapt to what you discovered):

"Agent must verify customer identity before account changes" → violation: account change without prior verification tool call
"Agent must not hallucinate policy details" → violation: agent states a policy that contradicts the embedded rules

Phase 3: Survey

Read ALL traces (if ≤ 20) or a stratified sample (if > 20, target ~15 or 30%, whichever is larger — sample by outcome, length, and complexity).

For each trace, record:

What was requested
What the agent did (key decisions, tool calls, reasoning)
How it ended (success / failure / partial)
Evaluation criteria results (pass / fail / not applicable per criterion)

Process in batches of ~3 traces at a time to manage context.

Phase 4: Categorize

Review all survey summaries. Group by task type and outcome.

Select 2-3 deep-dive targets, prioritizing:

Divergent outcomes — same task type, one succeeded, one failed. What made the difference?
Longest/most complex traces with mistakes — most decision points, most learning potential
Most common failure pattern — highest impact to fix
Confident-but-wrong — traces where the agent's stated reasoning seems worth cross-checking against the data it received
Rule/criteria violations that appeared across traces (even successful ones)

Group targets by root cause — max 2 deep-dives per root cause. Prioritize breadth over depth.

Skip short, simple, routine traces — they rarely yield learnings.

Phase 5: Deep-dive

For each deep-dive target, re-read the FULL raw trace — not your survey summary. Deep-dives that analyze summaries of summaries produce shallow, unverified conclusions.

Two passes per target:

Pass 1 — Verification: Separate what the agent claimed from what data it received. For each key claim or conclusion:

What did the agent claim?
What does the actual data/tool response show?
Does it comply with the discovered rules?
List any incorrect claims: what was claimed, what data shows, impact

This catches "confident but wrong" errors — where the agent proceeds without hesitation based on incorrect reasoning — that behavioral analysis alone misses.

Pass 2 — Root cause analysis: Given the verification findings and the full trace:

What should the agent do differently?
What is the root cause (not just the symptom)?
Is this a missing instruction, a wrong instruction, a code limitation, or a reasoning failure?

For divergent outcomes: compare success and failure traces side by side. What specifically made the difference?

Phase 6: Synthesize

Combine ALL survey summaries with ALL deep-dive results. Do not omit deep-dive findings — they contain your best evidence.

Produce a list of atomic learnings. For each:

Learning: one specific, actionable insight (one concept only)
Atomicity score (0.0–1.0): base 1.0, deduct 0.15 per "and/also/plus", 0.20 per vague term, 0.05 per word over 15
Evidence: cite specific trace details (file name, message index, exact data)
Severity: high (directly causes wrong outcomes), medium (degrades quality), low (minor inefficiency)
Category: code_fix | prompt_fix | process_fix

Verification findings are high-severity — when the agent's reasoning contradicted the data it received, this directly causes wrong outcomes regardless of correct procedure.

Output

Write to eval/stage0_trace_analysis.md:

# Trace Analysis

## Discovery
### Trace Format
### Schema
### Agent Rules
### Inventory

## Evaluation Criteria
1. [criterion]: [violation description]
...

## Survey
### [trace_file.json]
- Requested: ...
- Agent did: ...
- Outcome: success/failure/partial
- Criteria: ...

## Categories
### Success patterns
### Failure patterns
### Partial completions

## Deep-dive Targets
### [Target 1: description]
#### Verification findings
#### Root cause analysis
### [Target 2: description]
...

## Extracted Learnings
| # | Learning | Atomicity | Evidence | Severity | Category |
|---|----------|-----------|----------|----------|----------|
| 1 | ...      | 0.92      | ...      | high     | prompt_fix |
| 2 | ...      | 0.87      | ...      | medium   | code_fix   |

Stage 1: Skill Management

Transform raw learnings from Stage 0 into a structured skillbook with quality gates.

Inputs

eval/stage0_trace_analysis.md — extracted learnings from Stage 0
eval/skillbook.json — existing skillbook from a prior improvement cycle (if it exists, load and update it; if not, start fresh)

Step 1: Quality gate — Atomicity

For each learning from Stage 0, verify the atomicity score:

Score	Level	Action
0.95–1.00	Excellent	Accept as-is
0.85–0.94	Good	Accept, minor tightening optional
0.70–0.84	Fair	Split into multiple atomic learnings
0.40–0.69	Poor	Must split before proceeding
< 0.40	Rejected	Discard — too vague or compound

Splitting example:

Compound: "Tool X worked in 4 steps with 95% accuracy" (0.55)
Split into: "Use Tool X for task type Y" (0.95) + "Tool X completes in ~4 steps" (0.92) + "Expect 95% accuracy from Tool X" (0.90)

Step 2: Format as imperative commands

Every skill must be an imperative command, not an observation.

BAD: "The agent accurately answers factual questions" (observation)
GOOD: "Answer factual questions directly and concisely" (imperative)
BAD: "Missing verification step caused errors" (observation)
GOOD: "Verify customer identity before making account changes" (imperative)

Step 3: Deduplication

If eval/skillbook.json exists, load it. For each new learning, check whether any existing skill has >70% semantic overlap.

Semantic duplicates (use UPDATE, not ADD):

Existing skill	Duplicate (don't add)
"Answer directly"	"Use direct answers"
"Break into steps"	"Decompose into parts"
"Verify calculations"	"Double-check results"

Step 4: Determine operations

For each learning, select the operation:

Situation	Operation
New error pattern or missing capability	ADD new skill
Existing skill needs refinement	UPDATE with improved content
Existing skill contributed to success in traces	TAG as helpful
Existing skill caused or contributed to error	TAG as harmful
Strategies contradict each other	REMOVE one or UPDATE to resolve
Skill tagged harmful 3+ times	REMOVE
No actionable insight	SKIP

Default to UPDATE over ADD when a similar skill exists.

Step 5: Rejection filter

Reject any skill that contains:

Meta-commentary (not actionable): "be careful", "consider", "think about", "remember", "make sure"
Observations (not commands): "the agent", "the model" — write commands to follow, not descriptions of behavior
Vague terms: "appropriate", "proper", "various" — too vague to act on
Overgeneralizations: "always", "never" without specific context

Step 6: Skillbook size management

If the skillbook exceeds 50 skills:

Prioritize UPDATE over ADD
Merge skills with >70% overlap
Remove lowest-performing skills (most harmful tags, least helpful tags)

Outputs

eval/skillbook.json:

{
  "skills": {
    "section-00001": {
      "id": "section-00001",
      "section": "error_handling",
      "content": "Verify customer identity before making account changes",
      "evidence": "In trace_003.json, agent changed account without verification (msg 12)",
      "justification": "Prevents unauthorized account modifications",
      "helpful": 0,
      "harmful": 0,
      "status": "active"
    }
  },
  "sections": {
    "error_handling": ["section-00001"]
  },
  "next_id": 2
}

eval/stage1_insights_summary.md:

# Insights Summary

Generated by: recursive-improve (Stage 1)
Total insights: N

---

## Insight: {skill_id} — {section}

**Status:** active
**Helpful/Harmful:** 0/0

**Content:**
{imperative skill text}

**Evidence:**
{specific trace evidence}

**Justification:**
{why this improves the agent}

---

Write both files, then proceed to Stage 2.

Stage 2: Domain Context Gathering

Understand the agent's world — what it does, what tools it has, and what "success" looks like.

0. Detect trace format

Read 1 trace file from eval/traces/ and identify the framework:

Signal	Framework
`info.agent_info.implementation`, `simulation.messages[]` with `role`/`tool_calls`/`turn_idx`	tau2-bench
`runs[].steps[]` with `type: "tool"`, `lc_kwargs`	LangChain / LangSmith
`events[]` with `event_type`, `span_id`, `parent_id`	LlamaIndex
`choices[].message.tool_calls[]` at top level	Raw OpenAI API logs
`trace.spans[]` with `attributes`, `trace_id`	OpenTelemetry / Arize / Langfuse

Record the detected format. If unrecognized, note top-level keys and proceed best-effort.

1. Detect architecture

Read 2-3 traces. Determine single-agent vs multi-agent:

Single agent: one conversation thread, tool calls from one identity
Multi-agent: multiple agent_info entries, routing tool calls (transfer_to_*, delegate_to_*), distinct system prompts per agent

If multi-agent: document each agent separately and note routing logic.

2. Find the system prompt

Fallback chain — stop at first hit:

Config files — grep for: system_prompt, system_message, instructions, AGENT_INSTRUCTION, SYSTEM_PROMPT
Source code — search for prompt template strings, f-strings building system messages
Trace extraction — check info.environment_info.policy, first role: "system" message, raw_data fields
Not found — record SYSTEM_PROMPT_STATUS: NOT_FOUND

Record both content and source location.

3. Extract tool definitions

Pass 1 — Source code: Search for @tool, @is_tool, function schema arrays, tools=[]. For each: name, params, return type, side effects (READ/WRITE/GENERIC), unvalidated rules.

Pass 2 — Traces: Read ALL traces (if ≤ 20) or stratified sample. Extract every unique tool_calls[].name and role: "tool" response. Record one example input/output per tool.

Reconcile: Tools in source but not traces = "available but unused". Tools in traces but not source = investigate.

4. Find domain documentation

READMEs, policy files, inline comments, test files describing expected behavior.

5. Catalogue behavior patterns

Trace selection — stratified sampling (if > 20 traces):

2+ per unique termination_reason
Shortest, longest, 2 median by message count
Lowest and highest by tool call count
3+ of each pass/fail outcome
Target: ~15 traces or 30%, whichever is larger

For each trace, document: function call frequency, tool call sequences, success patterns, failure patterns, error patterns, policy violations, user feedback signals.

6. Write findings

Write to eval/stage2_domain_context.md:

# Domain Context

## Trace Format
## Architecture
## Agent Purpose
## System Prompt
## Tools
## Domain Rules
## Behavior Patterns
### Success patterns
### Failure patterns
### Policy violation patterns
### Error patterns
### User feedback signals

Stage 3: Metrics and Programmatic Analysis

Define metrics from insights, implement as code, run, review, iterate.

Step 0: Run built-in eval first

Before writing custom detectors, run the built-in eval:

recursive-improve eval eval/traces --branch main

Inputs

eval/stage1_insights_summary.md
eval/stage2_domain_context.md

Read both before starting.

Process

This stage is iterative with a 3 iteration cap. A metric set is "clean" when:

No small-sample metrics in priority set (denominator ≥ 5 for priority ranking)
No unexplained 0%/100% extremes
No redundant pairs (>70% denominator overlap)
Script runs without errors

Step 1: Define metrics

For each insight, identify observable trace signals. Classify by detector pattern:

Recovery detectors — consecutive calls: first error, next success
Loop detectors — N+ consecutive calls to same function (stuck agent)
Give-up detectors — regex for abandonment phrases ("I'm unable to", "cannot complete")
Error classifiers — match outputs against domain-specific error patterns
Over-exploration detectors — ratio of explore vs action calls exceeding threshold
Ground-truth comparison — agent claims value vs preceding tool response (regex extraction: dollar amounts, IDs, flight numbers, compare against JSON fields)
Ordering/sequencing detectors — tool call A before B when B should come first
Clean success — threads with no errors and no other tags

Validate each detector against 2-3 traces where you know ground truth before coding at scale.

Step 2: Implement and run

Write eval/compute_baselines.py with:

CLI args: --traces-dir (required), --output (default: eval/baseline_metrics.json)
load_traces(traces_dir) — loads all JSON trace files
tag_thread(thread) — combines all detectors
One measurement function per metric (numerator / denominator)
compute_all_baselines(traces_dir) — runs all, returns dict

Run it:

python eval/compute_baselines.py --traces-dir eval/traces --output eval/baseline_metrics.json

Then store the baseline as a benchmark run (for the dashboard):

recursive-improve store-baseline

Step 3: Review and iterate

Check A — Script health. Errors or null values? Fix, re-run (doesn't count toward cap).

Check B — Small-sample guard. Denominator ≥ 5 → full confidence. 1-4 → "directional-only". 0 → broken or genuinely absent.

Design principles

One metric per insight. Fewer metrics than insights = too conservative.
Express as ratio or percentage. No absolute counts.
Prefer per-event denominators over per-thread.
Build a metric for EVERY insight. "Unmeasurable" is a last resort.

Outputs

eval/compute_baselines.py
eval/baseline_metrics.json
eval/benchmark_results.json (stored via recursive-improve store-baseline)

Stage 4: Rubric Definition

Organize metrics into a tiered evaluation rubric.

Inputs

eval/baseline_metrics.json
eval/compute_baselines.py
eval/stage1_insights_summary.md
eval/stage2_domain_context.md

Process

1. Quantitative redundancy check

For every metric pair, check:

Denominator overlap > 70%
Same skill set
Logical subsumption

For each candidate pair: explicit decision (keep both / merge / drop) with reasoning.

Target: 5-7 metrics after redundancy resolution.

2. Tier each metric

Q1: Can a SINGLE skill/instruction change directly move this? → LEADING
Q2: Does it require MULTIPLE skills adopted together? → LAGGING
Q3: Does it require domain reasoning beyond instructions? → QUALITY

If ambiguous, pick the lower tier. Record which question determined the tier.

3. Flag low-confidence baselines

Denominator < 5 → marked with **Confidence: low** (n=X), excluded from priority sorting.

4. Set direction

Up or down. Ceiling guard: 100% baseline → "↑ maintain", never bare "↑". Same for 0% floor.

5. Map insights to metrics

Every insight → Mapped / Indirectly mapped / Qualitative-only. Report coverage counts.

6. Invalidation notes

Per metric: "What would make this tier assignment wrong?"

7. Write the rubric

Write to eval/baseline_metrics.md with summary table, tier definitions, metric details, redundancy analysis, insight coverage.

Stage 5: Action Plan

Triage each insight into discard/code-fix/prompt-fix and produce a prioritized plan.

Inputs

eval/stage1_insights_summary.md
eval/stage2_domain_context.md
eval/baseline_metrics.md
eval/baseline_metrics.json
eval/compute_baselines.py

Process

1. Triage each insight

1a. Validity check — real recurring problem or one-off noise?

1b. "Already handled" verification — grep codebase for key terms, read existing system prompt. Partially covered → keep as strengthening fix. Fully covered AND baseline ≥ 95% → discard.

1c. Code-vs-prompt decision:

For each insight, judge the best fix type on its own merits:

Signal	Fix type
Agent has the information but reasons incorrectly about it	PROMPT FIX — clarify reasoning guidance
Agent lacks instructions for a scenario it hasn't seen	PROMPT FIX — add instructions for the new case
Agent has the instruction but ignores or violates it	CODE FIX — enforce via validation, guardrails, or forced sequencing
Agent lacks a tool, schema, validation, or infrastructure capability	CODE FIX — add or modify code
Agent has partial info and a prompt workaround exists, but a code fix would be more robust	CODE FIX — prefer the durable solution
Fix involves both new instructions and supporting code	BOTH — implement both, note the dependency

Choose the approach that most directly and durably solves the root cause. There is no default — evaluate each case independently.

2. Consolidate related insights

Merge when ALL hold: same target behavior, overlapping fix text (>50%), fixing one fixes >80% of the other.

3. Write specific recommendations

Discards: one sentence why.
Code fixes: file, function, specific change.
Prompt fixes: exact instruction text, where it goes, why this wording.

4. Assess risk per fix

Risk	Definition
None	Additive, no existing behavior affected
Low	Targets currently-failing behavior
Medium	Modifies behavior where some cases already work
High	Rewrites behavior that mostly works

Medium/High: add one-sentence mitigation.

5. Handle qualitative-only insights

Still produce fixes. Use confidence = 0.5 and estimate impact from severity. Note that improvement should be verified via manual trace review.

6. Link to metrics

Each fix → which metric(s) would move.

7. Prioritize

Priority Score = Impact × Confidence × Tier Bonus ÷ Risk Factor

Impact = estimated gap closure
Confidence: n≥20 → 1.0, 10-19 → 0.8, 5-9 → 0.6, <5 → 0.3
Tier Bonus: leading → 1.5x, lagging/quality → 1.0x
Risk Factor: None/Low → 1.0, Medium → 1.5, High → 2.0

After scoring, promote prerequisites even if their standalone score is lower.

Output

Write to eval/action_plan.md with summary, implementation priority table, per-fix entries, consolidated prompt skills, monitor items.

Stage 6: Human-In-The-Loop Gate

Present the action plan for informed approval.

Inputs

eval/action_plan.md
eval/baseline_metrics.md
eval/baseline_metrics.json
eval/stage1_insights_summary.md

Process

1. Executive summary

Counts: total insights, distinct after dedup, prompt/code/discards, discard reasons.

2. Top 3 highest-impact changes

For each: before/after behavior from actual traces, target metric delta, risk rating.

3. Full prioritized fix list

Table: priority, name, type, target metrics, risk, effort (Low/Medium/High).

4. "What we are NOT fixing and why"

Every discard with: ID, name, reason, what would change your mind.

5. Flag small-sample items

Call out metrics with denominator < 5.

6. Traceability chain

Per fix: insight → metric → fix → expected improvement.

7. Collect decision

Explain the branch workflow: approved fixes will be applied on a dedicated branch (ri/improve-<YYYYMMDD-HHMMSS>), not directly on the current branch. This means:

The user's current branch stays untouched
All changes can be reviewed with git diff main...ri/improve-<timestamp>
The user can open a PR, get CI feedback, and merge when ready
Easy to discard with git branch -D if the fixes aren't right

Three options:

[A] Approve all — create branch, implement all fixes
[B] Approve with modifications — walk through each fix individually (approve/skip/modify), then create branch and implement
[C] Reject — collect feedback, re-run Stage 5

If [B]: update eval/action_plan.md with modifications. Write eval/stage6_decision.md.

Rules

Do NOT auto-approve
Do NOT proceed to fixes until clear approval is recorded
Always present small-sample warnings
Always present the "not fixing" section
Always explain the branch workflow before collecting the decision

Output

eval/stage6_decision.md
eval/action_plan.md (updated if modifications)

Stage 7: Fix Implementation

Implement every approved fix from the action plan.

Inputs

eval/action_plan.md
eval/stage6_decision.md (if exists)
eval/baseline_metrics.json

Pre-flight: Create Improvement Branch

All fixes are applied on a dedicated branch, keeping the user's current branch clean.

git status — verify clean working tree. If there are uncommitted changes, ask the user to commit or stash them first before proceeding.
Record the current branch name (e.g. main) as the base branch.

Create and switch to the improvement branch:

git checkout -b ri/improve-$(date +%Y%m%d-%H%M%S)

Record the branch name in eval/changes_log.md.

Pre-flight: HITL Modification Check

If eval/stage6_decision.md exists, identify user-modified items. Tag them [HITL-MODIFIED] in the changes log.

Pre-flight: Conflict Scan

Build file_path → [fix IDs] map. Flag co-located and overlapping fixes. Plan sequential application in priority order.

Process

For each non-discarded fix in priority order:

Understand — read the recommendation and referenced files
Implement — minimal, targeted change. Code fixes: find file, make the change. Prompt fixes: find system prompt, add instruction. No refactoring beyond what's needed.
Log — append to eval/changes_log.md: type, verdict, files modified, before/after snippets, linked metrics, conflict notes
Handle uncertainty — if unsure, log as NEEDS REVIEW with what's unclear and continue

Post-Fix: Next Steps

After all fixes are applied on the improvement branch:

Ensure eval/benchmark_results.json is present (it was created by store-baseline in Stage 3). If missing, run: recursive-improve store-baseline
Commit all changes on the improvement branch with a descriptive message. Include eval/benchmark_results.json in the commit.
Stay on the improvement branch (do NOT switch back to the base branch).

Rules

Do NOT modify trace files
Do NOT make changes beyond what was recommended
Do NOT run compute_baselines.py (baselines reflect old traces)
Do NOT apply fixes directly on the user's current branch — always use the improvement branch
Make minimal, targeted changes

Output

eval/changes_log.md
The improvement branch with all code/prompt changes