| name | meta-harness-evolver |
| description | Meta-Harness-style evolution loop for AI4S / agentic codebases. Reads prior candidates from $EVOLVER_WORKSPACE, proposes one targeted code/config edit via NexAU, evaluates via a user-provided evaluate program, logs results, and posts a summary to Feishu. |
Meta-Harness Evolver
What This Does
Implements a Meta-Harness-style outer loop. Each run:
- Reads the current best + all prior candidates in the evolution workspace
- Proposes one targeted modification via a NexAU coding agent
- Evaluates the candidate via
--evaluate-script (bash/py/executable) which prints JSON as the last line
- Logs the candidate harness + scores + execution traces to the evolution filesystem
- Posts a summary report to Feishu (Lark)
The Meta-Harness Loop
Proposer Agent āā(filesystem access)āāāŗ Hoss Workspace
ā² ā
ā propose harness
ā ā¼
ā Evaluate on benchmark
ā ā¼
log āāāā“āā store: code + scores + traces āāāŗ $EVOLVER_WORKSPACE/
Quick Start
Run from repo root:
bash meta-harness-evolver/scripts/example_run_evolution.sh
Directory Structure
$EVOLVER_WORKSPACE/
āāā best/ # Best harness found so far
ā āāā current/
āāā candidates/ # All evaluated harnesses
ā āāā candidate_N/ # One dir per candidate
ā āāā harness/ # The proposed config files (SOUL.md, etc.)
ā āāā eval_scores.json
ā āāā traces/ # Execution traces
āāā benchmark/ # Evaluation tasks + scorer
ā āāā scenarios/ # ~20 diverse task scenarios
āāā proposer/ # Proposer's workspace
ā āāā logs/ # Proposer's own reasoning traces
āāā evolution_log.jsonl # Full run history
What Can Be Evolved
Any files inside candidate_N/harness/ (e.g., model.py, train.py, config.yaml). Do not commit secrets into this workspace.
The Evolution Algorithm
- Seed: Start with Hoss's current configs as iteration 0
- Propose: Proposer reads full history from
$EVOLVER_WORKSPACE/candidates/, identifies failure patterns, proposes 1 targeted edit
- Validate: Lightweight import/syntax check before running full benchmark
- Evaluate: Run proposed harness against all 20 benchmark scenarios, score each
- Log: Store candidate harness + scores + proposer reasoning traces
- Select: Pareto frontier over (performance, simplicity) ā proposer decides which candidates to keep exploring from
- Repeat: Next night's proposer can read ALL prior candidates to build on good ideas
Key Insight from the Paper
The skill text is the strongest lever ā it steers the proposer. Iterating on the proposer's prompt/role description had more effect than changing iteration count or population size.
Evaluation
Evaluation is delegated to a user-provided program via --evaluate-script. It must accept <candidate_dir> and print JSON as the last stdout line.
Default benchmark has 20 scenarios across categories:
- Memory: Recall, update, synthesize from memory files
- Code: Write, review, debug code tasks
- Coordination: Spawn sub-agents, synthesize results
- Research: Web search, fetch, summarize, synthesize
- Communication: Draft emails, Feishu messages
- Quality: Spot errors, inconsistencies, broken links
Each scenario has:
- A concrete task description
- Expected outcome criteria
- A scoring rubric (0-3 per scenario: fail / partial / pass / excellent)
The Proposer Agent
The proposer is a coding-agent sub-agent (default: coder) that:
- Reads all prior candidates from
~/hoss-evolution/candidates/ via filesystem ops
- Identifies patterns in failed/succeeded candidates
- Proposes targeted, specific edits (NOT wholesale rewrites)
- Writes proposed configs to the new candidate directory
- Logs its reasoning trace so future iterations can build on it
Proposer Prompt
The proposer's role is defined by the task prompt in scripts/run_evolution.py and should prefer targeted edits over full rewrites.
Workflow Steps
Step 1: Read Prior Candidates
ls "$EVOLVER_WORKSPACE/candidates/"
cat ~/hoss-evolution/best/current/eval_scores.json
tail -20 "$EVOLVER_WORKSPACE/evolution_log.jsonl"
Step 2: Run Proposer
Step 3: Validate Before Benchmark
bash ~/hoss-evolution/scripts/validate.sh <candidate_dir>
Step 4: Run Evaluation
python3 meta-harness-evolver/scripts/run_evolution.py --evaluate-script /path/to/evaluate.sh
Step 5: Log Results
Step 6: Post to Feishu
python3 meta-harness-evolver/scripts/post_to_research.py <candidate_num> <candidate_dir> <score> <proposer_success>
Scoring
Final score = weighted average across scenarios:
- Memory tasks: 25%
- Code tasks: 25%
- Coordination: 15%
- Research: 20%
- Communication: 10%
- Quality: 5%
Results are tracked as a Pareto frontier: for each candidate, log both score and "complexity" (size/diff of changes). Simpler harnesses that score equally get priority.
Resources
Notes
- If the proposer fails to produce a valid candidate, the iteration is skipped
- Keep evaluation scenarios diverse enough that no single strategy can game all of them