원클릭으로 Manus에서 모든 스킬 실행

long-running-harness

Run multi-hour autonomous coding tasks using a 5-agent harness: Researcher, Planner, Executor (with Opus advisor), Evaluator, and Orchestrator. Combines Anthropic's advisor strategy, GAN-inspired generator/evaluator design, and Boris Tane's annotation-cycle workflow.

Manus에서 실행

스타2

포크0

업데이트2026년 5월 27일 16:45

출처

jleechanorg

jleechanorg/smartclaw

GitHub 저장소 열기 Creator 저장소 보기

설치 명령

다운로드

Manus에서 실행

유용한 대상SOC

소프트웨어 개발자컴퓨터 및 수학직15-1252L4

파일 탐색기

10 개 파일

SKILL.md

readonly

이 저장소의 다른 Skills

같은 저장소

ascii-art

jleechanorg/smartclaw

ASCII art: pyfiglet, cowsay, boxes, image-to-ascii.

2026-05-272

baoyu-comic

jleechanorg/smartclaw

Knowledge comics (知识漫画): educational, biography, tutorial.

2026-05-272

baoyu-infographic

jleechanorg/smartclaw

Infographics: 21 layouts x 21 styles (信息图, 可视化).

2026-05-272

comfyui

jleechanorg/smartclaw

Generate images, video, and audio with ComfyUI — install, launch, manage nodes/models, run workflows with parameter injection. Uses the official comfy-cli for lifecycle and direct REST/WebSocket API for execution.

2026-05-272

pixel-art

jleechanorg/smartclaw

Pixel art w/ era palettes (NES, Game Boy, PICO-8).

2026-05-272

deploy-hermes

jleechanorg/smartclaw

Hermes deploy pipeline — sync, test, and prevent drift between staging and prod

2026-05-202

name	long-running-harness
version	1.0.0
description	Run multi-hour autonomous coding tasks using a 5-agent harness: Researcher, Planner, Executor (with Opus advisor), Evaluator, and Orchestrator. Combines Anthropic's advisor strategy, GAN-inspired generator/evaluator design, and Boris Tane's annotation-cycle workflow.
when_to_use	When jleechan asks to build a complex feature, refactor a system, or tackle any multi-step coding task that will take >30 min of autonomous agent time. Also when he says 'long running', 'harness', 'multi-agent', or references the advisor-strategy/harness-design/annotation-cycle posts.

Long-Running Harness

A structured multi-agent architecture for autonomous coding tasks that would overwhelm a single agent session. Combines three proven approaches:

Advisor Strategy (Anthropic Apr 2026) — Opus-on-demand, cheap executor always
Harness Design (Anthropic Mar 2026) — GAN-inspired generator + evaluator, context resets, structured handoffs
Annotation Cycle (Boris Tane Feb 2026) — Research → Plan → Annotate → Implement → Feedback

Architecture

┌─────────────────────────────────────────────────────────┐
│                    ORCHESTRATOR                         │
│  (human-in-loop + Opus advisor, gatekeeps transitions) │
└──────────┬──────────┬──────────┬──────────┬────────────┘
           │          │          │          │
     ┌─────▼──┐ ┌────▼───┐ ┌───▼────┐ ┌──▼──────┐
     │RESEARCH│ │ PLANNER│ │EXECUTOR│ │EVALUATOR │
     │(Haiku)│ │(Sonnet)│ │(Sonnet)│ │(Opus)    │
     └────────┘ └────────┘ └────────┘ └──────────┘

Agent Roles

Agent	Model	Job	Output
Researcher	Haiku (cheap, broad)	Deep-read codebase, understand system	`research.md`
Planner	Sonnet (balanced)	Write implementation plan with code snippets	`plan.md` + `todo.md`
Executor	Sonnet + Opus advisor (main workhorse)	Implement one todo at a time, context reset on fill	code changes + `handoff.md`
Evaluator	Opus (skeptical judge)	Grade output against rubric, never self-grade	`eval_report.md`
Orchestrator	Human + Opus	Phase transitions, annotation cycles, ship gate	approval / rejection

The Loop

1. Researcher → research.md
2. [ANNOTATION CYCLE] Human reviews, adds inline notes → Researcher revises (1-3x)
3. Planner → plan.md + todo.md
4. [ANNOTATION CYCLE] Human reviews plan, adds notes → Planner revises (1-3x)
5. [GATE] Plan approved?
   NO → back to step 3 or 1
   YES → proceed
6. For each todo item:
   a. Executor implements (with Opus advisor on-call)
   b. Context reset if anxiety/fill detected → handoff.md → fresh session
   c. Evaluator grades output against plan spec
   d. Eval PASS → next todo
      Eval FAIL → specific feedback → Executor retries (max 2x, then escalate)
7. All todos done → Final evaluation (full Evaluator pass)
8. [SHIP GATE] Human approves merge

Quick Start

# Run the full harness from CLI
python3 ~/.hermes/skills/long-running-harness/scripts/harness.py \
  --task "Add cursor-based pagination to the list endpoint" \
  --repo /path/to/repo \
  --project-id agent-orchestrator

# Or step-by-step (more control):
# 1. Research
python3 ~/.hermes/skills/long-running-harness/scripts/harness.py \
  research --task "..." --repo /path/to/repo

# 2. Plan (after reviewing research.md)
python3 ~/.hermes/skills/long-running-harness/scripts/harness.py \
  plan --task "..." --repo /path/to/repo

# 3. Implement (after approving plan.md)
python3 ~/.hermes/skills/long-running-harness/scripts/harness.py \
  implement --repo /path/to/repo --todo-file .hermes/plans/todo.md

# 4. Evaluate
python3 ~/.hermes/skills/long-running-harness/scripts/harness.py \
  evaluate --repo /path/to/repo --plan-file .hermes/plans/plan.md

Key Design Decisions

Why context resets over compaction

Anthropic's finding: Sonnet 4.5's context anxiety persisted even with compaction. A reset + structured handoff gives a clean slate. Trade-off: orchestration complexity + token overhead for handoff artifacts. In our stack, ao-babysit detects context fill and triggers resets.

Why separate evaluator

Agents grade their own work too generously (Anthropic's GAN insight). Tuning a standalone evaluator for skepticism is tractable; making a generator self-critical is not. The evaluator runs Opus because judgment quality matters more than speed here.

Why advisor tool (not sub-agent orchestration)

The advisor_20260301 tool is a one-line API change. No worker pool, no decomposition. Sonnet drives, Opus only on hard calls. Cost stays near Sonnet levels.

Why annotation cycles

Boris Tane's insight: the plan document is a shared working surface. Inline notes in the artifact beat chat-message steering every time. Domain knowledge injection happens before code, not after.

Artifact Chain

Phase	Input	Output	Location
Research	prompt + codebase	`research.md`	`.hermes/plans/research.md`
Planning	`research.md` + human notes	`plan.md` + `todo.md`	`.hermes/plans/`
Execution	`plan.md` + `todo.md`	code changes + `handoff.md`	worktree + `.hermes/plans/handoff.md`
Evaluation	plan spec + code diff	`eval_report.md`	`.hermes/plans/eval_report.md`
Ship	all artifacts	final commit	worktree

Model + Cost Profile

Component	Model	% of tokens
Research	Haiku	~15%
Planning	Sonnet	~10%
Execution	Sonnet + Opus advisor (3x/request max)	~60% (mostly Sonnet)
Evaluation	Opus	~15%
Total		~Sonnet-equivalent cost, Opus-level quality

Integration with Existing Stack

Harness Component	Hermes Equivalent
Research Agent	`ao spawn` with Haiku, writes to worktree
Planner Agent	`ao spawn` with Sonnet, outputs `plan.md`
Executor Agent	`ao spawn` with Sonnet + advisor tool, `ao-babysit` for context monitoring
Evaluator Agent	`ao spawn` with Opus, skeptic-evidence-capture skill
Orchestrator	Hermes gateway session + `bidi-cmux-alignment` for steering
Context resets	`ao-babysit` detects context fill → kills session → fresh spawn reads `handoff.md`
Annotation cycle	Human in Slack thread + `dispatch-task` for sending agent back to artifact

Anti-Patterns

Don't let the agent self-evaluate — always use separate evaluator
Don't skip research — garbage in, garbage out
Don't implement before plan approved — wasted tokens on wrong approach
Don't rely on compaction alone — context resets + handoffs for long tasks
Don't use Opus end-to-end — advisor tool for on-demand intelligence
Don't steer via chat messages — annotate the artifact, send agent back to it
Don't let the executor write more than one todo without evaluation — drift compounds fast

Template Files

Read these when building agent prompts:

templates/researcher-prompt.md — Research agent system prompt
templates/planner-prompt.md — Planner agent system prompt
templates/executor-prompt.md — Executor agent system prompt (includes advisor tool wiring)
templates/evaluator-prompt.md — Evaluator agent system prompt (skeptical rubric)
templates/handoff-schema.md — Structured handoff artifact schema

Script

scripts/harness.py — CLI orchestrator that runs phases, manages artifacts, triggers context resets

References

references/advisor-strategy-summary.md — Key points from Anthropic's advisor strategy post
references/harness-design-summary.md — Key points from Anthropic's harness design post
references/annotation-cycle-summary.md — Key points from Boris Tane's workflow