一键在 Manus 中运行任何 Skill

harness-engineering

Designing robust harnesses for long-running agent workflows, repository shaping, routing docs, and multi-agent orchestration. ALWAYS use this skill when the user discusses: - Multi-step agent tasks or long-running workflows (running longer than a few minutes) - Repository bootstrapping, scaffolding, or structural changes - AGENTS.md or docs/index.md routing and maintenance - Generator-evaluator patterns and quality control for autonomous agents - Planner/generator/evaluator role separation and handoffs - Sprint contracts and explicit work contracts for agent tasks - External verification strategies (browser automation, API checks, logs) - Context management and reset strategies for extended sessions Use for any task requiring durable artifacts, explicit contracts, or independent evaluation beyond simple unit tests. Do not use for routine feature work inside an already-stable repo with clear local verification.

在 Manus 中运行

概览

安装命令

npx skills add https://github.com/Viking602/harness-engineering --skill harness-engineering

复制此命令并粘贴到 Claude Code 中以安装该技能

来源

Viking602/harness-engineering

星标0

分支0

更新时间2026年3月29日 11:21

文件资源管理器

25 个文件

SKILL.md

readonly

来源

Viking602

Viking602/harness-engineering

打开 GitHub 仓库查看创作者相关仓库

安装命令

下载

在 Manus 中运行

适用职业SOC

软件开发工程师计算机与数学类职业15-1252L4

name

harness-engineering

description

Harness Engineering

Use this skill when repository shape and agent continuity matter more than raw code generation. Keep this file as the skill entrypoint. Keep bundled skill references under ./references/ and Codex-only agent files under ./codex/agents/.

For multi-hour, multi-session, or failure-prone work, read ./references/long-running-harness-reference.md before planning roles, handoffs, or reset strategy.

Core Questions

What must the next agent read, run, and verify?
Does the repo need scaffold, runtime orchestration, or both?
What is the smallest credible default scaffold for this request?
Is an implementation dispatch gate required before code changes?
Which agent should implement the next step?
Which durable artifacts must exist before the task is done?

Implicit Triggers

Trigger this skill even without explicit $harness-engineering when the task implies any of these:

onboarding an agent to an existing repo
bootstrapping a new repo, service, or module
adding, moving, splitting, or renaming source roots or docs trees
fixing stale AGENTS.md, stale docs/index.md, or missing deeper docs maps
creating plans, contracts, handoffs, or QA artifacts for work that changes repo structure, routing docs, or long-lived task entrypoints

Do not trigger this skill for:

routine feature or bugfix work inside an already-stable repo shape
language-specific implementation work that does not change routing docs or orchestration
isolated code review or test repair work with no repo-shaping consequence

Routing Docs

Require a root AGENTS.md. It is the fixed first entrypoint.
Keep AGENTS.md short and routing-only.
If the repo has durable docs, require docs/index.md as the deeper docs map.
Prefer progressive disclosure: AGENTS.md, then docs/index.md, then child indexes only where needed.
A stale path in AGENTS.md or docs/index.md is a defect.
If structure, commands, entrypoints, or current work paths change, update the routing docs in the same pass.

Interaction Modes

Follow ./references/question-gate.md as the shared user-question policy.
DEFAULT is the normal mode. Do not ask the user about routine defaults, optional scope expansion, or whether to keep going. Record assumptions and proceed.
PLAN is for explicit planning passes. Only hard blockers that change the identity of the first runnable artifact or implementation owner may become user questions.
Child agents never ask the user directly. They return NEEDS_CONTEXT or BLOCKED to the parent agent with a blocker classification.
The parent agent owns every user-facing question and should treat NEEDS_CONTEXT as a dispatch or context problem before considering a user question.

Repo As Record

Assume anything the agent cannot access locally does not exist.
Move durable decisions out of chat and into versioned files.
Keep active plans, handoffs, QA notes, references, and quality docs in the repo when the task actually needs them.
Mark generated material clearly.

Bootstrap Default

When the user asks to initialize or bootstrap an empty or near-empty repo and the stack plus target are already clear enough to build, proceed without asking the user to pick from template menus.
Default to the smallest credible baseline that satisfies the request and matches existing repo signals when they exist.
Distinguish identity questions from expansion questions. Identity questions decide what the first runnable artifact is or who should own implementation. Expansion questions only widen scope.
Do not stop for expansion questions during the first scaffold pass. Choose the narrowest runnable baseline, leave widening work out, and record the assumption.
Do not turn the chosen baseline into a confirmation question. Avoid forms such as "先按...吗", "是否先...", or "第一版先...?" once the baseline is already clear enough to execute.
Treat "should I proceed with the minimal baseline?" as a defect, not as cautious behavior, when the first runnable artifact is already clear.
For bootstrap tasks with a clear identity, the first user-facing move should be an execution update that states the assumed baseline and starts the work.
Record assumptions in the plan or handoff instead of turning them into avoidable questions.
Ask only when the missing detail is an identity question that blocks implementation or would force a high-cost wrong turn.

When To Simplify The Harness

As models improve, periodically re-test harness assumptions. Remove components that no longer add clear lift.

Simplification Signals

Consider removing or collapsing roles when:

The model sustains longer coherent runs without under-scoping or losing track
Sprint contracts become ceremonial - the generator and evaluator agree immediately without meaningful negotiation
Evaluators find nothing meaningful across multiple consecutive iterations
Context resets add more overhead than value - the model stays coherent across the full task
Repair loops converge immediately - the first pass is already correct

Real Example: Opus 4.6 Transition

With Opus 4.6's improved planning and sustained agentic performance:

Approach	Duration	Cost	Outcome
Solo agent	20 min	$9	Broken core functionality
Full harness (3-agent + sprint contracts)	6 hr	$200	Working with AI features
Simplified harness (single-pass evaluation)	4 hr	$124	Same quality, faster

What changed:

Removed iterative sprint contracts (model plans more carefully upfront)
Moved to single-pass evaluation with hard thresholds (model sustains longer tasks)
Kept the evaluator for "last mile issues" like stubbed features

Simplification Decision Tree

Is the task short and locally verifiable?
  → Yes: Stay single-agent
  → No: Continue

Does the model consistently complete similar tasks without drift?
  → Yes: Try single-agent with explicit contract
  → No: Add planner or evaluator

Does the evaluator consistently find real, actionable issues?
  → Yes: Keep evaluator
  → No: Remove or defer evaluator to final pass only

Do sprint contracts produce meaningful negotiation?
  → Yes: Keep contracts
  → No: Move to single-pass with post-hoc evaluation

Maintaining Quality While Simplifying

When removing harness layers:

Keep hard thresholds - even without contracts, define explicit pass/fail criteria
Keep external verification - evaluators should still touch the real system
Keep evidence requirements - concrete artifacts over narration
Add back quickly - if quality drops, restore the removed layer

Documentation

When simplifying:

Record the simplification decision and date
Note which model version justified it
Define rollback triggers (quality regressions that would restore the layer)

See ./references/long-running-harness-reference.md for the full simplification heuristics and article-backed rationale.

Dispatch Gate

Before stack-owned code-writing starts, run an implementation dispatch gate when the user request or repo signals make the stack clear enough to assign an implementation owner.
The gate decides whether a matching domain-specific implementation agent must own the build step or whether a generic generator is allowed.
Skip the gate for routing-doc-only, read-only, or clearly stack-neutral tasks.
Record the gate result in the handoff before implementation starts.
Keep dispatch in the parent agent. Do not have child agents open nested codex exec, codex review, or other Codex CLI sessions to delegate work.

Agent Selection

Codex Agents

planner: prefer harness-planner, then research-analyst, then product-manager.
dispatch gate: prefer harness-dispatch-gate, then the parent agent using ./references/subagent-dispatch-playbook.md.
generator: use the gate result. Prefer the matching language-specific implementation agent when required. Use harness-generator only when the gate explicitly allows a generic generator. Fall back to worker last.
evaluator: prefer harness-evaluator, then reviewer, qa-expert, or browser-debugger.
doc-gardener: use harness-doc-gardener after structure-heavy changes.

Claude Code Agents

planner: use harness-planner (see ./agents/harness-planner.md)
dispatch gate: use harness-dispatch-gate (see ./agents/harness-dispatch-gate.md)
generator: use harness-generator (see ./agents/harness-generator.md)
evaluator: use harness-evaluator (see ./agents/harness-evaluator.md)
doc-gardener: use harness-doc-gardener (see ./agents/harness-doc-gardener.md)

See ./agents/README.md for detailed usage instructions and platform differences.

Cross-Platform Rules

Do not silently stay single-agent after detecting an explicit language match.
Record the gate result, chosen implementation agent, or dispatch limitation in the handoff.
If another role is needed mid-task, the child agent must return NEEDS_CONTEXT or BLOCKED and ask the parent agent to dispatch it directly.
Keep language and framework mappings in ./references/subagent-dispatch-playbook.md.

Handoffs

Every planner, generator, and evaluator handoff should state:

objective
artifacts read first
write or read-only scope
done criteria
evidence required on return
retry or escalate trigger
routing docs to update when the project surface changes

Verification

Prefer commands, tests, logs, traces, screenshots, and concrete failures over narration.
Reproduce, repair, and re-verify instead of stopping at explanation.
If review feedback repeats, turn it into a rule, test, or doc update when practical.

Drift

After structure-heavy changes or merges, refresh routing docs against the current tree.

Skill Maintenance

After changing SKILL.md or agents/openai.yaml, verify that trigger wording, install paths, and referenced docs still match the current tree.
Regenerate or manually trim UI metadata so short_description stays short and default_prompt stays copyable.
Run the skill validator when the environment provides it. If it does not, do a manual frontmatter and path check before closing the update.

References

./references/harness-principles.md
./references/runtime-orchestration-principles.md
./references/long-running-harness-reference.md for article-backed long-running heuristics, role selection, contract shape, and reset or simplification decisions
./references/subagent-dispatch-playbook.md (Codex-specific dispatch patterns)
./references/doc-index-maintenance.md
./references/openai-docs-index-pattern.md
./references/question-gate.md for shared DEFAULT vs PLAN interaction rules and blocker taxonomy
./references/codex-to-claude-migration.md for platform differences and agent migration guide
./agents/ for Claude Code-compatible agent definitions
./codex/agents/*.toml for Codex-specific role behavior