Run any Skill in Manus with one click

$pwd:

writing-agent-relay-workflows

Name: Writing Agent Relay Workflows
Author: AgentWorkforce

// Use when building multi-agent workflows with the relay broker-sdk - covers conversation-shape vs pipeline-shape coordination, repairable/reliable workflow gates, the WorkflowBuilder API, DAG step dependencies, agent definitions, step output chaining via {{steps.X.output}}, verification gates, evidence-based completion, owner decisions, dedicated channels, dynamic channel management (subscribe/unsubscribe/mute/unmute), swarm patterns, chat-native coordination recipes (Q/A, broadcast-ack, peer review, standup, hand-off), error handling, event listeners, step sizing rules, authoring best practices, and the lead+workers team pattern for complex steps

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:May 9, 2026 at 14:33

SKILL.md

readonly

related-skills.json

same repository

relay-80-100-workflow.md

from "AgentWorkforce/relayfile-adapters"

Use when writing agent-relay workflows that must fully validate features end-to-end before merging. Covers the 80-to-100 pattern - going beyond "code compiles" to "feature works, tested E2E locally." Includes repair-before-failure validation gates, PGlite for in-memory Postgres testing, mock sandbox patterns, test-fix-rerun loops, verify gates after every edit, and the full lifecycle from implementation through passing tests to commit.

2026-05-090

writing-agent-relay-workflows.md

from "AgentWorkforce/relayfile-adapters"

Use when building multi-agent workflows with the relay broker-sdk - covers conversation-shape vs pipeline-shape coordination, repairable/reliable workflow gates, the WorkflowBuilder API, DAG step dependencies, agent definitions, step output chaining via {{steps.X.output}}, verification gates, evidence-based completion, owner decisions, dedicated channels, dynamic channel management (subscribe/unsubscribe/mute/unmute), swarm patterns, chat-native coordination recipes (Q/A, broadcast-ack, peer review, standup, hand-off), error handling, event listeners, step sizing rules, authoring best practices, and the lead+workers team pattern for complex steps

2026-05-090

relay-80-100-workflow.md

from "AgentWorkforce/relayfile-adapters"

2026-05-090

package.json

"author": "AgentWorkforce"

"repository": "AgentWorkforce/relayfile-adapters"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

Run any Skill with one click

name

writing-agent-relay-workflows

description

Writing Agent Relay Workflows

Overview

The relay broker-sdk workflow system orchestrates multiple AI agents (Claude, Codex, Gemini, Aider, Goose) through typed DAG-based workflows. Workflows can be written in TypeScript (preferred), Python, or YAML.

Language preference: TypeScript > Python > YAML. Use TypeScript unless the project is Python-only or a simple config-driven workflow suits YAML.

Pattern selection: Do not default to dag blindly. If the job needs a different swarm/workflow type, consult the choosing-swarm-patterns skill when available and select the pattern that best matches the coordination problem.

When to Use

Building multi-agent workflows with step dependencies
Orchestrating different AI CLIs (claude, codex, gemini, aider, goose)
Creating DAG, pipeline, fan-out, or other swarm patterns
Needing verification gates, retries, or step output chaining
Designing product-contract workflows where failing checks should route to agents for repair instead of stopping the run
Dynamic channel management: agents joining/leaving/muting channels mid-workflow

Default Principle: Workflows Repair Before They Fail

The point of an agent team workflow is not to discover a red gate and stop. The point is to capture the failure, route it to the right agent, fix it, and continue toward a shippable result. Author non-trivial workflows as repairable systems:

Run deterministic checks as evidence-capturing gates with captureOutput: true.
Prefer failOnError: false for intermediate validation gates so the workflow can pass the output to a repair agent.
Add a repair step immediately after each red-prone gate. The repair agent reads {{steps.<gate>.output}}, fixes source/tests/config, reruns the same command locally, and exits only after the gate is green or the blocker is external.
Keep final acceptance deterministic, but still put an agent repair step before commit/PR creation. If the repair budget is exhausted or a true external blocker remains, write a blocked artifact and skip commit/PR creation; do not let the workflow end as FAILED.
Use .reliable() or .repairable() on SDK versions that support it, especially for product-contract workflows. As of AgentWorkforce/relay#827, retry-mode workflows with agents are repair-aware by default, repair agents run before retrying malformed/failed agent steps, and the SDK covers DAG, pipeline, fan-out, worktree-backed, deterministic-only, and agent-plus-gate shapes.

Avoid hard-stop gates (failOnError: true with no repair step) in workflows that are supposed to be self-healing. Even cheap preconditions such as missing credentials, wrong repository, or an unsafe dirty worktree should normally write a clear BLOCKED_* artifact and exit cleanly. For implementation, build, test, lint, schema, artifact, and review failures, model the fix path in the workflow.

Choose Your Coordination Style — Conversation vs Pipeline

Before writing the workflow, decide how the agents will coordinate. The relay primitive supports two very different shapes, and picking the wrong one wastes the most valuable thing the SDK gives you.

Shape	What it is	Use when
Conversation (chat-native)	Interactive agents share a channel; messages, `@-mentions`, and ambient awareness drive coordination. Lead and workers spawn in parallel and self-organize. The relay is the coordination layer, not just transport.	Multi-file work, peer review loops, cross-agent feedback, dynamic re-planning, multi-PR coordination, anything with a human-in-the-loop escape, swarms where workers pick up each other's output.
Pipeline (one-shot DAG)	Each step runs as a one-shot subprocess (`claude -p`, `codex exec`); steps hand off via `{{steps.X.output}}` text injection. No agents are alive at the same time; no chat happens.	Linear, well-specified transformations; deterministic data passing; no review loop expected; the work could be expressed as a `bash \| bash \| bash` pipe.

Default to Conversation for any non-trivial work. Pipeline DAGs are simpler to reason about but they do not exercise the relay primitive — they are a Unix pipe with extra steps. If you would happily write the same task as a single shell pipeline, pipeline-shape is fine. Otherwise, you almost certainly want a Conversation shape.

The two shapes can mix within one workflow: pipeline-style deterministic preflight → conversation in the middle → pipeline-style commit-and-PR at the end. See Quick Reference (Conversation) below and Common Patterns → Interactive Team for the canonical recipe.

A blunt rule of thumb: if your workflow only uses agent steps with preset: 'worker' chained by {{steps.X.output}}, you are not using the relay — you are using claude -p | codex exec. That may still be the right answer; just make it a deliberate choice.

Quick Reference (Pipeline shape)

Use this when steps are linear, well-specified, and need no agent-to-agent feedback. For anything with iteration, review, or coordination, jump to Quick Reference (Conversation shape) below.

Note: this Quick Reference assumes an ESM workflow file (the host package.json has "type": "module"). For CJS repos, see rule #1 in Critical TypeScript rules below — convert import { workflow } from '@agent-relay/sdk/workflows' to const { workflow } = require('@agent-relay/sdk/workflows') and wrap the workflow in async function main() { ... } main().catch(console.error) since CJS does not support top-level await. Always check package.json before copy-pasting the snippet.

import { workflow } from '@agent-relay/sdk/workflows';

const result = await workflow('my-workflow')
  .description('What this workflow does')
  .pattern('dag') // or 'pipeline', 'fan-out', etc.
  .channel('wf-my-workflow') // dedicated channel (auto-generated if omitted)
  .maxConcurrency(3)
  .timeout(3_600_000) // global timeout (ms)

  .agent('lead', { cli: 'claude', role: 'Architect', retries: 2 })
  .agent('worker', { cli: 'codex', role: 'Implementer', retries: 2 })

  .step('plan', {
    agent: 'lead',
    task: `Analyze the codebase and produce a plan.`,
    retries: 2,
    verification: { type: 'output_contains', value: 'PLAN_COMPLETE' },
  })
  .step('implement', {
    agent: 'worker',
    task: `Implement based on this plan:\n{{steps.plan.output}}`,
    dependsOn: ['plan'],
    verification: { type: 'exit_code' },
  })

  .onError('retry', { maxRetries: 2, retryDelayMs: 10_000 })
  .run({ cwd: process.cwd() });

  console.log('Result:', result.status);

Quick Reference (Conversation shape)

Use this for any non-trivial work — peer review, multi-file edits, cross-agent feedback, dynamic re-planning. Lead and workers spawn in parallel on a shared channel and self-organize via messages. The relay primitive does the coordinating; verification gates downstream of the lead close the workflow.

import { workflow } from '@agent-relay/sdk/workflows';
import { ClaudeModels, CodexModels } from '@agent-relay/config';

const result = await workflow('my-workflow')
  .description('Multi-file change with peer review')
  .pattern('dag')
  .channel('wf-my-feature')          // dedicated channel — agents share it
  .maxConcurrency(4)
  .timeout(3_600_000)
  .repairable()

  // Interactive agents — no preset, they live on the channel
  .agent('lead', {
    cli: 'claude',
    model: ClaudeModels.OPUS,
    role: 'Architect + reviewer. Plans, assigns, reviews, posts feedback.',
    retries: 1,
  })
  .agent('impl-a', {
    cli: 'codex',
    model: CodexModels.GPT_5_4,
    role: 'Implementer. Listens on channel for assignments and feedback.',
    retries: 2,
  })
  .agent('impl-b', {
    cli: 'codex',
    model: CodexModels.GPT_5_4,
    role: 'Implementer. Listens on channel for assignments and feedback.',
    retries: 2,
  })

  // Deterministic context — pre-reads files once, posts to the channel for everyone
  .step('context', {
    type: 'deterministic',
    command: 'git ls-files src/',
    captureOutput: true,
  })

  // Lead and workers all depend on `context` — they start CONCURRENTLY.
  // They coordinate over #wf-my-feature, not via {{steps.X.output}}.
  .step('lead-coordinate', {
    agent: 'lead',
    dependsOn: ['context'],
    task: `You are the lead on #wf-my-feature. Workers: impl-a, impl-b.
Post the plan. Assign files. Review their PRs/diffs. Post feedback in-channel.
Workers iterate based on your feedback. Exit when both files pass review.`,
  })
  .step('impl-a-work', {
    agent: 'impl-a',
    dependsOn: ['context'],   // SAME dep as lead → starts in parallel, no deadlock
    task: `You are impl-a on #wf-my-feature. Wait for the lead's plan.
Implement your assigned file. Post a completion message. Address feedback.`,
  })
  .step('impl-b-work', {
    agent: 'impl-b',
    dependsOn: ['context'],   // SAME dep as lead
    task: `You are impl-b on #wf-my-feature. Wait for the lead's plan.
Implement your assigned file. Post a completion message. Address feedback.`,
  })

  // Downstream gates on the lead — lead exits when satisfied.
  // Capture failures, then hand them to an agent for repair.
  .step('verify', {
    type: 'deterministic',
    dependsOn: ['lead-coordinate'],
    command: 'npm run typecheck && npm test 2>&1',
    captureOutput: true,
    failOnError: false,
  })
  .step('repair-verify', {
    agent: 'lead',
    dependsOn: ['verify'],
    task: `If verification passed, summarize evidence.
If it failed, use this output to assign and fix issues, then rerun the command until green:
{{steps.verify.output}}`,
    verification: { type: 'exit_code' },
  })
  .step('verify-final', {
    type: 'deterministic',
    dependsOn: ['repair-verify'],
    command: 'npm run typecheck && npm test 2>&1',
    captureOutput: true,
    failOnError: true,
  })

  .onError('retry', { maxRetries: 2, retryDelayMs: 10_000 })
  .run({ cwd: process.cwd() });

What this exercises that pipeline-shape does not:

Ambient awareness — workers see each other's completion messages and start dependent work without the lead relaying.
Lead-as-reviewer — the lead reads actual files between rounds and posts diff-aware feedback in chat. One agent does coordination + review; no separate reviewer step.
Iterative correction — when the lead pings "impl-a, the type on line 42 is wrong", impl-a fixes and re-posts. No new step, no re-spawn, no {{output}} chaining.

Critical workflow rules for this shape:

Lead and workers MUST share the same dependsOn (e.g., both depend on context). If a worker depends on the lead, you have a deadlock — the lead is waiting for worker output, the worker is waiting for the lead step to "complete."
Drop preset: 'worker' on the implementer agents — interactive mode is what lets them receive channel messages via PTY injection.
Downstream gates depend on the lead step, not the workers. The lead exits when it's satisfied; that's the workflow's signal that implementation is ready for repairable deterministic checks.
Use a dedicated .channel('wf-...') so the team is isolated from other workflows and the global general channel.

See Common Patterns → Interactive Team for production notes from real runs and decision criteria for picking this shape over one-shot DAG.

Critical TypeScript rules:

Check the project's package.json for "type": "module" — if ESM, use import and top-level await. If CJS, use require() and wrap in async function main().
agent-relay run <file.ts> executes the file as a standalone subprocess — it does NOT inspect exports. The file MUST call .run().
Use .run({ cwd: process.cwd() }) — createWorkflowRenderer does not exist
Validate with --dry-run before running: agent-relay run --dry-run workflow.ts

⚡ Parallelism — Design for Speed

This is the most important design consideration. Sequential workflows waste hours. Always design for maximum parallelism.

Cross-Workflow Parallelism: Wave Planning

When a project has multiple workflows, group independent ones into parallel waves:

# BAD — sequential (14 hours for 27 workflows at ~30 min each)
agent-relay run workflows/34-sst-wiring.ts
agent-relay run workflows/35-env-config.ts
agent-relay run workflows/36-loading-states.ts
# ... one at a time

# GOOD — parallel waves (3-4 hours for 27 workflows)
# Wave 1: independent infra (parallel)
agent-relay run workflows/34-sst-wiring.ts &
agent-relay run workflows/35-env-config.ts &
agent-relay run workflows/36-loading-states.ts &
agent-relay run workflows/37-responsive.ts &
wait
git add -A && git commit -m "Wave 1"

# Wave 2: testing (parallel — independent test suites)
agent-relay run workflows/40-unit-tests.ts &
agent-relay run workflows/41-integration-tests.ts &
agent-relay run workflows/42-e2e-tests.ts &
wait
git add -A && git commit -m "Wave 2"

Wave Planning Heuristics

Two workflows can run in parallel if they don't have write-write or write-read file conflicts:

Touch Zone	Can Parallelize?
Different `packages/*/src/` dirs	✅ Yes
Different `app/` routes	✅ Yes
Same package, different subdirs	⚠️ Usually yes
Same files (shared config, root package.json)	❌ No — sequential or same wave with merge
Explicit dependency	❌ No — ordered waves

Declare File Scope for Planning

Help wave planners (human or automated) understand what each workflow touches:

workflow('48-comparison-mode')
  .packages(['web', 'core'])                // monorepo packages touched
  .isolatedFrom(['49-feedback-system'])      // explicitly safe to parallelize
  .requiresBefore(['46-admin-dashboard'])    // explicit ordering constraint

Within-Workflow Parallelism

Use shared dependsOn to fan out independent sub-tasks:

// BAD — unnecessary sequential chain
.step('fix-component-a', { agent: 'worker', dependsOn: ['review'] })
.step('fix-component-b', { agent: 'worker', dependsOn: ['fix-component-a'] })  // why wait?

// GOOD — parallel fan-out, merge at the end
.step('fix-component-a', { agent: 'impl-1', dependsOn: ['review'] })
.step('fix-component-b', { agent: 'impl-2', dependsOn: ['review'] })  // same dep = parallel
.step('verify-all', { agent: 'reviewer', dependsOn: ['fix-component-a', 'fix-component-b'] })

Impact

Real-world example (Relayed — 60 workflows):

Sequential: ~30 min × 60 = 30 hours
Parallel waves (4-6 per wave): ~12 waves × 35 min = ~7 hours (4x faster)
Aggressive parallelism (8-way): ~4 hours (7.5x faster)

Failure Prevention

These workflow files are easy to break in ways that only appear mid-run. Follow these rules when authoring or editing workflow .ts files.

1. Do not use raw top-level `await`

Executor-driven workflow files may be run through a tsx/esbuild path that behaves like CJS. Raw top-level await can fail with:

Top-level await is currently not supported with the "cjs" output format

Always wrap execution like this:

async function runWorkflow() {
  const result = await workflow('my-workflow')
    // ...
    .run({ cwd: process.cwd() });

  console.log('Workflow status:', result.status);
}

runWorkflow().catch((error) => {
  console.error(error);
  process.exit(1);
});

Do not end workflow files with bare top-level await workflow(...).run(...).

1b. Make commit and PR boundaries explicit

Workflows do not get a PR for free just because they pass validation. If the intended deliverable is a branch, commit, push, or GitHub PR, the workflow itself must own that boundary explicitly and document the expected file scope.

Use this pattern only when the workflow is supposed to own repository delivery:

Preflight the git state and fail on unexpected staged changes.
Create or verify the intended branch.
Run implementation, review, repairable validation, fix, and final acceptance gates.
Stage only the declared target files and review/signoff artifacts.
Commit with a deterministic message.
Push the branch.
Use createGitHubStep({ action: 'createPR', ... }) from @agent-relay/sdk to open the PR.
Verify the PR URL/state deterministically and write it into the final signoff artifact.

Do not hide commit/PR work in agent prose. Model it as deterministic steps whenever possible. For PR creation, issue updates, file reads, or any GitHub operation, prefer createGitHubStep over shelling out to gh; import it from the SDK root (import { createGitHubStep } from '@agent-relay/sdk') on SDK versions that include AgentWorkforce/relay#823, or from the legacy subpath only when pinned to an older SDK. The downstream acceptance gate must still verify the PR exists before signoff, and any PR creation failure should route to a repair step before the workflow stops.

If commit or PR creation is intentionally outside the workflow, say that directly in the workflow description and signoff so the operator knows to do it after completion.

2. Avoid raw fenced code blocks inside workflow task template literals

Raw triple-backtick code fences inside large inline task: \...`template strings are fragile and can break outer TypeScript parsing, especially when they contain language tags likeswiftordiff`.

Preferred options, in order:

Avoid inline fenced examples entirely
Move larger examples to referenced files
Use plain indented examples instead of fenced blocks
If fenced blocks must exist inside generated inner code, escape them consistently and syntax-check the outer workflow file afterward

2b. Standard preflight template for resumable workflows

Every non-trivial workflow should start with a deterministic preflight step that validates the environment before any agent runs. A workflow that fails mid-DAG and gets re-run (or resumed via --start-from) will re-execute preflight, so preflight must tolerate the partial state left behind by the previous run — specifically, dirty files that the workflow itself is expected to edit.

The battle-tested template:

.step('preflight', {
  type: 'deterministic',
  command: [
    'set -e',
    'BRANCH=$(git rev-parse --abbrev-ref HEAD)',
    'echo "branch: $BRANCH"',
    'if [ "$BRANCH" != "fix/your-branch-name" ]; then echo "ERROR: wrong branch"; exit 1; fi',
    // Files the workflow is allowed to find dirty on entry:
    //   - package-lock.json: npm install is idempotent and often touches it
    //   - every file the workflow's edit steps will rewrite: a prior partial
    //     run may have left them dirty, and the edit step will rewrite
    //     them cleanly before commit
    // Everything else is unexpected drift and must fail preflight.
    'ALLOWED_DIRTY="package-lock.json|path/to/file1\\\\.ts|path/to/file2\\\\.ts"',
    'DIRTY=$(git diff --name-only | grep -vE "^(${ALLOWED_DIRTY})$" || true)',
    'if [ -n "$DIRTY" ]; then echo "ERROR: unexpected tracked drift:"; echo "$DIRTY"; exit 1; fi',
    'if ! git diff --cached --quiet; then echo "ERROR: staging area is dirty"; git diff --cached --stat; exit 1; fi',
    'gh auth status >/dev/null 2>&1 || (echo "ERROR: gh CLI not authenticated"; exit 1)',
    'echo PREFLIGHT_OK',
  ].join(' && '),
  captureOutput: true,
  failOnError: true,
}),

Rules baked into this template:

Always include package-lock.json in ALLOWED_DIRTY. Both npm install and npm ci can touch it idempotently.
Include every file the workflow's edit steps will rewrite. The commit step uses explicit git add <path> (never git add -A), so allowing these files to be dirty on entry is safe — unrelated drift in other files still fails preflight.
Escape dots in regex paths: setup\.ts not setup.ts. In a JS template literal this means four backslashes: "setup\\\\.ts".
Use grep -vE "^(...)$" for full-line match. Substring matches bleed across unrelated files (e.g., setup.ts would also match packages/core/src/bootstrap/setup.ts).
Append || true to the grep. Without it, an empty result triggers set -e and the whole preflight fails before the if can even run.
Check the staging area separately. A dirty index is different from a dirty working tree and both must be clean (modulo allow-list).
Check gh auth status early if downstream GitHub operations will use the local transport. Failing on auth at the end of a long DAG is painful.

Never use git diff --quiet alone as your "clean tree" check. It fails on any dirty file, including the ones the workflow is expected to rewrite, which causes false failures on every resume / re-run.

2c. Picking the right `.join()` for multi-line shell commands

When a command: field is a JS array that gets joined into a shell command string, the join delimiter determines what kinds of content the array can contain.

.join(' && ') — use when every element is a self-contained shell statement. Each element becomes independent and the next one runs only if the previous succeeded. Works for linear scripts with set -e.

command: [
  'set -e',
  'HITS=$(grep -c diag src/cli/commands/setup.ts || true)',
  'if [ "$HITS" -lt 6 ]; then echo "FAIL"; exit 1; fi',
  'echo OK',
].join(' && '),

.join('\n') — use when array elements must be part of a larger compound statement that spans multiple physical lines:

heredocs (cat <<EOF ... EOF)
multi-line if / while / for bodies
shell functions defined inline

&& is a command separator. It cannot appear between a heredoc's opening line and its body, between a for and its body, or inside an if's consequent block. Joining such content with && produces a shell syntax error.

Never mix heredocs with && joining. The most common failure mode:

// ❌ BROKEN — heredoc body gets && inserted between each line
command: [
  'set -e',
  'cat > /tmp/f <<EOF',
  'line 1',
  'line 2',
  'EOF',
  'next-command',
].join(' && '),

Results in set -e && cat > /tmp/f <<EOF && line 1 && line 2 && EOF && next-command — a shell syntax error because && cannot appear inside a heredoc body. Use .join('\n') or (better) sidestep the heredoc entirely.

The printf + mktemp alternative — use this for commit messages, raw-CLI fallback PR bodies, and any other multi-line file content. It avoids heredocs altogether and composes with .join(' && '):

command: [
  'set -e',
  'BODY=$(mktemp)',
  // Each line of the file is a separate printf argument. No heredoc,
  // no shell metacharacter hazards, no command-substitution nesting.
  'printf "%s\\n" "## Summary" "" "body line 1" "body line 2" > "$BODY"',
  'gh pr create --title "..." --body-file "$BODY"',
  'rm -f "$BODY"',
].join(' && '),

This pattern is specifically recommended over git commit -m "$(cat <<'EOF' ... EOF)" and raw gh pr create --body "$(cat <<'BODY' ... BODY)". Nesting a heredoc inside $(...) forces the shell to match a closing paren across many lines of unparsed body text, and any stray parenthesis in the body text can silently break the match. --body-file + mktemp + printf is immune to that entire class of bug. For workflow-owned PR creation, prefer createGitHubStep over raw gh; this shell pattern is for raw CLI fallback cases.

2d. Template-literal escape sequences are processed once before the string is rendered

If your file generates code as a giant template literal (the pattern used by packages/core/src/bootstrap/script-generator.ts in cloud), every backslash in that template gets processed by JavaScript before the string is returned. This silently breaks regexes and escape sequences that are meant to appear in the generated output.

Specifically:

\s is not a recognized string escape → the backslash is stripped → \s renders as a literal s
\b is a recognized string escape (backspace, U+0008) → \b renders as a backspace character in the output
\n, \t, \r, \\, \0, \uXXXX, \xXX all get resolved at template time

The footgun: the outer TypeScript compiles cleanly, the rendered code parses and runs, and the regex/escape just never matches what the author intended. See AgentWorkforce/cloud#113 for the exact incident (hasConfigExport = /^export\s+.../m silently became /^exports+.../m in the generated bootstrap, making every TS workflow fall through to the standalone-script fallback).

Guidelines:

If you want a regex pattern that survives the template-literal pass unchanged, double every backslash in the source: \\s, \\b, \\n (the \\ renders to \ in the output, producing a correct regex at runtime).
If you want to write a long string-literal newline into the output, '\\n' in the template renders to '\n' in the output, which the runtime JS interprets as a newline. Using a literal '\n' would render an actual newline into the JS source — visually messy and sometimes surprising.
If you add anything non-trivial to a generator file that returns a big template literal, add a unit test that calls the generator with canonical inputs and asserts something about the rendered output — either exact string matches or, for regexes, eval/construct the regex and test it against known samples. See tests/orchestrator/script-generator.test.ts in cloud for prior art.

Task-prompt workaround: for agent-relay workflow task prompts (where the contents go into a template literal but the inner content is plain text for an LLM), it's often cleaner to build the string as an array and .join('\n') at the boundary. That sidesteps the "does this backslash survive?" question entirely — no backslashes in the source, no processing to reason about. Several workflows in cloud/workflows/ use this pattern (see the sage migration PRs).

3. Keep final verification boring and deterministic

Final verification should validate real outputs with simple, portable shell commands. If checking for multiple symbols, use extended regex explicitly:

grep -Eq "foo|bar|baz" file.ts

Do not rely on basic grep alternation like:

grep -c "foo\|bar\|baz" file.ts

That can silently misbehave and create fake failures even when the generated code is correct.

4. Separate durable outputs from execution exhaust

Commit:

generated product code
migrations
tests
docs
workflow-definition fixes

Do not commit by default:

.logs/
transient executor output
retry artifacts
temporary step-output files

5. Prefer Codex for implementation-heavy roles and Claude for review

Default team split for workflow-authored agent roles:

lead / implementer / writer / fixer → codex
reviewer → claude

Use Claude as the primary implementer only when there is a specific reason.

6. Be explicit about shell requirements

If executor scripts use Bash-only features such as associative arrays, require modern Bash explicitly. On macOS, prefer a known-good Bash path when needed, for example:

/opt/homebrew/bin/bash workflows/your-workflow/execute.sh --wave 2

7. Make resume semantics explicit

Document clearly whether the executor supports:

full-run continuation
--wave
--workflow
--resume

Do not assume users will infer the behavior. In particular, --wave N should be understood as "run only this wave" unless the executor explicitly chains onward.

7a. `--resume` vs `--start-from` when fixing a buggy step

When a workflow fails at step X and you want to re-run it after editing the workflow file, the flag choice matters:

Flag	Reads workflow file fresh?	Uses cached step outputs?
`--resume <id>`	❌ replays stored config from DB	✅ from same run id
`--start-from <step> --previous-run-id <id>`	✅ reads fresh file	✅ from previous run id's cached outputs

Rule: if you edited the workflow file to fix the failing step, use --start-from <failing-step> --previous-run-id <id>, not --resume <id>. --resume pulls the entire workflow config from the run's DB record and replays it — your edits to the workflow file are ignored, and the step re-runs with its original (broken) definition.

This is counterintuitive because "resume" sounds like "pick up where you left off with whatever I just changed." It does not. It picks up where you left off with the stored config from when the run first started.

When to use each:

Transient failure (network hiccup, rate limit, flaky agent), no code edits: --resume <id> is fine, fast, and correct.
You edited the workflow file (any step definition, any prompt, any verify gate): always --start-from <failing-step> --previous-run-id <id>. Everything upstream of the failing step loads from cache, the fresh file supplies the fixed definition, and downstream steps run as normal.

If the runner complains that --start-from can't find cached outputs for the previous run id, fall back to a clean from-scratch run. The workflow's preflight should be forgiving enough (see §2b "Standard preflight template") that a from-scratch re-run succeeds even when a prior partial run left files dirty.

8. Syntax-check workflow files after editing

After editing workflow .ts files, run a lightweight syntax check before launching a large batch run. This is especially important if the workflow contains:

large inline task template literals
embedded code examples
escaped backticks
wrapper changes around workflow execution

9. Factor repo-specific setup into a shared helper

If multiple workflows in the same repo need the same boilerplate before any agent touches code (branch checkout, npm install, workspace-package prebuild, language toolchain init, etc.), do not copy-paste those steps into every workflow. Put them in workflows/lib/<repo>-setup.ts and import from there.

Why it matters: without a shared helper, the first workflow that needs a new prerequisite step (e.g. npm run build:platform because a workspace package's package.json points types at dist/) adds it locally, and every other workflow silently misses it. In a fresh cloud sandbox that means agents hit Cannot find module '@cloud/platform' during typecheck and paper over it with ad-hoc external-modules.d.ts shims or as GetObjectCommandOutput casts scattered across unrelated files. Those workarounds sync back down with the patch and pollute the PR.

Pattern:

// workflows/lib/cloud-repo-setup.ts
export interface CloudRepoSetupOptions {
  branch: string;
  committerName?: string;
  extraSetupCommands?: string[];
  skipWorkspaceBuild?: boolean;
}

export function applyCloudRepoSetup<T>(wf: T, opts: CloudRepoSetupOptions): T {
  // adds two steps: setup-branch, install-deps
  // install-deps runs: npm install + workspace prebuilds (build:platform, build:core, etc.)
  // ...
}

Consumer workflows break the builder chain once and call through:

const baseWf = workflow(NAME)
  .description(...)
  .pattern('dag')
  .agent(...)
  .agent(...);

const wf = applyCloudRepoSetup(baseWf, {
  branch: BRANCH,
  committerName: 'My Workflow Bot',
});

await wf
  .step('read-spec', { dependsOn: ['install-deps'], ... })
  ...
  .run(...);

Rules:

The helper lives in the consumer repo, not in the SDK. Different customer repos have different languages, package managers, and build graphs — @agent-relay/sdk should stay agnostic.
Pre-build any workspace package whose package.json main/types point at a generated dist/. Fresh sandboxes don't have that dist/ yet, and agents will invent workarounds rather than run the build. See the @cloud/platform case above.
Every install step includes --legacy-peer-deps --no-audit --no-fund 2>&1 | tail -10 (or equivalent noise-trimming) because full install output blows past captureOutput size limits.
Document the helper in the repo's CLAUDE.md / AGENTS.md so new workflow authors (and agents writing workflows) discover it.

End-to-End Bug Fix Workflows

For bug-fix or reliability workflows, do not stop at unit or integration tests. The workflow should explicitly prove that the original user-visible problem is fixed.

Required phases for fix workflows

Capture the original failure
- Reproduce the bug first in a deterministic or evidence-capturing step
- Save exact commands, logs, status codes, or screenshots/artifacts
State the acceptance contract
- Define the exact end-to-end success criteria before implementation
- Include the real entrypoint a user would run
Implement the fix
Rebuild / reinstall from scratch
- Do not trust dirty local state
- Prefer a clean environment when install/bootstrap behavior is involved
Run targeted regression checks
- Unit/integration tests are helpful but not sufficient by themselves
Run a full end-to-end validation
- Use the real CLI / API / install path
- Prefer a clean environment (Docker, sandbox, cloud workspace, Daytona, etc.) for install/runtime issues
Compare before vs after evidence
- Show that the original failure no longer occurs
Record residual risks
- Call out what was not covered
Ship the result as a PR
- Open the pull request from the workflow itself with createGitHubStep
- See Shipping the Result — Open a PR via createGitHubStep below
- A workflow that fixes a bug and stops short of the PR has only done half the loop

Clean-environment validation guidance

When the bug involves install, bootstrap, PATH/shims, auth, brokers, background services, OS-specific packaging, or first-run UX, add a second workflow (or second phase) that validates the fix in a fresh environment.

Preferred order of proving environments:

disposable sandbox / cloud workspace
Docker / containerized environment
fresh local shell with isolated paths

Meta-workflow guidance

If the right proving environment is unclear, first write a meta-workflow that:

compares candidate validation environments
defines the acceptance contract
chooses the best swarm pattern
then authors the final fix/validation workflow

This is often better than jumping straight to implementation.

Shipping the Result — Open a PR via `createGitHubStep`

A workflow whose final artifact is "a clean working tree on a sandbox you'll throw away" has not shipped anything. End every code-changing workflow by opening a pull request, and do it from inside the workflow using createGitHubStep from @agent-relay/sdk. Don't tell the operator to follow up with gh pr create — make the workflow's own last step the PR.

Why `createGitHubStep` (and not raw `gh` / `octokit`)

The primitive picks the right transport at runtime:

Where the workflow runs	Transport `createGitHubStep` uses	What you provide
Local (`agent-relay run`)	`gh` CLI	`gh auth status` works
Cloud (`agent-relay cloud run`) — tenant-scoped	Nango → workspace's GitHub App installation	Nothing — cloud injects credentials
Cloud — fallback	Relay-cloud GitHub proxy	Nothing — cloud injects credentials

You write one workflow. The same createPR step opens a PR via your local gh when you iterate on it on a laptop, and via the workspace's GitHub App when the same file runs in agent-relay cloud run. No branching by environment, no env-var sniffing in your task strings, no "this part only works in cloud" caveats. That's the whole point of the adapter.

Phase C interaction (cloud only): agent-relay cloud run already auto-pushes per-paths[] diffs as separate PRs after the workflow callback when the repos are allowlisted (see pushedTo in the run record). Phase C is the catch-all — if your workflow does nothing else, you still get one PR per declared path. Use createGitHubStep on top of that when you need PRs the catch-all can't produce: cross-cutting issues, follow-up tracking issues, opening one PR that spans multiple paths, draft PRs you want labeled/assigned in specific ways, or PRs against a repo you didn't paths[] in.

The minimal "open a PR" recipe

import { workflow } from '@agent-relay/sdk/workflows';
import { createGitHubStep } from '@agent-relay/sdk';

const REPO = 'AgentWorkforce/cloud';
const BRANCH = `agent-relay/run-${Date.now()}`;

await workflow('feature-x')
  // ... your real steps that produce code changes ...
  .step('write-marker', {
    type: 'deterministic',
    command: `echo "fix landed at $(date -u)" >> CHANGELOG.md`,
  })

  // Branch off main on the remote.
  .step('create-branch', createGitHubStep({
    dependsOn: ['write-marker'],
    action: 'createBranch',
    repo: REPO,
    params: { branch: BRANCH, source: 'main' },
  }))

  // Commit the change to the branch via Contents API.
  .step('commit-change', createGitHubStep({
    dependsOn: ['create-branch'],
    action: 'createFile',
    repo: REPO,
    params: {
      path: 'CHANGELOG.md',
      branch: BRANCH,
      content: '<file body here>',
      message: 'chore: changelog entry',
    },
  }))

  // Open the PR. This is the load-bearing step.
  .step('open-pr', createGitHubStep({
    dependsOn: ['commit-change'],
    action: 'createPR',
    repo: REPO,
    params: {
      title: 'feat: ship feature X',
      head: BRANCH,
      base: 'main',
      body: '## Summary\n\n- ...\n\n## Test plan\n\n- [x] ...',
      draft: false,
    },
    output: { mode: 'data', format: 'json', path: 'html_url' },
  }))

  .run({ cwd: process.cwd() });

createGitHubStep is bundled with @agent-relay/sdk; do not add a separate install. Its actions are stable across runtimes: getRepo, createBranch, createFile, updateFile, createPR, updatePR, getPR, listPRs, mergePR, createIssue, etc. See the SDK GitHub primitive docs for the full enum.

Authoring rules for PR-shipping workflows

Open the PR from the workflow, not from the operator's shell. "Tell the user to run gh pr create" is a regression to a manual step the workflow could have done. The whole point of running this in cloud is that there is no operator's shell.
One PR per workflow, by default. A workflow that opens five PRs from one run is almost always wrong — humans review one PR at a time. If you genuinely need multiple, prefer a tracking issue + linked PRs, or split into separate workflows.
Branch name encodes the run. agent-relay/run-${runId} or agent-relay/${workflow-name}-${timestamp} so reviewers can tell the PR apart from other automation, and so reruns don't clash.
draft: true while iterating. Once the workflow is stable end-to-end, flip to draft: false.
Body is a real PR description. Summary + Test plan, generated from the workflow's own evidence (verification step output, diff stats, test run output). If you find yourself writing a placeholder body, the workflow isn't done — capture the real evidence in an earlier step and template it in.
Don't use createGitHubStep to substitute for paths[] push-back in cloud. If the diff lives in a tarballed paths[] mount, let cloud's Phase C push-back open that PR (it handles the patch generation, branch lifecycle, and per-repo allowlist). Use createGitHubStep when you need a PR against a repo or branch outside the paths[] set, or when you want to add an extra PR (e.g. a tracking issue, a follow-up against a sibling repo, a docs-only PR).
PR creation failures route to repair. If createPR errors (auth, permissions, branch conflict), capture the output and give a repair owner a chance to fix auth, branch state, labels, or body generation before stopping. A "successful" workflow that silently failed to open the PR is the worst-case outcome — the human thinks the work shipped.

Where this fits in the bug-fix phases

End-to-End Bug Fix Workflows lists "Ship the result as a PR" as phase 9. Concretely that means: after phase 7 (compare before/after evidence) succeeds, the workflow's next step is createPR with that evidence templated into the body. The PR opening is the ship — there is no further manual step.

Key Concepts

Step Output Chaining

Use {{steps.STEP_NAME.output}} in a downstream step's task to inject the prior step's terminal output.

Mental model: this is a Unix pipe, not agent communication. {{steps.A.output}} flowing into step B is A | B — A is dead by the time B reads its stdout. There is no chat, no feedback, no addressing. If your workflow's coordination story is only output chaining, you're using the relay as transport, not as a coordination layer. See Choose Your Coordination Style before defaulting to this.

Only chain output from clean sources:

Deterministic steps (shell commands — always clean)
Non-interactive agents (preset: 'worker' — clean stdout)

Never chain from interactive agents (cli: 'claude' without preset) — PTY output includes spinners, ANSI codes, and TUI chrome. Instead, have the agent write to a file, then read it in a deterministic step. (Or: don't use chaining at all — let the agents coordinate over the channel.)

Verification Gates

verification: { type: 'exit_code' }                        // preferred for code-editing steps
verification: { type: 'output_contains', value: 'DONE' }   // optional accelerator
verification: { type: 'file_exists', value: 'src/out.ts' } // deterministic file check

Only these four types are valid: exit_code, output_contains, file_exists, custom. Invalid types are silently ignored and fall through to process-exit auto-pass.

Verification token gotcha: If the token appears in the task text, the runner requires it twice in output (once from task echo, once from agent). Prefer exit_code for code-editing steps to avoid this.

DAG Dependencies

Steps with dependsOn wait for all listed steps. Steps with no dependencies start immediately. Steps sharing the same dependsOn run in parallel:

.step('fix-types',  { agent: 'worker', dependsOn: ['review'], ... })
.step('fix-tests',  { agent: 'worker', dependsOn: ['review'], ... })
.step('final',      { agent: 'lead',   dependsOn: ['fix-types', 'fix-tests'], ... })

Self-Termination

Do NOT add exit instructions to task strings. The runner handles this automatically.

Step Completion Model

Steps complete through a multi-signal pipeline (highest priority first):

Deterministic verification — exit_code, file_exists, output_contains pass → immediate completion
Owner decision — OWNER_DECISION: COMPLETE|INCOMPLETE_RETRY|INCOMPLETE_FAIL
Evidence-based — channel signals, file artifacts, clean exit code
Marker fast-path — STEP_COMPLETE:<step-name> (optional accelerator)
Process-exit fallback — agent exits 0 with no signals → completes after grace period

Key principle: No single signal is mandatory. Describe the deliverable, not what to print.

Dynamic Channel Management

Agents can dynamically subscribe, unsubscribe, mute, and unmute channels after spawn. This eliminates the need for client-side channel filtering and manual peer fanout.

SDK API

// Subscribe an agent to additional channels post-spawn
relay.subscribe({ agent: 'security-auditor', channels: ['review-pr-456'] });

// Unsubscribe — agent leaves the channel entirely
relay.unsubscribe({ agent: 'security-auditor', channels: ['general'] });

// Mute — agent stays subscribed (history access) but messages are NOT injected into PTY
relay.mute({ agent: 'security-auditor', channel: 'review-pr-123' });

// Unmute — resume PTY injection
relay.unmute({ agent: 'security-auditor', channel: 'review-pr-123' });

Agent-level methods are also available:

const agent = await relay.claude.spawn({ name: 'auditor', channels: ['ch-a'] });
await agent.subscribe(['ch-b']);       // now subscribed to ch-a and ch-b
await agent.mute('ch-a');              // ch-a messages silenced (still in history)
await agent.unmute('ch-a');            // ch-a messages resume
await agent.unsubscribe(['ch-b']);     // leaves ch-b
console.log(agent.channels);          // ['ch-a']
console.log(agent.mutedChannels);     // []

Semantics

Operation	Channel membership	PTY injection	History access
`subscribe`	Yes	Yes	Yes
`unsubscribe`	No	No	No (leaves)
`mute`	Yes (stays)	No (silenced)	Yes (can query)
`unmute`	Yes	Yes (resumes)	Yes

Events

relay.onChannelSubscribed = (agent, channels) => { /* ... */ };
relay.onChannelUnsubscribed = (agent, channels) => { /* ... */ };
relay.onChannelMuted = (agent, channel) => { /* ... */ };
relay.onChannelUnmuted = (agent, channel) => { /* ... */ };

When to Use in Workflows

Multi-PR chat sessions: Agents focused on one PR can mute other PR channels to reduce noise
Phase transitions: Subscribe agents to new channels as work progresses between phases
Team isolation: Workers mute the main coordination channel during focused work, unmute for review
Dynamic fanout: A lead subscribes workers to sub-channels at runtime based on task decomposition

What This Eliminates

With broker-managed subscriptions, you no longer need:

Client-side persona filtering (personaNames.has(from) checks)
Channel prefix regex for message routing
Manual peer fanout (iterating agents to forward messages)
Dedup caches for dual-path delivery

Agent Definition

.agent('name', {
  cli: 'claude' | 'codex' | 'gemini' | 'aider' | 'goose' | 'opencode' | 'droid',
  role?: string,
  preset?: 'lead' | 'worker' | 'reviewer' | 'analyst',
  retries?: number,
  model?: string,
  interactive?: boolean, // default: true
})

Model Constants

Always use model constants from @agent-relay/config instead of string literals. Each CLI has a typed constants object with its available models:

import { ClaudeModels, CodexModels, GeminiModels } from '@agent-relay/config';

.agent('planner', { cli: 'claude', model: ClaudeModels.OPUS })    // not 'opus'
.agent('worker',  { cli: 'claude', model: ClaudeModels.SONNET })  // not 'sonnet'
.agent('coder',   { cli: 'codex',  model: CodexModels.GPT_5_4 })  // not 'gpt-5.4'

Post-spawn channel operations (available on Agent instances and AgentRelay facade):

// Agent instance methods
agent.subscribe(channels: string[]): Promise<void>
agent.unsubscribe(channels: string[]): Promise<void>
agent.mute(channel: string): Promise<void>
agent.unmute(channel: string): Promise<void>
agent.channels: string[]          // current subscribed channels
agent.mutedChannels: string[]     // currently muted channels

// AgentRelay facade methods (by agent name)
relay.subscribe({ agent: string, channels: string[] }): Promise<void>
relay.unsubscribe({ agent: string, channels: string[] }): Promise<void>
relay.mute({ agent: string, channel: string }): Promise<void>
relay.unmute({ agent: string, channel: string }): Promise<void>

Preset	Interactive	Relay access	Use for
`lead`	yes (PTY)	yes	Coordination, monitoring channels
`worker`	no (subprocess)	no	Bounded tasks, structured stdout
`reviewer`	no (subprocess)	no	Reading artifacts, producing verdicts
`analyst`	no (subprocess)	no	Reading code/files, writing findings

Non-interactive presets run via one-shot mode (claude -p, codex exec). Output is clean and available via {{steps.X.output}}.

Critical rule: Pre-inject content into non-interactive agents. Don't ask them to read large files — pre-read in a deterministic step and inject via {{steps.X.output}}.

Step Definition

Agent Steps

.step('name', {
  agent: string,
  task: string,                   // supports {{var}} and {{steps.NAME.output}}
  dependsOn?: string[],
  verification?: VerificationCheck,
  retries?: number,
})

Deterministic Steps (Shell Commands)

.step('verify-files', {
  type: 'deterministic',
  command: 'test -f src/auth.ts && echo "FILE_EXISTS"',
  dependsOn: ['implement'],
  captureOutput: true,
  failOnError: false,
})
.step('repair-files', {
  agent: 'worker',
  dependsOn: ['verify-files'],
  task: `If verify-files failed, create or fix the missing file and rerun the check.
Output:
{{steps.verify-files.output}}`,
  verification: { type: 'exit_code' },
})
.step('verify-files-final', {
  type: 'deterministic',
  command: 'test -f src/auth.ts && echo "FILE_EXISTS"',
  dependsOn: ['repair-files'],
  captureOutput: true,
  failOnError: true,
})

Use for: file checks, reading files for injection, build/test gates, git operations. For anything an agent can fix, follow the deterministic step with a repair step and a final deterministic proof step.

Common Patterns

Interactive Team (lead + workers on shared channel)

When a task involves creating/modifying multiple files with review feedback, use interactive agents on a shared channel instead of non-interactive one-shot workers. The lead coordinates, reviews, and posts feedback; workers implement and iterate.

.agent('lead', {
  cli: 'claude',
  model: ClaudeModels.OPUS,
  role: 'Architect and reviewer — assigns work, reviews, posts feedback',
  retries: 1,
  // No preset — interactive by default
})

.agent('impl-new', {
  cli: 'codex',
  model: CodexModels.O3,
  role: 'Creates new files. Listens on channel for assignments and feedback.',
  retries: 2,
  // No preset — interactive, receives channel messages
})

.agent('impl-modify', {
  cli: 'codex',
  model: CodexModels.O3,
  role: 'Edits existing files. Listens on channel for assignments and feedback.',
  retries: 2,
})

// All three share the same dependsOn — they start concurrently (no deadlock)
.step('lead-coordinate', {
  agent: 'lead',
  dependsOn: ['context'],
  task: `You are the lead on #channel. Workers: impl-new, impl-modify.
Post the plan. Assign files. Review their work. Post feedback if needed.
Workers iterate based on your feedback. Exit when all files are correct.`,
})
.step('impl-new-work', {
  agent: 'impl-new',
  dependsOn: ['context'],   // same dep as lead = parallel start
  task: `You are impl-new on #channel. Wait for the lead's plan.
Create files as assigned. Report completion. Fix issues from feedback.`,
})
.step('impl-modify-work', {
  agent: 'impl-modify',
  dependsOn: ['context'],   // same dep as lead = parallel start
  task: `You are impl-modify on #channel. Wait for the lead's plan.
Edit files as assigned. Report completion. Fix issues from feedback.`,
})
// Downstream gates on lead (lead exits when satisfied)
.step('verify', { type: 'deterministic', dependsOn: ['lead-coordinate'], ... })

Key behaviors observed in production:

Workers self-organize from channel context. Workers read each other's completion messages and start dependent work without waiting for the lead to relay. The shared channel gives them ambient awareness.
Lead-as-reviewer is more efficient than a separate reviewer agent. The lead reads actual files and runs typecheck between rounds — one agent doing coordination + review eliminates a step.
Codex interactive mode works well with PTY channel injection. Don't default to preset: 'worker' — interactive Codex agents receive and act on channel messages reliably.
Workers may outpace the lead. If the lead is reviewing while workers are fast, the lead's "proceed" message may arrive after the worker already started from channel context. This is harmless but worth knowing.
No feedback loop needed = fast path. If workers get it right first try, the interactive pattern completes just as fast as one-shot. The feedback loop is insurance, not overhead.

When to use interactive team vs one-shot DAG:

Scenario	Pattern
4+ files, likely needs iteration	Interactive team
Simple edits, well-specified	One-shot DAG with `preset: 'worker'`
Cross-agent review feedback loop	Interactive team
Independent tasks, no coordination	Fan-out with non-interactive workers
Anything where the answer to "could this be `cmd1 \| cmd2`?" is no	Interactive team

Chat-Native Coordination Recipes

Once you're in the Interactive Team shape, the channel is your coordination medium. These are recipes for using it well — they are prompt-authoring patterns, not new SDK surface. All of them assume interactive agents (no preset) sharing a .channel('wf-...').

1. Question / Answer (blocking ask)

When agent A needs information only agent B has, instruct A to post a direct question and wait for a reply rather than guessing or proceeding.

.step('integrate', {
  agent: 'integrator',
  dependsOn: ['context'],
  task: `You are the integrator on #wf-feature.
Before writing code, post a direct question to @schema-owner asking which
table owns the new field. Do NOT proceed until @schema-owner replies in
channel. If no reply arrives in 5 minutes, @-mention the lead.`,
})

Why it beats {{steps.X.output}}: the answer depends on something only an agent (or human) can decide at runtime; encoding it as a prior step's stdout is wrong.

2. Broadcast / Ack

When a lead needs N workers to confirm receipt before proceeding (e.g., to make sure the plan was actually read), require explicit acks.

.step('lead-coordinate', {
  agent: 'lead',
  dependsOn: ['context'],
  task: `Post the plan to #wf-feature, then @impl-a @impl-b @impl-c.
Wait for each to reply with "ACK <agent-name>" before issuing assignments.
If any worker hasn't acked in 3 minutes, re-post and ping again.
Only after all three have acked, post per-worker assignments.`,
})

Why it matters: in the Codex history, the most common silent failure is a worker step that started but never read the channel. An ack gate makes "did you actually receive this?" deterministic without a separate verification step.

3. Peer Review Handoff

The substantive form of "review my work." Worker pings reviewer in-channel with a concrete artifact reference; reviewer reads the actual files (not the chat); reviewer replies with a verdict.

.step('impl-a-work', {
  agent: 'impl-a',
  dependsOn: ['context'],
  task: `Implement src/foo.ts per the lead's assignment.
When done, post to #wf-feature: "@reviewer ready: src/foo.ts" — include the
commit SHA. Then wait for @reviewer's verdict in channel.
- If "APPROVED", you're done.
- If "CHANGES_REQUESTED <notes>", apply the notes and re-post.
- If no verdict in 5 min, @-mention the lead.`,
})

Pattern note: the reviewer must read the files themselves — never let the worker paste the diff into chat. Channel messages are for coordination, not content. That's also what keeps you under output-token limits.

4. Standup / Status Probe

For long-running workflows, have the lead post periodic @-mention probes so silently-stuck workers surface fast.

.step('lead-coordinate', {
  agent: 'lead',
  task: `... coordinate the team ...

Every 10 minutes, post a status probe: "@impl-a @impl-b status?"
Each worker should reply with one of:
  - "RUNNING <step>" (still working)
  - "BLOCKED <reason>" (@-mention the lead with the blocker)
  - "DONE <artifact>" (ready for review)

If a worker is silent for two probes in a row, mark them stalled and
reassign their work to a peer.`,
})

5. Hand-Off with Context

When work flows from agent A to agent B during a workflow (not just between steps), have A post a structured handoff message so B doesn't re-derive context.

.step('impl-a-work', {
  agent: 'impl-a',
  task: `... finish your part ...

When done, post a handoff to #wf-feature targeting the next worker:
"@impl-b HANDOFF: src/foo.ts ready. Touched: <files>. Open question: <if any>.
Tests: <pass/fail summary>. Commit: <sha>."`,
})

Vs {{steps.X.output}}: an output-chain forces B to parse A's entire stdout. A handoff message is a curated summary A writes for B — much higher signal, no PTY/ANSI noise.

Picking a recipe

Need	Recipe
One agent needs an answer from another at runtime	Q/A
Lead needs to confirm workers received the plan	Broadcast/Ack
Agent-to-agent code review	Peer Review Handoff
Long-running team, want stalled-worker visibility	Standup/Probe
Sequential agent work that needs context curation	Hand-Off with Context

Authoring rule: if your workflow has interactive agents on a channel but their task strings don't instruct them to talk to each other, you're not using the chat primitive — you've just paid the overhead of starting it. Either add an explicit recipe above, or drop to preset: 'worker' and pipeline-shape.

Pipeline (sequential handoff)

.pattern('pipeline')
.step('analyze', { agent: 'analyst', task: '...' })
.step('implement', { agent: 'dev', task: '{{steps.analyze.output}}', dependsOn: ['analyze'] })
.step('test', { agent: 'tester', task: '{{steps.implement.output}}', dependsOn: ['implement'] })

Error Handling

.onError('fail-fast')   // stop on first failure (default)
.onError('continue')    // skip failed branches, continue others
.onError('retry', { maxRetries: 3, retryDelayMs: 5000 })

For agent-team workflows, prefer retry over fail-fast, and use .repairable() or .reliable() when the installed SDK supports it. AgentWorkforce/relay#827 made reliability repair-aware: retry-mode workflows with agents should run repair agents before retrying failed or malformed agent steps, not only deterministic checks. Keep explicit repair steps when the failing output needs to be handed to a specific domain owner.

Multi-File Edit Pattern

When a workflow needs to modify multiple existing files, use one agent step per file with a deterministic verify gate after each. Agents reliably edit 1-2 files per step but fail on 4+.

steps:
  - name: read-types
    type: deterministic
    command: cat src/types.ts
    captureOutput: true

  - name: edit-types
    agent: dev
    dependsOn: [read-types]
    task: |
      Edit src/types.ts. Current contents:
      {{steps.read-types.output}}
      Add 'pending' to the Status union type.
      Only edit this one file.
    verification:
      type: exit_code

  - name: verify-types
    type: deterministic
    dependsOn: [edit-types]
    command: 'if git diff --quiet src/types.ts; then echo "NOT MODIFIED"; exit 1; fi; echo "OK"'
    captureOutput: true
    failOnError: false

  - name: fix-types-verification
    agent: dev
    dependsOn: [verify-types]
    task: |
      If verify-types failed, fix src/types.ts and rerun the verify command.
      Output:
      {{steps.verify-types.output}}
    verification:
      type: exit_code

  - name: verify-types-final
    type: deterministic
    dependsOn: [fix-types-verification]
    command: 'if git diff --quiet src/types.ts; then echo "NOT MODIFIED"; exit 1; fi; echo "OK"'
    captureOutput: true
    failOnError: true

  - name: read-service
    type: deterministic
    dependsOn: [verify-types-final]
    command: cat src/service.ts
    captureOutput: true

  - name: edit-service
    agent: dev
    dependsOn: [read-service]
    task: |
      Edit src/service.ts. Current contents:
      {{steps.read-service.output}}
      Add a handlePending() method.
      Only edit this one file.
    verification:
      type: exit_code

  - name: verify-service
    type: deterministic
    dependsOn: [edit-service]
    command: 'if git diff --quiet src/service.ts; then echo "NOT MODIFIED"; exit 1; fi; echo "OK"'
    captureOutput: true
    failOnError: false

  - name: fix-service-verification
    agent: dev
    dependsOn: [verify-service]
    task: |
      If verify-service failed, fix src/service.ts and rerun the verify command.
      Output:
      {{steps.verify-service.output}}
    verification:
      type: exit_code

  - name: verify-service-final
    type: deterministic
    dependsOn: [fix-service-verification]
    command: 'if git diff --quiet src/service.ts; then echo "NOT MODIFIED"; exit 1; fi; echo "OK"'
    captureOutput: true
    failOnError: true

  # Deterministic commit — never rely on agents to commit
  - name: commit
    type: deterministic
    dependsOn: [verify-service-final]
    command: npm run typecheck && npm test && git add src/types.ts src/service.ts && git commit -m "feat: add pending status"
    captureOutput: true
    failOnError: false

  - name: repair-commit
    agent: dev
    dependsOn: [commit]
    task: |
      If commit failed, fix the blocker, rerun npm run typecheck && npm test, and create the commit.
      If commit passed, confirm the commit subject.
      Output:
      {{steps.commit.output}}
    verification:
      type: exit_code

  - name: verify-commit-created
    type: deterministic
    dependsOn: [repair-commit]
    command: 'git log -1 --pretty=%s | grep -q "^feat: add pending status$" && echo "COMMIT_OK" || (echo "COMMIT_MISSING"; exit 1)'
    captureOutput: true
    failOnError: true

Key rules:

Read the file in a deterministic step right before the edit (not all files upfront)
Tell the agent "Only edit this one file" to prevent it touching other files
Verify tracked-only edits with git diff --quiet, hand failures back to an agent to repair, then rerun the deterministic check as proof
If the edit may create new files/packages, verify with git status --short -- <paths> because git diff --quiet ignores untracked files
Always commit with a deterministic step, never an agent step; rerun acceptance checks in that step, let an agent repair commit blockers, and prove the commit exists

File Materialization: Verify Before Proceeding

After any step that creates files, add a deterministic file_exists check before proceeding. Non-interactive agents may exit 0 without writing anything (wrong cwd, stdout instead of disk).

- name: verify-files
  type: deterministic
  dependsOn: [impl-auth, impl-storage]
  command: |
    missing=0
    for f in src/auth/credentials.ts src/storage/client.ts; do
      if [ ! -f "$f" ]; then echo "MISSING: $f"; missing=$((missing+1)); fi
    done
    if [ $missing -gt 0 ]; then echo "$missing files missing"; exit 1; fi
    echo "All files present"
  captureOutput: true
  failOnError: false

- name: fix-missing-files
  agent: impl-auth
  dependsOn: [verify-files]
  task: |
    If verify-files found missing files, create/fix them and rerun the check.
    Output:
    {{steps.verify-files.output}}
  verification:
    type: exit_code

- name: verify-files-final
  type: deterministic
  dependsOn: [fix-missing-files]
  command: |
    missing=0
    for f in src/auth/credentials.ts src/storage/client.ts; do
      if [ ! -f "$f" ]; then echo "MISSING: $f"; missing=$((missing+1)); fi
    done
    if [ $missing -gt 0 ]; then echo "$missing files missing"; exit 1; fi
    echo "All files present"
  captureOutput: true
  failOnError: true

Rules for file-writing tasks:

Use full paths from project root — say src/auth/credentials.ts, not credentials.ts
Add IMPORTANT: Write the file to disk. Do NOT output to stdout.
Use file_exists verification for creation steps (not just exit_code)
Gate all downstream steps on the final deterministic proof step that follows the repair step

Edit Gates Must See Untracked Files

For gates that validate new files, generated artifacts, tests, or package directories, do not use only git diff --quiet -- <paths>. git diff ignores untracked files, so a valid new package can be misclassified as NO_CHANGES.

Use git status --short -- <paths> for materialization/edit gates, and keep the first gate repairable:

- name: provider-edit-gate-capture
  type: deterministic
  dependsOn: [implement-providers]
  command: |
    if [ -z "$(git status --short -- packages/new-provider .workflow-artifacts/my-flow)" ]; then
      echo "NO_PROVIDER_CHANGES"
      exit 1
    fi
    echo "PROVIDER_EDIT_GATE_OK"
  captureOutput: true
  failOnError: false

- name: repair-edit-gate
  agent: provider-worker
  dependsOn: [provider-edit-gate-capture]
  task: |
    If provider-edit-gate-capture reported NO_PROVIDER_CHANGES, inspect git
    status including untracked files and add the missing provider artifacts.
    If it already passed, do nothing.
  verification:
    type: exit_code

- name: provider-edit-gate-final
  type: deterministic
  dependsOn: [repair-edit-gate]
  command: |
    if [ -z "$(git status --short -- packages/new-provider .workflow-artifacts/my-flow)" ]; then
      echo "NO_PROVIDER_CHANGES"
      exit 1
    fi
    echo "PROVIDER_EDIT_GATE_FINAL_OK"
  captureOutput: true
  failOnError: false

- name: repair-provider-edit-gate-final
  agent: provider-worker
  dependsOn: [provider-edit-gate-final]
  task: |
    If provider-edit-gate-final is still red, repair the missing provider
    artifacts and rerun the check. If repair is impossible, write
    .workflow-artifacts/my-flow/BLOCKED_NO_COMMIT.md with exact evidence and
    do not commit.
    Output:
    {{steps.provider-edit-gate-final.output}}
  verification:
    type: exit_code

Both gates capture evidence and give an agent a chance to fix. A still-red final gate becomes a blocked/no-commit artifact, not a workflow crash.

Agent Transport Must Not Be The First Hard Gate

Interactive lead-and-worker teams are useful, but they are still process sessions. A long-running PTY can go idle, emit noisy terminal output, or fail to respawn with a transport error before the workflow reaches tests. If every downstream gate depends directly on that agent step, the workflow can fail without giving a repair owner command output to fix.

For long rollouts, keep the critical path evidence-based:

.step('runtime-implementation', {
  agent: 'impl-runtime',
  dependsOn: ['context'],
  task: 'Implement the runtime slice and write .workflow-artifacts/runtime.md',
})
.step('adapter-implementation', {
  agent: 'impl-adapters',
  dependsOn: ['context'],
  task: 'Implement adapter wiring and write .workflow-artifacts/adapters.md',
})
.step('implementation-reconcile', {
  type: 'deterministic',
  dependsOn: ['context'],
  command: `git status --short -- packages/core packages/*/src/writeback.ts scripts tests .workflow-artifacts
test -f scripts/verify-e2e.mjs || echo "MISSING_E2E"
test -f packages/core/src/runtime/router.ts || echo "MISSING_ROUTER"`,
  captureOutput: true,
  failOnError: false,
})
.step('repair-implementation-reconcile', {
  agent: 'qa',
  dependsOn: ['implementation-reconcile'],
  task: `Finish anything missing before gates run:\n{{steps.implementation-reconcile.output}}`,
  verification: { type: 'exit_code' },
})
.step('run-e2e', {
  type: 'deterministic',
  dependsOn: ['repair-implementation-reconcile'],
  command: 'npm run verify:e2e',
  captureOutput: true,
  failOnError: false,
})

Implementation agents may still run and coordinate on a channel, but tests depend on the reconcile/repair path. That makes transport failures advisory. If final deterministic evidence is still red after repair, write a blocked artifact and skip commit/PR creation rather than failing the workflow.

DAG Deadlock Anti-Pattern

# WRONG — deadlock: coordinate depends on context, work-a depends on coordinate
steps:
  - name: coordinate
    dependsOn: [context]    # lead waits for WORKER_DONE...
  - name: work-a
    dependsOn: [coordinate] # ...but work-a can't start until coordinate finishes

# RIGHT — workers and lead start in parallel
steps:
  - name: context
    type: deterministic
  - name: work-a
    dependsOn: [context]    # starts with lead
  - name: coordinate
    dependsOn: [context]    # starts with workers
  - name: merge
    dependsOn: [work-a, coordinate]

Rule: if a lead step's task mentions downstream step names alongside waiting keywords, that's a deadlock.

Step Sizing

One agent, one deliverable. A step's task prompt should be 10-20 lines max.

Split into a lead + workers team when:

The task requires a 50+ line prompt
The deliverable is multiple files that must be consistent
You need one agent to verify another's output

# Team pattern: lead + workers on a shared channel
steps:
  - name: track-lead-coord
    agent: track-lead
    dependsOn: [prior-step]
    task: |
      Lead the track on #my-track. Workers: track-worker-1, track-worker-2.
      Post assignments to the channel. Review worker output.

  - name: track-worker-1-impl
    agent: track-worker-1
    dependsOn: [prior-step]  # same dep as lead — starts concurrently
    task: |
      Join #my-track. track-lead will post your assignment.
      Implement the file as directed.
    verification:
      type: exit_code

  - name: next-step
    dependsOn: [track-lead-coord]  # downstream depends on lead, not workers

Supervisor Pattern

When you set .pattern('supervisor') (or hub-spoke, fan-out), the runner auto-assigns a supervisor agent as owner for worker steps. The supervisor monitors progress, nudges idle workers, and issues OWNER_DECISION.

Auto-hardening only activates for hub patterns — not pipeline or dag.

Use case	Pattern	Why
Sequential, no monitoring	`pipeline`	Simple, no overhead
Workers need oversight	`supervisor`	Auto-owner monitors
Local/small models	`supervisor`	Supervisor catches stuck workers
All non-interactive	`pipeline` or `dag`	No PTY = no supervision needed

Concurrency

Cap maxConcurrency at 4-6. Spawning 10+ agents simultaneously causes broker timeouts.

Parallel agents	`maxConcurrency`
2-4	4 (default safe)
5-10	5
10+	6-8 max

Common Mistakes

Mistake	Fix
Treating relay as transport, not as a coordination layer (every step is `preset: 'worker'`, every handoff is `{{steps.X.output}}`)	Default to Conversation shape for non-trivial work — interactive agents on a shared channel. Pipeline-shape is only correct when the work could be expressed as a `bash \| bash \| bash` pipe.
Interactive agents on a channel whose task strings don't tell them to talk to each other	Pick a Chat-Native Coordination Recipe (Q/A, Broadcast/Ack, Peer Review, Standup, Hand-Off) and bake it into the task prompt — otherwise you're paying for a chat substrate you're not using
All workflows run sequentially	Group independent workflows into parallel waves (4-7x speedup)
Every step depends on the previous one	Only add `dependsOn` when there's a real data dependency
Self-review step with no timeout	Set `timeout: 300_000` (5 min) — Codex hangs in non-interactive review
One giant workflow per feature	Split into smaller workflows that can run in parallel waves
Adding exit instructions to tasks	Runner handles self-termination automatically
Setting `timeoutMs` on agents/steps	Use global `.timeout()` only
Using `general` channel	Set `.channel('wf-name')` for isolation
`{{steps.X.output}}` without `dependsOn: ['X']`	Output won't be available yet
Requiring exact sentinel as only completion gate	Use `exit_code` or `file_exists` verification
Writing 100-line task prompts	Split into lead + workers on a channel
`maxConcurrency: 16` with many parallel steps	Cap at 5-6
Non-interactive agent reading large files via tools	Pre-read in deterministic step, inject via `{{steps.X.output}}`
Workers depending on lead step (deadlock)	Both depend on shared context step
Validation gates depending directly on long interactive implementation agents	Add a deterministic implementation-reconcile step and make gates depend on its repair step
`fan-out`/`hub-spoke` for simple parallel workers	Use `dag` instead
`pipeline` but expecting auto-supervisor	Only hub patterns auto-harden. Use `.pattern('supervisor')`
Workers without `preset: 'worker'` in one-shot DAG lead+worker flows	Add preset for clean stdout when chaining `{{steps.X.output}}` (not needed for interactive team patterns)
Using `_` in YAML numbers (`timeoutMs: 1_200_000`)	YAML doesn't support `_` separators
Workflow timeout under 30 min for complex workflows	Use `3600000` (1 hour) as default
Using `require()` in ESM projects	Check `package.json` for `"type": "module"` — use `import` if ESM
Wrapping in `async function main()` in ESM	ESM supports top-level `await` — no wrapper needed
Using `createWorkflowRenderer`	Does not exist. Use `.run({ cwd: process.cwd() })`
`export default workflow(...)...build()`	No `.build()`. Chain ends with `.run()` — the file must call `.run()`, not just export config
Relative import `'../workflows/builder.js'`	Use `import { workflow } from '@agent-relay/sdk/workflows'`
Hardcoded model strings (`model: 'opus'`)	Use constants: `import { ClaudeModels } from '@agent-relay/config'` → `model: ClaudeModels.OPUS`
Thinking `agent-relay run` inspects exports	It executes the file as a subprocess. Only `.run()` invocations trigger steps
`pattern('single')` on cloud runner	Not supported — use `dag`
`pattern('supervisor')` with one agent	Same agent is owner + specialist. Use `dag`
Invalid verification type (`type: 'deterministic'`)	Only `exit_code`, `output_contains`, `file_exists`, `custom` are valid
Chaining `{{steps.X.output}}` from interactive agents	PTY output is garbled. Use deterministic steps or `preset: 'worker'`
Single step editing 4+ files	Agents modify 1-2 then exit. Split to one file per step with verify gates
Relying on agents to `git commit`	Agents emit markers without running git. Use deterministic commit step
File-writing steps without `file_exists` verification	`exit_code` auto-passes even if no file written
Edit gate uses `git diff --quiet` for new files/packages	`git diff` ignores untracked files and can fail a valid implementation with `NO_CHANGES`
Hard-stop validation gates in product workflows	A red check stops the agent team at the exact moment it should fix the problem. Capture gate output with `failOnError: false`, add a repair agent step, rerun, and reserve hard failure for exhausted repair budget or external blockers
Final acceptance before repair	Broken work can stop or commit without giving the team a final chance to fix it. Run final acceptance, hand output to a repair owner, rerun, then commit/open PR only after green deterministic evidence
Treating optional notification credentials as fatal	Workflow progress gets blocked by a non-core side effect. Prefer primitive/runtime fallbacks such as the Slack primitive's `cloud-relay` or `noop` shape from AgentWorkforce/relay#823 when notification is not the product contract
Manual peer fanout in `handleChannelMessage()`	Use broker-managed channel subscriptions — broker fans out to all subscribers automatically
Client-side `personaNames.has(from)` filtering	Use `relay.subscribe()`/`relay.unsubscribe()` — only subscribed agents receive messages
Agents receiving noisy cross-channel messages during focused work	Use `relay.mute({ agent, channel })` to silence non-primary channels without leaving them
Hardcoding all channels at spawn time	Use `agent.subscribe()` / `agent.unsubscribe()` for dynamic channel membership post-spawn
Using `preset: 'worker'` for Codex in interactive team patterns when coordination is needed	Codex interactive mode works fine with PTY channel injection. Drop the preset for interactive team patterns (keep it for one-shot DAG workers where clean stdout matters)
Separate reviewer agent from lead in interactive team	Merge lead + reviewer into one interactive Claude agent — reviews between rounds, fewer agents
Not printing PR URL after `createGitHubStep({ action: 'createPR' })`	Capture `html_url` with `output: { mode: 'data', format: 'json', path: 'html_url' }` and echo or write it in a final deterministic step
Workflow ending without worktree + PR for cross-repo changes	Add `setup-worktree` at start and `push-and-pr` + `cleanup-worktree` at end

YAML Alternative

version: '1.0'
name: my-workflow
swarm:
  pattern: dag
  channel: wf-my-workflow
agents:
  - name: lead
    cli: claude
    role: Architect
  - name: worker
    cli: codex
    role: Implementer
workflows:
  - name: default
    steps:
      - name: plan
        agent: lead
        task: 'Produce a detailed implementation plan.'
      - name: implement
        agent: worker
        task: 'Implement: {{steps.plan.output}}'
        dependsOn: [plan]
        verification:
          type: exit_code

Run with: agent-relay run path/to/workflow.yaml

Available Swarm Patterns

dag (default), fan-out, pipeline, hub-spoke, consensus, mesh, handoff, cascade, debate, hierarchical, map-reduce, scatter-gather, supervisor, reflection, red-team, verifier, auction, escalation, saga, circuit-breaker, blackboard, swarm

See skill choosing-swarm-patterns for pattern selection guidance.

name

writing-agent-relay-workflows

description

Writing Agent Relay Workflows

Overview

Language preference: TypeScript > Python > YAML. Use TypeScript unless the project is Python-only or a simple config-driven workflow suits YAML.

When to Use

Building multi-agent workflows with step dependencies
Orchestrating different AI CLIs (claude, codex, gemini, aider, goose)
Creating DAG, pipeline, fan-out, or other swarm patterns
Needing verification gates, retries, or step output chaining
Designing product-contract workflows where failing checks should route to agents for repair instead of stopping the run
Dynamic channel management: agents joining/leaving/muting channels mid-workflow

Default Principle: Workflows Repair Before They Fail

Run deterministic checks as evidence-capturing gates with captureOutput: true.
Prefer failOnError: false for intermediate validation gates so the workflow can pass the output to a repair agent.
Add a repair step immediately after each red-prone gate. The repair agent reads {{steps.<gate>.output}}, fixes source/tests/config, reruns the same command locally, and exits only after the gate is green or the blocker is external.
Keep final acceptance deterministic, but still put an agent repair step before commit/PR creation. If the repair budget is exhausted or a true external blocker remains, write a blocked artifact and skip commit/PR creation; do not let the workflow end as FAILED.
Use .reliable() or .repairable() on SDK versions that support it, especially for product-contract workflows. As of AgentWorkforce/relay#827, retry-mode workflows with agents are repair-aware by default, repair agents run before retrying malformed/failed agent steps, and the SDK covers DAG, pipeline, fan-out, worktree-backed, deterministic-only, and agent-plus-gate shapes.

Choose Your Coordination Style — Conversation vs Pipeline

Shape	What it is	Use when
Conversation (chat-native)	Interactive agents share a channel; messages, `@-mentions`, and ambient awareness drive coordination. Lead and workers spawn in parallel and self-organize. The relay is the coordination layer, not just transport.	Multi-file work, peer review loops, cross-agent feedback, dynamic re-planning, multi-PR coordination, anything with a human-in-the-loop escape, swarms where workers pick up each other's output.
Pipeline (one-shot DAG)	Each step runs as a one-shot subprocess (`claude -p`, `codex exec`); steps hand off via `{{steps.X.output}}` text injection. No agents are alive at the same time; no chat happens.	Linear, well-specified transformations; deterministic data passing; no review loop expected; the work could be expressed as a `bash \| bash \| bash` pipe.

A blunt rule of thumb: if your workflow only uses agent steps with preset: 'worker' chained by {{steps.X.output}}, you are not using the relay — you are using claude -p | codex exec. That may still be the right answer; just make it a deliberate choice.

Quick Reference (Pipeline shape)

Use this when steps are linear, well-specified, and need no agent-to-agent feedback. For anything with iteration, review, or coordination, jump to Quick Reference (Conversation shape) below.

Note: this Quick Reference assumes an ESM workflow file (the host package.json has "type": "module"). For CJS repos, see rule #1 in Critical TypeScript rules below — convert import { workflow } from '@agent-relay/sdk/workflows' to const { workflow } = require('@agent-relay/sdk/workflows') and wrap the workflow in async function main() { ... } main().catch(console.error) since CJS does not support top-level await. Always check package.json before copy-pasting the snippet.

import { workflow } from '@agent-relay/sdk/workflows';

const result = await workflow('my-workflow')
  .description('What this workflow does')
  .pattern('dag') // or 'pipeline', 'fan-out', etc.
  .channel('wf-my-workflow') // dedicated channel (auto-generated if omitted)
  .maxConcurrency(3)
  .timeout(3_600_000) // global timeout (ms)

  .agent('lead', { cli: 'claude', role: 'Architect', retries: 2 })
  .agent('worker', { cli: 'codex', role: 'Implementer', retries: 2 })

  .step('plan', {
    agent: 'lead',
    task: `Analyze the codebase and produce a plan.`,
    retries: 2,
    verification: { type: 'output_contains', value: 'PLAN_COMPLETE' },
  })
  .step('implement', {
    agent: 'worker',
    task: `Implement based on this plan:\n{{steps.plan.output}}`,
    dependsOn: ['plan'],
    verification: { type: 'exit_code' },
  })

  .onError('retry', { maxRetries: 2, retryDelayMs: 10_000 })
  .run({ cwd: process.cwd() });

  console.log('Result:', result.status);

Quick Reference (Conversation shape)

Use this for any non-trivial work — peer review, multi-file edits, cross-agent feedback, dynamic re-planning. Lead and workers spawn in parallel on a shared channel and self-organize via messages. The relay primitive does the coordinating; verification gates downstream of the lead close the workflow.

import { workflow } from '@agent-relay/sdk/workflows';
import { ClaudeModels, CodexModels } from '@agent-relay/config';

const result = await workflow('my-workflow')
  .description('Multi-file change with peer review')
  .pattern('dag')
  .channel('wf-my-feature')          // dedicated channel — agents share it
  .maxConcurrency(4)
  .timeout(3_600_000)
  .repairable()

  // Interactive agents — no preset, they live on the channel
  .agent('lead', {
    cli: 'claude',
    model: ClaudeModels.OPUS,
    role: 'Architect + reviewer. Plans, assigns, reviews, posts feedback.',
    retries: 1,
  })
  .agent('impl-a', {
    cli: 'codex',
    model: CodexModels.GPT_5_4,
    role: 'Implementer. Listens on channel for assignments and feedback.',
    retries: 2,
  })
  .agent('impl-b', {
    cli: 'codex',
    model: CodexModels.GPT_5_4,
    role: 'Implementer. Listens on channel for assignments and feedback.',
    retries: 2,
  })

  // Deterministic context — pre-reads files once, posts to the channel for everyone
  .step('context', {
    type: 'deterministic',
    command: 'git ls-files src/',
    captureOutput: true,
  })

  // Lead and workers all depend on `context` — they start CONCURRENTLY.
  // They coordinate over #wf-my-feature, not via {{steps.X.output}}.
  .step('lead-coordinate', {
    agent: 'lead',
    dependsOn: ['context'],
    task: `You are the lead on #wf-my-feature. Workers: impl-a, impl-b.
Post the plan. Assign files. Review their PRs/diffs. Post feedback in-channel.
Workers iterate based on your feedback. Exit when both files pass review.`,
  })
  .step('impl-a-work', {
    agent: 'impl-a',
    dependsOn: ['context'],   // SAME dep as lead → starts in parallel, no deadlock
    task: `You are impl-a on #wf-my-feature. Wait for the lead's plan.
Implement your assigned file. Post a completion message. Address feedback.`,
  })
  .step('impl-b-work', {
    agent: 'impl-b',
    dependsOn: ['context'],   // SAME dep as lead
    task: `You are impl-b on #wf-my-feature. Wait for the lead's plan.
Implement your assigned file. Post a completion message. Address feedback.`,
  })

  // Downstream gates on the lead — lead exits when satisfied.
  // Capture failures, then hand them to an agent for repair.
  .step('verify', {
    type: 'deterministic',
    dependsOn: ['lead-coordinate'],
    command: 'npm run typecheck && npm test 2>&1',
    captureOutput: true,
    failOnError: false,
  })
  .step('repair-verify', {
    agent: 'lead',
    dependsOn: ['verify'],
    task: `If verification passed, summarize evidence.
If it failed, use this output to assign and fix issues, then rerun the command until green:
{{steps.verify.output}}`,
    verification: { type: 'exit_code' },
  })
  .step('verify-final', {
    type: 'deterministic',
    dependsOn: ['repair-verify'],
    command: 'npm run typecheck && npm test 2>&1',
    captureOutput: true,
    failOnError: true,
  })

  .onError('retry', { maxRetries: 2, retryDelayMs: 10_000 })
  .run({ cwd: process.cwd() });

What this exercises that pipeline-shape does not:

Ambient awareness — workers see each other's completion messages and start dependent work without the lead relaying.
Lead-as-reviewer — the lead reads actual files between rounds and posts diff-aware feedback in chat. One agent does coordination + review; no separate reviewer step.
Iterative correction — when the lead pings "impl-a, the type on line 42 is wrong", impl-a fixes and re-posts. No new step, no re-spawn, no {{output}} chaining.

Critical workflow rules for this shape:

Lead and workers MUST share the same dependsOn (e.g., both depend on context). If a worker depends on the lead, you have a deadlock — the lead is waiting for worker output, the worker is waiting for the lead step to "complete."
Drop preset: 'worker' on the implementer agents — interactive mode is what lets them receive channel messages via PTY injection.
Downstream gates depend on the lead step, not the workers. The lead exits when it's satisfied; that's the workflow's signal that implementation is ready for repairable deterministic checks.
Use a dedicated .channel('wf-...') so the team is isolated from other workflows and the global general channel.

See Common Patterns → Interactive Team for production notes from real runs and decision criteria for picking this shape over one-shot DAG.

Critical TypeScript rules:

Check the project's package.json for "type": "module" — if ESM, use import and top-level await. If CJS, use require() and wrap in async function main().
agent-relay run <file.ts> executes the file as a standalone subprocess — it does NOT inspect exports. The file MUST call .run().
Use .run({ cwd: process.cwd() }) — createWorkflowRenderer does not exist
Validate with --dry-run before running: agent-relay run --dry-run workflow.ts

⚡ Parallelism — Design for Speed

This is the most important design consideration. Sequential workflows waste hours. Always design for maximum parallelism.

Cross-Workflow Parallelism: Wave Planning

When a project has multiple workflows, group independent ones into parallel waves:

# BAD — sequential (14 hours for 27 workflows at ~30 min each)
agent-relay run workflows/34-sst-wiring.ts
agent-relay run workflows/35-env-config.ts
agent-relay run workflows/36-loading-states.ts
# ... one at a time

# GOOD — parallel waves (3-4 hours for 27 workflows)
# Wave 1: independent infra (parallel)
agent-relay run workflows/34-sst-wiring.ts &
agent-relay run workflows/35-env-config.ts &
agent-relay run workflows/36-loading-states.ts &
agent-relay run workflows/37-responsive.ts &
wait
git add -A && git commit -m "Wave 1"

# Wave 2: testing (parallel — independent test suites)
agent-relay run workflows/40-unit-tests.ts &
agent-relay run workflows/41-integration-tests.ts &
agent-relay run workflows/42-e2e-tests.ts &
wait
git add -A && git commit -m "Wave 2"

Wave Planning Heuristics

Two workflows can run in parallel if they don't have write-write or write-read file conflicts:

Touch Zone	Can Parallelize?
Different `packages/*/src/` dirs	✅ Yes
Different `app/` routes	✅ Yes
Same package, different subdirs	⚠️ Usually yes
Same files (shared config, root package.json)	❌ No — sequential or same wave with merge
Explicit dependency	❌ No — ordered waves

Declare File Scope for Planning

Help wave planners (human or automated) understand what each workflow touches:

workflow('48-comparison-mode')
  .packages(['web', 'core'])                // monorepo packages touched
  .isolatedFrom(['49-feedback-system'])      // explicitly safe to parallelize
  .requiresBefore(['46-admin-dashboard'])    // explicit ordering constraint

Within-Workflow Parallelism

Use shared dependsOn to fan out independent sub-tasks:

// BAD — unnecessary sequential chain
.step('fix-component-a', { agent: 'worker', dependsOn: ['review'] })
.step('fix-component-b', { agent: 'worker', dependsOn: ['fix-component-a'] })  // why wait?

// GOOD — parallel fan-out, merge at the end
.step('fix-component-a', { agent: 'impl-1', dependsOn: ['review'] })
.step('fix-component-b', { agent: 'impl-2', dependsOn: ['review'] })  // same dep = parallel
.step('verify-all', { agent: 'reviewer', dependsOn: ['fix-component-a', 'fix-component-b'] })

Impact

Real-world example (Relayed — 60 workflows):

Sequential: ~30 min × 60 = 30 hours
Parallel waves (4-6 per wave): ~12 waves × 35 min = ~7 hours (4x faster)
Aggressive parallelism (8-way): ~4 hours (7.5x faster)

Failure Prevention

These workflow files are easy to break in ways that only appear mid-run. Follow these rules when authoring or editing workflow .ts files.

1. Do not use raw top-level `await`

Executor-driven workflow files may be run through a tsx/esbuild path that behaves like CJS. Raw top-level await can fail with:

Top-level await is currently not supported with the "cjs" output format

Always wrap execution like this:

async function runWorkflow() {
  const result = await workflow('my-workflow')
    // ...
    .run({ cwd: process.cwd() });

  console.log('Workflow status:', result.status);
}

runWorkflow().catch((error) => {
  console.error(error);
  process.exit(1);
});

Do not end workflow files with bare top-level await workflow(...).run(...).

1b. Make commit and PR boundaries explicit

Use this pattern only when the workflow is supposed to own repository delivery:

Preflight the git state and fail on unexpected staged changes.
Create or verify the intended branch.
Run implementation, review, repairable validation, fix, and final acceptance gates.
Stage only the declared target files and review/signoff artifacts.
Commit with a deterministic message.
Push the branch.
Use createGitHubStep({ action: 'createPR', ... }) from @agent-relay/sdk to open the PR.
Verify the PR URL/state deterministically and write it into the final signoff artifact.

If commit or PR creation is intentionally outside the workflow, say that directly in the workflow description and signoff so the operator knows to do it after completion.

2. Avoid raw fenced code blocks inside workflow task template literals

Raw triple-backtick code fences inside large inline task: \...`template strings are fragile and can break outer TypeScript parsing, especially when they contain language tags likeswiftordiff`.

Preferred options, in order:

Avoid inline fenced examples entirely
Move larger examples to referenced files
Use plain indented examples instead of fenced blocks
If fenced blocks must exist inside generated inner code, escape them consistently and syntax-check the outer workflow file afterward

2b. Standard preflight template for resumable workflows

The battle-tested template:

.step('preflight', {
  type: 'deterministic',
  command: [
    'set -e',
    'BRANCH=$(git rev-parse --abbrev-ref HEAD)',
    'echo "branch: $BRANCH"',
    'if [ "$BRANCH" != "fix/your-branch-name" ]; then echo "ERROR: wrong branch"; exit 1; fi',
    // Files the workflow is allowed to find dirty on entry:
    //   - package-lock.json: npm install is idempotent and often touches it
    //   - every file the workflow's edit steps will rewrite: a prior partial
    //     run may have left them dirty, and the edit step will rewrite
    //     them cleanly before commit
    // Everything else is unexpected drift and must fail preflight.
    'ALLOWED_DIRTY="package-lock.json|path/to/file1\\\\.ts|path/to/file2\\\\.ts"',
    'DIRTY=$(git diff --name-only | grep -vE "^(${ALLOWED_DIRTY})$" || true)',
    'if [ -n "$DIRTY" ]; then echo "ERROR: unexpected tracked drift:"; echo "$DIRTY"; exit 1; fi',
    'if ! git diff --cached --quiet; then echo "ERROR: staging area is dirty"; git diff --cached --stat; exit 1; fi',
    'gh auth status >/dev/null 2>&1 || (echo "ERROR: gh CLI not authenticated"; exit 1)',
    'echo PREFLIGHT_OK',
  ].join(' && '),
  captureOutput: true,
  failOnError: true,
}),

Rules baked into this template:

Always include package-lock.json in ALLOWED_DIRTY. Both npm install and npm ci can touch it idempotently.
Include every file the workflow's edit steps will rewrite. The commit step uses explicit git add <path> (never git add -A), so allowing these files to be dirty on entry is safe — unrelated drift in other files still fails preflight.
Escape dots in regex paths: setup\.ts not setup.ts. In a JS template literal this means four backslashes: "setup\\\\.ts".
Use grep -vE "^(...)$" for full-line match. Substring matches bleed across unrelated files (e.g., setup.ts would also match packages/core/src/bootstrap/setup.ts).
Append || true to the grep. Without it, an empty result triggers set -e and the whole preflight fails before the if can even run.
Check the staging area separately. A dirty index is different from a dirty working tree and both must be clean (modulo allow-list).
Check gh auth status early if downstream GitHub operations will use the local transport. Failing on auth at the end of a long DAG is painful.

2c. Picking the right `.join()` for multi-line shell commands

When a command: field is a JS array that gets joined into a shell command string, the join delimiter determines what kinds of content the array can contain.

command: [
  'set -e',
  'HITS=$(grep -c diag src/cli/commands/setup.ts || true)',
  'if [ "$HITS" -lt 6 ]; then echo "FAIL"; exit 1; fi',
  'echo OK',
].join(' && '),

.join('\n') — use when array elements must be part of a larger compound statement that spans multiple physical lines:

heredocs (cat <<EOF ... EOF)
multi-line if / while / for bodies
shell functions defined inline

Never mix heredocs with && joining. The most common failure mode:

// ❌ BROKEN — heredoc body gets && inserted between each line
command: [
  'set -e',
  'cat > /tmp/f <<EOF',
  'line 1',
  'line 2',
  'EOF',
  'next-command',
].join(' && '),

command: [
  'set -e',
  'BODY=$(mktemp)',
  // Each line of the file is a separate printf argument. No heredoc,
  // no shell metacharacter hazards, no command-substitution nesting.
  'printf "%s\\n" "## Summary" "" "body line 1" "body line 2" > "$BODY"',
  'gh pr create --title "..." --body-file "$BODY"',
  'rm -f "$BODY"',
].join(' && '),

2d. Template-literal escape sequences are processed once before the string is rendered

Specifically:

\s is not a recognized string escape → the backslash is stripped → \s renders as a literal s
\b is a recognized string escape (backspace, U+0008) → \b renders as a backspace character in the output
\n, \t, \r, \\, \0, \uXXXX, \xXX all get resolved at template time

Guidelines:

If you want a regex pattern that survives the template-literal pass unchanged, double every backslash in the source: \\s, \\b, \\n (the \\ renders to \ in the output, producing a correct regex at runtime).
If you want to write a long string-literal newline into the output, '\\n' in the template renders to '\n' in the output, which the runtime JS interprets as a newline. Using a literal '\n' would render an actual newline into the JS source — visually messy and sometimes surprising.
If you add anything non-trivial to a generator file that returns a big template literal, add a unit test that calls the generator with canonical inputs and asserts something about the rendered output — either exact string matches or, for regexes, eval/construct the regex and test it against known samples. See tests/orchestrator/script-generator.test.ts in cloud for prior art.

3. Keep final verification boring and deterministic

Final verification should validate real outputs with simple, portable shell commands. If checking for multiple symbols, use extended regex explicitly:

grep -Eq "foo|bar|baz" file.ts

Do not rely on basic grep alternation like:

grep -c "foo\|bar\|baz" file.ts

That can silently misbehave and create fake failures even when the generated code is correct.

4. Separate durable outputs from execution exhaust

Commit:

generated product code
migrations
tests
docs
workflow-definition fixes

Do not commit by default:

.logs/
transient executor output
retry artifacts
temporary step-output files

5. Prefer Codex for implementation-heavy roles and Claude for review

Default team split for workflow-authored agent roles:

lead / implementer / writer / fixer → codex
reviewer → claude

Use Claude as the primary implementer only when there is a specific reason.

6. Be explicit about shell requirements

If executor scripts use Bash-only features such as associative arrays, require modern Bash explicitly. On macOS, prefer a known-good Bash path when needed, for example:

/opt/homebrew/bin/bash workflows/your-workflow/execute.sh --wave 2

7. Make resume semantics explicit

Document clearly whether the executor supports:

full-run continuation
--wave
--workflow
--resume

Do not assume users will infer the behavior. In particular, --wave N should be understood as "run only this wave" unless the executor explicitly chains onward.

7a. `--resume` vs `--start-from` when fixing a buggy step

When a workflow fails at step X and you want to re-run it after editing the workflow file, the flag choice matters:

Flag	Reads workflow file fresh?	Uses cached step outputs?
`--resume <id>`	❌ replays stored config from DB	✅ from same run id
`--start-from <step> --previous-run-id <id>`	✅ reads fresh file	✅ from previous run id's cached outputs

When to use each:

Transient failure (network hiccup, rate limit, flaky agent), no code edits: --resume <id> is fine, fast, and correct.
You edited the workflow file (any step definition, any prompt, any verify gate): always --start-from <failing-step> --previous-run-id <id>. Everything upstream of the failing step loads from cache, the fresh file supplies the fixed definition, and downstream steps run as normal.

8. Syntax-check workflow files after editing

After editing workflow .ts files, run a lightweight syntax check before launching a large batch run. This is especially important if the workflow contains:

large inline task template literals
embedded code examples
escaped backticks
wrapper changes around workflow execution

9. Factor repo-specific setup into a shared helper

Pattern:

// workflows/lib/cloud-repo-setup.ts
export interface CloudRepoSetupOptions {
  branch: string;
  committerName?: string;
  extraSetupCommands?: string[];
  skipWorkspaceBuild?: boolean;
}

export function applyCloudRepoSetup<T>(wf: T, opts: CloudRepoSetupOptions): T {
  // adds two steps: setup-branch, install-deps
  // install-deps runs: npm install + workspace prebuilds (build:platform, build:core, etc.)
  // ...
}

Consumer workflows break the builder chain once and call through:

const baseWf = workflow(NAME)
  .description(...)
  .pattern('dag')
  .agent(...)
  .agent(...);

const wf = applyCloudRepoSetup(baseWf, {
  branch: BRANCH,
  committerName: 'My Workflow Bot',
});

await wf
  .step('read-spec', { dependsOn: ['install-deps'], ... })
  ...
  .run(...);

Rules:

The helper lives in the consumer repo, not in the SDK. Different customer repos have different languages, package managers, and build graphs — @agent-relay/sdk should stay agnostic.
Pre-build any workspace package whose package.json main/types point at a generated dist/. Fresh sandboxes don't have that dist/ yet, and agents will invent workarounds rather than run the build. See the @cloud/platform case above.
Every install step includes --legacy-peer-deps --no-audit --no-fund 2>&1 | tail -10 (or equivalent noise-trimming) because full install output blows past captureOutput size limits.
Document the helper in the repo's CLAUDE.md / AGENTS.md so new workflow authors (and agents writing workflows) discover it.

End-to-End Bug Fix Workflows

For bug-fix or reliability workflows, do not stop at unit or integration tests. The workflow should explicitly prove that the original user-visible problem is fixed.

Required phases for fix workflows

Capture the original failure
- Reproduce the bug first in a deterministic or evidence-capturing step
- Save exact commands, logs, status codes, or screenshots/artifacts
State the acceptance contract
- Define the exact end-to-end success criteria before implementation
- Include the real entrypoint a user would run
Implement the fix
Rebuild / reinstall from scratch
- Do not trust dirty local state
- Prefer a clean environment when install/bootstrap behavior is involved
Run targeted regression checks
- Unit/integration tests are helpful but not sufficient by themselves
Run a full end-to-end validation
- Use the real CLI / API / install path
- Prefer a clean environment (Docker, sandbox, cloud workspace, Daytona, etc.) for install/runtime issues
Compare before vs after evidence
- Show that the original failure no longer occurs
Record residual risks
- Call out what was not covered
Ship the result as a PR
- Open the pull request from the workflow itself with createGitHubStep
- See Shipping the Result — Open a PR via createGitHubStep below
- A workflow that fixes a bug and stops short of the PR has only done half the loop

Clean-environment validation guidance

Preferred order of proving environments:

disposable sandbox / cloud workspace
Docker / containerized environment
fresh local shell with isolated paths

Meta-workflow guidance

If the right proving environment is unclear, first write a meta-workflow that:

compares candidate validation environments
defines the acceptance contract
chooses the best swarm pattern
then authors the final fix/validation workflow

This is often better than jumping straight to implementation.

Shipping the Result — Open a PR via `createGitHubStep`

Why `createGitHubStep` (and not raw `gh` / `octokit`)

The primitive picks the right transport at runtime:

Where the workflow runs	Transport `createGitHubStep` uses	What you provide
Local (`agent-relay run`)	`gh` CLI	`gh auth status` works
Cloud (`agent-relay cloud run`) — tenant-scoped	Nango → workspace's GitHub App installation	Nothing — cloud injects credentials
Cloud — fallback	Relay-cloud GitHub proxy	Nothing — cloud injects credentials

Phase C interaction (cloud only): agent-relay cloud run already auto-pushes per-paths[] diffs as separate PRs after the workflow callback when the repos are allowlisted (see pushedTo in the run record). Phase C is the catch-all — if your workflow does nothing else, you still get one PR per declared path. Use createGitHubStep on top of that when you need PRs the catch-all can't produce: cross-cutting issues, follow-up tracking issues, opening one PR that spans multiple paths, draft PRs you want labeled/assigned in specific ways, or PRs against a repo you didn't paths[] in.

The minimal "open a PR" recipe

import { workflow } from '@agent-relay/sdk/workflows';
import { createGitHubStep } from '@agent-relay/sdk';

const REPO = 'AgentWorkforce/cloud';
const BRANCH = `agent-relay/run-${Date.now()}`;

await workflow('feature-x')
  // ... your real steps that produce code changes ...
  .step('write-marker', {
    type: 'deterministic',
    command: `echo "fix landed at $(date -u)" >> CHANGELOG.md`,
  })

  // Branch off main on the remote.
  .step('create-branch', createGitHubStep({
    dependsOn: ['write-marker'],
    action: 'createBranch',
    repo: REPO,
    params: { branch: BRANCH, source: 'main' },
  }))

  // Commit the change to the branch via Contents API.
  .step('commit-change', createGitHubStep({
    dependsOn: ['create-branch'],
    action: 'createFile',
    repo: REPO,
    params: {
      path: 'CHANGELOG.md',
      branch: BRANCH,
      content: '<file body here>',
      message: 'chore: changelog entry',
    },
  }))

  // Open the PR. This is the load-bearing step.
  .step('open-pr', createGitHubStep({
    dependsOn: ['commit-change'],
    action: 'createPR',
    repo: REPO,
    params: {
      title: 'feat: ship feature X',
      head: BRANCH,
      base: 'main',
      body: '## Summary\n\n- ...\n\n## Test plan\n\n- [x] ...',
      draft: false,
    },
    output: { mode: 'data', format: 'json', path: 'html_url' },
  }))

  .run({ cwd: process.cwd() });

Authoring rules for PR-shipping workflows

Open the PR from the workflow, not from the operator's shell. "Tell the user to run gh pr create" is a regression to a manual step the workflow could have done. The whole point of running this in cloud is that there is no operator's shell.
One PR per workflow, by default. A workflow that opens five PRs from one run is almost always wrong — humans review one PR at a time. If you genuinely need multiple, prefer a tracking issue + linked PRs, or split into separate workflows.
Branch name encodes the run. agent-relay/run-${runId} or agent-relay/${workflow-name}-${timestamp} so reviewers can tell the PR apart from other automation, and so reruns don't clash.
draft: true while iterating. Once the workflow is stable end-to-end, flip to draft: false.
Body is a real PR description. Summary + Test plan, generated from the workflow's own evidence (verification step output, diff stats, test run output). If you find yourself writing a placeholder body, the workflow isn't done — capture the real evidence in an earlier step and template it in.
Don't use createGitHubStep to substitute for paths[] push-back in cloud. If the diff lives in a tarballed paths[] mount, let cloud's Phase C push-back open that PR (it handles the patch generation, branch lifecycle, and per-repo allowlist). Use createGitHubStep when you need a PR against a repo or branch outside the paths[] set, or when you want to add an extra PR (e.g. a tracking issue, a follow-up against a sibling repo, a docs-only PR).
PR creation failures route to repair. If createPR errors (auth, permissions, branch conflict), capture the output and give a repair owner a chance to fix auth, branch state, labels, or body generation before stopping. A "successful" workflow that silently failed to open the PR is the worst-case outcome — the human thinks the work shipped.

Where this fits in the bug-fix phases

Key Concepts

Step Output Chaining

Use {{steps.STEP_NAME.output}} in a downstream step's task to inject the prior step's terminal output.

Mental model: this is a Unix pipe, not agent communication. {{steps.A.output}} flowing into step B is A | B — A is dead by the time B reads its stdout. There is no chat, no feedback, no addressing. If your workflow's coordination story is only output chaining, you're using the relay as transport, not as a coordination layer. See Choose Your Coordination Style before defaulting to this.

Only chain output from clean sources:

Deterministic steps (shell commands — always clean)
Non-interactive agents (preset: 'worker' — clean stdout)

Verification Gates

verification: { type: 'exit_code' }                        // preferred for code-editing steps
verification: { type: 'output_contains', value: 'DONE' }   // optional accelerator
verification: { type: 'file_exists', value: 'src/out.ts' } // deterministic file check

Only these four types are valid: exit_code, output_contains, file_exists, custom. Invalid types are silently ignored and fall through to process-exit auto-pass.

DAG Dependencies

Steps with dependsOn wait for all listed steps. Steps with no dependencies start immediately. Steps sharing the same dependsOn run in parallel:

.step('fix-types',  { agent: 'worker', dependsOn: ['review'], ... })
.step('fix-tests',  { agent: 'worker', dependsOn: ['review'], ... })
.step('final',      { agent: 'lead',   dependsOn: ['fix-types', 'fix-tests'], ... })

Self-Termination

Do NOT add exit instructions to task strings. The runner handles this automatically.

Step Completion Model

Steps complete through a multi-signal pipeline (highest priority first):

Deterministic verification — exit_code, file_exists, output_contains pass → immediate completion
Owner decision — OWNER_DECISION: COMPLETE|INCOMPLETE_RETRY|INCOMPLETE_FAIL
Evidence-based — channel signals, file artifacts, clean exit code
Marker fast-path — STEP_COMPLETE:<step-name> (optional accelerator)
Process-exit fallback — agent exits 0 with no signals → completes after grace period

Key principle: No single signal is mandatory. Describe the deliverable, not what to print.

Dynamic Channel Management

Agents can dynamically subscribe, unsubscribe, mute, and unmute channels after spawn. This eliminates the need for client-side channel filtering and manual peer fanout.

SDK API

// Subscribe an agent to additional channels post-spawn
relay.subscribe({ agent: 'security-auditor', channels: ['review-pr-456'] });

// Unsubscribe — agent leaves the channel entirely
relay.unsubscribe({ agent: 'security-auditor', channels: ['general'] });

// Mute — agent stays subscribed (history access) but messages are NOT injected into PTY
relay.mute({ agent: 'security-auditor', channel: 'review-pr-123' });

// Unmute — resume PTY injection
relay.unmute({ agent: 'security-auditor', channel: 'review-pr-123' });

Agent-level methods are also available:

const agent = await relay.claude.spawn({ name: 'auditor', channels: ['ch-a'] });
await agent.subscribe(['ch-b']);       // now subscribed to ch-a and ch-b
await agent.mute('ch-a');              // ch-a messages silenced (still in history)
await agent.unmute('ch-a');            // ch-a messages resume
await agent.unsubscribe(['ch-b']);     // leaves ch-b
console.log(agent.channels);          // ['ch-a']
console.log(agent.mutedChannels);     // []

Semantics

Operation	Channel membership	PTY injection	History access
`subscribe`	Yes	Yes	Yes
`unsubscribe`	No	No	No (leaves)
`mute`	Yes (stays)	No (silenced)	Yes (can query)
`unmute`	Yes	Yes (resumes)	Yes

Events

relay.onChannelSubscribed = (agent, channels) => { /* ... */ };
relay.onChannelUnsubscribed = (agent, channels) => { /* ... */ };
relay.onChannelMuted = (agent, channel) => { /* ... */ };
relay.onChannelUnmuted = (agent, channel) => { /* ... */ };

When to Use in Workflows

Multi-PR chat sessions: Agents focused on one PR can mute other PR channels to reduce noise
Phase transitions: Subscribe agents to new channels as work progresses between phases
Team isolation: Workers mute the main coordination channel during focused work, unmute for review
Dynamic fanout: A lead subscribes workers to sub-channels at runtime based on task decomposition

What This Eliminates

With broker-managed subscriptions, you no longer need:

Client-side persona filtering (personaNames.has(from) checks)
Channel prefix regex for message routing
Manual peer fanout (iterating agents to forward messages)
Dedup caches for dual-path delivery

Agent Definition

.agent('name', {
  cli: 'claude' | 'codex' | 'gemini' | 'aider' | 'goose' | 'opencode' | 'droid',
  role?: string,
  preset?: 'lead' | 'worker' | 'reviewer' | 'analyst',
  retries?: number,
  model?: string,
  interactive?: boolean, // default: true
})

Model Constants

Always use model constants from @agent-relay/config instead of string literals. Each CLI has a typed constants object with its available models:

import { ClaudeModels, CodexModels, GeminiModels } from '@agent-relay/config';

.agent('planner', { cli: 'claude', model: ClaudeModels.OPUS })    // not 'opus'
.agent('worker',  { cli: 'claude', model: ClaudeModels.SONNET })  // not 'sonnet'
.agent('coder',   { cli: 'codex',  model: CodexModels.GPT_5_4 })  // not 'gpt-5.4'

Post-spawn channel operations (available on Agent instances and AgentRelay facade):

// Agent instance methods
agent.subscribe(channels: string[]): Promise<void>
agent.unsubscribe(channels: string[]): Promise<void>
agent.mute(channel: string): Promise<void>
agent.unmute(channel: string): Promise<void>
agent.channels: string[]          // current subscribed channels
agent.mutedChannels: string[]     // currently muted channels

// AgentRelay facade methods (by agent name)
relay.subscribe({ agent: string, channels: string[] }): Promise<void>
relay.unsubscribe({ agent: string, channels: string[] }): Promise<void>
relay.mute({ agent: string, channel: string }): Promise<void>
relay.unmute({ agent: string, channel: string }): Promise<void>

Preset	Interactive	Relay access	Use for
`lead`	yes (PTY)	yes	Coordination, monitoring channels
`worker`	no (subprocess)	no	Bounded tasks, structured stdout
`reviewer`	no (subprocess)	no	Reading artifacts, producing verdicts
`analyst`	no (subprocess)	no	Reading code/files, writing findings

Non-interactive presets run via one-shot mode (claude -p, codex exec). Output is clean and available via {{steps.X.output}}.

Critical rule: Pre-inject content into non-interactive agents. Don't ask them to read large files — pre-read in a deterministic step and inject via {{steps.X.output}}.

Step Definition

Agent Steps

.step('name', {
  agent: string,
  task: string,                   // supports {{var}} and {{steps.NAME.output}}
  dependsOn?: string[],
  verification?: VerificationCheck,
  retries?: number,
})

Deterministic Steps (Shell Commands)

.step('verify-files', {
  type: 'deterministic',
  command: 'test -f src/auth.ts && echo "FILE_EXISTS"',
  dependsOn: ['implement'],
  captureOutput: true,
  failOnError: false,
})
.step('repair-files', {
  agent: 'worker',
  dependsOn: ['verify-files'],
  task: `If verify-files failed, create or fix the missing file and rerun the check.
Output:
{{steps.verify-files.output}}`,
  verification: { type: 'exit_code' },
})
.step('verify-files-final', {
  type: 'deterministic',
  command: 'test -f src/auth.ts && echo "FILE_EXISTS"',
  dependsOn: ['repair-files'],
  captureOutput: true,
  failOnError: true,
})

Common Patterns

Interactive Team (lead + workers on shared channel)

.agent('lead', {
  cli: 'claude',
  model: ClaudeModels.OPUS,
  role: 'Architect and reviewer — assigns work, reviews, posts feedback',
  retries: 1,
  // No preset — interactive by default
})

.agent('impl-new', {
  cli: 'codex',
  model: CodexModels.O3,
  role: 'Creates new files. Listens on channel for assignments and feedback.',
  retries: 2,
  // No preset — interactive, receives channel messages
})

.agent('impl-modify', {
  cli: 'codex',
  model: CodexModels.O3,
  role: 'Edits existing files. Listens on channel for assignments and feedback.',
  retries: 2,
})

// All three share the same dependsOn — they start concurrently (no deadlock)
.step('lead-coordinate', {
  agent: 'lead',
  dependsOn: ['context'],
  task: `You are the lead on #channel. Workers: impl-new, impl-modify.
Post the plan. Assign files. Review their work. Post feedback if needed.
Workers iterate based on your feedback. Exit when all files are correct.`,
})
.step('impl-new-work', {
  agent: 'impl-new',
  dependsOn: ['context'],   // same dep as lead = parallel start
  task: `You are impl-new on #channel. Wait for the lead's plan.
Create files as assigned. Report completion. Fix issues from feedback.`,
})
.step('impl-modify-work', {
  agent: 'impl-modify',
  dependsOn: ['context'],   // same dep as lead = parallel start
  task: `You are impl-modify on #channel. Wait for the lead's plan.
Edit files as assigned. Report completion. Fix issues from feedback.`,
})
// Downstream gates on lead (lead exits when satisfied)
.step('verify', { type: 'deterministic', dependsOn: ['lead-coordinate'], ... })

Key behaviors observed in production:

Workers self-organize from channel context. Workers read each other's completion messages and start dependent work without waiting for the lead to relay. The shared channel gives them ambient awareness.
Lead-as-reviewer is more efficient than a separate reviewer agent. The lead reads actual files and runs typecheck between rounds — one agent doing coordination + review eliminates a step.
Codex interactive mode works well with PTY channel injection. Don't default to preset: 'worker' — interactive Codex agents receive and act on channel messages reliably.
Workers may outpace the lead. If the lead is reviewing while workers are fast, the lead's "proceed" message may arrive after the worker already started from channel context. This is harmless but worth knowing.
No feedback loop needed = fast path. If workers get it right first try, the interactive pattern completes just as fast as one-shot. The feedback loop is insurance, not overhead.

When to use interactive team vs one-shot DAG:

Scenario	Pattern
4+ files, likely needs iteration	Interactive team
Simple edits, well-specified	One-shot DAG with `preset: 'worker'`
Cross-agent review feedback loop	Interactive team
Independent tasks, no coordination	Fan-out with non-interactive workers
Anything where the answer to "could this be `cmd1 \| cmd2`?" is no	Interactive team

Chat-Native Coordination Recipes

1. Question / Answer (blocking ask)

When agent A needs information only agent B has, instruct A to post a direct question and wait for a reply rather than guessing or proceeding.

.step('integrate', {
  agent: 'integrator',
  dependsOn: ['context'],
  task: `You are the integrator on #wf-feature.
Before writing code, post a direct question to @schema-owner asking which
table owns the new field. Do NOT proceed until @schema-owner replies in
channel. If no reply arrives in 5 minutes, @-mention the lead.`,
})

Why it beats {{steps.X.output}}: the answer depends on something only an agent (or human) can decide at runtime; encoding it as a prior step's stdout is wrong.

2. Broadcast / Ack

When a lead needs N workers to confirm receipt before proceeding (e.g., to make sure the plan was actually read), require explicit acks.

.step('lead-coordinate', {
  agent: 'lead',
  dependsOn: ['context'],
  task: `Post the plan to #wf-feature, then @impl-a @impl-b @impl-c.
Wait for each to reply with "ACK <agent-name>" before issuing assignments.
If any worker hasn't acked in 3 minutes, re-post and ping again.
Only after all three have acked, post per-worker assignments.`,
})

3. Peer Review Handoff

The substantive form of "review my work." Worker pings reviewer in-channel with a concrete artifact reference; reviewer reads the actual files (not the chat); reviewer replies with a verdict.

.step('impl-a-work', {
  agent: 'impl-a',
  dependsOn: ['context'],
  task: `Implement src/foo.ts per the lead's assignment.
When done, post to #wf-feature: "@reviewer ready: src/foo.ts" — include the
commit SHA. Then wait for @reviewer's verdict in channel.
- If "APPROVED", you're done.
- If "CHANGES_REQUESTED <notes>", apply the notes and re-post.
- If no verdict in 5 min, @-mention the lead.`,
})

4. Standup / Status Probe

For long-running workflows, have the lead post periodic @-mention probes so silently-stuck workers surface fast.

.step('lead-coordinate', {
  agent: 'lead',
  task: `... coordinate the team ...

Every 10 minutes, post a status probe: "@impl-a @impl-b status?"
Each worker should reply with one of:
  - "RUNNING <step>" (still working)
  - "BLOCKED <reason>" (@-mention the lead with the blocker)
  - "DONE <artifact>" (ready for review)

If a worker is silent for two probes in a row, mark them stalled and
reassign their work to a peer.`,
})

5. Hand-Off with Context

When work flows from agent A to agent B during a workflow (not just between steps), have A post a structured handoff message so B doesn't re-derive context.

.step('impl-a-work', {
  agent: 'impl-a',
  task: `... finish your part ...

When done, post a handoff to #wf-feature targeting the next worker:
"@impl-b HANDOFF: src/foo.ts ready. Touched: <files>. Open question: <if any>.
Tests: <pass/fail summary>. Commit: <sha>."`,
})

Vs {{steps.X.output}}: an output-chain forces B to parse A's entire stdout. A handoff message is a curated summary A writes for B — much higher signal, no PTY/ANSI noise.

Picking a recipe

Need	Recipe
One agent needs an answer from another at runtime	Q/A
Lead needs to confirm workers received the plan	Broadcast/Ack
Agent-to-agent code review	Peer Review Handoff
Long-running team, want stalled-worker visibility	Standup/Probe
Sequential agent work that needs context curation	Hand-Off with Context

Authoring rule: if your workflow has interactive agents on a channel but their task strings don't instruct them to talk to each other, you're not using the chat primitive — you've just paid the overhead of starting it. Either add an explicit recipe above, or drop to preset: 'worker' and pipeline-shape.

Pipeline (sequential handoff)

.pattern('pipeline')
.step('analyze', { agent: 'analyst', task: '...' })
.step('implement', { agent: 'dev', task: '{{steps.analyze.output}}', dependsOn: ['analyze'] })
.step('test', { agent: 'tester', task: '{{steps.implement.output}}', dependsOn: ['implement'] })

Error Handling

.onError('fail-fast')   // stop on first failure (default)
.onError('continue')    // skip failed branches, continue others
.onError('retry', { maxRetries: 3, retryDelayMs: 5000 })

Multi-File Edit Pattern

When a workflow needs to modify multiple existing files, use one agent step per file with a deterministic verify gate after each. Agents reliably edit 1-2 files per step but fail on 4+.

steps:
  - name: read-types
    type: deterministic
    command: cat src/types.ts
    captureOutput: true

  - name: edit-types
    agent: dev
    dependsOn: [read-types]
    task: |
      Edit src/types.ts. Current contents:
      {{steps.read-types.output}}
      Add 'pending' to the Status union type.
      Only edit this one file.
    verification:
      type: exit_code

  - name: verify-types
    type: deterministic
    dependsOn: [edit-types]
    command: 'if git diff --quiet src/types.ts; then echo "NOT MODIFIED"; exit 1; fi; echo "OK"'
    captureOutput: true
    failOnError: false

  - name: fix-types-verification
    agent: dev
    dependsOn: [verify-types]
    task: |
      If verify-types failed, fix src/types.ts and rerun the verify command.
      Output:
      {{steps.verify-types.output}}
    verification:
      type: exit_code

  - name: verify-types-final
    type: deterministic
    dependsOn: [fix-types-verification]
    command: 'if git diff --quiet src/types.ts; then echo "NOT MODIFIED"; exit 1; fi; echo "OK"'
    captureOutput: true
    failOnError: true

  - name: read-service
    type: deterministic
    dependsOn: [verify-types-final]
    command: cat src/service.ts
    captureOutput: true

  - name: edit-service
    agent: dev
    dependsOn: [read-service]
    task: |
      Edit src/service.ts. Current contents:
      {{steps.read-service.output}}
      Add a handlePending() method.
      Only edit this one file.
    verification:
      type: exit_code

  - name: verify-service
    type: deterministic
    dependsOn: [edit-service]
    command: 'if git diff --quiet src/service.ts; then echo "NOT MODIFIED"; exit 1; fi; echo "OK"'
    captureOutput: true
    failOnError: false

  - name: fix-service-verification
    agent: dev
    dependsOn: [verify-service]
    task: |
      If verify-service failed, fix src/service.ts and rerun the verify command.
      Output:
      {{steps.verify-service.output}}
    verification:
      type: exit_code

  - name: verify-service-final
    type: deterministic
    dependsOn: [fix-service-verification]
    command: 'if git diff --quiet src/service.ts; then echo "NOT MODIFIED"; exit 1; fi; echo "OK"'
    captureOutput: true
    failOnError: true

  # Deterministic commit — never rely on agents to commit
  - name: commit
    type: deterministic
    dependsOn: [verify-service-final]
    command: npm run typecheck && npm test && git add src/types.ts src/service.ts && git commit -m "feat: add pending status"
    captureOutput: true
    failOnError: false

  - name: repair-commit
    agent: dev
    dependsOn: [commit]
    task: |
      If commit failed, fix the blocker, rerun npm run typecheck && npm test, and create the commit.
      If commit passed, confirm the commit subject.
      Output:
      {{steps.commit.output}}
    verification:
      type: exit_code

  - name: verify-commit-created
    type: deterministic
    dependsOn: [repair-commit]
    command: 'git log -1 --pretty=%s | grep -q "^feat: add pending status$" && echo "COMMIT_OK" || (echo "COMMIT_MISSING"; exit 1)'
    captureOutput: true
    failOnError: true

Key rules:

Read the file in a deterministic step right before the edit (not all files upfront)
Tell the agent "Only edit this one file" to prevent it touching other files
Verify tracked-only edits with git diff --quiet, hand failures back to an agent to repair, then rerun the deterministic check as proof
If the edit may create new files/packages, verify with git status --short -- <paths> because git diff --quiet ignores untracked files
Always commit with a deterministic step, never an agent step; rerun acceptance checks in that step, let an agent repair commit blockers, and prove the commit exists

File Materialization: Verify Before Proceeding

After any step that creates files, add a deterministic file_exists check before proceeding. Non-interactive agents may exit 0 without writing anything (wrong cwd, stdout instead of disk).

- name: verify-files
  type: deterministic
  dependsOn: [impl-auth, impl-storage]
  command: |
    missing=0
    for f in src/auth/credentials.ts src/storage/client.ts; do
      if [ ! -f "$f" ]; then echo "MISSING: $f"; missing=$((missing+1)); fi
    done
    if [ $missing -gt 0 ]; then echo "$missing files missing"; exit 1; fi
    echo "All files present"
  captureOutput: true
  failOnError: false

- name: fix-missing-files
  agent: impl-auth
  dependsOn: [verify-files]
  task: |
    If verify-files found missing files, create/fix them and rerun the check.
    Output:
    {{steps.verify-files.output}}
  verification:
    type: exit_code

- name: verify-files-final
  type: deterministic
  dependsOn: [fix-missing-files]
  command: |
    missing=0
    for f in src/auth/credentials.ts src/storage/client.ts; do
      if [ ! -f "$f" ]; then echo "MISSING: $f"; missing=$((missing+1)); fi
    done
    if [ $missing -gt 0 ]; then echo "$missing files missing"; exit 1; fi
    echo "All files present"
  captureOutput: true
  failOnError: true

Rules for file-writing tasks:

Use full paths from project root — say src/auth/credentials.ts, not credentials.ts
Add IMPORTANT: Write the file to disk. Do NOT output to stdout.
Use file_exists verification for creation steps (not just exit_code)
Gate all downstream steps on the final deterministic proof step that follows the repair step

Edit Gates Must See Untracked Files

Use git status --short -- <paths> for materialization/edit gates, and keep the first gate repairable:

- name: provider-edit-gate-capture
  type: deterministic
  dependsOn: [implement-providers]
  command: |
    if [ -z "$(git status --short -- packages/new-provider .workflow-artifacts/my-flow)" ]; then
      echo "NO_PROVIDER_CHANGES"
      exit 1
    fi
    echo "PROVIDER_EDIT_GATE_OK"
  captureOutput: true
  failOnError: false

- name: repair-edit-gate
  agent: provider-worker
  dependsOn: [provider-edit-gate-capture]
  task: |
    If provider-edit-gate-capture reported NO_PROVIDER_CHANGES, inspect git
    status including untracked files and add the missing provider artifacts.
    If it already passed, do nothing.
  verification:
    type: exit_code

- name: provider-edit-gate-final
  type: deterministic
  dependsOn: [repair-edit-gate]
  command: |
    if [ -z "$(git status --short -- packages/new-provider .workflow-artifacts/my-flow)" ]; then
      echo "NO_PROVIDER_CHANGES"
      exit 1
    fi
    echo "PROVIDER_EDIT_GATE_FINAL_OK"
  captureOutput: true
  failOnError: false

- name: repair-provider-edit-gate-final
  agent: provider-worker
  dependsOn: [provider-edit-gate-final]
  task: |
    If provider-edit-gate-final is still red, repair the missing provider
    artifacts and rerun the check. If repair is impossible, write
    .workflow-artifacts/my-flow/BLOCKED_NO_COMMIT.md with exact evidence and
    do not commit.
    Output:
    {{steps.provider-edit-gate-final.output}}
  verification:
    type: exit_code

Both gates capture evidence and give an agent a chance to fix. A still-red final gate becomes a blocked/no-commit artifact, not a workflow crash.

Agent Transport Must Not Be The First Hard Gate

For long rollouts, keep the critical path evidence-based:

.step('runtime-implementation', {
  agent: 'impl-runtime',
  dependsOn: ['context'],
  task: 'Implement the runtime slice and write .workflow-artifacts/runtime.md',
})
.step('adapter-implementation', {
  agent: 'impl-adapters',
  dependsOn: ['context'],
  task: 'Implement adapter wiring and write .workflow-artifacts/adapters.md',
})
.step('implementation-reconcile', {
  type: 'deterministic',
  dependsOn: ['context'],
  command: `git status --short -- packages/core packages/*/src/writeback.ts scripts tests .workflow-artifacts
test -f scripts/verify-e2e.mjs || echo "MISSING_E2E"
test -f packages/core/src/runtime/router.ts || echo "MISSING_ROUTER"`,
  captureOutput: true,
  failOnError: false,
})
.step('repair-implementation-reconcile', {
  agent: 'qa',
  dependsOn: ['implementation-reconcile'],
  task: `Finish anything missing before gates run:\n{{steps.implementation-reconcile.output}}`,
  verification: { type: 'exit_code' },
})
.step('run-e2e', {
  type: 'deterministic',
  dependsOn: ['repair-implementation-reconcile'],
  command: 'npm run verify:e2e',
  captureOutput: true,
  failOnError: false,
})

DAG Deadlock Anti-Pattern

# WRONG — deadlock: coordinate depends on context, work-a depends on coordinate
steps:
  - name: coordinate
    dependsOn: [context]    # lead waits for WORKER_DONE...
  - name: work-a
    dependsOn: [coordinate] # ...but work-a can't start until coordinate finishes

# RIGHT — workers and lead start in parallel
steps:
  - name: context
    type: deterministic
  - name: work-a
    dependsOn: [context]    # starts with lead
  - name: coordinate
    dependsOn: [context]    # starts with workers
  - name: merge
    dependsOn: [work-a, coordinate]

Rule: if a lead step's task mentions downstream step names alongside waiting keywords, that's a deadlock.

Step Sizing

One agent, one deliverable. A step's task prompt should be 10-20 lines max.

Split into a lead + workers team when:

The task requires a 50+ line prompt
The deliverable is multiple files that must be consistent
You need one agent to verify another's output

# Team pattern: lead + workers on a shared channel
steps:
  - name: track-lead-coord
    agent: track-lead
    dependsOn: [prior-step]
    task: |
      Lead the track on #my-track. Workers: track-worker-1, track-worker-2.
      Post assignments to the channel. Review worker output.

  - name: track-worker-1-impl
    agent: track-worker-1
    dependsOn: [prior-step]  # same dep as lead — starts concurrently
    task: |
      Join #my-track. track-lead will post your assignment.
      Implement the file as directed.
    verification:
      type: exit_code

  - name: next-step
    dependsOn: [track-lead-coord]  # downstream depends on lead, not workers

Supervisor Pattern

Auto-hardening only activates for hub patterns — not pipeline or dag.

Use case	Pattern	Why
Sequential, no monitoring	`pipeline`	Simple, no overhead
Workers need oversight	`supervisor`	Auto-owner monitors
Local/small models	`supervisor`	Supervisor catches stuck workers
All non-interactive	`pipeline` or `dag`	No PTY = no supervision needed

Concurrency

Cap maxConcurrency at 4-6. Spawning 10+ agents simultaneously causes broker timeouts.

Parallel agents	`maxConcurrency`
2-4	4 (default safe)
5-10	5
10+	6-8 max

Common Mistakes

Mistake	Fix
Treating relay as transport, not as a coordination layer (every step is `preset: 'worker'`, every handoff is `{{steps.X.output}}`)	Default to Conversation shape for non-trivial work — interactive agents on a shared channel. Pipeline-shape is only correct when the work could be expressed as a `bash \| bash \| bash` pipe.
Interactive agents on a channel whose task strings don't tell them to talk to each other	Pick a Chat-Native Coordination Recipe (Q/A, Broadcast/Ack, Peer Review, Standup, Hand-Off) and bake it into the task prompt — otherwise you're paying for a chat substrate you're not using
All workflows run sequentially	Group independent workflows into parallel waves (4-7x speedup)
Every step depends on the previous one	Only add `dependsOn` when there's a real data dependency
Self-review step with no timeout	Set `timeout: 300_000` (5 min) — Codex hangs in non-interactive review
One giant workflow per feature	Split into smaller workflows that can run in parallel waves
Adding exit instructions to tasks	Runner handles self-termination automatically
Setting `timeoutMs` on agents/steps	Use global `.timeout()` only
Using `general` channel	Set `.channel('wf-name')` for isolation
`{{steps.X.output}}` without `dependsOn: ['X']`	Output won't be available yet
Requiring exact sentinel as only completion gate	Use `exit_code` or `file_exists` verification
Writing 100-line task prompts	Split into lead + workers on a channel
`maxConcurrency: 16` with many parallel steps	Cap at 5-6
Non-interactive agent reading large files via tools	Pre-read in deterministic step, inject via `{{steps.X.output}}`
Workers depending on lead step (deadlock)	Both depend on shared context step
Validation gates depending directly on long interactive implementation agents	Add a deterministic implementation-reconcile step and make gates depend on its repair step
`fan-out`/`hub-spoke` for simple parallel workers	Use `dag` instead
`pipeline` but expecting auto-supervisor	Only hub patterns auto-harden. Use `.pattern('supervisor')`
Workers without `preset: 'worker'` in one-shot DAG lead+worker flows	Add preset for clean stdout when chaining `{{steps.X.output}}` (not needed for interactive team patterns)
Using `_` in YAML numbers (`timeoutMs: 1_200_000`)	YAML doesn't support `_` separators
Workflow timeout under 30 min for complex workflows	Use `3600000` (1 hour) as default
Using `require()` in ESM projects	Check `package.json` for `"type": "module"` — use `import` if ESM
Wrapping in `async function main()` in ESM	ESM supports top-level `await` — no wrapper needed
Using `createWorkflowRenderer`	Does not exist. Use `.run({ cwd: process.cwd() })`
`export default workflow(...)...build()`	No `.build()`. Chain ends with `.run()` — the file must call `.run()`, not just export config
Relative import `'../workflows/builder.js'`	Use `import { workflow } from '@agent-relay/sdk/workflows'`
Hardcoded model strings (`model: 'opus'`)	Use constants: `import { ClaudeModels } from '@agent-relay/config'` → `model: ClaudeModels.OPUS`
Thinking `agent-relay run` inspects exports	It executes the file as a subprocess. Only `.run()` invocations trigger steps
`pattern('single')` on cloud runner	Not supported — use `dag`
`pattern('supervisor')` with one agent	Same agent is owner + specialist. Use `dag`
Invalid verification type (`type: 'deterministic'`)	Only `exit_code`, `output_contains`, `file_exists`, `custom` are valid
Chaining `{{steps.X.output}}` from interactive agents	PTY output is garbled. Use deterministic steps or `preset: 'worker'`
Single step editing 4+ files	Agents modify 1-2 then exit. Split to one file per step with verify gates
Relying on agents to `git commit`	Agents emit markers without running git. Use deterministic commit step
File-writing steps without `file_exists` verification	`exit_code` auto-passes even if no file written
Edit gate uses `git diff --quiet` for new files/packages	`git diff` ignores untracked files and can fail a valid implementation with `NO_CHANGES`
Hard-stop validation gates in product workflows	A red check stops the agent team at the exact moment it should fix the problem. Capture gate output with `failOnError: false`, add a repair agent step, rerun, and reserve hard failure for exhausted repair budget or external blockers
Final acceptance before repair	Broken work can stop or commit without giving the team a final chance to fix it. Run final acceptance, hand output to a repair owner, rerun, then commit/open PR only after green deterministic evidence
Treating optional notification credentials as fatal	Workflow progress gets blocked by a non-core side effect. Prefer primitive/runtime fallbacks such as the Slack primitive's `cloud-relay` or `noop` shape from AgentWorkforce/relay#823 when notification is not the product contract
Manual peer fanout in `handleChannelMessage()`	Use broker-managed channel subscriptions — broker fans out to all subscribers automatically
Client-side `personaNames.has(from)` filtering	Use `relay.subscribe()`/`relay.unsubscribe()` — only subscribed agents receive messages
Agents receiving noisy cross-channel messages during focused work	Use `relay.mute({ agent, channel })` to silence non-primary channels without leaving them
Hardcoding all channels at spawn time	Use `agent.subscribe()` / `agent.unsubscribe()` for dynamic channel membership post-spawn
Using `preset: 'worker'` for Codex in interactive team patterns when coordination is needed	Codex interactive mode works fine with PTY channel injection. Drop the preset for interactive team patterns (keep it for one-shot DAG workers where clean stdout matters)
Separate reviewer agent from lead in interactive team	Merge lead + reviewer into one interactive Claude agent — reviews between rounds, fewer agents
Not printing PR URL after `createGitHubStep({ action: 'createPR' })`	Capture `html_url` with `output: { mode: 'data', format: 'json', path: 'html_url' }` and echo or write it in a final deterministic step
Workflow ending without worktree + PR for cross-repo changes	Add `setup-worktree` at start and `push-and-pr` + `cleanup-worktree` at end

YAML Alternative

version: '1.0'
name: my-workflow
swarm:
  pattern: dag
  channel: wf-my-workflow
agents:
  - name: lead
    cli: claude
    role: Architect
  - name: worker
    cli: codex
    role: Implementer
workflows:
  - name: default
    steps:
      - name: plan
        agent: lead
        task: 'Produce a detailed implementation plan.'
      - name: implement
        agent: worker
        task: 'Implement: {{steps.plan.output}}'
        dependsOn: [plan]
        verification:
          type: exit_code

Run with: agent-relay run path/to/workflow.yaml

Available Swarm Patterns

See skill choosing-swarm-patterns for pattern selection guidance.

writing-agent-relay-workflows

More from this repository

More from this repository

Writing Agent Relay Workflows

Overview

When to Use

Default Principle: Workflows Repair Before They Fail

Choose Your Coordination Style — Conversation vs Pipeline

Quick Reference (Pipeline shape)

Quick Reference (Conversation shape)

⚡ Parallelism — Design for Speed

Cross-Workflow Parallelism: Wave Planning

Wave Planning Heuristics

Declare File Scope for Planning

Within-Workflow Parallelism

Impact

Failure Prevention

1. Do not use raw top-level await

1b. Make commit and PR boundaries explicit

2. Avoid raw fenced code blocks inside workflow task template literals

2b. Standard preflight template for resumable workflows

2c. Picking the right .join() for multi-line shell commands

2d. Template-literal escape sequences are processed once before the string is rendered

3. Keep final verification boring and deterministic

4. Separate durable outputs from execution exhaust

5. Prefer Codex for implementation-heavy roles and Claude for review

6. Be explicit about shell requirements

7. Make resume semantics explicit

7a. --resume vs --start-from when fixing a buggy step

8. Syntax-check workflow files after editing

9. Factor repo-specific setup into a shared helper

End-to-End Bug Fix Workflows

Required phases for fix workflows

Clean-environment validation guidance

Meta-workflow guidance

Shipping the Result — Open a PR via createGitHubStep

Why createGitHubStep (and not raw gh / octokit)

The minimal "open a PR" recipe

Authoring rules for PR-shipping workflows

Where this fits in the bug-fix phases

Key Concepts

Step Output Chaining

Verification Gates

DAG Dependencies

Self-Termination

Step Completion Model

Dynamic Channel Management

SDK API

Semantics

Events

When to Use in Workflows

What This Eliminates

Agent Definition

Model Constants

Step Definition

Agent Steps

Deterministic Steps (Shell Commands)

Common Patterns

Interactive Team (lead + workers on shared channel)

Chat-Native Coordination Recipes

1. Question / Answer (blocking ask)

2. Broadcast / Ack

3. Peer Review Handoff

4. Standup / Status Probe

5. Hand-Off with Context

Picking a recipe

Pipeline (sequential handoff)

Error Handling

Multi-File Edit Pattern

File Materialization: Verify Before Proceeding

Edit Gates Must See Untracked Files

Agent Transport Must Not Be The First Hard Gate

DAG Deadlock Anti-Pattern

Step Sizing

Supervisor Pattern

Concurrency

Common Mistakes

YAML Alternative

Available Swarm Patterns

Writing Agent Relay Workflows

1. Do not use raw top-level `await`

2c. Picking the right `.join()` for multi-line shell commands

7a. `--resume` vs `--start-from` when fixing a buggy step

Shipping the Result — Open a PR via `createGitHubStep`

Why `createGitHubStep` (and not raw `gh` / `octokit`)

1. Do not use raw top-level `await`

2c. Picking the right `.join()` for multi-line shell commands

7a. `--resume` vs `--start-from` when fixing a buggy step

Shipping the Result — Open a PR via `createGitHubStep`

Why `createGitHubStep` (and not raw `gh` / `octokit`)