| name | writing-agent-relay-workflows |
| description | Use when building multi-agent workflows with the relay broker-sdk - covers the WorkflowBuilder API, DAG step dependencies, agent definitions, step output chaining via {{steps.X.output}}, verification gates, evidence-based completion, owner decisions, dedicated channels, dynamic channel management (subscribe/unsubscribe/mute/unmute), swarm patterns, error handling, event listeners, step sizing rules, authoring best practices, and the lead+workers team pattern for complex steps |
Overview
The relay broker-sdk workflow system orchestrates multiple AI agents (Claude, Codex, Gemini, Aider, Goose) through typed DAG-based workflows. Workflows can be written in TypeScript (preferred), Python, or YAML.
Language preference: TypeScript > Python > YAML. Use TypeScript unless the project is Python-only or a simple config-driven workflow suits YAML.
Pattern selection: Do not default to dag blindly. If the job needs a different swarm/workflow type, consult the choosing-swarm-patterns skill when available and select the pattern that best matches the coordination problem.
When to Use
- Building multi-agent workflows with step dependencies
- Orchestrating different AI CLIs (claude, codex, gemini, aider, goose)
- Creating DAG, pipeline, fan-out, or other swarm patterns
- Needing verification gates, retries, or step output chaining
- Dynamic channel management: agents joining/leaving/muting channels mid-workflow
Quick Reference
```typescript
import { workflow } from '@agent-relay/sdk/workflows';
const result = await workflow('my-workflow')
.description('What this workflow does')
.pattern('dag')
.channel('wf-my-workflow')
.maxConcurrency(3)
.timeout(3_600_000)
.agent('lead', { cli: 'claude', role: 'Architect', retries: 2 })
.agent('worker', { cli: 'codex', role: 'Implementer', retries: 2 })
.step('plan', {
agent: 'lead',
task: `Analyze the codebase and produce a plan.`,
retries: 2,
verification: { type: 'output_contains', value: 'PLAN_COMPLETE' },
})
.step('implement', {
agent: 'worker',
task: `Implement based on this plan:\n{{steps.plan.output}}`,
dependsOn: ['plan'],
verification: { type: 'exit_code' },
})
.onError('retry', { maxRetries: 2, retryDelayMs: 10_000 })
.run({ cwd: process.cwd() });
console.log('Result:', result.status);
⚡ Parallelism — Design for Speed
Cross-Workflow Parallelism: Wave Planning
agent-relay run workflows/34-sst-wiring.ts
agent-relay run workflows/35-env-config.ts
agent-relay run workflows/36-loading-states.ts
agent-relay run workflows/34-sst-wiring.ts &
agent-relay run workflows/35-env-config.ts &
agent-relay run workflows/36-loading-states.ts &
agent-relay run workflows/37-responsive.ts &
wait
git add -A && git commit -m "Wave 1"
agent-relay run workflows/40-unit-tests.ts &
agent-relay run workflows/41-integration-tests.ts &
agent-relay run workflows/42-e2e-tests.ts &
wait
git add -A && git commit -m "Wave 2"
Declare File Scope for Planning
workflow('48-comparison-mode')
.packages(['web', 'core'])
.isolatedFrom(['49-feedback-system'])
.requiresBefore(['46-admin-dashboard'])
Within-Workflow Parallelism
.step('fix-component-a', { agent: 'worker', dependsOn: ['review'] })
.step('fix-component-b', { agent: 'worker', dependsOn: ['fix-component-a'] })
.step('fix-component-a', { agent: 'impl-1', dependsOn: ['review'] })
.step('fix-component-b', { agent: 'impl-2', dependsOn: ['review'] })
.step('verify-all', { agent: 'reviewer', dependsOn: ['fix-component-a', 'fix-component-b'] })
Failure Prevention
1. Do not use raw top-level await
async function runWorkflow() {
const result = await workflow('my-workflow')
.run({ cwd: process.cwd() });
console.log('Workflow status:', result.status);
}
runWorkflow().catch((error) => {
console.error(error);
process.exit(1);
});
3. Keep final verification boring and deterministic
grep -Eq "foo|bar|baz" file.ts
6. Be explicit about shell requirements
/opt/homebrew/bin/bash workflows/your-workflow/execute.sh --wave 2
End-to-End Bug Fix Workflows
- Capture the original failure
- Reproduce the bug first in a deterministic or evidence-capturing step
- Save exact commands, logs, status codes, or screenshots/artifacts
- State the acceptance contract
- Define the exact end-to-end success criteria before implementation
- Include the real entrypoint a user would run
- Implement the fix
- Rebuild / reinstall from scratch
- Do not trust dirty local state
- Prefer a clean environment when install/bootstrap behavior is involved
- Run targeted regression checks
- Unit/integration tests are helpful but not sufficient by themselves
- Run a full end-to-end validation
- Use the real CLI / API / install path
- Prefer a clean environment (Docker, sandbox, cloud workspace, Daytona, etc.) for install/runtime issues
- Compare before vs after evidence
- Show that the original failure no longer occurs
- Record residual risks
- Call out what was not covered
- disposable sandbox / cloud workspace
- Docker / containerized environment
- fresh local shell with isolated paths
- compares candidate validation environments
- defines the acceptance contract
- chooses the best swarm pattern
- then authors the final fix/validation workflow
Key Concepts
Verification Gates
verification: { type: 'exit_code' }
verification: { type: 'output_contains', value: 'DONE' }
verification: { type: 'file_exists', value: 'src/out.ts' }
DAG Dependencies
.step('fix-types', { agent: 'worker', dependsOn: ['review'], ... })
.step('fix-tests', { agent: 'worker', dependsOn: ['review'], ... })
.step('final', { agent: 'lead', dependsOn: ['fix-types', 'fix-tests'], ... })
SDK API
relay.subscribe({ agent: 'security-auditor', channels: ['review-pr-456'] });
relay.unsubscribe({ agent: 'security-auditor', channels: ['general'] });
relay.mute({ agent: 'security-auditor', channel: 'review-pr-123' });
relay.unmute({ agent: 'security-auditor', channel: 'review-pr-123' });
Events
relay.onChannelSubscribed = (agent, channels) => { };
relay.onChannelUnsubscribed = (agent, channels) => { };
relay.onChannelMuted = (agent, channel) => { };
relay.onChannelUnmuted = (agent, channel) => { };
Agent Definition
```typescript
.agent('name', {
cli: 'claude' | 'codex' | 'gemini' | 'aider' | 'goose' | 'opencode' | 'droid',
role?: string,
preset?: 'lead' | 'worker' | 'reviewer' | 'analyst',
retries?: number,
model?: string,
interactive?: boolean,
})
Model Constants
import { ClaudeModels, CodexModels, GeminiModels } from '@agent-relay/config';
.agent('planner', { cli: 'claude', model: ClaudeModels.OPUS })
.agent('worker', { cli: 'claude', model: ClaudeModels.SONNET })
.agent('coder', { cli: 'codex', model: CodexModels.GPT_5_4 })
Step Definition
Agent Steps
.step('name', {
agent: string,
task: string,
dependsOn?: string[],
verification?: VerificationCheck,
retries?: number,
})
Deterministic Steps (Shell Commands)
.step('verify-files', {
type: 'deterministic',
command: 'test -f src/auth.ts && echo "FILE_EXISTS"',
dependsOn: ['implement'],
captureOutput: true,
failOnError: true,
})
Common Patterns
Interactive Team (lead + workers on shared channel)
.agent('lead', {
cli: 'claude',
model: ClaudeModels.OPUS,
role: 'Architect and reviewer — assigns work, reviews, posts feedback',
retries: 1,
})
.agent('impl-new', {
cli: 'codex',
model: CodexModels.O3,
role: 'Creates new files. Listens on channel for assignments and feedback.',
retries: 2,
})
.agent('impl-modify', {
cli: 'codex',
model: CodexModels.O3,
role: 'Edits existing files. Listens on channel for assignments and feedback.',
retries: 2,
})
.step('lead-coordinate', {
agent: 'lead',
dependsOn: ['context'],
task: `You are the lead on #channel. Workers: impl-new, impl-modify.
Post the plan. Assign files. Review their work. Post feedback if needed.
Workers iterate based on your feedback. Exit when all files are correct.`,
})
.step('impl-new-work', {
agent: 'impl-new',
dependsOn: ['context'],
task: `You are impl-new on #channel. Wait for the lead's plan.
Create files as assigned. Report completion. Fix issues from feedback.`,
})
.step('impl-modify-work', {
agent: 'impl-modify',
dependsOn: ['context'],
task: `You are impl-modify on #channel. Wait for the lead's plan.
Edit files as assigned. Report completion. Fix issues from feedback.`,
})
.step('verify', { type: 'deterministic', dependsOn: ['lead-coordinate'], ... })
Pipeline (sequential handoff)
.pattern('pipeline')
.step('analyze', { agent: 'analyst', task: '...' })
.step('implement', { agent: 'dev', task: '{{steps.analyze.output}}', dependsOn: ['analyze'] })
.step('test', { agent: 'tester', task: '{{steps.implement.output}}', dependsOn: ['implement'] })
Error Handling
.onError('fail-fast')
.onError('continue')
.onError('retry', { maxRetries: 3, retryDelayMs: 5000 })
Multi-File Edit Pattern
When a workflow needs to modify multiple existing files, use one agent step per file with a deterministic verify gate after each. Agents reliably edit 1-2 files per step but fail on 4+.
steps:
- name: read-types
type: deterministic
command: cat src/types.ts
captureOutput: true
- name: edit-types
agent: dev
dependsOn: [read-types]
task: |
Edit src/types.ts. Current contents:
{{steps.read-types.output}}
Add 'pending' to the Status union type.
Only edit this one file.
verification:
type: exit_code
- name: verify-types
type: deterministic
dependsOn: [edit-types]
command: 'if git diff --quiet src/types.ts; then echo "NOT MODIFIED"; exit 1; fi; echo "OK"'
failOnError: true
- name: read-service
type: deterministic
dependsOn: [verify-types]
command: cat src/service.ts
captureOutput: true
- name: edit-service
agent: dev
dependsOn: [read-service]
task: |
Edit src/service.ts. Current contents:
{{steps.read-service.output}}
Add a handlePending() method.
Only edit this one file.
verification:
type: exit_code
- name: verify-service
type: deterministic
dependsOn: [edit-service]
command: 'if git diff --quiet src/service.ts; then echo "NOT MODIFIED"; exit 1; fi; echo "OK"'
failOnError: true
- name: commit
type: deterministic
dependsOn: [verify-service]
command: git add src/types.ts src/service.ts && git commit -m "feat: add pending status"
failOnError: true
File Materialization: Verify Before Proceeding
After any step that creates files, add a deterministic file_exists check before proceeding. Non-interactive agents may exit 0 without writing anything (wrong cwd, stdout instead of disk).
- name: verify-files
type: deterministic
dependsOn: [impl-auth, impl-storage]
command: |
missing=0
for f in src/auth/credentials.ts src/storage/client.ts; do
if [ ! -f "$f" ]; then echo "MISSING: $f"; missing=$((missing+1)); fi
done
if [ $missing -gt 0 ]; then echo "$missing files missing"; exit 1; fi
echo "All files present"
failOnError: true
DAG Deadlock Anti-Pattern
```yaml
steps:
- name: coordinate
dependsOn: [context]
- name: work-a
dependsOn: [coordinate]
steps:
- name: context
type: deterministic
- name: work-a
dependsOn: [context]
- name: coordinate
dependsOn: [context]
- name: merge
dependsOn: [work-a, coordinate]
Step Sizing
One agent, one deliverable. A step's task prompt should be 10-20 lines max.
steps:
- name: track-lead-coord
agent: track-lead
dependsOn: [prior-step]
task: |
Lead the track on #my-track. Workers: track-worker-1, track-worker-2.
Post assignments to the channel. Review worker output.
- name: track-worker-1-impl
agent: track-worker-1
dependsOn: [prior-step]
task: |
Join #my-track. track-lead will post your assignment.
Implement the file as directed.
verification:
type: exit_code
- name: next-step
dependsOn: [track-lead-coord]
Supervisor Pattern
When you set .pattern('supervisor') (or hub-spoke, fan-out), the runner auto-assigns a supervisor agent as owner for worker steps. The supervisor monitors progress, nudges idle workers, and issues OWNER_DECISION.
Auto-hardening only activates for hub patterns — not pipeline or dag.
| Use case | Pattern | Why |
|---|
| Sequential, no monitoring | pipeline | Simple, no overhead |
| Workers need oversight | supervisor | Auto-owner monitors |
| Local/small models | supervisor | Supervisor catches stuck workers |
| All non-interactive | pipeline or dag | No PTY = no supervision needed |
Concurrency
Cap maxConcurrency at 4-6. Spawning 10+ agents simultaneously causes broker timeouts.
| Parallel agents | maxConcurrency |
|---|
| 2-4 | 4 (default safe) |
| 5-10 | 5 |
| 10+ | 6-8 max |
Common Mistakes
| Mistake | Fix |
|---|
| All workflows run sequentially | Group independent workflows into parallel waves (4-7x speedup) |
| Every step depends on the previous one | Only add dependsOn when there's a real data dependency |
| Self-review step with no timeout | Set timeout: 300_000 (5 min) — Codex hangs in non-interactive review |
| One giant workflow per feature | Split into smaller workflows that can run in parallel waves |
| Adding exit instructions to tasks | Runner handles self-termination automatically |
Setting timeoutMs on agents/steps | Use global .timeout() only |
Using general channel | Set .channel('wf-name') for isolation |
{{steps.X.output}} without dependsOn: ['X'] | Output won't be available yet |
| Requiring exact sentinel as only completion gate | Use exit_code or file_exists verification |
| Writing 100-line task prompts | Split into lead + workers on a channel |
maxConcurrency: 16 with many parallel steps | Cap at 5-6 |
| Non-interactive agent reading large files via tools | Pre-read in deterministic step, inject via {{steps.X.output}} |
| Workers depending on lead step (deadlock) | Both depend on shared context step |
fan-out/hub-spoke for simple parallel workers | Use dag instead |
pipeline but expecting auto-supervisor | Only hub patterns auto-harden. Use .pattern('supervisor') |
Workers without preset: 'worker' in one-shot DAG lead+worker flows | Add preset for clean stdout when chaining {{steps.X.output}} (not needed for interactive team patterns) |
Using _ in YAML numbers (timeoutMs: 1_200_000) | YAML doesn't support _ separators |
| Workflow timeout under 30 min for complex workflows | Use 3600000 (1 hour) as default |
Using require() in ESM projects | Check package.json for "type": "module" — use import if ESM |
Wrapping in async function main() in ESM | ESM supports top-level await — no wrapper needed |
Using createWorkflowRenderer | Does not exist. Use .run({ cwd: process.cwd() }) |
export default workflow(...)...build() | No .build(). Chain ends with .run() — the file must call .run(), not just export config |
Relative import '../workflows/builder.js' | Use import { workflow } from '@agent-relay/sdk/workflows' |
Hardcoded model strings (model: 'opus') | Use constants: import { ClaudeModels } from '@agent-relay/config' → model: ClaudeModels.OPUS |
Thinking agent-relay run inspects exports | It executes the file as a subprocess. Only .run() invocations trigger steps |
pattern('single') on cloud runner | Not supported — use dag |
pattern('supervisor') with one agent | Same agent is owner + specialist. Use dag |
Invalid verification type (type: 'deterministic') | Only exit_code, output_contains, file_exists, custom are valid |
Chaining {{steps.X.output}} from interactive agents | PTY output is garbled. Use deterministic steps or preset: 'worker' |
| Single step editing 4+ files | Agents modify 1-2 then exit. Split to one file per step with verify gates |
Relying on agents to git commit | Agents emit markers without running git. Use deterministic commit step |
File-writing steps without file_exists verification | exit_code auto-passes even if no file written |
Manual peer fanout in handleChannelMessage() | Use broker-managed channel subscriptions — broker fans out to all subscribers automatically |
Client-side personaNames.has(from) filtering | Use relay.subscribe()/relay.unsubscribe() — only subscribed agents receive messages |
| Agents receiving noisy cross-channel messages during focused work | Use relay.mute({ agent, channel }) to silence non-primary channels without leaving them |
| Hardcoding all channels at spawn time | Use agent.subscribe() / agent.unsubscribe() for dynamic channel membership post-spawn |
Using preset: 'worker' for Codex in interactive team patterns when coordination is needed | Codex interactive mode works fine with PTY channel injection. Drop the preset for interactive team patterns (keep it for one-shot DAG workers where clean stdout matters) |
| Separate reviewer agent from lead in interactive team | Merge lead + reviewer into one interactive Claude agent — reviews between rounds, fewer agents |
Not printing PR URL after gh pr create | Add a final deterministic step: echo "PR: $(cat pr-url.txt)" or capture in the gh pr create command |
| Workflow ending without worktree + PR for cross-repo changes | Add setup-worktree at start and push-and-pr + cleanup-worktree at end |
YAML Alternative
```yaml
version: '1.0'
name: my-workflow
swarm:
pattern: dag
channel: wf-my-workflow
agents:
- name: lead
cli: claude
role: Architect
- name: worker
cli: codex
role: Implementer
workflows:
- name: default
steps:
- name: plan
agent: lead
task: 'Produce a detailed implementation plan.'
- name: implement
agent: worker
task: 'Implement: {{steps.plan.output}}'
dependsOn: [plan]
verification:
type: exit_code
Available Swarm Patterns
dag (default), fan-out, pipeline, hub-spoke, consensus, mesh, handoff, cascade, debate, hierarchical, map-reduce, scatter-gather, supervisor, reflection, red-team, verifier, auction, escalation, saga, circuit-breaker, blackboard, swarm
See skill choosing-swarm-patterns for pattern selection guidance.