| name | deepswarm |
| description | Use when running parallel AI workers for any long-running or multi-turn batch API task. Auto-calculates optimal workers + stagger. Supports tiered delegation (V4 Pro orchestrator → V4 Flash workers). 99.95% API success rate at scale. |
| version | 2.0.0 |
| author | Hermes Agent |
| license | MIT |
| metadata | {"hermes":{"tags":["orchestration","parallel","workers","api-generation","swarm","batch","multi-turn","delegation","tiered-models"],"related_skills":["tmux-agent-orchestrator","hermes-agent"]}} |
DeepSwarm — Task-Agnostic Parallel Worker Orchestration
Spawn N parallel API workers for any long-running or multi-turn batch task. Auto-calculates optimal worker count and stagger delay. Supports tiered model delegation: orchestrator plans with a frontier model (V4 Pro), workers execute with a cheaper model (V4 Flash).
Overview
DeepSwarm 2.0 generalizes the proven orchestration pattern from the 19,331-trace generation project to any batch API task. You define a task — translations, reasoning traces, code reviews, summarization — and DeepSwarm parallelizes it across optimal workers with the right stagger for your API.
The core insight: API rate limits are a function of simultaneous connections, not total volume. Auto-calculated stagger + worker count = 99.95% success.
When to Use
- Any batch API task: generation, translation, summarization, extraction, classification
- Long-running individual calls (30s+) that benefit from parallelization
- Multi-turn tasks where each worker loops through conversation turns
- Cost optimization via tiered delegation (orchestrator ≠ worker model)
- Crash-resilient batch processing (checkpointed, idempotent)
Don't use for:
- Quick calls under 10s (overhead not worth it — just loop)
- Tasks requiring inter-worker coordination (use
delegate_task)
- Real-time interactive sessions (use tmux-agent-orchestrator)
Quick Start
hermes skills tap add amanning3390/deepswarm
python3 scripts/seed.py --task task.yaml
python3 scripts/swarm.py --task task.yaml --total 1000
python3 scripts/filter.py --input-dir output/ --output clean.jsonl --errors errors.jsonl
Task Definition (task.yaml)
task_type: generation
prompt_template: |
You are an AI assistant. {{seed}}
orchestrator_model: deepseek-v4-pro
worker_model: deepseek-v4-flash
worker_api_base: https://api.deepseek.com/v1/chat/completions
worker_max_tokens: 4096
multi_turn: true
max_turns: 20
seeds_file: seeds.jsonl
workers: auto
stagger: auto
batch_size: auto
output_dir: output/
output_format: jsonl
checkpoint_every: 10
worker_script: custom_worker.py
Tiered Model Delegation
Orchestrator (V4 Pro) and workers (V4 Flash) can use different models:
User Task → V4 Pro (plans, monitors)
├─ V4 Flash Worker 0 → API → output/
├─ V4 Flash Worker 1 → API → output/
├─ V4 Flash Worker 2 → API → output/
└─ ...
Why tiered delegation matters:
- V4 Pro costs ~3× V4 Flash per token
- Orchestrator only plans + monitors (few calls)
- Workers make thousands of calls — use the cheapest model that works
- Typical savings: 60-70% vs using V4 Pro for everything
When to use same model for both:
- Task quality requires frontier reasoning at every step
- Worker model doesn't support the required format
- Budget allows it and quality is paramount
Auto-Optimization
When workers: auto and stagger: auto:
- DeepSwarm runs a single calibration call to measure call duration
- Calculates optimal workers:
min(8, floor(rate_limit / call_duration))
- Sets stagger:
call_duration / workers × 2
- Adjusts batch_size:
total / workers
Calibration table (pre-computed):
| Call Duration | Workers | Stagger | Success | Throughput |
|---|
| <10s | 16 | 1s | 99.9% | ~5,760/hr |
| 10-30s | 12 | 2s | 99.9% | ~1,440/hr |
| 30-60s | 8 | 5s | 99.95% | ~440/hr |
| 60-90s | 6 | 10s | 99.9% | ~240/hr |
| >90s | 4 | 15s | 99.9% | ~96/hr |
Multi-Turn Task Support
For tasks requiring conversation loops (generation, debugging, interactive work):
Worker loop:
for each seed:
messages = [system_prompt, user_task]
for turn in range(max_turns):
response = api_call(messages, model=worker_model)
messages.append({"role": "assistant", "content": response})
if task_complete(response):
break
if needs_tool_call(response):
messages.append(simulate_tool_response(response))
Each turn is an independent API call. Multi-turn tasks benefit most from parallelization because per-task latency is high.
Task-Agnostic Worker Design
The worker (worker.py) accepts a YAML task definition and executes any pipeline:
def run_task(seed, config):
messages = build_messages(seed, config)
for turn in range(config["max_turns"]):
response = call_api(messages, config)
if is_complete(response, config):
return finish(response, messages)
if needs_continuation(response, config):
messages = append_turn(messages, response, config)
return messages
Built-in task types:
generation — Generate content from seed (the trace generation pattern)
translation — Translate each seed text
summarization — Summarize each seed document
classification — Classify each seed input
custom — Uses worker_script for completely custom logic
Common Pitfalls
workers: auto choosing too many. If calibration call was fast but actual calls are slow, override manually.
- Forgetting to stagger between workers. Even 2 workers at same millisecond can trigger rate limits on slow APIs.
- Mixing models without checking format compatibility. Worker model must support the same prompt format as orchestrator.
- Not checkpointing. Worker dies at 120/125 = lost work. Checkpoint every 10.
- Shell
& without wait. Without wait, shell exits early and kills child workers.
- Using V4 Pro for workers when V4 Flash would work. Check quality on 10-sample test before committing to expensive model.
- Not deleting error outputs before restarting. Error files consume indices and inflate disk.
Verification Checklist