원클릭으로
run-scaling-test
// Run an ECS scaling validation test with parallel monitoring, baseline enforcement, and automated result documentation
// Run an ECS scaling validation test with parallel monitoring, baseline enforcement, and automated result documentation
Scaffold a new A2A capability agent with Python application, Dockerfile, requirements.txt, and CDK stack following the project's established patterns
Scaffold a new local tool for the voice agent pipeline with capability-based registration, executor function, ToolDefinition, and tests
Sets up Daily.co phone number and webhook for PSTN dial-in. Guides through API key verification, phone number purchase, pinless dial-in configuration, and secrets sync. Use after deploying infrastructure, when setting up a phone number, or when configuring dial-in.
Scaffolds a new A2A capability agent with Python application, Dockerfile, requirements.txt, and CDK stack. Use when adding a new remote tool or service that the voice agent discovers via CloudMap.
Scaffolds a new local tool for the voice agent pipeline with capability-based registration, executor function, and tests. Use when adding a tool that runs inside the voice agent container and may need transport or SIP session access.
Deploys optional A2A capability agents (Knowledge Base, CRM) that extend the voice agent with new skills. Explains the A2A architecture, enables the capability registry, deploys agent stacks, and verifies discovery. Use after the core deployment and Daily setup are complete.
| name | run-scaling-test |
| description | Run an ECS scaling validation test with parallel monitoring, baseline enforcement, and automated result documentation |
Execute an ECS scaling validation scenario from the SIPp load test harness (../asset-scaling-load-test) with:
docs/results/scaling-tests/Use this skill when you want to run a scaling validation test against the deployed voice agent stack. Available scenarios:
| Scenario | Duration | What It Tests |
|---|---|---|
steady-state | ~22 min | Phased scale-out/in with cold start gap |
burst | ~10 min | Heavy-load proportional scale-out |
scale-in-protection | ~15 min | Task protection during scale-in |
sustained-24 | ~30 min | Near-max capacity soak test |
../asset-scaling-load-test/scenarios/*.yamluv run load-test run --scenario <name> (in ../asset-scaling-load-test)uv run python scripts/poll_metrics.py --watch (in ../asset-scaling-load-test)voice-agent-poc-poc-voice-agentvoice-agentdocs/results/scaling-tests/Ask the user which scenario they want to run. Default recommendation is steady-state for first-time validation. Show the table from "When to Use Me" above.
Before starting, verify all preconditions. Run these checks:
# Check ECS: must be 1 running, 1 desired, 0 pending
aws ecs describe-services \
--cluster voice-agent-poc-poc-voice-agent \
--services voice-agent-poc-poc-voice-agent \
--profile voice-agent --region us-east-1 \
--query 'services[0].{running:runningCount,desired:desiredCount,pending:pendingCount}'
Expected: {"running": 1, "desired": 1, "pending": 0}
# Check SIPp instance is reachable, no running SIPp processes
uv run python scripts/run_sipp.py status
Expected: Instance accessible, "No SIPp processes running"
# Check audio files are present on EC2
uv run python scripts/ec2_shell.py "ls /opt/sipp/audio/calls_pcmu/ | wc -l"
Expected: > 0 files
# One-shot metrics check: sessions should be 0
uv run python scripts/poll_metrics.py
Expected: Active: 0 or no active count
All SIPp/metrics commands run with workdir set to ../asset-scaling-load-test.
If baseline is NOT met, attempt recovery:
uv run python scripts/run_sipp.py stop (kill stale SIPp)aws ecs update-service --cluster voice-agent-poc-poc-voice-agent \
--service voice-agent-poc-poc-voice-agent \
--desired-count 1 --profile voice-agent --region us-east-1
runningCount == 1Before starting the scenario, launch two parallel Task tool sub-agents:
Agent A -- Metrics Watcher:
Launch a general sub-agent with this prompt:
Run the following command in workdir
/Users/schuettc/Documents/GitHub/ml-frameworks-voice/asset-scaling-load-testwith a timeout of 5400000ms (90 minutes):uv run python scripts/poll_metrics.py --watch --interval 30Capture all output. When the command is terminated, return the complete output.
Agent B -- NLB Distribution Checker:
Launch a general sub-agent with this prompt:
Run the following bash loop in workdir
/Users/schuettc/Documents/GitHub/ml-frameworks-voice/asset-scaling-load-testwith a timeout of 5400000ms:for i in $(seq 1 180); do echo "=== SNAPSHOT $(date -u +%H:%M:%S) ===" ; aws dynamodb scan --table-name $(aws ssm get-parameter --name /voice-agent/dynamodb/session-table-name --profile voice-agent --region us-east-1 --query 'Parameter.Value' --output text) --filter-expression "begins_with(PK, :prefix) AND SK = :sk" --expression-attribute-values '{":prefix":{"S":"TASK#"},":sk":{"S":"HEARTBEAT"}}' --projection-expression "PK, active_session_count" --profile voice-agent --region us-east-1 --output table 2>/dev/null || echo "(no heartbeats)" ; sleep 30 ; doneCapture all output. When done, return the complete output showing per-task session distribution over time.
Run the scenario in the foreground (in workdir: ../asset-scaling-load-test):
uv run load-test run --scenario <SCENARIO_NAME> --config config.yaml
Set a generous timeout based on the scenario:
steady-state: 1800000ms (30 min)burst: 1200000ms (20 min)scale-in-protection: 1200000ms (20 min)sustained-24: 3600000ms (60 min)Capture the full output. The harness prints:
After the scenario completes:
../asset-scaling-load-test/results/<scenario>-*.json (the most recent one)Create a markdown report at:
docs/results/scaling-tests/<scenario>-<YYYY-MM-DD-HHmmss>.md
Use this template:
# Scaling Test Results: <scenario-name>
**Date**: <YYYY-MM-DD HH:MM UTC>
**Duration**: <M>m <S>s
**Result**: PASSED / FAILED
## Scaling Configuration
| Parameter | Value |
|-----------|-------|
| Target tracking target | 3 (SessionsPerTask avg) |
| Scale-in | -3 per 30s (AvgSessionsPerTask < 1.0) |
| Max capacity | 12 |
| MAX_CONCURRENT_CALLS | 10 |
## Cold Start Timing (Measured)
New tasks take **~90s** from creation to receiving traffic. This is an
irreducible floor driven by the container image size (~824 MB compressed).
| Phase | Duration | Cumulative |
|-------|----------|-----------|
| ENI attach + scheduling | ~14s | 14s |
| Image pull (824 MB) | ~37s | 51s |
| Container init (app start) | ~17s | 68s |
| NLB health check (2 x 10s) | ~20s | **~88s** |
On top of this, the scaling **decision pipeline** adds 1-3 min:
| Phase | Duration |
|-------|----------|
| Session counter Lambda emits metric | every 60s |
| CloudWatch alarm evaluates (1 min period) | 0-60s |
| Target tracking reacts | 0-60s |
**Total: overload detected -> new task serving traffic = ~3-5 min.**
Test scenarios MUST account for this lag. Any calls placed before new
capacity is ready will be rejected (503) by the overloaded task's
`/ready` endpoint. Design scenarios with a wait phase between filling
the first task past the target and sending overflow calls.
## Assertions
| # | Assertion | Expected | Actual | Result |
|---|-----------|----------|--------|--------|
(populate from scenario results)
## Scaling Timeline
| Time | Event | Tasks (R/D) | SessionsPerTask | ActiveCount |
|------|-------|-------------|-----------------|-------------|
(populate from metrics history -- key inflection points)
## NLB Distribution (at peak)
| Task ID | Active Sessions | % of Total |
|---------|----------------|------------|
(populate from Agent B DynamoDB snapshots at peak call count)
## Metrics Summary (last 10 snapshots)
(paste the metrics summary table from the harness output)
## Call Summary
| Metric | Value |
|--------|-------|
| Total calls | N |
| Completed | N |
| Dropped | N |
| Failed | N |
Summarize: