Run any Skill in Manus with one click

$pwd:

rosetta-add-framework-tier-test

Name: Rosetta Add Framework Tier Test
Author: Arize-ai

// Test a freshly-built tier for a new framework — boots the backend, smoke-tests the chat endpoint, runs synthetic requests, and (for phoenix) runs the eval harness. Verifies traces land in the right project. Part of the rosetta-add-framework flow; can be invoked standalone after a build to validate.

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:May 19, 2026 at 20:26

SKILL.md

readonly

name	rosetta-add-framework-tier-test
description	Test a freshly-built tier for a new framework — boots the backend, smoke-tests the chat endpoint, runs synthetic requests, and (for phoenix) runs the eval harness. Verifies traces land in the right project. Part of the rosetta-add-framework flow; can be invoked standalone after a build to validate.

Tier Test

Validate a single tier (no-observability, phoenix, or ax) after build.

Inputs

FRAMEWORK_DIR — e.g. google-adk-py
TIER — no-observability | phoenix | ax

Steps

1. Stop anything already on backend ports

lsof -ti :8001 2>/dev/null | xargs -r kill 2>/dev/null
sleep 1

2. Start the backend

The canonical way is npm run dev which boots ChromaDB + Python backend + Next.js. For a focused tier test, that's fine — just hit the Python backend directly on :8001 for chat tests.

cd "$TIER/$FRAMEWORK_DIR"
npm run dev > /tmp/rosetta-tier-test.log 2>&1 &
DEV_PID=$!
until curl -sf http://127.0.0.1:8001/products/featured > /dev/null 2>&1; do
  if ! kill -0 $DEV_PID 2>/dev/null; then
    echo "dev script died — see /tmp/rosetta-tier-test.log"
    tail -30 /tmp/rosetta-tier-test.log
    exit 1
  fi
  sleep 2
done

3. Smoke tests (all tiers)

Three chat requests against :8001/chat:

# Test A: conversational (no tools expected)
curl -sN -X POST http://127.0.0.1:8001/chat \
  -H "Content-Type: application/json" \
  -H "x-user-id: test-a" \
  -d '{"messages":[{"role":"user","content":"Just say hi back in one sentence, no tools."}]}' \
  --max-time 60 | head -10

# Test B: tool call (should trigger search_products)
curl -sN -X POST http://127.0.0.1:8001/chat \
  -H "Content-Type: application/json" \
  -H "x-user-id: test-b" \
  -d '{"messages":[{"role":"user","content":"Show me one dinosaur toy"}]}' \
  --max-time 90 | head -25

# Test C: multi-turn (history persistence)
# First turn primes the per-user history
curl -sN -X POST http://127.0.0.1:8001/chat \
  -H "x-user-id: test-c" -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"My favorite color is teal. Acknowledge in one sentence."}]}' \
  --max-time 30 > /dev/null
# Second turn should remember
curl -sN -X POST http://127.0.0.1:8001/chat \
  -H "x-user-id: test-c" -H "Content-Type: application/json" \
  -d '{"messages":[
    {"role":"user","content":"My favorite color is teal. Acknowledge in one sentence."},
    {"role":"assistant","content":"Got it!"},
    {"role":"user","content":"What is my favorite color? One word, no tools."}
  ]}' \
  --max-time 30 | tail -3

Pass criteria: all 3 return SSE chunks ending in data: [DONE], no ERROR: lines in /tmp/rosetta-tier-test.log (excluding ExperimentalWarning / DeprecationWarning). Test C's response should contain "teal".

4. Tier-specific verification

`no-observability` — stop here

No traces to verify. Kill the backend and report.

`phoenix` tier

Trace verification + full eval harness.

# Wait for ingestion
sleep 10

# Verify a trace landed in the configured project
PHOENIX_PROJECT=$(grep '^PHOENIX_PROJECT_NAME=' .env.local | cut -d= -f2)
/Users/jimbobbennett/github/project-rosetta-stone/.venv/bin/python <<EOF
from phoenix.client import Client
c = Client(base_url="http://localhost:6006")
df = c.spans.get_spans_dataframe(project_identifier="$PHOENIX_PROJECT", limit=20)
print(f"spans: {len(df)}")
assert len(df) > 0, "no spans landed — check tracing.py"
EOF

# Stop backend (synthetic-requests will boot a fresh one)
lsof -ti :8001 :3000 | xargs -r kill 2>/dev/null
sleep 1

# Run the full eval harness (~10 min)
npm run synthetic-requests 2>&1 | grep -E "^User:|ERROR|Response \(|Done!" | tail -30
sleep 30  # ingestion
npm run evals 2>&1 | tail -10

Pass criteria: 25 synthetic requests all return non-error responses; evals logs annotations to Phoenix without crashing.

`ax` tier

Trace verification only (evals are UI-configured in AX).

sleep 30  # AX ingestion is slower than Phoenix
ARIZE_PROJECT=$(grep '^ARIZE_PROJECT_NAME=' .env.local | cut -d= -f2)
set -a; source .env.local; set +a
ax traces list "$ARIZE_PROJECT" \
  --space "$ARIZE_SPACE_ID" \
  --start-time "$(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date -u -v+1M +%Y-%m-%dT%H:%M:%SZ)" \
  --limit 10 --output /tmp/rosetta-ax-traces.json > /dev/null 2>&1

python3 <<EOF
import json
data = json.load(open('/tmp/rosetta-ax-traces.json'))
spans = data.get('spans', data) if isinstance(data, dict) else data
print(f"traces found: {len(spans)}")
assert len(spans) >= 3, "expected at least 3 traces (smoke A+B+C)"
EOF

# Then synthetic-requests
lsof -ti :8001 :3000 | xargs -r kill 2>/dev/null
sleep 1
npm run synthetic-requests 2>&1 | grep -E "^User:|ERROR|Response \(|Done!" | tail -30

Pass criteria: at least 3 traces visible in AX from the smoke test; 25 synthetic requests all return non-error responses.

5. Cleanup

lsof -ti :3000 :8001 2>/dev/null | xargs -r kill 2>/dev/null
# Don't touch :8000 (ChromaDB) or :6006 (Phoenix) — orchestrator manages those

Output

TIER_TESTED: <TIER>/<FRAMEWORK_DIR>
  smoke A (greeting): PASS
  smoke B (tool call): PASS
  smoke C (multi-turn): PASS  (recalled history: yes|no)
  traces verified: <count>
  synthetic-requests: 25/25 OK | <N>/25 OK (see log)
  evals: PASS | n/a | FAIL <error>

related-skills.json

same repository

rosetta-add-framework-discover.md

from "Arize-ai/project-rosetta-stone"

Refresh the list of agent frameworks supported by Arize tracing and diff against what's already in the repo. Pulls live data from https://arize.com/docs/llms.txt and produces a clean to-do list. Part of the rosetta-add-framework flow; can also be invoked standalone to answer "what frameworks are left to add?"

2026-05-190

rosetta-add-framework.md

from "Arize-ai/project-rosetta-stone"

Add a new agent framework to the Rosetta Stone repo — researches the framework, builds all three observability tiers (no-observability, phoenix, ax), tests each, runs a Playwright smoke against the UI, updates README + TODO, and raises a PR. Trigger when the user asks to "add the <framework> framework", "implement <framework>", "wire up <framework>", or similar. The framework must be one of the Arize-supported agent frameworks (see TODO below).

2026-05-190

rosetta-add-framework-tier-build.md

from "Arize-ai/project-rosetta-stone"

Build a single tier (no-observability, phoenix, or ax) for a new framework. Clones the closest existing tier, swaps in framework-specific agent.py / tools.py / requirements.txt, and (for observability tiers) adds tracing.py + main.py wiring + eval-harness scripts. Part of the rosetta-add-framework flow; can be invoked standalone to rebuild a single tier from scratch.

2026-05-190

rosetta-add-framework-playwright.md

from "Arize-ai/project-rosetta-stone"

Run a public-flow Playwright smoke test against a freshly-built framework tier's Next.js frontend. Covers home page rendering + product browsing — the parts that don't require X/Twitter OAuth. The Playwright project (package.json, config, tests) lives inside this skill directory and is checked into the repo. Part of the rosetta-add-framework flow.

2026-05-190

rosetta-add-framework-docs.md

from "Arize-ai/project-rosetta-stone"

Finalise a newly-added framework — updates the README's supported-frameworks table, directory tree, and per-framework "what differs" section, marks off the framework in the orchestrator skill's embedded TODO, commits per tier, and raises a PR. Part of the rosetta-add-framework flow.

2026-05-190

rosetta-demo-capture.md

from "Arize-ai/project-rosetta-stone"

Record a full Wonder Toys demo by running a canned 3-turn conversation (search dragons → buy plushie → ship), then opening Arize AX in Safari and screenshotting the session view plus every trace in it. Use when the user asks to "capture a demo", "record screenshots of an Arize session", "demo the agent flow", or any similar phrasing. macOS only — uses AppleScript and `screencapture`.

2026-05-190

package.json

"author": "Arize-ai"

"repository": "Arize-ai/project-rosetta-stone"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

name	rosetta-add-framework-tier-test
description	Test a freshly-built tier for a new framework — boots the backend, smoke-tests the chat endpoint, runs synthetic requests, and (for phoenix) runs the eval harness. Verifies traces land in the right project. Part of the rosetta-add-framework flow; can be invoked standalone after a build to validate.

Tier Test

Validate a single tier (no-observability, phoenix, or ax) after build.

Inputs

FRAMEWORK_DIR — e.g. google-adk-py
TIER — no-observability | phoenix | ax

Steps

1. Stop anything already on backend ports

lsof -ti :8001 2>/dev/null | xargs -r kill 2>/dev/null
sleep 1

2. Start the backend

The canonical way is npm run dev which boots ChromaDB + Python backend + Next.js. For a focused tier test, that's fine — just hit the Python backend directly on :8001 for chat tests.

cd "$TIER/$FRAMEWORK_DIR"
npm run dev > /tmp/rosetta-tier-test.log 2>&1 &
DEV_PID=$!
until curl -sf http://127.0.0.1:8001/products/featured > /dev/null 2>&1; do
  if ! kill -0 $DEV_PID 2>/dev/null; then
    echo "dev script died — see /tmp/rosetta-tier-test.log"
    tail -30 /tmp/rosetta-tier-test.log
    exit 1
  fi
  sleep 2
done

3. Smoke tests (all tiers)

Three chat requests against :8001/chat:

# Test A: conversational (no tools expected)
curl -sN -X POST http://127.0.0.1:8001/chat \
  -H "Content-Type: application/json" \
  -H "x-user-id: test-a" \
  -d '{"messages":[{"role":"user","content":"Just say hi back in one sentence, no tools."}]}' \
  --max-time 60 | head -10

# Test B: tool call (should trigger search_products)
curl -sN -X POST http://127.0.0.1:8001/chat \
  -H "Content-Type: application/json" \
  -H "x-user-id: test-b" \
  -d '{"messages":[{"role":"user","content":"Show me one dinosaur toy"}]}' \
  --max-time 90 | head -25

# Test C: multi-turn (history persistence)
# First turn primes the per-user history
curl -sN -X POST http://127.0.0.1:8001/chat \
  -H "x-user-id: test-c" -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"My favorite color is teal. Acknowledge in one sentence."}]}' \
  --max-time 30 > /dev/null
# Second turn should remember
curl -sN -X POST http://127.0.0.1:8001/chat \
  -H "x-user-id: test-c" -H "Content-Type: application/json" \
  -d '{"messages":[
    {"role":"user","content":"My favorite color is teal. Acknowledge in one sentence."},
    {"role":"assistant","content":"Got it!"},
    {"role":"user","content":"What is my favorite color? One word, no tools."}
  ]}' \
  --max-time 30 | tail -3

4. Tier-specific verification

`no-observability` — stop here

No traces to verify. Kill the backend and report.

`phoenix` tier

Trace verification + full eval harness.

# Wait for ingestion
sleep 10

# Verify a trace landed in the configured project
PHOENIX_PROJECT=$(grep '^PHOENIX_PROJECT_NAME=' .env.local | cut -d= -f2)
/Users/jimbobbennett/github/project-rosetta-stone/.venv/bin/python <<EOF
from phoenix.client import Client
c = Client(base_url="http://localhost:6006")
df = c.spans.get_spans_dataframe(project_identifier="$PHOENIX_PROJECT", limit=20)
print(f"spans: {len(df)}")
assert len(df) > 0, "no spans landed — check tracing.py"
EOF

# Stop backend (synthetic-requests will boot a fresh one)
lsof -ti :8001 :3000 | xargs -r kill 2>/dev/null
sleep 1

# Run the full eval harness (~10 min)
npm run synthetic-requests 2>&1 | grep -E "^User:|ERROR|Response \(|Done!" | tail -30
sleep 30  # ingestion
npm run evals 2>&1 | tail -10

Pass criteria: 25 synthetic requests all return non-error responses; evals logs annotations to Phoenix without crashing.

`ax` tier

Trace verification only (evals are UI-configured in AX).

sleep 30  # AX ingestion is slower than Phoenix
ARIZE_PROJECT=$(grep '^ARIZE_PROJECT_NAME=' .env.local | cut -d= -f2)
set -a; source .env.local; set +a
ax traces list "$ARIZE_PROJECT" \
  --space "$ARIZE_SPACE_ID" \
  --start-time "$(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date -u -v+1M +%Y-%m-%dT%H:%M:%SZ)" \
  --limit 10 --output /tmp/rosetta-ax-traces.json > /dev/null 2>&1

python3 <<EOF
import json
data = json.load(open('/tmp/rosetta-ax-traces.json'))
spans = data.get('spans', data) if isinstance(data, dict) else data
print(f"traces found: {len(spans)}")
assert len(spans) >= 3, "expected at least 3 traces (smoke A+B+C)"
EOF

# Then synthetic-requests
lsof -ti :8001 :3000 | xargs -r kill 2>/dev/null
sleep 1
npm run synthetic-requests 2>&1 | grep -E "^User:|ERROR|Response \(|Done!" | tail -30

Pass criteria: at least 3 traces visible in AX from the smoke test; 25 synthetic requests all return non-error responses.

5. Cleanup

lsof -ti :3000 :8001 2>/dev/null | xargs -r kill 2>/dev/null
# Don't touch :8000 (ChromaDB) or :6006 (Phoenix) — orchestrator manages those

Output

TIER_TESTED: <TIER>/<FRAMEWORK_DIR>
  smoke A (greeting): PASS
  smoke B (tool call): PASS
  smoke C (multi-turn): PASS  (recalled history: yes|no)
  traces verified: <count>
  synthetic-requests: 25/25 OK | <N>/25 OK (see log)
  evals: PASS | n/a | FAIL <error>

rosetta-add-framework-tier-test

Tier Test

Inputs

Steps

1. Stop anything already on backend ports

2. Start the backend

3. Smoke tests (all tiers)

4. Tier-specific verification

no-observability — stop here

phoenix tier

ax tier

5. Cleanup

Output

More from this repository

More from this repository

Tier Test

Inputs

Steps

1. Stop anything already on backend ports

2. Start the backend

3. Smoke tests (all tiers)

4. Tier-specific verification

no-observability — stop here

phoenix tier

ax tier

5. Cleanup

Output

`no-observability` — stop here

`phoenix` tier

`ax` tier

`no-observability` — stop here

`phoenix` tier

`ax` tier