Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

test-run-analysis

Estrellas0

Forks0

Actualizado6 de febrero de 2026, 17:54

Analyze Freeplay test run results to surface insights, evaluation metrics, and test case details. Use when the user wants to review test run performance, compare two test runs, understand evaluation scores, or identify failing test cases. Do NOT use for executing new test runs (use run-test) or managing datasets (use dataset-management).

Instalación

Instalar con Codex o Claude Copia este prompt, pégalo en Codex, Claude u otro asistente, y deja que revise la página de la skill y la instale por ti.

Ejecutar en Manus

Fuente

freeplayai

freeplayai/freeplay-skills

Abrir repositorio de GitHub Ver repositorios del creador

Descarga

Ejecutar en Manus

Ocupaciones relacionadasSOC

Basado en la clasificación ocupacional SOC

Analistas de garantía de calidad de software y probadoresOcupaciones informáticas y matemáticas·SOC 15-1253

SKILL.md

readonly

Más de este repositorio

mismo repositorio

freeplay-onboarding

freeplayai/freeplay-skills

Complete guided onboarding to Freeplay — analyzes codebase, migrates prompts, and integrates logging. Use whenever the user says 'onboard to Freeplay', 'set up Freeplay', 'get started with Freeplay', 'integrate Freeplay', or wants end-to-end setup.

2026-02-130

freeplay-plan

freeplayai/freeplay-skills

Analyze a codebase to map its LLM implementation for Freeplay migration. Examines prompts, model configs, tool use, sessions, and agent patterns. Use when user wants to understand their LLM usage, inventory prompts, plan a migration, audit LLM patterns, or before integrating with Freeplay — even if they just say 'analyze my code' or 'what LLM stuff do I have'.

2026-02-130

prompt-migration

freeplayai/freeplay-skills

Migrate prompts from code into Freeplay for version control, A/B testing, and centralized management. Use whenever the user wants to move prompts, manage templates, set up prompt management, or mentions 'migrate prompts', 'prompt versioning', 'prompt templates', or wants to organize their prompts.

2026-02-130

record-to-freeplay

freeplayai/freeplay-skills

Integrate Freeplay logging, tracing, and observability into an LLM application. Use whenever the user mentions recording LLM calls, adding observability, capturing traces, logging to Freeplay, monitoring LLM usage, or wants a dashboard — even if they don't say 'Freeplay'. Also use after prompt migration, or for observability-only setups.

2026-02-130

health-check

freeplayai/freeplay-skills

Assess the health, completeness, and production readiness of a Freeplay project across the data flywheel. Use when the user asks about project status, wants to know if their project is ready for production, asks what's missing in their Freeplay setup, wants a project health check, or asks "what should I set up next?" Also use when the user is first connecting a project to Freeplay.

2026-02-120

dataset-management

freeplayai/freeplay-skills

Create, manage, and curate Freeplay datasets (prompt datasets and agent datasets). Always confirm with the user before any write operations. Use when the user wants to create a new dataset, add test cases, update datasets, manage dataset content, import test data from CSV or JSONL, create golden sets, or build evaluation datasets. Do NOT use for running tests (use run-test) or analyzing test results (use test-run-analysis).

2026-02-060

name	test-run-analysis
description	Analyze Freeplay test run results to surface insights, evaluation metrics, and test case details. Use when the user wants to review test run performance, compare two test runs, understand evaluation scores, or identify failing test cases. Do NOT use for executing new test runs (use run-test) or managing datasets (use dataset-management).

Test Run Analysis

This skill helps you review Freeplay test run results, view evaluation metrics, and identify patterns across test cases.

When to use this skill

"Summarize the results of test run X"
"Summarize the major differences between two test runs in the same comparison"
"What were the metrics for my last test run?"
"Analyze test run [ID]"
"Which test cases failed in run X?"
"What are the evaluation scores from this test?"
User mentions a specific test run ID and wants insights
Compare two test runs and summarize differences between latency, cost, and evaluations
Select a winner from two test runs based on available data

Available API Endpoints

Get Test Run Summary

Endpoint: GET /api/v2/projects/{project_id}/test-runs/id/{test_run_id}

Returns:

{
  "id": "test-run-uuid",
  "name": "Test run name",
  "description": "Description",
  "created_at": 1234567890,
  "prompt_name": "Prompt template name",
  "prompt_version": "version-id",
  "model_name": "gpt-4",
  "sessions_count": 10,
  "summary_statistics": {
    "human_evaluation": {},
    "auto_evaluation": {
      "criteria_name": {
        "pass": 8,
        "fail": 2,
        "average_score": 0.85
      }
    },
    "client_evaluation": {}
  }
}

List All Test Runs

Endpoint: GET /api/v2/projects/{project_id}/test-runs

Returns a list of all test runs with their IDs and associated test case IDs.

How to Analyze a Test Run

When a user provides a test run ID, follow these steps:

1. Fetch Test Run Summary

Use curl or Python to call the API:

curl -H "Authorization: Bearer $FREEPLAY_API_KEY" \
     "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/test-runs/id/{test_run_id}"

2. Extract Key Insights

From the response, identify:

Overall performance: sessions_count, average scores
Evaluation metrics: Parse the summary_statistics object
- auto_evaluation: LLM-as-judge metrics
- human_evaluation: Human review scores
- client_evaluation: Custom evaluation criteria
Prompt details: prompt_name, prompt_version, model_name
Timestamp: created_at (Unix timestamp)

3. Analyze Metrics

For each evaluation criterion in summary_statistics:

Count pass/fail cases
Calculate pass rate percentage
Identify average scores
Flag criteria with low performance

Environment Variables Required

The following environment variables must be set:

FREEPLAY_API_KEY: Your Freeplay API key
FREEPLAY_BASE_URL: Freeplay API base URL (default: https://app.freeplay.ai)

These are typically configured in the plugin's .mcp.json file or user's environment.

Project ID can come from:

User specification
MCP list_projects() tool to discover available projects

Response Format

When presenting test run results to the user, include:

Summary Header
- Test run name
- Prompt tested & version
- Model used
- Number of test cases
- Date/time
Evaluation Metrics
- For each criterion:
  - Criterion name
  - Pass/fail count
  - Pass rate percentage
  - Average score (if applicable)
Key Insights
- Overall pass rate
- Best performing criteria
- Areas needing improvement
- Any notable patterns or failures
Recommendations
- Suggest next steps based on results
- Highlight specific test cases to review if pass rate is low
- When no evaluations are configured: The test run completed but has no quantitative metrics to analyze. Provide specific, actionable guidance:
  - Explain that without evaluations, test runs only capture raw outputs — evaluations are what score those outputs automatically on each run
  - Direct the user to set up evaluations: open the prompt in the Freeplay UI → go to the Evaluate tab → create evaluation criteria (e.g., model-graded criteria that check for accuracy, relevance, tone, or format compliance)
  - Link to the documentation: Evaluations Overview
  - Suggest running the health-check skill (/health-check) to get a full assessment of their project setup, including evaluation coverage
  - For immediate value, recommend reviewing a sample of test case outputs directly in the Freeplay UI to manually spot-check quality before investing in automated evaluations

Example Analysis Output

# Test Run: "New test"
- Prompt: Chat Transcript to Issue
- Model: gpt-4-turbo
- Test Cases: 15
- Date: Jan 30, 2025

## Evaluation Results

### Accuracy (auto_evaluation)
- Pass: 13/15 (86.7%)
- Average Score: 0.87

### Relevance (auto_evaluation)
- Pass: 14/15 (93.3%)
- Average Score: 0.92

### Format Compliance (auto_evaluation)
- Pass: 15/15 (100%)
- Average Score: 1.0

## Overall Performance
✓ Strong performance across all criteria
✓ 93.3% overall pass rate
⚠ Review 2 failing test cases for Accuracy criterion

## Recommendations
1. Investigate the 2 accuracy failures to identify patterns
2. Consider adding more test cases for edge cases
3. Prompt is ready for staging deployment

Example Analysis Output (Multiple Test Runs)

# Test Run Comparison

## Test Run 1: "Bug Report Summarization"
- Prompt: Bug-to-Summary Converter
- Model: gpt-4-turbo
- Test Cases: 10
- Date: Jan 15, 2025

### Evaluation Results

#### Accuracy (auto_evaluation)
- Pass: 8/10 (80.0%)
- Average Score: 0.82

#### Clarity (manual_evaluation)
- Pass: 7/10 (70.0%)
- Average Score: 0.73

#### Format Compliance (auto_evaluation)
- Pass: 10/10 (100%)
- Average Score: 1.0

## Overall Performance
✓ Good accuracy and format compliance
⚠ Clarity underperforms at 70%
✓ 83.3% average pass rate

## Recommendations
1. Refine prompt instructions to improve clarity in summaries
2. Review the 3 test cases with clarity failures
3. Maintain strong format adherence

---

## Test Run 2: "Support Ticket Categorization 234dv vs a24bd"
- Prompt: Ticket Classifier
- Versions compared: 
    - Version: 234dv
      - Model: Opus 4.5
      - Created by: Prompt optimizer
      - Created: 10/11/25 11:12:32
    - Version: a24bd
      - Model: Sonnet 4.5
      - Created by: Rob Rhyne
      - Created: 10/10/25 11:12:32
- Test Cases: 12
- Date: Feb 2, 2025

### Evaluation Results

#### Accuracy (auto_evaluation)
- Pass: 9/12 (75.0%)
- Average Score: 0.79

#### Category Relevance (auto_evaluation)
- Pass: 8/12 (66.7%) ⚠
- Average Score: 0.68

#### Format Compliance (auto_evaluation)
- Pass: 11/12 (91.7%)
- Average Score: 0.95

## Overall Performance
⚠ Category relevance below 70% – needs attention
✓ Good format compliance, moderate accuracy
✗ 77.8% average pass rate

## Recommendations
1. Investigate category relevance failures—review misclassified tickets
2. Consider adding category examples to the prompt
3. Re-evaluate if accuracy improves after prompt update

Tips

Always format Unix timestamps into readable dates
Calculate percentages and round to 1 decimal place
Use visual indicators (✓ ⚠ ✗) for quick scanning
Highlight concerning metrics (< 70% pass rate)
Keep the analysis concise but actionable