| name | health-check |
| description | Assess the health, completeness, and production readiness of a Freeplay project across the data flywheel. Use when the user asks about project status, wants to know if their project is ready for production, asks what's missing in their Freeplay setup, wants a project health check, or asks "what should I set up next?" Also use when the user is first connecting a project to Freeplay. |
Freeplay Project Health Check
This skill performs a comprehensive assessment of a Freeplay project's health across all dimensions of the data flywheel: observability, datasets, evaluations, testing, and continuous improvement.
When to use this skill
- "Is my project production ready?"
- "What's missing in my Freeplay setup?"
- "Check the health of project X"
- "How complete is my evaluation setup?"
- "Is my data flywheel working?"
- "What should I set up next?"
- When starting work on an unfamiliar project
- Before making significant changes to prompts or evaluations
- After onboarding a new project to Freeplay (AFTER initial integration)
The Data Flywheel Mental Model
Freeplay enables continuous improvement through a connected data flywheel:
Production (Observability)
↓ Logs sessions/traces/completions
Monitoring & Review
↓ Identifies patterns, failures
Datasets (Curation)
↓ Failures and successes become test cases
Prompt and Agent Iteration (Improvement)
↓ New versions created
Test Runs (Validation)
↓ Results inform changes and prevent regressions
Deployment (Versioning)
↓ Controlled rollout
Production (Repeat)
A healthy project has all stages connected and flowing.
Health Dimensions
Assess each dimension on a 3-level scale:
- ✅ Healthy: Well-configured, active, no action needed
- ⚠️ Needs Attention: Partially configured or showing warning signs
- ❌ Critical/Missing: Not set up or blocking the flywheel
1. Prompt Management
What to check:
- At least one prompt template exists
- Templates have multiple versions (showing iteration)
- Versions are deployed to environments (dev, staging, prod, etc.)
- Clear naming and versioning conventions
MCP tools to use:
list_prompt_templates(project_id)
get_prompt_version(project_id, template_id, version_id)
API calls:
curl -s "$FREEPLAY_BASE_URL/api/v2/environments" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/prompt-templates/id/{template_id}/versions" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
What to look for:
list_prompt_templates returns templates with latest_version_id populated
- Each template's version shows deployed environments (check for prod/staging/dev)
- Multiple versions per template indicates active iteration
- Version names follow a pattern (e.g., semantic versioning, descriptive names)
Scoring:
- ✅ Multiple templates with versions deployed to multiple environments (or one template with lots of activity in the case of single prompt projects)
- ⚠️ Templates exist but only deployed to one environment, or no recent versions
- ❌ No templates, or templates with no deployed versions
2. Evaluation Setup
What to check:
- Evaluation criteria exist for key prompt templates/agents
- Multiple evaluation types configured or present in logs (model-graded, code, human)
- Evaluation criteria are enabled and running/published
- Sample rates are appropriate (not 0%)
- Insights generation is enabled at project and criteria level
MCP tools to use:
search_completions(project_id, limit=50) → check for evaluation results in logs
list_insights(project_id) → check if insights are being generated
API calls:
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/evaluation-criteria" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
What to look for:
- Evaluation criteria list shows
is_enabled: true for active criteria
- Look for
type field: "llm_eval", "code_eval", "user_eval", "auto_categorization"
- Check
sample_rate is > 0 (1.0 = 100% of completions evaluated)
- Check
generate_insights on criteria
- Project settings should have
enable_eval_insights and enable_review_insights set to true
- Completions from
search_completions should show evaluation scores
Scoring:
- ✅ 3+ evaluation criteria, insights enabled, consistent scoring
- ⚠️ 1-2 evaluation criteria, or insights disabled, or inconsistent results
- ❌ No evaluation criteria configured, or all evaluation criteria disabled
3. Observability (Production Logging)
What to check:
- Sessions, traces*, AND completions are each being logged (*except in the case of projects with a single prompt template)
- Recent activity (within last 7 days)
- Completions linked to prompt templates (not orphaned)
- Evaluation criteria running on production data
- Customer feedback and/or custom metadata being logged
- Cost and latency being tracked
MCP tools to use:
search_sessions(project_id, limit=20)
search_completions(project_id, limit=20)
search_traces(project_id, limit=20)
find_logging_issues(project_id, template_name=<main_template>) → identifies missing logged fields
What to look for in search results:
search_sessions: Check count, most recent timestamp, presence of metadata
search_completions: Check for:
template_name populated (not orphaned)
environment set (tracking deployment context)
- Evaluation scores present in results
- Cost and latency data populated
search_traces: Check if traces exist (for multi-step/agentic projects)
find_logging_issues: Returns specific missing fields with fix suggestions
Date filtering for recency:
Use the start_date parameter to check recent activity (use a date 7 days ago):
search_completions(project_id, limit=20, start_date="YYYY-MM-DD") # 7 days ago
Scoring:
- ✅ Active logging (100+ sessions), recent activity, evaluations running, all key fields populated
- ⚠️ Some logging but sparse, or no recent activity, or completions not linked to prompt templates, or missing feedback/metadata
- ❌ No sessions/completions logged, or no activity in 30+ days
4. Dataset Coverage
What to check:
- At least one dataset exists
- Test cases exist in datasets that include inputs and output
- Various inputs in test cases cover key usage scenarios, based on what's happening in production logs
API calls:
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/prompt-datasets" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/agent-datasets" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/prompt-datasets/id/{dataset_id}/test-cases" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/agent-datasets/id/{dataset_id}/test-cases" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
MCP tools to use:
These are useful for comparing dataset coverage to production usage patterns:
search_completions(project_id, limit=50) → see what inputs are common in production
search_traces(project_id, limit=20) → for agent workflows
What to look for:
- Dataset list returns one or more datasets with meaningful names
- Test cases include both
inputs and expected output (not just inputs)
- Compare test case inputs to production completion inputs for coverage gaps
- Look for dataset purposes: golden examples, failure cases, edge cases, red team
Scoring:
- ✅ 3+ datasets with 50+ total test cases, covering different purposes/scenarios
- ⚠️ 1-2 datasets, or fewer than 20 test cases each, or missing expected outputs
- ❌ No datasets, all datasets are empty, or fewer than 20 test cases total
5. Testing Cadence
What to check:
- Test runs being executed
- Recent test runs (within last 10 days)
- Multiple test runs per prompt template and/or agent (showing iteration)
- Test runs include evaluation results
- Comparison tests being created
API calls:
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/test-runs" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/test-runs/id/{test_run_id}" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
What to look for:
- Test runs list shows multiple runs with recent
created_at timestamps
- Same
prompt_name appears in multiple runs (showing iteration)
summary_statistics contains auto_evaluation and/or human_evaluation scores
- Look for paired runs with similar names (e.g., "baseline" vs "optimized") indicating A/B comparisons
- Check
sessions_count matches expected dataset size
Scoring:
- ✅ 10+ test runs, recent activity (within 10 days), comparative testing evident
- ⚠️ 1-9 test runs, or no recent tests, or no comparisons
- ❌ No test runs ever executed
6. Continuous Improvement Signals
What to check:
- Insights being generated (eval insights, review insights)
- Prompt optimization runs attempted
- Human reviews being conducted (manual evaluation criteria scoring or notes present)
- Patterns being identified and addressed
MCP tools to use:
list_insights(project_id) → check for active insights
get_prompt_version(project_id, template_id, version_id) → check for optimized versions
search_completions(project_id, limit=50) → look for human evaluation scores
API calls:
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/prompt-templates/id/{template_id}/versions" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
What to look for:
list_insights returns insights with meaningful content (not empty)
- Prompt versions with names containing "Optimized" or created by optimization process
- Version descriptions mentioning optimization or improvement
- Completions showing
human_evaluation or manual_score values
- Insights have
status: "active" (not just orphaned/pruned)
- Multiple prompt versions over time (not just one static version)
Scoring:
- ✅ Active insights being created, prompt optimization used, human review scores present
- ⚠️ Some insights exist but not acted on, or no prompt optimization attempts, or sparse human reviews
- ❌ No insights, no optimization attempts, no human review activity
7. Configuration Completeness
What to check:
- API credentials working
- Environments being used (at least 1 prompt template deployed to prod at minimum)
- LLM provider credentials configured
- Project settings appropriate (data retention, spend limits, insights enabled)
MCP tools to use:
list_projects() → validates API credentials are working
list_prompt_templates(project_id) → check which environments have deployments
API calls:
curl -s "$FREEPLAY_BASE_URL/api/v2/environments" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/prompt-templates/environment/production" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
What to look for:
- MCP calls succeed (credentials valid)
- Environments list includes at least: production (or prod), and ideally staging/dev
- Project settings show:
enable_eval_insights: true
enable_review_insights: true
data_retention_days set appropriately
freeplay_spend_limit_usd configured if using Freeplay-hosted models
- At least one template deployed to production environment
Scoring:
- ✅ All credentials valid, 3+ environments with active deployments, insights enabled
- ⚠️ Missing some environments, or insight flags disabled, or no production deployment
- ❌ Invalid credentials, or no environments defined, or project misconfigured
How to Perform the Health Check
Step 1: Gather Project Context
First, identify the project. If not provided, ask the user or use list_projects() to show available projects.
Step 2: Collect Data (Parallel)
Run these MCP calls in parallel to gather comprehensive data:
list_prompt_templates(project_id)
search_sessions(project_id, limit=50)
search_completions(project_id, limit=50)
search_traces(project_id, limit=20)
list_insights(project_id)
find_logging_issues(project_id) → optional, for deeper observability analysis
And these API calls (can be run in parallel):
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/environments" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/evaluation-criteria" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/prompt-datasets" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/agent-datasets" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/test-runs" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
Follow-up calls (based on initial results):
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/prompt-datasets/id/{dataset_id}/test-cases" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/prompt-templates/id/{template_id}/versions" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
curl -s "$FREEPLAY_BASE_URL/api/v2/projects/{project_id}/test-runs/id/{test_run_id}" \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
Step 3: Analyze Each Dimension
For each dimension, evaluate against the scoring criteria and note specific findings.
Step 4: Calculate Overall Health
Count the scores:
- Production Ready: 6-7 dimensions ✅, no ❌
- Almost Ready: 4-5 dimensions ✅, max 1 ❌
- Needs Work: 2-3 dimensions ✅, or 2+ ❌
- Getting Started: 0-1 dimensions ✅
Step 5: Generate Recommendations
Based on gaps, provide prioritized recommendations:
- Critical (❌ items): Must fix before production
- Important (⚠️ items): Should address for reliability
- Optimization: Nice-to-have improvements
Output Format
Present results in this structure:
# Project Health Check: {Project Name}
## Overall Status: {Production Ready | Almost Ready | Needs Work | Getting Started}
## Flywheel Scorecard
| Dimension | Status | Finding |
|------------------------|--------|---------|
| Prompt Management | ✅/⚠️/❌ | Brief description |
| Evaluation Setup | ✅/⚠️/❌ | Brief description |
| Observability | ✅/⚠️/❌ | Brief description |
| Dataset Coverage | ✅/⚠️/❌ | Brief description |
| Testing Cadence | ✅/⚠️/❌ | Brief description |
| Continuous Improvement | ✅/⚠️/❌ | Brief description |
| Configuration | ✅/⚠️/❌ | Brief description |
## Key Metrics
- **Prompt Templates**: X (Y versions total)
- **Active Evaluations**: X criteria
- **Sessions Logged**: X (last activity: date)
- **Datasets**: X (Y test cases total)
- **Test Runs**: X (last run: date)
- **Insights Generated**: X
## Critical Issues (if any)
1. {Issue}: {Impact and why it matters}
## Recommendations
### Priority 1: {Category}
- {Specific action}
- {Specific action}
### Priority 2: {Category}
- {Specific action}
## Next Steps
Based on this assessment, you should:
1. {First action with skill link if applicable}
2. {Second action}
3. {Third action}
---
*Use the `run-test` skill to execute tests after making changes*
*Use the `test-run-analysis` skill to analyze test results*
*Use the `dataset-management` skill to build or update datasets*
Common Patterns and Recommendations
Pattern: No Evaluation Criteria
Symptom: Completions or traces exist but no evaluation results
Recommendation:
- Create at least 2-3 model-graded evaluations for your main template
- Start with "bottoms-up" error conditions that act like unit tests: "citations present" or "accurate discount code format" or "includes suggested next action".
- Enable insights generation on evaluation criteria
Pattern: No Recent Test Runs
Symptom: Datasets exist but no test runs in 30+ days
Recommendation:
- Run tests before any prompt changes using
/freeplay:run-test
- Set up comparative testing (baseline vs. new version)
- Integrate testing into your development workflow
Pattern: Orphaned Completions
Symptom: Completions not linked to prompt templates
Recommendation:
0. Make sure prompt templates exist, or help create them if not
- Update SDK integration to include
prompt_template_id
- Link completions to environments for proper tracking
- Review Freeplay SDK documentation for proper logging
Pattern: Weak Dataset
Symptom: Only one dataset with limited test cases
Recommendation:
- Ask the user to confirm the semantic meaning of the dataset (i.e. is it a "Golden Dataset" of representative input/output pairs, or "Failure Cases" including known failures to improve, or "Red Team" test cases that help detect abuse)
- Analyze the existing test cases to understand what they cover
- Analyze a sample of 100-200 recent production logs for the same component (prompt template or agent) and assess whether the dataset is representative of the production sample
- Where production examples are markedly different or distinct, suggest examples to the user to add to their dataset. Always get confirmation from the user before changing the test cases in a dataset.
Pattern: No Insights
Symptom: Insights list is empty despite activity
Recommendation:
- Enable
enable_eval_insights on project settings
- Enable
enable_review_insights on project settings
- Ensure
generate_insights is true on evaluation criteria
- Wait for sufficient data (typically 50+ completions)
Pattern: Evaluation Criteria Misalignment
Symptom: Insights show consistent mis-scoring or low pass rates on expected-good outputs
Recommendation:
- Review evaluation criteria prompts for clarity
- Check if criteria are inverted (high score = bad)
- Validate model-graded evals against human judgment
- Consider creating calibration datasets
Security: Protecting API Keys in curl Commands
All curl commands use process substitution to pass the Authorization header,
preventing the API key from appearing in process listings:
curl -s "$FREEPLAY_BASE_URL/api/v2/..." \
-H @<(echo "Authorization: Bearer $FREEPLAY_API_KEY")
Never log, echo, or display the value of FREEPLAY_API_KEY in output.
Environment Variables
Required for API calls:
FREEPLAY_API_KEY: Freeplay API key
FREEPLAY_BASE_URL: API base URL (from .env file)
Project ID can come from:
- User specification
- MCP
list_projects() tool to discover available projects
Linking to Other Skills
After the health check, suggest relevant skills:
- Missing tests? → Use the
run-test skill
- Need to analyze results? → Use the
test-run-analysis skill
- Need to build datasets? → Use the
dataset-management skill
- Check deployments? → Use the
get_deployed_prompt_versions MCP tool