Run any Skill in Manus with one click

bernstein-quality

Stars583

Forks49

UpdatedMay 21, 2026 at 14:34

Show quality metrics for Bernstein runs - success rates per model, lint/test pass rates, completion time distributions. Use when the user asks about quality, reliability, which model performs best, or pass rates.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

sipyourdrink-ltd

sipyourdrink-ltd/bernstein

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations·SOC 15-1253

File Explorer

2 files

SKILL.md

readonly

name	bernstein-quality
description	Show quality metrics for Bernstein runs - success rates per model, lint/test pass rates, completion time distributions. Use when the user asks about quality, reliability, which model performs best, or pass rates.

Bernstein Quality Metrics

Analyze quality and reliability of agent-generated code.

When to Use

User asks "how reliable are the agents?" or "which model is best?"
User wants success rates, pass rates, or completion time stats
User asks about test failures or lint issues across models
User says "show me quality metrics"

Instructions

Run scripts/quality.sh metrics for overall quality metrics.
Run scripts/quality.sh pass-rates for lint/typecheck/test pass rates by model.
Run scripts/quality.sh times for completion time distributions.
Present a quality dashboard:

## Quality Dashboard

### Success Rate by Model
| Model | Tasks | Success | Fail | Rate |
|-------|-------|---------|------|------|
| claude-sonnet-4 | 24 | 22 | 2 | 91.7% |
| gpt-4.1 | 12 | 10 | 2 | 83.3% |

### Pass Rates
| Check | Overall | claude-sonnet-4 | gpt-4.1 |
|-------|---------|-----------------|---------|
| Lint | 96% | 98% | 92% |
| Type-check | 88% | 91% | 83% |
| Tests | 85% | 89% | 75% |

### Completion Times
| Percentile | Time |
|------------|------|
| p50 | 3m 20s |
| p90 | 8m 45s |
| p99 | 15m 12s |

Highlight any models with significantly lower pass rates.
Recommend model routing adjustments if one model consistently underperforms.

More from this repository

same repository

bernstein-agents

sipyourdrink-ltd/bernstein

Manage Bernstein agents - list active agents, inspect their output, kill stalled agents, or stream live logs. Use when the user asks about agents, wants to see what an agent is doing, or needs to kill one.

2026-05-21583

bernstein-alerts

sipyourdrink-ltd/bernstein

Show active alerts from Bernstein - failed tasks, stalled agents, budget warnings, blocked tasks needing human intervention. Use when the user asks about problems, errors, warnings, or what needs attention.

2026-05-21583

bernstein-plan

sipyourdrink-ltd/bernstein

Create and manage multi-step execution plans in Bernstein. Plans decompose complex goals into stages with dependencies. Use when the user wants to plan a complex feature, break down a large task, or review an execution plan before agents start working.

2026-05-21583

bernstein-status

sipyourdrink-ltd/bernstein

Show Bernstein orchestrator status - active agents, task progress, costs, and alerts. Use when the user asks about orchestrator status, what agents are doing, task progress, how much has been spent, or what's happening with the build.

2026-05-21583

architect

sipyourdrink-ltd/bernstein

System design - module boundaries, API contracts, ADRs.

2026-05-21583

backend

sipyourdrink-ltd/bernstein

Python server code, APIs, async, strict typing.

2026-05-21583

name	bernstein-quality
description	Show quality metrics for Bernstein runs - success rates per model, lint/test pass rates, completion time distributions. Use when the user asks about quality, reliability, which model performs best, or pass rates.

Bernstein Quality Metrics

Analyze quality and reliability of agent-generated code.

When to Use

User asks "how reliable are the agents?" or "which model is best?"
User wants success rates, pass rates, or completion time stats
User asks about test failures or lint issues across models
User says "show me quality metrics"

Instructions

Run scripts/quality.sh metrics for overall quality metrics.
Run scripts/quality.sh pass-rates for lint/typecheck/test pass rates by model.
Run scripts/quality.sh times for completion time distributions.
Present a quality dashboard:

## Quality Dashboard

### Success Rate by Model
| Model | Tasks | Success | Fail | Rate |
|-------|-------|---------|------|------|
| claude-sonnet-4 | 24 | 22 | 2 | 91.7% |
| gpt-4.1 | 12 | 10 | 2 | 83.3% |

### Pass Rates
| Check | Overall | claude-sonnet-4 | gpt-4.1 |
|-------|---------|-----------------|---------|
| Lint | 96% | 98% | 92% |
| Type-check | 88% | 91% | 83% |
| Tests | 85% | 89% | 75% |

### Completion Times
| Percentile | Time |
|------------|------|
| p50 | 3m 20s |
| p90 | 8m 45s |
| p99 | 15m 12s |

Highlight any models with significantly lower pass rates.
Recommend model routing adjustments if one model consistently underperforms.