원클릭으로 Manus에서 모든 스킬 실행

$pwd:

analyze-eval

Name: Analyze Eval
Author: get-convex

// Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.

Manus에서 실행

$ git log --oneline --stat

stars:122

forks:10

updated:2026년 2월 27일 02:46

SKILL.md

readonly

name	analyze-eval
description	Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.

Analyze Eval

When to use

User shares a URL like https://convex-evals.netlify.app/experiment/.../run/$runId/$category/$evalId
User asks "why did this eval fail?" or "what went wrong with this eval?"
User references a specific eval ID

Step 1: Extract the eval ID from the URL

The visualizer URL pattern is:

/experiment/$experimentId/run/$runId/$category/$evalId?tab=steps

$runId — the Convex document ID for the run (e.g. jn7922j1w29pdxm76bj9ps0enx80mg9e)
$evalId — the Convex document ID for the specific eval (e.g. jh73jvjz2n00gfeve1dt5h963s80mbc6)

You need the evalId to query.

Step 2: Query the debug action

Run the internal action from the evalScores/ directory. Always use --prod to query the production database (where CI writes results):

npx convex run --prod debug:getEvalDebugInfo '{"evalId": "<evalId>"}'

This returns a JSON object with:

Field	Contents
`eval`	Name, category, evalPath, status (pass/fail + failure reason), task text
`run`	Model name, provider, experiment name, run status
`steps`	Array of step results: filesystem, install, deploy, tsc, eslint, tests — each with pass/fail/skipped and failure reason
`outputFiles`	Map of file path -> file content from the model's generated output (unzipped)
`evalSourceFiles`	Map of file path -> file content from the eval source (answer dir, grader, TASK.txt, etc.)

Step 3: Analyze the failure

With the data returned, compare:

Which step failed? — Check steps for the first entry with status.kind === "failed". The failureReason field has the error message.
What did the model generate? — Look at outputFiles for the model's code.
What was expected? — Look at evalSourceFiles for the answer directory and grader test files.
What was the task? — Check eval.task for the TASK.txt content.

Common failure patterns:

eslint fail — Check the failure reason for the specific lint rule violated. Compare the model output against the answer to spot the lint issue.
tsc fail — TypeScript compilation error. Check the failure reason for the specific type error.
convex dev fail — Schema or function definition issues that prevent Convex from deploying.
tests fail — The grader tests didn't pass. Compare outputFiles against evalSourceFiles (look for files like grader.test.ts or answer/) to understand what the tests expected.

Step 4: Classify and report findings

Classify the failure as one of:

MODEL_FAULT: The model genuinely got it wrong
OVERLY_STRICT: The eval/lint/test requirements are unreasonable for what was asked
AMBIGUOUS_TASK: The task description is unclear and the model's interpretation was reasonable
KNOWN_GAP: A known limitation of this eval that affects all models (e.g. the Convex API returns fields the model can't predict without being told)

Summarize:

The eval name, model, and experiment
Which step failed and the exact error
The classification and reasoning
The relevant code from the model output that caused the failure
What the correct code should look like (from the answer/eval source)
Whether any action is recommended (config change, task clarification, etc.)

related-skills.json

같은 저장소

add-eval.md

from "get-convex/convex-evals"

Design, implement, validate, and calibrate a new eval for the convex-evals suite. Use when the user wants to add a new eval, create an eval, test a new Convex concept, or expand eval coverage.

2026-03-24122

add-model.md

from "get-convex/convex-evals"

Add a new AI model to the eval runner, update the manual eval workflow, push changes, and trigger baseline eval runs. Use when the user wants to add a new model, onboard a model, or mentions a new model name/link to add to the leaderboard.

2026-03-13122

analyze-run.md

from "get-convex/convex-evals"

Analyze all failures in a convex-evals run, spawning parallel sub-agents to investigate each failure and producing a report with classifications and recommendations. Use when the user asks to analyze an entire run, review all failures in a run, or wants to understand why a model scored poorly.

2026-02-27122

validate-guidelines.md

from "get-convex/convex-evals"

Empirically verify guideline changes by running before/after eval runs across multiple models and ensuring no regressions. Use when proposing or reviewing changes to runner/models/guidelines.ts, or when the user asks to validate guidelines.

2026-02-27122

analyze-ablation.md

from "get-convex/convex-evals"

Analyze guideline ablation experiment results to determine which guideline sections are essential, marginal, or dispensable. Use when the user asks to analyze ablation results, interpret guideline compaction data, or wants to know which guidelines to keep for AGENTS.md.

2026-02-18122

package.json

"author": "get-convex"

"repository": "get-convex/convex-evals"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

$ useful --forSOC

소프트웨어 품질 보증 분석가·테스터컴퓨터 및 수학직15-1253L4

name	analyze-eval
description	Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.

Analyze Eval

When to use

User shares a URL like https://convex-evals.netlify.app/experiment/.../run/$runId/$category/$evalId
User asks "why did this eval fail?" or "what went wrong with this eval?"
User references a specific eval ID

Step 1: Extract the eval ID from the URL

The visualizer URL pattern is:

/experiment/$experimentId/run/$runId/$category/$evalId?tab=steps

$runId — the Convex document ID for the run (e.g. jn7922j1w29pdxm76bj9ps0enx80mg9e)
$evalId — the Convex document ID for the specific eval (e.g. jh73jvjz2n00gfeve1dt5h963s80mbc6)

You need the evalId to query.

Step 2: Query the debug action

Run the internal action from the evalScores/ directory. Always use --prod to query the production database (where CI writes results):

npx convex run --prod debug:getEvalDebugInfo '{"evalId": "<evalId>"}'

This returns a JSON object with:

Field	Contents
`eval`	Name, category, evalPath, status (pass/fail + failure reason), task text
`run`	Model name, provider, experiment name, run status
`steps`	Array of step results: filesystem, install, deploy, tsc, eslint, tests — each with pass/fail/skipped and failure reason
`outputFiles`	Map of file path -> file content from the model's generated output (unzipped)
`evalSourceFiles`	Map of file path -> file content from the eval source (answer dir, grader, TASK.txt, etc.)

Step 3: Analyze the failure

With the data returned, compare:

Which step failed? — Check steps for the first entry with status.kind === "failed". The failureReason field has the error message.
What did the model generate? — Look at outputFiles for the model's code.
What was expected? — Look at evalSourceFiles for the answer directory and grader test files.
What was the task? — Check eval.task for the TASK.txt content.

Common failure patterns:

eslint fail — Check the failure reason for the specific lint rule violated. Compare the model output against the answer to spot the lint issue.
tsc fail — TypeScript compilation error. Check the failure reason for the specific type error.
convex dev fail — Schema or function definition issues that prevent Convex from deploying.
tests fail — The grader tests didn't pass. Compare outputFiles against evalSourceFiles (look for files like grader.test.ts or answer/) to understand what the tests expected.

Step 4: Classify and report findings

Classify the failure as one of:

MODEL_FAULT: The model genuinely got it wrong
OVERLY_STRICT: The eval/lint/test requirements are unreasonable for what was asked
AMBIGUOUS_TASK: The task description is unclear and the model's interpretation was reasonable
KNOWN_GAP: A known limitation of this eval that affects all models (e.g. the Convex API returns fields the model can't predict without being told)

Summarize:

The eval name, model, and experiment
Which step failed and the exact error
The classification and reasoning
The relevant code from the model output that caused the failure
What the correct code should look like (from the answer/eval source)
Whether any action is recommended (config change, task clarification, etc.)

analyze-eval

Analyze Eval

When to use

Step 1: Extract the eval ID from the URL

Step 2: Query the debug action

Step 3: Analyze the failure

Step 4: Classify and report findings

이 저장소의 다른 Skills

Analyze Eval

When to use

Step 1: Extract the eval ID from the URL

Step 2: Query the debug action

Step 3: Analyze the failure

Step 4: Classify and report findings

이 저장소의 다른 Skills