skillgrade-graders

// Authors deterministic and LLM rubric graders for skillgrade evaluations. Use when creating scoring scripts, writing evaluation rubrics, or combining multiple graders with weighted scoring. Don't use for setting up eval pipelines, configuring eval.yaml defaults, or general test writing.

name	skillgrade-graders
description	Authors deterministic and LLM rubric graders for skillgrade evaluations. Use when creating scoring scripts, writing evaluation rubrics, or combining multiple graders with weighted scoring. Don't use for setting up eval pipelines, configuring eval.yaml defaults, or general test writing.

Skillgrade Grader Authoring

Procedures

Step 1: Identify the Grading Strategy

Determine whether the task requires objective verification (deterministic) or qualitative assessment (LLM rubric).
For most tasks, combine both: deterministic graders verify outcomes (weight 0.7), LLM rubrics assess approach quality (weight 0.3).

Step 2: Write a Deterministic Grader

Create a script in the skill's graders/ directory (bash or TypeScript).

The script must output a JSON object to stdout with the following structure:

{"score": 0.67, "details": "2/3 checks passed", "checks": [{"name": "check-name", "passed": true, "message": "Description"}]}

score (0.0–1.0) and details are required. checks is optional but recommended.
Read references/grader-output-schema.md for the full output specification.
Use awk for arithmetic in bash scripts — bc is not available in node:20-slim.

Reference the grader in eval.yaml:

- type: deterministic
  run: bash graders/check.sh
  weight: 0.7

Step 3: Write an LLM Rubric Grader

Draft a rubric with explicit scoring criteria and point allocations.

Structure the rubric into weighted sections that sum to 1.0:

Workflow Compliance (0-0.5):
- Did the agent follow the mandatory workflow steps?
Efficiency (0-0.5):
- Completed in ≤5 commands without trial-and-error?

Reference the rubric in eval.yaml:

- type: llm_rubric
  rubric: |
    [rubric text or file path]
  weight: 0.3
  provider: gemini               # optional: gemini (default) | anthropic | openai
  model: gemini-3-flash-preview  # optional, each provider has a default model

For long rubrics, store in a separate file and reference by path: rubric: rubrics/quality.md.

Step 4: Combine Multiple Graders

Assign weights to each grader based on importance. Weights are normalized automatically.
Final reward is calculated as: Σ (grader_score × weight) / Σ weight.

Example configuration:

graders:
  - type: deterministic
    run: bash graders/check.sh
    weight: 0.7
  - type: llm_rubric
    rubric: rubrics/quality.md
    weight: 0.3

Step 5: Validate Graders

Create a reference solution script that produces the expected output.
Run skillgrade --validate to verify graders score the reference solution correctly.
Test only deterministic graders: skillgrade --grader=deterministic (skips LLM calls, faster iteration).
Test only LLM rubric graders: skillgrade --grader=llm_rubric.
Run a specific eval with a specific grader type: skillgrade --eval=my-eval --grader=deterministic.
If a grader returns unexpected scores, inspect the script output and adjust scoring logic.

Error Handling

If a deterministic grader outputs non-JSON, ensure all echo/console.log statements except the final JSON result are redirected to stderr.
If an LLM rubric grader returns 0.00 with a missing API key message, set the appropriate key for your provider: GEMINI_API_KEY (provider: gemini), ANTHROPIC_API_KEY (provider: anthropic), or OPENAI_API_KEY (provider: openai).
To use a custom/self-hosted LLM endpoint, set ANTHROPIC_BASE_URL (for provider: anthropic) or OPENAI_BASE_URL (for provider: openai) — e.g. for Ollama or vLLM.
If scores are inconsistent across trials, reduce rubric ambiguity by adding concrete examples of passing and failing behavior.

Skillgrade Grader Authoring

Procedures

Step 1: Identify the Grading Strategy

Determine whether the task requires objective verification (deterministic) or qualitative assessment (LLM rubric).

For most tasks, combine both: deterministic graders verify outcomes (weight 0.7), LLM rubrics assess approach quality (weight 0.3).

Step 2: Write a Deterministic Grader

Create a script in the skill's graders/ directory (bash or TypeScript).

The script must output a JSON object to stdout with the following structure:

{"score": 0.67, "details": "2/3 checks passed", "checks": [{"name": "check-name", "passed": true, "message": "Description"}]}

score (0.0–1.0) and details are required. checks is optional but recommended.

Read references/grader-output-schema.md for the full output specification.

Use awk for arithmetic in bash scripts — bc is not available in node:20-slim.

Reference the grader in eval.yaml:

- type: deterministic
  run: bash graders/check.sh
  weight: 0.7

Step 3: Write an LLM Rubric Grader

Draft a rubric with explicit scoring criteria and point allocations.

Structure the rubric into weighted sections that sum to 1.0:

Workflow Compliance (0-0.5):
- Did the agent follow the mandatory workflow steps?
Efficiency (0-0.5):
- Completed in ≤5 commands without trial-and-error?

Reference the rubric in eval.yaml:

- type: llm_rubric
  rubric: |
    [rubric text or file path]
  weight: 0.3
  provider: gemini               # optional: gemini (default) | anthropic | openai
  model: gemini-3-flash-preview  # optional, each provider has a default model

For long rubrics, store in a separate file and reference by path: rubric: rubrics/quality.md.

Step 4: Combine Multiple Graders

Assign weights to each grader based on importance. Weights are normalized automatically.

Final reward is calculated as: Σ (grader_score × weight) / Σ weight.

Example configuration:

graders:
  - type: deterministic
    run: bash graders/check.sh
    weight: 0.7
  - type: llm_rubric
    rubric: rubrics/quality.md
    weight: 0.3

Step 5: Validate Graders

Create a reference solution script that produces the expected output.

Run skillgrade --validate to verify graders score the reference solution correctly.

Test only deterministic graders: skillgrade --grader=deterministic (skips LLM calls, faster iteration).

Test only LLM rubric graders: skillgrade --grader=llm_rubric.

Run a specific eval with a specific grader type: skillgrade --eval=my-eval --grader=deterministic.

If a grader returns unexpected scores, inspect the script output and adjust scoring logic.

Error Handling

If a deterministic grader outputs non-JSON, ensure all echo/console.log statements except the final JSON result are redirected to stderr.

If an LLM rubric grader returns 0.00 with a missing API key message, set the appropriate key for your provider: GEMINI_API_KEY (provider: gemini), ANTHROPIC_API_KEY (provider: anthropic), or OPENAI_API_KEY (provider: openai).

To use a custom/self-hosted LLM endpoint, set ANTHROPIC_BASE_URL (for provider: anthropic) or OPENAI_BASE_URL (for provider: openai) — e.g. for Ollama or vLLM.

If scores are inconsistent across trials, reduce rubric ambiguity by adding concrete examples of passing and failing behavior.

skillgrade-graders

Skillgrade Grader Authoring

Procedures

Error Handling

Mehr aus diesem Repository

Skillgrade Grader Authoring

Procedures

Error Handling

Mehr aus diesem Repository