Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

eval-report-workflow

Name: Eval Report Workflow
Author: UKGovernmentBEIS

// Create an evaluation report for a README by selecting models, estimating costs, running evaluations, and formatting results tables. Use when user asks to make/create/generate an evaluation report. Trigger when the user asks you to run the "Make An Evaluation Report" workflow.

In Manus ausführen

$ git log --oneline --stat

stars:518

forks:336

updated:24. Mai 2026 um 04:48

Datei-Explorer

2 Dateien

SKILL.md

readonly

related-skills.json

gleiches Repository

ci-maintenance-workflow.md

from "UKGovernmentBEIS/inspect_evals"

CI and GitHub Actions maintenance workflows — fix a failing test from a CI URL, fix a failing smoke test, add @pytest.mark.slow markers to slow tests, or review a PR against agent-checkable standards. Use when user asks to fix a failing test, fix a smoke test, mark slow tests, or review a PR. Trigger when the user asks you to run the "Write a PR For A Failing Test", "Fix A Failing Smoke Test", "Mark Slow Tests", or "Review PR According to Agent-Checkable Standards" workflow.

2026-05-12518

create-eval.md

from "UKGovernmentBEIS/inspect_evals"

Redirect to the inspect-evals-template for creating new evaluations. New evals are no longer created in this repository — they live in standalone repos. Use when user asks to create/implement/build a new evaluation.

2026-05-04518

prepare-submission-workflow.md

from "UKGovernmentBEIS/inspect_evals"

Prepare an evaluation for PR submission as an entry to the register. Use when user asks to prepare an eval for submission or finalize a PR. Trigger when the user asks you to run the "Prepare Evaluation For Submission" workflow.

2026-05-04518

ensure-test-coverage.md

from "UKGovernmentBEIS/inspect_evals"

Ensure test coverage for a single evaluation - both reviewing existing tests and creating missing ones. Analyzes testable components, checks tests against repository conventions, reports coverage gaps, and creates or improves tests. Use when user asks to check/review/create/add/ensure tests for an eval. Use whenever you are asked to review an evaluation that contains tests, or whenever you need to write a suite of tests. Do NOT use for fixing a specific failing CI test (use ci-maintenance-workflow instead).

2026-04-30518

eval-validity-review.md

from "UKGovernmentBEIS/inspect_evals"

Review a single evaluation's validity — whether its claims hold up, whether its name is accurate, whether samples can be both succeeded and failed at, and whether scoring measures ground truth. Use when user asks to check validity of an eval, or as part of the Master Checklist workflow. Do NOT use for code quality or test coverage (use eval-quality-workflow or ensure-test-coverage instead).

2026-04-30518

security-audit-eval.md

from "UKGovernmentBEIS/inspect_evals"

Audit a third-party Inspect AI evaluation for security risks before running it locally. Decide whether the eval is safe by checking for malicious host-side code, externally-fetched files that aren't quality-controlled, sandbox-breakout instructions, weak sandbox configuration, supply-chain hazards, credential exposure, resource exhaustion, and provenance signals. Use when the user asks to audit / vet / security-review an eval repo (GitHub URL or local path), or asks "is it safe to run X". Do NOT use for assessing whether an eval *measures what it claims* (use eval-validity-review) or for general code-quality review (use eval-quality-workflow / code-quality-review-all).

2026-04-30518

package.json

"author": "UKGovernmentBEIS"

"repository": "UKGovernmentBEIS/inspect_evals"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

Softwarequalitätssicherungsanalysten und -testerInformatik- und Mathematikberufe15-1253L4

name	eval-report-workflow
description	Create an evaluation report for a README by selecting models, estimating costs, running evaluations, and formatting results tables. Use when user asks to make/create/generate an evaluation report. Trigger when the user asks you to run the "Make An Evaluation Report" workflow.

Make an Evaluation Report

This workflow drives tools/evaluation_report.py, which reads a per-eval report_config.yaml and produces a full reproducible report.md (results table, reference comparison, per-category breakdowns, token totals, approximate cost) plus header-only JSON copies of the input logs under results/. The report_config.yaml, regenerated report.md, and results/ folder are committed alongside the eval's eval.yaml.

Report Formatting

The evaluation report included in the README.md is the rendered report.md produced by tools/evaluation_report.py. It should run on the entire dataset or $5 of compute per model, whichever is cheaper. Use the token count from smaller runs to make this prediction.

A typical rendered report looks like this:

# Evaluation Report

## Implementation Details

Brief description of any deviations from the paper, known limitations, etc.

## Results

| Model         | Inspect (accuracy) | Reference | Δ      | Samples | Stderr | Time |
| ------------- | ------------------ | --------- | ------ | ------- | ------ | ---- |
| openai/...    | 0.600              | 0.580     | +0.020 | 100/100 | 0.049  | 18s  |
| anthropic/... | 0.400              | 0.420     | -0.020 | 100/100 | 0.049  | 6s   |

_Reference: Paper, Table 3_

## Reproducibility Information

- Samples: 100 / 100 per model
- Run dates: 2026-04-29
- Versions: inspect_ai=0.3.x, inspect_evals=0.x
- Models: ...
- Total tokens: 1,234,567
- Approximate cost: $0.42 USD (prices as of 2026-04)

Reproduction commands: ...

Register entries: for register entries (register/<name>/eval.yaml), populate the optional evaluation_report block in eval.yaml instead of editing README.md directly — the README is regenerated from the YAML by make check. The block accepts timestamp, a results list (with model, accuracy, and optionally provider, stderr, time, date), and notes. Extra fields at either level are allowed for eval-specific metric columns. See register/README.md for the schema.

If the eval.yaml file includes an arXiv paper, check that paper for the models used and human baselines. Include the human baseline in the notes section if it is present. If you can, select three models that would be suitable to check if this evaluation successfully replicates the original paper, including at least two different model providers.

If you cannot do this, or if there aren't three suitable candidates, use frontier models to fill out the remainder of the list. Currently, these are as follows:

Frontier Models

See references/frontier-models.md for the current list of frontier models and their costs. This file should be updated when models or prices change.

Workflow Steps

Set up the working directory: a. If the user provides specific instructions about any step, assume the user's instructions override these instructions. b. If there is no evaluation name, ask the user for one. c. The evaluation name should be the eval folder name plus its version (from the @task function's version argument). For instance, GPQA version 1.1.2 becomes "gpqa_1_1_2". If this exact folder name already exists, add a number to it via "gpqa_1_1_2_analysis2". This name will be referred to as <eval_name>. d. Create a folder called agent_artefacts/<eval_name>/evalreport if it isn't present. e. Whenever you create a .md file as part of this workflow, assume it is made in agent_artefacts/<eval_name>/evalreport. f. Copy EVALUATION_CHECKLIST.md to the folder. g. Create a NOTES.md file for miscellaneous helpful notes. Err on the side of taking lots of notes. Create an UNCERTAINTIES.md file to note any uncertainties.
Read the Evaluation Report Guidelines.
Check to see if the README for the evaluation already has an evaluation report. If so, double-check with the user that they want it overwritten.
Read the main file in EVAL_NAME, which should be src/inspect_evals/EVAL_NAME/EVAL_NAME.py in order to see how many tasks there are.
Perform an initial test with 'uv run inspect eval inspect_evals/EVAL_NAME --model gpt-5.1-2025-11-13 --limit 5' to get estimated token counts. Use -T shuffle=True if possible to produce random samples - to see if it's possible, you'll need to check the evaluation itself.
Perform model selection as above to decide which models to run.
Select a method of randomisation of the samples that ensures all meaningfully different subsets of the data are checked. This is as simple as ensuring the dataset is shuffled in most cases. Explicitly track how many meaningfully different subsets exist.
Tell the user what commands to run and how much compute it is expected to cost.

Base your compute calculation on the most expensive model in your list. Each model should be run on the same dataset size regardless of cost, hence we limit it via the most expensive one. If the task will be more expensive than $5 per model to run the full dataset, give the user your estimate for the cost of the full dataset. This means if there are multiple tasks to run, you'll need to split the cost among them according to token usage in the initial estimate. You should assume Gemini reasoning models take roughly 10x the tokens of the other models when performing these calculations.

As a rough example, tell the user something like this:

**I recommend running N samples of this dataset, for an estimated compute cost of $X. I recommend the following models: <list of models>. I recommend them because <reasoning here>. The full dataset is expected to cost roughly <full_cost> for all three models. This is based on running <model> for 5 samples and using Y input tokens and Z output tokens. Please note that this is a rough estimate, especially if samples vary significantly in difficulty.

The command to perform this run is as follows:

uv run inspect eval inspect_evals/<eval_name> --model model1,model2,model3 --limit <limit> --max-tasks <num_tasks> --epochs <epochs>**

After you have run the command, let me know and I'll fill out the evaluation report from there.**

Num tasks should be number of models run * the number of @task functions being tested. Epochs should be 1 if the limit is below the size of the full dataset, or as many epochs as can be fit into the cost requirements otherwise.

Make sure to give them the command in one line or with multiline escapes to ensure it can be run after copy-pasted.

If the number of samples recommended by this process is less than 20, or less than (3 * meaningful subcategories), you should also inform the user that the number of samples achievable on the $5/model budget is too small for a meaningful evaluation, and that more resources are needed to test properly. A minimum of 20 samples are required for error testing. If they need help, they can ask the repository's maintainers for testing resources.

If the user asks you to run this for them, remind them that they won't be able to see the progress of the evaluation due to the way Inspect works, and asking if they're sure. Do not proactively offer to run the command for them.
Once the eval has been run, create src/inspect_evals/<eval_name>/report_config.yaml with the headline metric, any reference_results from the original paper or leaderboard, a reference_source citation, and notes describing implementation details and any deviations. The schema is defined by tools.report_utils.ReportConfig; see tools/README.md for the full set of fields.
Run the report script, passing in the .eval files produced by the run:
```
uv run python tools/evaluation_report.py src/inspect_evals/<eval_name>/report_config.yaml \
  --logs logs/file1.eval logs/file2.eval logs/file3.eval
```
The script writes src/inspect_evals/<eval_name>/report.md and a header-only JSON copy of each input log under src/inspect_evals/<eval_name>/results/<task>/<YYYY-MM-DD>_<model>.json (the machine-readable companion). If any problems arise, ask the user to give you the relevant information manually.
Splice the contents of src/inspect_evals/<eval_name>/report.md into the README.md in the appropriate section, and commit report_config.yaml, report.md, and the results/ folder alongside eval.yaml. Then tell the user the task is done. Do not add human baseline or random baseline data unless they already appear in the README.
(Optional) If the evaluation uses an LLM judge and an oracle log exists or can be created (e.g., by running the eval with a stronger reference judge, or by collecting human labels via --oracle-labels), run uv run python tools/judge_calibration_diagnostics.py <eval_logs> --oracle-log <reference_log> to produce calibrated estimates with confidence intervals. Include findings in the evaluation report notes if relevant. See tools/README.md for details.

eval-report-workflow

Mehr aus diesem Repository

Mehr aus diesem Repository

Make an Evaluation Report

Report Formatting

Frontier Models

Workflow Steps

Make an Evaluation Report

Report Formatting

Frontier Models

Workflow Steps