| name | eval-report-workflow |
| description | Create an evaluation report for a README by selecting models, estimating costs, running evaluations, and formatting results tables. Use when user asks to make/create/generate an evaluation report. Trigger when the user asks you to run the "Make An Evaluation Report" workflow. |
Make an Evaluation Report
This workflow drives tools/evaluation_report.py, which reads a per-eval report_config.yaml and produces a full reproducible report.md (results table, reference comparison, per-category breakdowns, token totals, approximate cost) plus header-only JSON copies of the input logs under results/. The report_config.yaml, regenerated report.md, and results/ folder are committed alongside the eval's eval.yaml.
Report Formatting
The evaluation report included in the README.md is the rendered report.md produced by tools/evaluation_report.py. It should run on the entire dataset or $5 of compute per model, whichever is cheaper. Use the token count from smaller runs to make this prediction.
A typical rendered report looks like this:
# Evaluation Report
## Implementation Details
Brief description of any deviations from the paper, known limitations, etc.
## Results
| Model | Inspect (accuracy) | Reference | Δ | Samples | Stderr | Time |
| ------------- | ------------------ | --------- | ------ | ------- | ------ | ---- |
| openai/... | 0.600 | 0.580 | +0.020 | 100/100 | 0.049 | 18s |
| anthropic/... | 0.400 | 0.420 | -0.020 | 100/100 | 0.049 | 6s |
_Reference: Paper, Table 3_
## Reproducibility Information
- Samples: 100 / 100 per model
- Run dates: 2026-04-29
- Versions: inspect_ai=0.3.x, inspect_evals=0.x
- Models: ...
- Total tokens: 1,234,567
- Approximate cost: $0.42 USD (prices as of 2026-04)
Reproduction commands: ...
Register entries: for register entries (register/<name>/eval.yaml), populate the optional evaluation_report block in eval.yaml instead of editing README.md directly — the README is regenerated from the YAML by make check. The block accepts timestamp, a results list (with model, accuracy, and optionally provider, stderr, time, date), and notes. Extra fields at either level are allowed for eval-specific metric columns. See register/README.md for the schema.
If the eval.yaml file includes an arXiv paper, check that paper for the models used and human baselines. Include the human baseline in the notes section if it is present. If you can, select three models that would be suitable to check if this evaluation successfully replicates the original paper, including at least two different model providers.
If you cannot do this, or if there aren't three suitable candidates, use frontier models to fill out the remainder of the list. Currently, these are as follows:
Frontier Models
See references/frontier-models.md for the current list of frontier models and their costs. This file should be updated when models or prices change.
Workflow Steps
-
Set up the working directory:
a. If the user provides specific instructions about any step, assume the user's instructions override these instructions.
b. If there is no evaluation name, ask the user for one.
c. The evaluation name should be the eval folder name plus its version (from the @task function's version argument). For instance, GPQA version 1.1.2 becomes "gpqa_1_1_2". If this exact folder name already exists, add a number to it via "gpqa_1_1_2_analysis2". This name will be referred to as <eval_name>.
d. Create a folder called agent_artefacts/<eval_name>/evalreport if it isn't present.
e. Whenever you create a .md file as part of this workflow, assume it is made in agent_artefacts/<eval_name>/evalreport.
f. Copy EVALUATION_CHECKLIST.md to the folder.
g. Create a NOTES.md file for miscellaneous helpful notes. Err on the side of taking lots of notes. Create an UNCERTAINTIES.md file to note any uncertainties.
-
Read the Evaluation Report Guidelines.
-
Check to see if the README for the evaluation already has an evaluation report. If so, double-check with the user that they want it overwritten.
-
Read the main file in EVAL_NAME, which should be src/inspect_evals/EVAL_NAME/EVAL_NAME.py in order to see how many tasks there are.
-
Perform an initial test with 'uv run inspect eval inspect_evals/EVAL_NAME --model gpt-5.1-2025-11-13 --limit 5' to get estimated token counts. Use -T shuffle=True if possible to produce random samples - to see if it's possible, you'll need to check the evaluation itself.
-
Perform model selection as above to decide which models to run.
-
Select a method of randomisation of the samples that ensures all meaningfully different subsets of the data are checked. This is as simple as ensuring the dataset is shuffled in most cases. Explicitly track how many meaningfully different subsets exist.
-
Tell the user what commands to run and how much compute it is expected to cost.
Base your compute calculation on the most expensive model in your list. Each model should be run on the same dataset size regardless of cost, hence we limit it via the most expensive one. If the task will be more expensive than $5 per model to run the full dataset, give the user your estimate for the cost of the full dataset. This means if there are multiple tasks to run, you'll need to split the cost among them according to token usage in the initial estimate. You should assume Gemini reasoning models take roughly 10x the tokens of the other models when performing these calculations.
As a rough example, tell the user something like this:
**I recommend running N samples of this dataset, for an estimated compute cost of $X. I recommend the following models: <list of models>. I recommend them because <reasoning here>. The full dataset is expected to cost roughly <full_cost> for all three models. This is based on running <model> for 5 samples and using Y input tokens and Z output tokens. Please note that this is a rough estimate, especially if samples vary significantly in difficulty.
The command to perform this run is as follows:
uv run inspect eval inspect_evals/<eval_name> --model model1,model2,model3 --limit <limit> --max-tasks <num_tasks> --epochs <epochs>**
After you have run the command, let me know and I'll fill out the evaluation report from there.**
Num tasks should be number of models run * the number of @task functions being tested.
Epochs should be 1 if the limit is below the size of the full dataset, or as many epochs as can be fit into the cost requirements otherwise.
Make sure to give them the command in one line or with multiline escapes to ensure it can be run after copy-pasted.
If the number of samples recommended by this process is less than 20, or less than (3 * meaningful subcategories), you should also inform the user that the number of samples achievable on the $5/model budget is too small for a meaningful evaluation, and that more resources are needed to test properly. A minimum of 20 samples are required for error testing. If they need help, they can ask the repository's maintainers for testing resources.
If the user asks you to run this for them, remind them that they won't be able to see the progress of the evaluation due to the way Inspect works, and asking if they're sure. Do not proactively offer to run the command for them.
-
Once the eval has been run, create src/inspect_evals/<eval_name>/report_config.yaml with the headline metric, any reference_results from the original paper or leaderboard, a reference_source citation, and notes describing implementation details and any deviations. The schema is defined by tools.report_utils.ReportConfig; see tools/README.md for the full set of fields.
-
Run the report script, passing in the .eval files produced by the run:
uv run python tools/evaluation_report.py src/inspect_evals/<eval_name>/report_config.yaml \
--logs logs/file1.eval logs/file2.eval logs/file3.eval
The script writes src/inspect_evals/<eval_name>/report.md and a header-only JSON copy of each input log under src/inspect_evals/<eval_name>/results/<task>/<YYYY-MM-DD>_<model>.json (the machine-readable companion). If any problems arise, ask the user to give you the relevant information manually.
-
Splice the contents of src/inspect_evals/<eval_name>/report.md into the README.md in the appropriate section, and commit report_config.yaml, report.md, and the results/ folder alongside eval.yaml. Then tell the user the task is done. Do not add human baseline or random baseline data unless they already appear in the README.
-
(Optional) If the evaluation uses an LLM judge and an oracle log exists or can be created (e.g., by running the eval with a stronger reference judge, or by collecting human labels via --oracle-labels), run uv run python tools/judge_calibration_diagnostics.py <eval_logs> --oracle-log <reference_log> to produce calibrated estimates with confidence intervals. Include findings in the evaluation report notes if relevant. See tools/README.md for details.