| name | inspect-ai |
| description | Use this skill whenever asked to run Inspect AI evaluations, analyze .eval or .json log files, read trajectory data, or work with the Inspect AI Python API. Triggers include mentions of "inspect eval", ".eval files", "eval logs", "trajectory", "EvalLog", "samples", "scores", or references to the inspect_ai Python package. Also use when asked to analyze agent behavior from evaluation runs, extract scores/metrics, retry failed evals, or compare evaluation results programmatically. |
Inspect AI — Running Evaluations & Analyzing Log Files
This skill covers two core workflows with the Inspect AI framework:
- Running evaluations using the Python API (
eval(), eval_set(), eval_retry())
- Reading and analyzing
.eval / .json log files using the Log File API
Reference documentation:
Installation
pip install inspect-ai
Provider-specific packages are also needed (e.g. pip install openai, pip install anthropic). API keys must be set as environment variables or in a .env file:
OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
INSPECT_LOG_DIR=./logs
For OpenRouter, set:
OPENROUTER_API_KEY=...
Part 1: Running Evaluations from Python
Basic eval() Usage
from inspect_ai import eval
logs = eval("path/to/task.py", model="openai/gpt-4o")
logs = eval("path/to/task.py@my_task_name", model="anthropic/claude-sonnet-4-0")
logs = eval("agents/task.py@game_1830_multi_agent", model="openrouter/openai/gpt-5-mini")
Running a Task Object Directly
from inspect_ai import eval
from my_tasks import my_task
logs = eval(my_task(), model="openai/gpt-4o")
logs = eval([task_a(), task_b()], model="openai/gpt-4o")
Common eval() Parameters
logs = eval(
tasks="task.py",
model="openai/gpt-4o",
log_dir="./my-logs",
log_format="eval",
max_connections=10,
max_samples=None,
max_tasks=None,
fail_on_error=True,
log_level="warning",
tags=["experiment-1"],
metadata={"version": "2.0"},
)
eval() Return Value
eval() returns a list[EvalLog] — one per task evaluated. Always check status:
logs = eval("task.py", model="openai/gpt-4o")
for log in logs:
if log.status == "success":
print(f"Results: {log.results}")
print(f"Samples: {len(log.samples)}")
elif log.status == "error":
print(f"Error: {log.error}")
Running Eval Sets
from inspect_ai import eval_set
success, logs = eval_set(
tasks=["task_a.py", "task_b.py"],
model=["openai/gpt-4o", "anthropic/claude-sonnet-4-0"],
log_dir="logs-run-42",
)
Retrying Failed Evals
from inspect_ai import eval_retry
eval_retry("logs/2024-05-29T12-38-43_math_Gprr29Mv.eval")
log = eval("task.py", model="openai/gpt-4o")[0]
if log.status != "success":
eval_retry(log, max_connections=3)
CLI Equivalents (for reference)
inspect eval task.py --model openai/gpt-4o
inspect eval agents/task.py@game_1830_multi_agent --model openrouter/openai/gpt-5-mini
inspect eval task.py --model openai/gpt-4o --log-dir ./my-logs --max-connections 20
inspect eval-retry logs/2024-05-29T12-38-43_math_Gprr29Mv.eval
inspect eval-set task_a.py task_b.py --model openai/gpt-4o --log-dir logs-run-42
Part 2: Analyzing Log Files (Log File API)
Key Imports
from inspect_ai.log import (
read_eval_log,
read_eval_log_sample,
read_eval_log_samples,
read_eval_log_sample_summaries,
list_eval_logs,
write_eval_log,
edit_score,
recompute_metrics,
resolve_sample_attachments,
EvalLog,
EvalLogInfo,
EvalSample,
EvalSampleSummary,
)
Log File Formats
| Type | Description |
|---|
.eval | Binary format — compressed, fast, incremental sample access. Default since v0.3.46. |
.json | Text JSON — larger, slower for big files, but human-readable. |
Always use the Python Log File API to read/write logs. Never parse the JSON directly — the API handles both formats transparently.
Reading a Full Log
log = read_eval_log("logs/2024-05-29T12-38-43_math_Gprr29Mv.eval")
assert log.status == "success"
print(log.version)
print(log.status)
print(log.eval)
print(log.plan)
print(log.results)
print(log.stats)
print(log.error)
print(log.samples)
print(log.reductions)
print(log.location)
EvalLog Fields Reference
| Field | Type | Description |
|---|
version | int | File format version (currently 2) |
status | str | "started", "success", "cancelled", or "error" |
eval | EvalSpec | Task name, model, creation time, dataset info, config |
plan | EvalPlan | Solver list and model generation config |
results | EvalResults | Aggregate scores computed by scorer metrics |
stats | EvalStats | Model usage (input/output tokens) |
error | EvalError | Error info + traceback (if status == "error") |
samples | list[EvalSample] | Each sample: input, output, target, messages, events, score |
reductions | list[EvalSampleReduction] | Reductions for multi-epoch evaluations |
location | str | URI the log was read from |
Reading Header Only (Fast — No Samples)
For large logs (multi-GB), read only metadata + aggregate scores:
log_header = read_eval_log("path/to/log.eval", header_only=True)
Reading Sample Summaries (Efficient Filtering)
summaries = read_eval_log_sample_summaries("path/to/log.eval")
for s in summaries:
print(f"Sample {s.id}: score={s.scores}, error={s.error}")
Reading Individual Samples
sample = read_eval_log_sample("path/to/log.eval", id=42)
sample = read_eval_log_sample("path/to/log.eval", id=42, epoch=1)
Streaming All Samples (Memory-Efficient)
for sample in read_eval_log_samples("path/to/log.eval"):
print(f"Sample {sample.id}: {sample.scores}")
for sample in read_eval_log_samples("path/to/log.eval", all_samples_required=False):
...
Listing All Logs in a Directory
import os
logs = list_eval_logs()
logs = list_eval_logs(log_dir="./my-experiment-logs")
logs = list_eval_logs(filter=lambda log: log.status == "success")
logs = list_eval_logs(recursive=False)
for info in logs:
print(info)
Analyzing Samples — Common Patterns
log = read_eval_log("path/to/log.eval")
for sample in log.samples:
for scorer_name, score in sample.scores.items():
print(f"Sample {sample.id} [{scorer_name}]: {score.value}")
for sample in log.samples:
for msg in sample.messages:
print(f" [{msg.role}]: {msg.content[:100]}...")
for sample in log.samples:
for event in sample.events:
print(f" Event: {event}")
failed = [s for s in log.samples if s.scores.get("accuracy", None) and s.scores["accuracy"].value == 0]
passed = [s for s in log.samples if s.scores.get("accuracy", None) and s.scores["accuracy"].value == 1]
print(f"Total input tokens: {log.stats.input_tokens}")
print(f"Total output tokens: {log.stats.output_tokens}")
Resolving Attachments
Large content (images, etc.) is de-duplicated and stored as attachments. Resolve them when you need the actual content:
from inspect_ai.log import resolve_sample_attachments
sample = read_eval_log_sample("path/to/log.eval", id=42, resolve_attachments=True)
sample = resolve_sample_attachments(sample)
You typically only need this if:
- You want base64 images from
input or messages
- You are directly reading the
events transcript
Editing Scores
from inspect_ai.log import read_eval_log, write_eval_log, edit_score
from inspect_ai.scorer import ScoreEdit, ProvenanceData
log = read_eval_log("my_eval.eval")
edit = ScoreEdit(
value=0.95,
explanation="Corrected model grader bug",
provenance=ProvenanceData(
author="jack",
reason="Scoring bug in original grader",
)
)
edit_score(
log=log,
sample_id=log.samples[0].id,
score_name="accuracy",
edit=edit,
)
write_eval_log(log)
Batch Editing (Defer Metric Recomputation)
from inspect_ai.log import recompute_metrics
edit_score(log, sample_id_1, "accuracy", edit1, recompute_metrics=False)
edit_score(log, sample_id_2, "accuracy", edit2, recompute_metrics=False)
recompute_metrics(log)
write_eval_log(log)
Writing / Copying Logs
write_eval_log(log)
write_eval_log(log, location="./backup/my_log.eval")
Part 3: Dataframes API (for Structured Analysis)
For pandas-style analysis, use the inspect_ai.analysis module:
from inspect_ai.analysis import evals_df, samples_df, messages_df, events_df
df_evals = evals_df(log_dir="./logs")
df_samples = samples_df("path/to/log.eval")
df_messages = messages_df("path/to/log.eval")
df_events = events_df("path/to/log.eval")
Part 4: CLI Log Commands (Non-Python Interop)
inspect log list --json --log-dir ./logs
inspect log list --json --status success
inspect log dump path/to/log.eval
inspect log convert source.json --to eval --output-dir log-output
inspect log convert logs/ --to eval --output-dir logs-eval
inspect log schema
inspect view
Common Workflow: Run → Analyze
from inspect_ai import eval
from inspect_ai.log import read_eval_log
logs = eval("agents/task.py@game_1830_multi_agent", model="openrouter/openai/gpt-5-mini")
log = logs[0]
if log.status == "success":
print(f"Results: {log.results}")
for sample in log.samples:
print(f"\nSample {sample.id}:")
print(f" Input: {str(sample.input)[:200]}")
print(f" Target: {sample.target}")
print(f" Scores: {sample.scores}")
for msg in sample.messages:
role = msg.role
content = str(msg.content)[:300]
print(f" [{role}]: {content}")
log = read_eval_log(log.location)
Environment Variables Reference
| Variable | Description |
|---|
INSPECT_LOG_DIR | Default log directory (default: ./logs) |
INSPECT_LOG_FORMAT | Default log format: eval or json |
INSPECT_LOG_LEVEL | Console log level |
INSPECT_LOG_LEVEL_TRANSCRIPT | Transcript log level |
INSPECT_EVAL_MODEL | Default model for evals |
INSPECT_EVAL_MAX_RETRIES | Default max retries |
INSPECT_EVAL_MAX_CONNECTIONS | Default max connections |
INSPECT_EVAL_LOG_IMAGES | Whether to log base64 images |
INSPECT_EVAL_LOG_FILE_PATTERN | Log filename pattern (e.g. {task}_{model}_{id}) |
OPENAI_API_KEY | OpenAI API key |
ANTHROPIC_API_KEY | Anthropic API key |
OPENROUTER_API_KEY | OpenRouter API key |
GOOGLE_API_KEY | Google API key |
Tips
- Always check
log.status == "success" before accessing samples or results.
- Use
header_only=True when you only need aggregate metrics — avoids loading potentially gigabytes of sample data.
- Use
read_eval_log_samples() generator for memory-efficient iteration over large logs.
- Use
read_eval_log_sample_summaries() when you only need scores and metadata, not full message histories.
- Resolve attachments only when needed — most analysis doesn't require the raw image/content data.
- The
.eval format is strongly preferred over .json — it's ~8x smaller and supports incremental sample access.
eval() returns EvalLog objects in memory — you don't need to re-read from disk unless you want to analyze logs from a previous session.
- Use
eval_retry() for transient failures (rate limits, network errors) — it preserves already-completed samples.