| name | read-eval-logs |
| description | View and analyse Inspect evaluation log files using the Python API. Trigger whenever you need to look at a .eval file yourself without using pre-written scripts. |
Analysing Eval Log Files
This skill covers how to view and analyse Inspect evaluation log files using the Python API and CLI commands.
Quick Reference
CLI Commands
uv run inspect log list --json
uv run inspect log list --json --status success
uv run inspect log list --json --status error
uv run inspect log list --json --retryable
uv run inspect log dump <log_file_path>
uv run inspect log convert source.json --to eval --output-dir log-output
uv run inspect log convert logs/ --to eval --output-dir logs-eval
uv run inspect log schema
Interactive Log Viewer
uv run inspect view
Python API
Key Imports
from inspect_ai.log import (
list_eval_logs,
read_eval_log,
read_eval_log_sample,
read_eval_log_samples,
read_eval_log_sample_summaries,
write_eval_log,
retryable_eval_logs,
recompute_metrics,
resolve_sample_attachments,
EvalLog,
EvalLogInfo,
EvalSample,
EvalSampleSummary,
)
Listing Logs
logs = list_eval_logs()
logs = list_eval_logs(log_dir="./experiment-logs")
logs = list_eval_logs(formats=["eval"])
logs = list_eval_logs(filter=lambda log: log.status == "success")
logs = list_eval_logs(recursive=False)
Reading Logs
log = read_eval_log("path/to/logfile.eval")
log = read_eval_log("path/to/logfile.eval", header_only=True)
log = read_eval_log("path/to/logfile.eval", resolve_attachments=True)
Reading Samples
sample = read_eval_log_sample("path/to/logfile.eval", id=42, epoch=1)
sample = read_eval_log_sample("path/to/logfile.eval", uuid="sample-uuid")
for sample in read_eval_log_samples("path/to/logfile.eval"):
process(sample)
for sample in read_eval_log_samples("path/to/logfile.eval", all_samples_required=False):
process(sample)
summaries = read_eval_log_sample_summaries("path/to/logfile.eval")
Filtering Samples
errors = []
for summary in read_eval_log_sample_summaries(log_file):
if summary.error is not None:
errors.append(
read_eval_log_sample(log_file, summary.id, summary.epoch)
)
EvalLog Structure
The EvalLog object contains:
| Field | Type | Description |
|---|
version | int | File format version (currently 2) |
status | str | "started", "success", "cancelled", or "error" |
eval | EvalSpec | Task, model, creation time, config |
plan | EvalPlan | Solvers and generation config |
results | EvalResults | Aggregate scores and metrics |
stats | EvalStats | Runtime, model usage statistics |
error | EvalError | Error info if status == "error" |
samples | list[EvalSample] | Individual samples (if not header_only) |
reductions | list[EvalSampleReductions] | Multi-epoch reductions |
location | str | URI where log was read from |
Always Check Status
log = read_eval_log("path/to/logfile.eval")
if log.status == "success":
for score in log.results.scores:
print(f"{score.name}: {score.metrics}")
EvalSample Structure
Each sample contains:
| Field | Type | Description |
|---|
id | int | str | Unique sample ID |
epoch | int | Epoch number |
input | str | list[ChatMessage] | Sample input |
target | str | list[str] | Expected target(s) |
messages | list[ChatMessage] | Full conversation history |
output | ModelOutput | Model's output |
scores | dict[str, Score] | Scores from scorers |
metadata | dict[str, Any] | Sample metadata |
store | dict[str, Any] | State at end of execution |
events | list[Event] | Transcript events |
error | EvalError | Error if sample failed |
total_time | float | Total sample runtime |
model_usage | dict[str, ModelUsage] | Token usage |
Common Analysis Patterns
Get Aggregate Metrics
log = read_eval_log(log_file, header_only=True)
if log.results:
for score in log.results.scores:
print(f"Scorer: {score.name}")
for metric_name, metric in score.metrics.items():
print(f" {metric_name}: {metric.value}")
Analyse Failed Samples
log = read_eval_log(log_file)
if log.samples:
failed = [s for s in log.samples if s.error is not None]
for sample in failed:
print(f"Sample {sample.id}: {sample.error.message}")
Extract Model Usage
log = read_eval_log(log_file, header_only=True)
for model, usage in log.stats.model_usage.items():
print(f"{model}: {usage.input_tokens} in, {usage.output_tokens} out")
Compare Multiple Runs
logs = list_eval_logs(filter=lambda l: l.eval.task == "my_task")
for log_info in logs:
log = read_eval_log(log_info, header_only=True)
if log.results and log.results.scores:
accuracy = log.results.scores[0].metrics.get("accuracy")
print(f"{log.eval.model}: {accuracy.value if accuracy else 'N/A'}")
Find Retryable Logs
all_logs = list_eval_logs()
retryable = retryable_eval_logs(all_logs)
for log_info in retryable:
print(f"Can retry: {log_info.name}")
Log File Formats
| Type | Description |
|---|
.eval | Binary format, ~1/8 size of JSON, fast incremental access |
.json | Text format, human-readable, slower for large files |
Both formats are fully supported by the API and can be intermixed.
Working with Large Logs
For logs too large to fit in memory:
- Use
.eval format - supports compression and incremental access
- Read header only -
read_eval_log(log_file, header_only=True)
- Stream samples -
read_eval_log_samples() yields one at a time
- Use summaries -
read_eval_log_sample_summaries() for quick overview
Modifying Logs
log = read_eval_log(log_file)
write_eval_log(log)
write_eval_log(log, location="new_path.eval")
recompute_metrics(log)
write_eval_log(log)
Environment Variables
| Variable | Description |
|---|
INSPECT_LOG_DIR | Default log directory (default: ./logs) |
INSPECT_EVAL_LOG_FILE_PATTERN | Log filename pattern (e.g., {task}_{model}_{id}) |
Related Tools
- Inspect Scout - Transcript analysis
- Inspect Viz - Data visualization
- Log Dataframes - Extract pandas DataFrames from logs (see
inspect_ai.analysis)