تشغيل أي مهارة في Manus بنقرة واحدة

read-eval-logs

النجوم٥٥١

التفرعات٣٥٨

آخر تحديث٦ مارس ٢٠٢٦ في ٠٥:٤٨

View and analyse Inspect evaluation log files using the Python API. Trigger whenever you need to look at a .eval file yourself without using pre-written scripts.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

UKGovernmentBEIS

UKGovernmentBEIS/inspect_evals

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

محللو ضمان جودة البرمجيات والمختبرونمهن الحاسوب والرياضيات·SOC 15-1253

SKILL.md

readonly

المزيد من هذا المستودع

نفس المستودع

ci-maintenance-workflow

UKGovernmentBEIS/inspect_evals

CI and GitHub Actions maintenance workflows — fix a failing test from a CI URL, fix a failing smoke test, add @pytest.mark.slow markers to slow tests, or review a PR against agent-checkable standards. Use when user asks to fix a failing test, fix a smoke test, mark slow tests, or review a PR. Trigger when the user asks you to run the "Write a PR For A Failing Test", "Fix A Failing Smoke Test", "Mark Slow Tests", or "Review PR According to Agent-Checkable Standards" workflow.

2026-06-19551

prepare-submission-workflow

UKGovernmentBEIS/inspect_evals

Prepare an evaluation for PR submission as an entry to the register. Use when user asks to prepare an eval for submission or finalize a PR. Trigger when the user asks you to run the "Prepare Evaluation For Submission" workflow.

2026-06-11551

eval-validity-review

UKGovernmentBEIS/inspect_evals

Review a single evaluation's validity — whether its claims hold up, whether its name is accurate, whether samples can be both succeeded and failed at, and whether scoring measures ground truth. Use when user asks to check validity of an eval, or as part of the Master Checklist workflow. Do NOT use for code quality or test coverage (use eval-quality-workflow or ensure-test-coverage instead).

2026-06-07551

code-quality-fix-all

UKGovernmentBEIS/inspect_evals

Fix code quality issues identified in a code quality review stored in agent_artefacts/code_quality/<topic>/. Systematically addresses issues found by the code-quality-review-all skill for ANY code quality topic, with validation and testing at each step. Use when user asks to fix issues from a code quality review, or asks to fix issues from agent_artefacts/code_quality/<topic>.

2026-06-04551

eval-report-workflow

UKGovernmentBEIS/inspect_evals

Create an evaluation report for a README by selecting models, estimating costs, running evaluations, and formatting results tables. Use when user asks to make/create/generate an evaluation report. Trigger when the user asks you to run the "Make An Evaluation Report" workflow.

2026-05-24551

create-eval

UKGovernmentBEIS/inspect_evals

Redirect to the inspect-evals-template for creating new evaluations. New evals are no longer created in this repository — they live in standalone repos. Use when user asks to create/implement/build a new evaluation.

2026-05-04551

name	read-eval-logs
description	View and analyse Inspect evaluation log files using the Python API. Trigger whenever you need to look at a .eval file yourself without using pre-written scripts.

Analysing Eval Log Files

This skill covers how to view and analyse Inspect evaluation log files using the Python API and CLI commands.

Quick Reference

CLI Commands

# List all logs in the default log directory (./logs or INSPECT_LOG_DIR)
uv run inspect log list --json

# List logs with specific status
uv run inspect log list --json --status success
uv run inspect log list --json --status error

# List retryable logs (error/cancelled without subsequent success)
uv run inspect log list --json --retryable

# Dump a log file as JSON (works with any format: .eval or .json)
uv run inspect log dump <log_file_path>

# Convert between log formats
uv run inspect log convert source.json --to eval --output-dir log-output
uv run inspect log convert logs/ --to eval --output-dir logs-eval

# Get JSON schema for log files
uv run inspect log schema

Interactive Log Viewer

# Start the interactive log viewer (updates automatically)
uv run inspect view

Python API

Key Imports

from inspect_ai.log import (
    # Listing and reading
    list_eval_logs,
    read_eval_log,
    read_eval_log_sample,
    read_eval_log_samples,
    read_eval_log_sample_summaries,
    
    # Writing
    write_eval_log,
    
    # Utilities
    retryable_eval_logs,
    recompute_metrics,
    resolve_sample_attachments,
    
    # Types
    EvalLog,
    EvalLogInfo,
    EvalSample,
    EvalSampleSummary,
)

Listing Logs

# List all logs in default directory
logs = list_eval_logs()

# List logs in a specific directory
logs = list_eval_logs(log_dir="./experiment-logs")

# Filter by format
logs = list_eval_logs(formats=["eval"])  # Only .eval files

# Filter with a custom function (receives header-only EvalLog)
logs = list_eval_logs(filter=lambda log: log.status == "success")

# Non-recursive listing
logs = list_eval_logs(recursive=False)

Reading Logs

# Read full log
log = read_eval_log("path/to/logfile.eval")

# Read header only (fast, excludes samples)
log = read_eval_log("path/to/logfile.eval", header_only=True)

# Read with attachments resolved
log = read_eval_log("path/to/logfile.eval", resolve_attachments=True)

Reading Samples

# Read a single sample by ID and epoch
sample = read_eval_log_sample("path/to/logfile.eval", id=42, epoch=1)

# Read a single sample by UUID
sample = read_eval_log_sample("path/to/logfile.eval", uuid="sample-uuid")

# Stream all samples (memory efficient - one at a time)
for sample in read_eval_log_samples("path/to/logfile.eval"):
    process(sample)

# Stream samples from incomplete logs
for sample in read_eval_log_samples("path/to/logfile.eval", all_samples_required=False):
    process(sample)

# Read sample summaries (fast, includes scoring info)
summaries = read_eval_log_sample_summaries("path/to/logfile.eval")

Filtering Samples

# Read only samples with errors using summaries for filtering
errors = []
for summary in read_eval_log_sample_summaries(log_file):
    if summary.error is not None:
        errors.append(
            read_eval_log_sample(log_file, summary.id, summary.epoch)
        )

EvalLog Structure

The EvalLog object contains:

Field	Type	Description
`version`	`int`	File format version (currently 2)
`status`	`str`	`"started"`, `"success"`, `"cancelled"`, or `"error"`
`eval`	`EvalSpec`	Task, model, creation time, config
`plan`	`EvalPlan`	Solvers and generation config
`results`	`EvalResults`	Aggregate scores and metrics
`stats`	`EvalStats`	Runtime, model usage statistics
`error`	`EvalError`	Error info if `status == "error"`
`samples`	`list[EvalSample]`	Individual samples (if not header_only)
`reductions`	`list[EvalSampleReductions]`	Multi-epoch reductions
`location`	`str`	URI where log was read from

Always Check Status

log = read_eval_log("path/to/logfile.eval")
if log.status == "success":
    # Safe to analyse results
    for score in log.results.scores:
        print(f"{score.name}: {score.metrics}")

EvalSample Structure

Each sample contains:

Field	Type	Description
`id`	`int \| str`	Unique sample ID
`epoch`	`int`	Epoch number
`input`	`str \| list[ChatMessage]`	Sample input
`target`	`str \| list[str]`	Expected target(s)
`messages`	`list[ChatMessage]`	Full conversation history
`output`	`ModelOutput`	Model's output
`scores`	`dict[str, Score]`	Scores from scorers
`metadata`	`dict[str, Any]`	Sample metadata
`store`	`dict[str, Any]`	State at end of execution
`events`	`list[Event]`	Transcript events
`error`	`EvalError`	Error if sample failed
`total_time`	`float`	Total sample runtime
`model_usage`	`dict[str, ModelUsage]`	Token usage

Common Analysis Patterns

Get Aggregate Metrics

log = read_eval_log(log_file, header_only=True)
if log.results:
    for score in log.results.scores:
        print(f"Scorer: {score.name}")
        for metric_name, metric in score.metrics.items():
            print(f"  {metric_name}: {metric.value}")

Analyse Failed Samples

log = read_eval_log(log_file)
if log.samples:
    failed = [s for s in log.samples if s.error is not None]
    for sample in failed:
        print(f"Sample {sample.id}: {sample.error.message}")

Extract Model Usage

log = read_eval_log(log_file, header_only=True)
for model, usage in log.stats.model_usage.items():
    print(f"{model}: {usage.input_tokens} in, {usage.output_tokens} out")

Compare Multiple Runs

logs = list_eval_logs(filter=lambda l: l.eval.task == "my_task")
for log_info in logs:
    log = read_eval_log(log_info, header_only=True)
    if log.results and log.results.scores:
        accuracy = log.results.scores[0].metrics.get("accuracy")
        print(f"{log.eval.model}: {accuracy.value if accuracy else 'N/A'}")

Find Retryable Logs

all_logs = list_eval_logs()
retryable = retryable_eval_logs(all_logs)
for log_info in retryable:
    print(f"Can retry: {log_info.name}")

Log File Formats

Type	Description
`.eval`	Binary format, ~1/8 size of JSON, fast incremental access
`.json`	Text format, human-readable, slower for large files

Both formats are fully supported by the API and can be intermixed.

Working with Large Logs

For logs too large to fit in memory:

Use .eval format - supports compression and incremental access
Read header only - read_eval_log(log_file, header_only=True)
Stream samples - read_eval_log_samples() yields one at a time
Use summaries - read_eval_log_sample_summaries() for quick overview

Modifying Logs

# Read, modify, and write back
log = read_eval_log(log_file)
# ... modify log ...
write_eval_log(log)  # Uses log.location automatically

# Write to a new location
write_eval_log(log, location="new_path.eval")

# Recompute metrics after score edits
recompute_metrics(log)
write_eval_log(log)

Environment Variables

Variable	Description
`INSPECT_LOG_DIR`	Default log directory (default: `./logs`)
`INSPECT_EVAL_LOG_FILE_PATTERN`	Log filename pattern (e.g., `{task}_{model}_{id}`)

Related Tools

Inspect Scout - Transcript analysis
Inspect Viz - Data visualization
Log Dataframes - Extract pandas DataFrames from logs (see inspect_ai.analysis)