ワンクリックで
inspect-ai
// Analyze Inspect AI evaluation logs, understand EvalLog structure, extract samples, events, and scoring data using dataframes
// Analyze Inspect AI evaluation logs, understand EvalLog structure, extract samples, events, and scoring data using dataframes
| name | inspect-ai |
| description | Analyze Inspect AI evaluation logs, understand EvalLog structure, extract samples, events, and scoring data using dataframes |
| user-invocable | false |
Use this knowledge when working with Inspect AI evaluation logs (.eval or .json files).
.eval files are binary (compressed) format containing JSON data. They are essentially zip archives. To read them programmatically, use the Inspect Python API.
class EvalLog:
version: int # File format version (currently 2)
status: str # "started", "success", or "error"
eval: EvalSpec # Task, model, creation time
plan: EvalPlan # Solvers and generation config
results: EvalResults # Aggregate scorer metrics
stats: EvalStats # Token usage statistics
error: EvalError | None # Error details if status="error"
samples: list[EvalSample] # Individual sample records
reductions: list[EvalSampleReduction] # Multi-epoch reductions
Always check log.status == "success" before analyzing results.
class EvalSample:
id: int | str # Unique sample identifier
epoch: int # Epoch number (for multi-epoch runs)
input: str | list[ChatMessage] # The prompt/task given to model
target: str | list[str] # Expected answer(s)
choices: list[str] | None # Multiple choice options if applicable
# Execution results
messages: list[ChatMessage] # Full conversation history
output: ModelOutput # Final model output
scores: dict[str, Score] | None # Scoring results by scorer name
# Events and state
events: list[Event] # Complete transcript of all events
store: dict[str, Any] # State at end of execution
attachments: dict[str, str] # Referenced content (images, etc.)
# Metadata
metadata: dict[str, Any] # Custom key-value pairs from dataset
sandbox: SandboxEnvironmentSpec | None # Sandbox config
files: list[str] | None # Files provided to sandbox
setup: str | None # Setup script run in sandbox
# Timing
started_at: datetime | None
completed_at: datetime | None
total_time: float | None # Wall clock time
working_time: float | None # Active processing time
# Token usage
model_usage: dict[str, ModelUsage] # Tokens by model
# Error/limit info
error: EvalError | None # Error that halted sample
error_retries: list[EvalError] | None # Retried errors
limit: EvalSampleLimit | None # Limit that halted sample (context/time/message/token)
Events are the core of behavioral analysis. Each sample has an events list containing:
Event = Union[
SampleInitEvent, # Sample initialization
SampleLimitEvent, # Limit reached
SandboxEvent, # Sandbox operations (exec, read_file, write_file)
StateEvent, # State changes
StoreEvent, # Store updates
ModelEvent, # LLM API calls
ToolEvent, # Tool invocations
ApprovalEvent, # Human approval events
InputEvent, # User input
ScoreEvent, # Scoring events
ScoreEditEvent, # Score modifications
ErrorEvent, # Errors
LoggerEvent, # Log messages
InfoEvent, # Info messages
SpanBeginEvent, # Span start (for timing)
SpanEndEvent, # Span end
StepEvent, # Solver step events
SubtaskEvent, # Subtask events
]
class ModelEvent:
event: "model"
model: str # Model name
input: list[ChatMessage] # Messages sent to model
tools: list[ToolInfo] # Available tools
tool_choice: ToolChoice # Tool selection directive
config: GenerateConfig # Generation parameters
output: ModelOutput # Model response
retries: int | None # API retries
error: str | None # Error if failed
cache: "read" | "write" | None # Cache hit/miss
timestamp: datetime # When call started
completed: datetime | None # When call finished
working_time: float | None # Processing time
class ToolEvent:
event: "tool"
id: str # Unique tool call ID
function: str # Tool/function name
arguments: dict[str, JsonValue] # Arguments passed
result: ToolResult # Return value
error: ToolCallError | None # Error if failed
truncated: tuple[int, int] | None # If output was truncated
timestamp: datetime # When call started
completed: datetime | None # When call finished
working_time: float | None # Processing time
agent: str | None # Agent name if handoff
failed: bool | None # Hard failure flag
class Score:
value: float | str | int | bool | list # The score value
answer: str | None # Model's answer extracted
explanation: str | None # Explanation of score
metadata: dict[str, Any] | None # Additional scoring metadata
history: list[ScoreEdit] # Edit history (history[0] = original)
The inspect_ai.analysis module provides functions to convert logs into Pandas dataframes.
from inspect_ai.analysis import evals_df
df = evals_df("logs") # Read all logs in directory
df = evals_df(["path/to/file1.eval", "path/to/file2.eval"]) # Specific files
Default columns (~51):
eval_id - Unique evaluation identifierlog - URI of source filetask, task_version, task_file, task_arg_* - Task infomodel, model_args, generate_config_* - Model infostatus, error - Completion statusscore_<scorer>_<metric> - All scores expanded as columnssamples_completed, samples_totalcreated, git_commit, tags, metadata_*Pre-built column groups:
from inspect_ai.analysis import (
EvalInfo, # created, tags, metadata, git
EvalTask, # task name, file, args, solver
EvalModel, # model name, args, generation config
EvalDataset, # dataset name, location, sample IDs
EvalConfig, # epochs, approval, sample limits
EvalResults, # status, errors, samples completed
EvalScores, # all scores as separate columns
EvalColumns, # all of the above (~50 columns)
)
from inspect_ai.analysis import samples_df, SampleSummary, SampleScores, SampleMessages
# Fast read (summaries only, 12 columns)
df = samples_df("logs")
# With detailed scores
df = samples_df("logs", columns=SampleSummary + SampleScores)
# With message content (slower, loads full samples)
df = samples_df("logs", columns=SampleSummary + SampleMessages)
SampleSummary columns (default, 12 columns):
sample_id - Globally unique identifiereval_id - Links to evaluationid, epoch - Sample ID within eval and epoch numberinput, target - Task input and expected outputmetadata_* - Expanded metadata dictionaryscore_* - Score values onlymodel_usage - Token countstotal_time, working_time - Timing dataerror, limit, retries - Failure infolog - Source file URISampleScores adds:
SampleMessages adds:
from inspect_ai.analysis import messages_df
# All messages
df = messages_df("logs")
# Filter by role
df = messages_df("logs", filter=["assistant"])
df = messages_df("logs", filter=["user", "assistant"])
# Custom filter function
df = messages_df("logs", filter=lambda msg: "error" in msg.content.lower())
Default columns:
sample_id, eval_id - Links to sample and evaluationevent_id - Unique message identifierrole - user, assistant, system, toolcontent - Message textsource - Origin of messagetool_calls - Formatted function callstool_call_id, tool_call_function, tool_call_errorlog - Source file URIfrom inspect_ai.analysis import (
events_df,
EventInfo, # event type, span ID
EventTiming, # start/end times
ModelEventColumns, # model event data
ToolEventColumns, # tool event data
)
# Must specify columns (events are heterogeneous)
df = events_df("logs", columns=EventInfo + EventTiming)
# Filter to specific event types
df = events_df("logs", columns=EventInfo + ToolEventColumns,
filter=lambda e: e.event == "tool")
EventInfo columns:
event_type - Type of event (model, tool, sandbox, etc.)span_id - Span identifier for groupingEventTiming columns:
timestamp - When event startedcompleted - When event finishedworking_time - Active processing timeUse eval_id and sample_id to join across dataframes:
# Join evals with samples
merged = samples.merge(evals, on='eval_id')
# Join samples with messages
merged = messages.merge(samples, on='sample_id')
# DuckDB integration
import duckdb
con = duckdb.connect()
con.register('evals', evals_df("logs"))
con.register('samples', samples_df("logs"))
con.execute("""
SELECT e.model, AVG(s.score_accuracy)
FROM samples s JOIN evals e ON s.eval_id = e.eval_id
GROUP BY e.model
""")
from inspect_ai.analysis import prepare, model_info, task_info, frontier
# Add model metadata columns
df = prepare(df, model_info())
# Adds: model_organization_name, model_display_name, model_snapshot,
# model_release_date, model_knowledge_cutoff_date
# Map task names to display names
df = prepare(df, task_info({"gpqa_diamond": "GPQA Diamond"}))
# Add frontier indicator (requires model_info first)
df = prepare(df, frontier())
# Adds boolean column: was model top-scoring at release date?
from inspect_ai.log import (
read_eval_log,
read_eval_log_sample,
read_eval_log_samples,
read_eval_log_sample_summaries,
list_eval_logs,
)
# Read complete log
log = read_eval_log("path/to/file.eval")
# Read just header (no samples) - fast for large files
log = read_eval_log("path/to/file.eval", header_only=True)
# Stream samples one at a time (memory efficient)
for sample in read_eval_log_samples("path/to/file.eval"):
process(sample)
# Get lightweight summaries for filtering
summaries = read_eval_log_sample_summaries("path/to/file.eval")
# Read specific sample
sample = read_eval_log_sample("path/to/file.eval", id="sample_id", epoch=1)
# List all logs in directory
logs = list_eval_logs("./logs", recursive=True)
# List logs with filtering
inspect log list --json
inspect log list --status success
# Dump log as JSON
inspect log dump path/to/file.eval
# Convert between formats
inspect log convert file.json --to eval --output-dir ./converted
# Check all evaluations completed successfully
evals = evals_df("logs")
failed = evals[evals['status'] != 'success']
# Find samples with errors
samples = samples_df("logs")
errored = samples[samples['error'].notna()]
# Find samples that hit limits
limited = samples[samples['limit'].notna()]
# Get tool usage patterns
events = events_df("logs", columns=EventInfo + ToolEventColumns,
filter=lambda e: e.event == "tool")
tool_counts = events.groupby(['eval_id', 'function']).size()
# Analyze message patterns
messages = messages_df("logs")
msg_counts = messages.groupby(['eval_id', 'role']).size().unstack()
# Compare successful vs failed attempts
samples = samples_df("logs")
successful = samples[samples['score_accuracy'] == 1.0]
failed = samples[samples['score_accuracy'] == 0.0]
evals = evals_df("logs")
by_model = evals.groupby('model').agg({
'score_accuracy_mean': 'mean',
'samples_completed': 'sum'
})
SampleSummary (default) for fast reads - only loads headersparallel=True for large datasets: samples_df("logs", parallel=True)header_only=True with read_eval_log() when you don't need samplesread_eval_log_samples() for memory-constrained environmentsstrict=False to get partial results: df, errors = evals_df("logs", strict=False)