Run any Skill in Manus
with one click
with one click
Run any Skill in Manus with one click
Get Started$pwd:
$ git log --oneline --stat
stars:75
forks:4
updated:March 23, 2026 at 15:39
SKILL.md
| name | eval-parquet |
| description | Working with Eval Parquet Files |
| advertise | true |
You can create and access Parquet files produced by evals/eda/eval_to_parquet_cli.py.
These files contain flattened results from Keystone eval runs.
# S3 → local
uv run python evals/eda/eval_to_parquet_cli.py \
s3://int8-datasets/keystone/evals/<run_name> \
./results.parquet
# Local → local
uv run python evals/eda/eval_to_parquet_cli.py \
/path/to/eval_output /tmp/results.parquet
Each row is one trial (one agent run on one repo). Columns:
| Column | Type | Description |
|---|---|---|
source_path | str | fsspec path to the original eval_result.json this row was read from |
raw_json | str | Full KeystoneRepoResult serialized as JSON — deserialize with KeystoneRepoResult.model_validate_json(row.raw_json) to access any field |
config_name | str | null | EvalConfig.name — the primary identifier for which eval configuration was used (e.g. model name or experiment label) |
repo_id | str | Unique repo identifier (e.g. "requests", "flask") |
trial_index | int | null | Trial number (0-indexed) when trials_per_repo > 1 |
success | bool | Whether the keystone run succeeded |
error_message | str | null | Error message if success is False |
agent_exit_code | int | null | Agent process exit code |
agent_walltime_seconds | float | null | Wall-clock time for the agent |
agent_timed_out | bool | null | Whether the agent hit its time limit |
cost_usd | float | null | Inference cost in USD |
input_tokens | int | null | LLM input tokens consumed |
output_tokens | int | null | LLM output tokens consumed |
image_build_seconds | float | null | Docker image build time |
test_execution_seconds | float | null | Test suite execution time |
tests_passed | int | null | Number of unique tests passed (deduplicated by test name; logical OR across duplicate runs) |
tests_failed | int | null | Number of unique tests failed (deduplicated; tests where no execution passed) |
tests_discovered | int | null | Total unique test names discovered (= tests_passed + tests_failed) |
summary | str | null | agent.summary.message — the agent's final summary of what it did |
status_messages | str (JSON) | JSON array of {"timestamp": "...", "message": "..."} objects — the agent's step-by-step progress |
Rows are uniquely identified by (config_name, repo_id, trial_index).
When the user asks you to analyze eval results, load the parquet with pandas and answer their questions. Here are common patterns:
import pandas as pd
df = pd.read_parquet("results.parquet")
df.groupby("config_name")["success"].mean().sort_values(ascending=False)
df.groupby("config_name").agg(
mean_cost=("cost_usd", "mean"),
median_cost=("cost_usd", "median"),
total_cost=("cost_usd", "sum"),
)
df.nlargest(10, "agent_walltime_seconds")[["config_name", "repo_id", "agent_walltime_seconds", "success"]]
df[~df["success"]][["config_name", "repo_id", "error_message", "summary"]]
a, b = "claude-opus", "codex"
merged = df[df.config_name == a].merge(
df[df.config_name == b],
on=["repo_id", "trial_index"],
suffixes=("_a", "_b"),
)
merged["both_pass"] = merged.success_a & merged.success_b
merged["a_only"] = merged.success_a & ~merged.success_b
merged["b_only"] = ~merged.success_a & merged.success_b
merged[["both_pass", "a_only", "b_only"]].sum()
import json
row = df[(df.config_name == "claude-opus") & (df.repo_id == "requests")].iloc[0]
messages = json.loads(row.status_messages)
for m in messages:
print(f"[{m['timestamp']}] {m['message']}")
from eval_schema import KeystoneRepoResult
result = KeystoneRepoResult.model_validate_json(row.raw_json)
# Now access any nested field, e.g.:
result.bootstrap_result.verification.test_results
df["test_pass_rate"] = df.tests_passed / (df.tests_passed + df.tests_failed)
df.groupby("config_name")["test_pass_rate"].describe()
raw_json preserves everything — if a column doesn't exist for something you need, deserialize raw_json back into KeystoneRepoResult.status_messages is a JSON string, not a list — use json.loads() to parse it.source_path lets you go back to the original file on S3 or local disk for debugging.