一键导入
self-play
// Run schema probing self-play loop to find and fix ClickHouse schema ambiguity in the panda repo. Use when the user wants to improve query reliability by finding where the agent picks different tables for the same question.
// Run schema probing self-play loop to find and fix ClickHouse schema ambiguity in the panda repo. Use when the user wants to improve query reliability by finding where the agent picks different tables for the same question.
| name | self-play |
| description | Run schema probing self-play loop to find and fix ClickHouse schema ambiguity in the panda repo. Use when the user wants to improve query reliability by finding where the agent picks different tables for the same question. |
You are running the self-play loop for the ethpandaops/panda project. This finds schema ambiguities by asking the same question N times with different personas and checking if the generated queries agree on which tables to use. When they disagree, you use schema introspection to determine the correct tables and write the fix autonomously.
The primary metric is average entropy across all probes. Lower is better (0 = perfect agreement). This number should trend down over time as you add examples and runbooks.
The panda repo must be at the working directory. The probe infrastructure lives in tests/eval/.
Server: Build first, then the probe runner auto-starts a local server on :2481:
make build # builds panda-server binary
Dependencies: The evaluator LLM needs OPENROUTER_API_KEY set in the environment.
Run all probes:
cd tests/eval
uv run python -m scripts.run_probes --model claude-haiku-4-5
To filter by domain: --tag blobs, --tag mev, --tag attestations, etc.
The local server starts on :2481 and shuts down automatically when probes finish. First run takes ~10s for startup.
Read the latest results file from tests/eval/probes/results/.
For each probe where all_agreed is false, resolve it yourself using the actual schema:
./panda schema <table> for each candidate to get columns, types, and comments./panda schema, it hallucinated — discard itfct_) over canonical tables, pre-aggregated over raw, xatu-cbt over xatu for performanceOnly escalate to the user if the schema genuinely doesn't disambiguate — e.g., two tables have overlapping columns and it's unclear which is the right source of truth for the question.
Skip these:
Based on your schema analysis, decide the best intervention. The entire repo is in scope — pick whatever will most effectively resolve the ambiguity. Find the root cause of why the model is confused rather than adding surface-level patches.
Possible fixes, in rough order of impact:
modules/clickhouse/examples.yaml) — add a query example showing the correct table and pattern. Best for "which table do I use for X?" ambiguities.runbooks/*.md) — add or update a runbook with procedural guidance. Best for multi-step cross-cluster workflows.For examples specifically:
xatu vs xatu-cbt)slot_start_date_time) and network filter{network} placeholder for network name in CBT tables, or meta_network_name filter for xatu tablesSELECT max(block_number) FROM ... that cause full table scansRead existing files before modifying them.
Important: Fixes must generalize. Don't add a narrow example that only answers the exact probe question — add something that teaches the agent how to handle the whole class of questions. For example:
The goal is that fixing one probe also fixes 5 others in the same domain that we haven't written yet.
Tip: Add clear positive examples that demonstrate the correct pattern. Never resort to "do NOT use X" negative guidance — that's lazy and doesn't teach anything.
After making fixes:
make build to rebuild the serverprobes.yaml)uv run python -m scripts.run_probes --model claude-haiku-4-5 -c 20 --tag <domain>git revert <commit> and try a different approachGo back to Step 1. The goal is to drive average entropy toward zero across all 38 probes.
tests/eval/cases/probes.yamltests/eval/probes/results/--probe "glob_pattern" to filter specific probes by ID (fnmatch syntax, single pattern only — no commas). Examples: --probe "block_*", --probe "mev_*"--tag <tag> to filter probes by domain tag (e.g., --tag blobs, --tag mev)-n N to limit how many probes to run-v for verbose output showing generated code--only-previously-failed to re-run only probes that disagreed in the last runtests/eval/cases/probes.yaml — probe questionstests/eval/scripts/run_probes.py — probe runnertests/eval/probes/analysis.py — table extraction and agreement scoringtests/eval/probes/results/ — timestamped result filesmodules/clickhouse/examples.yaml — where fixes go (query examples)runbooks/*.md — alternative fix target (procedural guides)tests/eval/config-probe.yaml — server config for local probe runsQuery Ethereum network data via ethpandaops CLI or MCP server. Use when analyzing blockchain data, block timing, attestations, validator performance, network health, or infrastructure metrics. Provides access to ClickHouse (blockchain data), Prometheus (metrics), Loki (logs), and Dora (explorer APIs).
Debug Ethereum devnet or network issues. Use when diagnosing finality delays, network splits, offline nodes, client bugs, or general network health problems. Works for both local Kurtosis devnets and remote hosted deployments.
Install panda as an MCP server in AI coding assistants (Claude Code, Claude Desktop, Cursor). Use when setting up panda, configuring MCP, or connecting AI tools to panda. Triggers on: install mcp, setup mcp, configure mcp, register mcp, add mcp server.
Install and set up panda from scratch. Use when the user wants to install panda, get started with panda, or set up their environment. Triggers on: install panda, setup panda, get started, getting started.
Add a new datasource module to ethpandaops/panda. Triggers on: add module, new module, create module, add plugin, new plugin, create plugin, add datasource.
Extract a reusable runbook from a successful investigation or troubleshooting session. Use after completing a multi-step diagnosis that could help future investigations.