一键导入
skill-autoctx-evaluate
// Guides the agent to execute an evaluation of a generated ContextSet against a golden dataset utilizing the Evalbench framework.
// Guides the agent to execute an evaluation of a generated ContextSet against a golden dataset utilizing the Evalbench framework.
| name | skill-autoctx-evaluate |
| description | Guides the agent to execute an evaluation of a generated ContextSet against a golden dataset utilizing the Evalbench framework. |
This skill guides the process of rigorously evaluating an existing ContextSet against a specific golden truth dataset using Google's Evalbench architecture. It structures the evaluation experiments and executes the binaries.
Before beginning the workflow, you explicitly require:
A tools.yaml file securely located in the workspace root directory containing the target database connection details.
A golden evaluation dataset (golden_dataset_path), formatted as an absolute system path. The file must be in the simplified user-facing format.
Simplified User-Facing Dataset Format: A JSON list of objects, where each object must have the following keys:
id: Unique string identifier (e.g., eval_001).database: Target database name.nlq: Natural language question.golden_sql: The correct reference SQL query.Example:
[
{
"id": "eval_001",
"database": "my_db",
"nlq": "Count users",
"golden_sql": "SELECT COUNT(*) FROM users"
}
]
The context_set_id (the Data Agent's authored context configuration identifier, retrievable by the user directly from the GCP Database Studio console; e.g., projects/<project_id>/locations/<region>/contextSets/<context_set_name>).
Follow these steps exactly in order:
Experiment Selection & Memory:
autoctx/experiments/ directory and list the available tuning workflows/subfolders to the user.[!IMPORTANT] Inform the user that if they have an existing context, it must be uploaded to GCP Database Studio to obtain a
context_set_idfor evaluation.
autoctx/experiments/.autoctx/state.md for long-term memory.autoctx/state.md file to act as long-term memory so you don't forget it during subsequent evaluations.Parameter Collection:
golden_dataset_path and the context_set_id (if they haven't provided them already). Do NOT ask them to explain or verify database configurations.autoctx/tools.yaml file to list available databases to the user:
kind: source blocks with supported evaluation engines (consult the generate_evalbench_configs tool description for the exact list of supported types).name and type and let the user select which database to evaluate.Config Generation (Core Execution):
generate_evalbench_configs MCP tool. This is the only way to generate Evalbench configs. Never invent configs from scratch.experiment_name, dataset_path, context_set_id, absolute toolbox_config_path (e.g. autoctx/tools.yaml), and selected toolbox_source_name.golden_queries.json) directly to the eval_configs/ directory inside the chosen autoctx/experiments/<experiment_name>/ folder.Evalbench Run Integration:
run_shell_command natively to execute the evaluation from the ROOT of the workspace using the following exact command template:
uvx google-evalbench --experiment_config=autoctx/experiments/<experiment_name>/eval_configs/run_config.yamlautoctx/experiments/<experiment_name>/eval_reports/ directory.Upon successful completion, the workspace must contain:
eval_configs/ folder.Conclude by providing a succinct summary to the user:
When listing sources from tools.yaml, ensure you only present kind: source records to the user.
The tool generate_evalbench_configs will find the selected block inside the file and validate its connection parameters deterministically using Python code. You do not need to manually parse or map individual properties such as host, port, or database yourself. If the tool indicates a verification failure for a specific database type, refer to the schema examples inside references/ (e.g., cloud-sql-postgres.md) to guide the user on fixing their tools.yaml definition.
Guides the agent to bootstrap an initial context set (templates & facets) by deducing key information from the database schema and generating a ContextSet file.
Generate and expand datasets of Natural Language Questions (NLQ) and SQL pairs for evaluation.
Guides the agent to perform hill-climbing iterations to improve a ContextSet based on evaluation results.
Orchestrates the initialization workflow for auto context generation, and provides helper workflow for setting up dataset connection by creating or updating tools.yaml configurations.
Guidelines and best practices for generating context items (Templates, Facets, Value Searches). Use this skill whenever the user asks to create, author, or generate context for database enrichment, or asks for examples and instructions on how to write templates, facets, or value searches. It helps bridge the gap between LLMs and structured databases.