Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

create-context-tests

Name: Create Context Tests
Author: getnao

// Generate a test suite of natural-language → SQL pairs that becomes the quality benchmark for a nao agent, then run it via `nao test`. Use when the user wants to start measuring agent reliability, extend an existing test suite, or add tests for new metrics. Tests are the only honest answer to "is the context working?". Do not use for writing rules (write-context-rules) or diagnosing failures (audit-context).

In Manus ausführen

$ git log --oneline --stat

stars:1.229

forks:164

updated:30. April 2026 um 08:57

Datei-Explorer

2 Dateien

SKILL.md

readonly

name

create-context-tests

description

Generate a test suite of natural-language → SQL pairs that becomes the quality benchmark for a nao agent, then run it via `nao test`. Use when the user wants to start measuring agent reliability, extend an existing test suite, or add tests for new metrics. Tests are the only honest answer to "is the context working?". Do not use for writing rules (write-context-rules) or diagnosing failures (audit-context).

create-context-tests

nao test runs each natural-language prompt through the agent, executes both the agent's SQL and the test's expected SQL against the warehouse, and diffs the result data row-by-row. A test passes only if the actual data matches — same rows, same values. The suite is the reliability benchmark; every change to RULES.md is measured against it. Reference: docs.getnao.io/nao-agent/context-engineering/evaluation.

How many tests

One test per key metric in ## Key Metrics Reference is the floor. Then add tests for: time scoping (especially "last 8 weeks" / "last 30 days"), CTE / multi-step queries, edge cases (NULLs, empty windows), and ambiguous wording ("our users", "active") to validate naming-convention rules.

Two authoring rules — apply to every test

Rule 1 — Prompts read like real chat. Vague, short, no table/column/method hints. The test verifies the agent reaches the right answer from a real-user input.

Bad	Good
"What was the churn rate from `fct_subscriptions` in Q1?"	"How's churn looking this quarter?"
"Compute MRR as SUM(`mrr_amount`) where status='active'"	"What's our MRR?"

Rule 2 — Output column names encode format / unit, not source. A column name communicates how to interpret the value.

Bad	Good
`churn_rate_from_fct_subscriptions`	`churn_rate_float_0_1`
`mrr_amount_fct_stripe_mrr`	`mrr_usd_dollars`
`signup_at_dim_users`	`signup_date_yyyy_mm_dd`

Naming patterns: <metric>_float_0_1 or <metric>_percentage_0_100 for rates; <metric>_<currency>_<unit> for money; <thing>_count; <thing>_date_yyyy_mm_dd. See templates/test.yaml.

Steps

Ask once: does the user have trusted source-of-truth queries (Looker, dashboards, prior benchmarks)? If yes, transform each into a test (rewrite SELECT to apply Rule 2; reverse-engineer a Rule 1 prompt). For metrics without a trusted query, draft new tests one per metric.
Save flat under tests/ (no subfolders), one YAML file per test. Use templates/test.yaml.
Have the user validate — confirm prompts match their team's phrasing and SQL matches their definition of truth.
Run nao test. Prerequisites:
- cd into the project directory (where nao_config.yaml lives).
- Start nao chat & in the background (the test runner reuses the chat server).
- LLM configured in nao_config.yaml.
- First run prompts for login credentials — let the user type them; don't script around it.
- If you see AI_APICallError: Not Found at https://api.anthropic.com/messages (no /v1/), run unset ANTHROPIC_BASE_URL ANTHROPIC_API_KEY first (parent agent CLI is leaking env vars). See setup-context for the full note.
```
nao test -m <model_id> -t 10   # -t = parallelism
```
Recap results: pass rate, token cost, wall-clock time. Cite this as the baseline.
Diagnose failures (optional): read tests/outputs/ for each failure, identify the rule gap, propose the smallest fix, then route to write-context-rules (or audit-context for systemic issues). Re-run between fixes so impact is attributable.

Guardrails

Tests' SQL must execute as-is — no <placeholder> in FROM. Use real table / column names.
Never leak the answer in prompt or output column names (Rules 1 + 2).
One test per metric is the floor; coverage tests come after.
Apply one context fix at a time between runs.
If a test contradicts RULES.md, stop and ask which is correct — it's a bug in one or the other.

Templates

templates/test.yaml — single-test format.

related-skills.json

gleiches Repository

deploy-context.md

from "getnao/nao"

Wire a nao project's context folder to a remote nao instance (Cloud or self-hosted) so every push to `main` automatically runs `nao deploy` via GitHub Actions. Use when the user has a working local nao project versioned in Git and wants their team's deployed instance to always reflect the latest committed context. Covers `nao deploy` usage, the GitHub Actions workflow, organization API keys, GitHub Secrets, `.naoignore`, environment-variable references in `nao_config.yaml`, and create-vs-update behavior. Do not use for first-time project setup (use `setup-context`) or for `nao sync` automation that commits warehouse metadata back to the repo (covered in the docs' synchronization page).

2026-05-201.2k

setup-context.md

from "getnao/nao"

Bootstrap a nao agent for a project — gather warehouse + scope + extra-context info in one round, look up the warehouse-specific config from nao docs, write nao_config.yaml, run nao init + nao sync, set up the LLM key, and generate the first RULES.md. Use when the user has just decided to use nao on a new project. Only for first-time setup; for editing rules, generating tests, or reviewing an existing context, use write-context-rules / create-context-tests / audit-context.

2026-05-041.2k

add-semantic-layer.md

from "getnao/nao"

Wire a semantic layer into a nao agent so that metric queries are routed through a single source of truth. Supports dbt MetricFlow (dbt Cloud with Semantic Layer), Snowflake (views or semantic views via MCP), an in-house nao YAML semantic layer, or other tools (via MCP discovery). Installs the right MCP server, updates RULES.md to route metric queries through the semantic layer, and (for the nao YAML option) generates starter metric files. Use after a first round of tests has shown the agent struggling with metric reliability. Do not use for raw rule writing (write-context-rules) or first-time setup (setup-context).

2026-04-301.2k

audit-context.md

from "getnao/nao"

Diagnose the health of a nao context at any stage of its lifecycle. Use when the user wants a structured review of what's been synced, how RULES.md compares to the target structure, whether every table is documented, whether the data model is MECE, whether tests exist and what their failures reveal, and whether context files are bloated. Outputs a structured audit report with ranked recommendations. Do not use for first-time setup (setup-context) or routine rule writing (write-context-rules).

2026-04-301.2k

write-context-rules.md

from "getnao/nao"

Create or extend a nao project's RULES.md. Owns the RULES.md template. Use when the user wants to generate the initial RULES.md from synced metadata (called by setup-context), or improve their existing RULES.md. Do not use for first-time scope setup (use setup-context) or for diagnosing existing problems (use audit-context).

2026-04-301.2k

package.json

"author": "getnao"

"repository": "getnao/nao"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

Softwarequalitätssicherungsanalysten und -testerInformatik- und Mathematikberufe15-1253L4

name

create-context-tests

description

create-context-tests

How many tests

Two authoring rules — apply to every test

Rule 1 — Prompts read like real chat. Vague, short, no table/column/method hints. The test verifies the agent reaches the right answer from a real-user input.

Bad	Good
"What was the churn rate from `fct_subscriptions` in Q1?"	"How's churn looking this quarter?"
"Compute MRR as SUM(`mrr_amount`) where status='active'"	"What's our MRR?"

Rule 2 — Output column names encode format / unit, not source. A column name communicates how to interpret the value.

Bad	Good
`churn_rate_from_fct_subscriptions`	`churn_rate_float_0_1`
`mrr_amount_fct_stripe_mrr`	`mrr_usd_dollars`
`signup_at_dim_users`	`signup_date_yyyy_mm_dd`

Naming patterns: <metric>_float_0_1 or <metric>_percentage_0_100 for rates; <metric>_<currency>_<unit> for money; <thing>_count; <thing>_date_yyyy_mm_dd. See templates/test.yaml.

Steps

Ask once: does the user have trusted source-of-truth queries (Looker, dashboards, prior benchmarks)? If yes, transform each into a test (rewrite SELECT to apply Rule 2; reverse-engineer a Rule 1 prompt). For metrics without a trusted query, draft new tests one per metric.
Save flat under tests/ (no subfolders), one YAML file per test. Use templates/test.yaml.
Have the user validate — confirm prompts match their team's phrasing and SQL matches their definition of truth.
Run nao test. Prerequisites:
- cd into the project directory (where nao_config.yaml lives).
- Start nao chat & in the background (the test runner reuses the chat server).
- LLM configured in nao_config.yaml.
- First run prompts for login credentials — let the user type them; don't script around it.
- If you see AI_APICallError: Not Found at https://api.anthropic.com/messages (no /v1/), run unset ANTHROPIC_BASE_URL ANTHROPIC_API_KEY first (parent agent CLI is leaking env vars). See setup-context for the full note.
```
nao test -m <model_id> -t 10   # -t = parallelism
```
Recap results: pass rate, token cost, wall-clock time. Cite this as the baseline.
Diagnose failures (optional): read tests/outputs/ for each failure, identify the rule gap, propose the smallest fix, then route to write-context-rules (or audit-context for systemic issues). Re-run between fixes so impact is attributable.

Guardrails

Tests' SQL must execute as-is — no <placeholder> in FROM. Use real table / column names.
Never leak the answer in prompt or output column names (Rules 1 + 2).
One test per metric is the floor; coverage tests come after.
Apply one context fix at a time between runs.
If a test contradicts RULES.md, stop and ask which is correct — it's a bug in one or the other.

Templates

templates/test.yaml — single-test format.

create-context-tests

create-context-tests

How many tests

Two authoring rules — apply to every test

Steps

Guardrails

Templates

Mehr aus diesem Repository

create-context-tests

How many tests

Two authoring rules — apply to every test

Steps

Guardrails

Templates

Mehr aus diesem Repository