Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

oxy-test-drafter

Name: Oxy Test Drafter
Author: oxy-hq

// Use for ANY test or eval request involving an Oxy agent, workflow, or agentic workflow — 'add a test', 'create a test', 'write a test case', 'test this agent', 'run tests', 'add evals', 'evaluate my agent', 'create an evaluation', 'fill in expected answers', or 'bootstrap an eval suite'. Manages .test.yml eval files: scaffolding cases, running `oxy test --output-json`, parsing results, and drafting expected strings from observed outputs.

Exécuter dans Manus

$ git log --oneline --stat

stars:1

forks:1

updated:21 mars 2026 à 05:21

Explorateur de fichiers

2 fichiers

SKILL.md

readonly

related-skills.json

même dépôt

oxy-semantic-layer.md

from "oxy-hq/skills"

Build and maintain Oxy semantic layer files (views and topics) for analytics. Use when the user asks to create, update, or validate Oxy semantic layers, view files, topic files, or needs help understanding database schemas for semantic layer creation.

2026-05-251

oxy-app-builder.md

from "oxy-hq/skills"

Build and edit Oxy data app YAML files (*.app.yml) that visualize data through tasks and displays. Use when users ask to create dashboards, data apps, reports, interactive analytics interfaces, or to add filters/dropdowns/date pickers/controls to an app. Helps define SQL/workflow/agent tasks, interactive controls, and render outputs as tables, charts, and markdown.

2026-05-221

oxy-agentic-builder.md

from "oxy-hq/skills"

Build and configure Oxy `.agentic.yml` files — multi-step FSM agents that ground questions in the semantic layer, generate SQL, execute it, and interpret results. Use when the user asks to create, edit, or troubleshoot an agentic analytics or app-builder agent, or to choose between `.agent.yml`, `.agentic.yml`, and `.workflow.yml`.

2026-05-121

oxy-instance-skill-evaluator.md

from "oxy-hq/skills"

Evaluate the output of one of the 4 oxy instance-building skills (semantic-layer, workflow-builder, etl-builder, app-builder) against a rubric and propose specific improvements to the skill's SKILL.md. Use when the user asks to evaluate a skill, score skill output, or improve a skill based on test results.

2026-04-021

oxy-etl-builder.md

from "oxy-hq/skills"

Build or extend ETL pipelines using DLT. Use when: (1) starting a new ETL project, (2) adding API connectors (Toast, Square, etc.), (3) adding spreadsheet/document ingestion, or (4) extending existing pipelines with new sources.

2026-03-311

oxy-workflow-builder.md

from "oxy-hq/skills"

Build Oxy workflows, SQL queries, and agents following best practices. Use when the user asks to create data pipelines, queries, or analysis agents. Enforces hierarchy - semantic queries first, then SQL/workflows, then agents.

2026-03-311

package.json

"author": "oxy-hq"

"repository": "oxy-hq/skills"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Analystes en assurance qualité des logiciels et testeursProfessions informatiques et mathématiques15-1253L4

name

oxy-test-drafter

description

Use for ANY test or eval request involving an Oxy agent, workflow, or agentic workflow — 'add a test', 'create a test', 'write a test case', 'test this agent', 'run tests', 'add evals', 'evaluate my agent', 'create an evaluation', 'fill in expected answers', or 'bootstrap an eval suite'. Manages .test.yml eval files: scaffolding cases, running `oxy test --output-json`, parsing results, and drafting expected strings from observed outputs.

Oxy Test Drafter

You are an expert at bootstrapping and refining .test.yml eval files for Oxy agents and workflows. Your role is to run agents against a set of prompts, parse the resulting JSON traces, and draft accurate, concise expected strings grounded in observed evidence — not in assumptions.

Scope boundary: This skill touches only .test.yml files. Never modify semantic layer files, agent YAML, workflow YAML, system instructions, or any code outside the test file.

When to Use This Skill

Activate when the user:

Wants to create eval tests for a list of prompts
Has a .test.yml with DRAFT: placeholders or empty expected strings
Asks to "run tests and fill in expected answers"
Wants to bootstrap an eval suite without knowing the right answers upfront

Core Workflow

Phase 1 — Load context

Read the target .test.yml if it exists. Identify:

Target agent/workflow path
Cases where expected is missing, a DRAFT: placeholder, or weak (vague/fabricated)
Prompts with relative time language ("last month", "recent", "this week") — flag these for rewriting once the agent resolves the actual period

Phase 2 — Scaffold if needed

If the user provides only a list of prompts (no test file yet), ask for:

Target agent path (e.g. analyst.agent.yml)
Desired test file path (suggest tests/<agent-stem>.<category>.test.yml)
Any tag hints

Then generate a valid .test.yml with all expected set to:

DRAFT: expected answer to be synthesized from repeated runs and trace inspection.

Use these defaults: runs: 3, concurrency: 5, judge_model: openai-5-mini.

Phase 3 — Run evidence-gathering

When filling in a single DRAFT case (adding one new test, or resolving one placeholder), use --case to run only that case rather than the full suite:

cd <repo-root>
# by name (preferred — requires name: field on the case)
oxy test <test-file> --case <name> --output-json

# by prompt string
oxy test <test-file> --case "How many users signed up last month?" --output-json

# by 0-based index
oxy test <test-file> --case 0 --output-json

The output JSON contains only that case's results, making it faster to parse and less noisy.

When filling in multiple DRAFT cases at once, run the full suite:

cd <repo-root>
oxy test <test-file> --output-json

In both modes, re-run if the case failed due to transient errors (backend 400s, empty results, timeouts). Up to 3 total runs is reasonable.

The output is written as <test-name>.results.json in the same directory as the test file.

Reading the results file — use Read and Grep, never Python or Bash:

--case run (single case): the file is small. Read it directly in full.
```
Read <test-name>.results.json
```
Full suite run (multiple cases): the file can be large. Use Grep to locate the relevant records by prompt text, then Read those line ranges.
```
# Find line numbers for a specific prompt
Grep "prompt text keywords" in <results-file> (output_mode: content, -n, -C 5)

# Then Read the surrounding lines
Read <results-file> (offset: <line>, limit: 30)
```
To read all actual_output values for a set of cases, locate each record's prompt line with Grep, then Read the surrounding block (the actual_output field follows within a few lines).

Do not use Python or Bash to parse the JSON. The Read and Grep tools are sufficient and require no shell approval.

Detecting partial failures:

A case can "succeed" (no top-level error) but have empty actual_output — this is a data availability failure, not success.
A case where actual_output contains a ranked list but all metric columns are null/empty is a failing attempt, not a stable answer.
Transient backend errors (retryable 400s, timeouts) may resolve on re-run; count only the final successful outcome.

Phase 4 — Analyze each case

For each case examine all records (one per run attempt):

Check	Question
Stability	Do 2+ attempts agree on key business facts?
Numbers	Are numerics approximately consistent across attempts?
Ranking	Is the ordering stable even if exact values vary?
Null metrics	Did the query return rows but all metric values are empty?
Backend errors	Were failures transient or systematic?
Refusals	Did the agent correctly decline? What reason did it give?

Classify each case:

Class	Meaning
`stable`	2+ attempts agree; draft a full expected
`stable_but_partial`	Ranking/direction stable but values vary; draft partial expected
`flaky`	Attempts disagree materially; draft only stable facts, flag the rest
`ambiguous`	Prompt is underspecified; suggest prompt rewrite
`likely_unsupported`	Data/capability gap; note in diagnostics
`cannot_answer_as_asked`	Agent correctly declines; encode the refusal reason as expected

Phase 5 — Draft `expected` strings

Write expected strings following the style guide below. Decision rules:

2+ attempts converge on the same business answer → full expected with approximate numbers
Ranking stable, values vary → encode ranking + approximate anchors for top 2–3 entries
Only a refusal is stable → encode the reason and acceptable fallback
No trustworthy attempt → leave a narrowed DRAFT note, never fabricate

Phase 6 — Write outputs

1. Updated .test.yml

Replace DRAFT: placeholders with drafted expected strings
Rewrite relative time prompts to absolute periods where the run context clarified the period (e.g. "last month" → "November 2025")
Leave truly uncertain cases with a scoped DRAFT note explaining what's missing

2. Diagnostic summary (print to conversation, not to a file):

## Test Drafting Summary

Updated test file: tests/<name>.test.yml
Cases analyzed: N
Stable expecteds drafted: X
Partial/uncertain: Y
Flaky prompts: Z

### Per-case results
| Prompt (truncated)          | Class              | Notes                          |
|-----------------------------|--------------------|--------------------------------|
| What is total revenue…      | stable             | consistent across 3 runs       |
| Which items are volatile…   | stable_but_partial | ranking stable; values vary    |

### Recommended follow-up
- [semantic/system issues to investigate]
- [prompts that should be split]
- [prompts that remain ambiguous and why]

Test File Format

name: "Human-readable suite name"
target: path/to/agent.agent.yml       # relative to repo root

settings:
  concurrency: 5
  runs: 3
  judge_model: openai-5-mini          # use openai-5-mini unless repo specifies otherwise

cases:
  - name: total-revenue-all-stores       # optional but recommended — enables --case targeting
    prompt: What is the total revenue for all stores?
    expected: |
      DRAFT: expected answer to be synthesized from repeated runs and trace inspection.
    tags:
      - revenue
      - stores
    tool: ""    # include if a tool hint is relevant; leave empty string otherwise

Add name: to every case you scaffold. It enables targeted single-case runs via --case and makes the diagnostic output easier to read.

File naming: <agent-stem>.<category>.test.yml e.g. analyst.sales_performance.test.yml

File location: tests/ subdirectory of the target repo, unless the repo already uses a different convention.

`expected` String Style Guide

This is the most important content this skill produces. Follow these rules precisely.

Rules

Do NOT start with "The response should..."
Do NOT repeat the prompt unnecessarily
Do NOT include internal field names or schema names unless essential
Do NOT require exact wording or exact formatting
Use absolute time periods — replace "last month" with the actual month/year
Use approximate numerics naturally: ~, about, approximately, roughly
For rankings: mention the ranking criterion once, then include only the most important anchor entities and values (top 2–3 plus notable outliers)
For trend questions: include current value, rough range, a few notable points
For unanswerable questions: state why, and what acceptable alternative a correct answer should offer
Keep it short: 1 paragraph, 2–4 sentences max
Use natural business language, not judge instructions

Good examples

# Simple aggregate
Total revenue across all locations is approximately $6.7 billion.

# Ranking with anchors
Top 5 items ranked by first-week engagements published in February 2026. The #1 item
had ~25K engagements; the #5 item had ~10K. Each entry includes title, channel,
engagement count, and ID.

# Unanswerable — encode the refusal and acceptable alternative
Regional plan attainment cannot be calculated because targets are set at the company
level only. A correct answer explains this and offers company-level attainment as
a useful alternative.

Bad examples

# Bad: starts with "The response should"
The response should explain total revenue across all stores...

# Bad: exact values without approximation
Total revenue is $6,737,218,004.32

# Bad: internal field/schema names
The agent should query store__weekly_sales summed across all 45 store__store_id values.

# Bad: exhaustive list instead of anchors
The answer must include: location A at $73.3M, location B at $275M, location C at $402K...

CLI Reference

# Run a single case by name (preferred for incremental drafting)
oxy test tests/analyst.sales_performance.test.yml --case total-revenue-all-stores --output-json

# Run a single case by prompt string
oxy test tests/analyst.sales_performance.test.yml --case "What is the total revenue?" --output-json

# Run a single case by 0-based index
oxy test tests/analyst.sales_performance.test.yml --case 0 --output-json

# Run full suite
oxy test tests/analyst.sales_performance.test.yml --output-json

# Filter full suite by tag
oxy test tests/analyst.sales_performance.test.yml --output-json --tag revenue

# Run all test files in the repo
oxy test --output-json

Always run from the repo root of the target project.

JSON output schema

--output-json produces an array of EvalResult objects:

[
  {
    "test_name": "analyst.sales_performance",
    "errors": [],
    "stats": {
      "total_attempted": 9,
      "answered": 9
    },
    "metrics": [
      {
        "type": "Correctness",
        "score": 0.85,
        "records": [
          {
            "prompt":        "What is the total revenue for all stores?",
            "expected":      "DRAFT: ...",
            "actual_output": "Total revenue across all 45 stores is $6.74B...",
            "cot":           "...",   // judge's chain-of-thought
            "score":         1.0,
            "duration_ms":   4200.0
          }
          // one record per run attempt (runs: 3 → 3 records per case)
        ]
      }
    ]
  }
]

The actual_output field is the primary evidence source. Read it across all run attempts to identify stable facts.

Guardrails

Never do these:

Modify semantic layer files, agent YAML, workflow YAML, or system instructions
Remove cases because they are difficult or flaky
Convert a clearly wrong agent answer into expected just because it was observed
Silently accept null/empty metric results as "the answer"
Fabricate expected values when evidence is insufficient — use a DRAFT note instead
Use Python or Bash to parse results JSON — always use the Read and Grep tools directly

Strong preferences:

Intersection of stable facts across attempts > any single run
Approximate numbers > exact numbers (unless the exact value is reliably stable)
A shorter, honest expected > a longer, invented one
Explicit DRAFT markers > false confidence

oxy-test-drafter

Plus depuis ce dépôt

Plus depuis ce dépôt

Oxy Test Drafter

When to Use This Skill

Core Workflow

Phase 1 — Load context

Phase 2 — Scaffold if needed

Phase 3 — Run evidence-gathering

Phase 4 — Analyze each case

Phase 5 — Draft expected strings

Phase 6 — Write outputs

Test File Format

expected String Style Guide

Rules

Good examples

Bad examples

CLI Reference

JSON output schema

Guardrails

Oxy Test Drafter

When to Use This Skill

Core Workflow

Phase 1 — Load context

Phase 2 — Scaffold if needed

Phase 3 — Run evidence-gathering

Phase 4 — Analyze each case

Phase 5 — Draft expected strings

Phase 6 — Write outputs

Test File Format

expected String Style Guide

Rules

Good examples

Bad examples

CLI Reference

JSON output schema

Guardrails

Phase 5 — Draft `expected` strings

`expected` String Style Guide

Phase 5 — Draft `expected` strings

`expected` String Style Guide