Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

codex-promptfoo-agentic-eval

Name: Codex Promptfoo Agentic Eval
Author: scouzi1966

// Run and review the Promptfoo-based AFM agentic evaluation suite. Use when the user wants structured-output, tool-calling, grammar, guided-json, streaming, concurrency, or agentic QA coverage for AFM, and especially when they want help choosing harness options or interpreting failures.

Exécuter dans Manus

$ git log --oneline --stat

stars:292

forks:15

updated:3 avril 2026 à 14:47

SKILL.md

readonly

name	codex-promptfoo-agentic-eval
description	Run and review the Promptfoo-based AFM agentic evaluation suite. Use when the user wants structured-output, tool-calling, grammar, guided-json, streaming, concurrency, or agentic QA coverage for AFM, and especially when they want help choosing harness options or interpreting failures.

Promptfoo Agentic Eval

Use this skill when the user wants to run, expand, or interpret the Promptfoo agentic suite for AFM.

This skill is for two linked goals:

AFM functional validation
model-quality evaluation for agentic use

Always distinguish:

afm_bug
model_quality
harness_bug

Always report provenance for the suite you run:

afm_internal
primary_source
public_benchmark_inspired
synthetic

Never present a benchmark-inspired or synthetic suite as if it were a public benchmark import.

First questions to ask

Before running the suite, ask the user the minimum needed questions:

Which model should be tested?
Which scope should be run?
- structured
- structured-stress
- toolcall
- toolcall-quality
- agentic
- frameworks
- opencode
- all
- one profile only: default, adaptive-xml, adaptive-xml-grammar
Is the goal:
- AFM functional QA
- model quality
- both
Should the run stay serial/safe, or include concurrency cases?
Should you only review existing reports, or also execute the harness?
Should the run prefer:
- primary-source-only cases
- public-benchmark-inspired cases
- synthetic representative cases
- mixed

If the user does not specify, assume:

model: the repo's current primary MLX model under test
scope: all
goal: both
run mode: serial/safe first
action: execute and then review
provenance preference: mixed, but explicitly labeled

Working directory

Run from:

cd /Volumes/edata/codex/dev/git/maclocal-api/NEXT/maclocal-api

Main assets

Read only what is needed:

Scripts/feature-promptfoo-agentic/README.md
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh
Scripts/feature-promptfoo-agentic/providers/afm_provider.mjs
Scripts/feature-promptfoo-agentic/matrix/functional-matrix.yaml
Scripts/feature-promptfoo-agentic/matrix/failure-classification.yaml
docs/roadmap/promptfoo-agentic-matrix.md

Relevant suite configs and datasets:

Scripts/feature-promptfoo-agentic/promptfooconfig.structured.yaml
Scripts/feature-promptfoo-agentic/promptfooconfig.structured-stress.yaml
Scripts/feature-promptfoo-agentic/promptfooconfig.toolcall.yaml
Scripts/feature-promptfoo-agentic/promptfooconfig.toolcall-quality.yaml
Scripts/feature-promptfoo-agentic/promptfooconfig.agentic.yaml
Scripts/feature-promptfoo-agentic/promptfooconfig.agentic-frameworks.yaml
Scripts/feature-promptfoo-agentic/promptfooconfig.opencode.yaml
Scripts/feature-promptfoo-agentic/datasets/agentic/opencode-primary-tools.yaml

If reviewing failures, inspect:

test-reports/promptfoo-agentic/*.json
test-reports/promptfoo-agentic/*.classified.json
test-reports/promptfoo-agentic/*.classified.summary.md
test-reports/promptfoo-agentic/server-*.log

Execution workflow

1. Run the harness

Use the wrapper unless the user explicitly wants a narrower manual run:

MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache \
  Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh

Allowed narrowed runs:

Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh structured
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh structured-stress
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh toolcall
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh toolcall-quality
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh agentic
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh frameworks
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh opencode
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh default
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh adaptive-xml
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh adaptive-xml-grammar

Known current suite sizes:

structured: 6 cases (afm_internal)
structured-stress: 4 cases (public_benchmark_inspired)
toolcall: 7 cases (afm_internal)
toolcall-quality: 6 cases (public_benchmark_inspired)
agentic: 4 cases (synthetic representative)
frameworks: 8 cases (mixed / currently assumption-heavy)
opencode: 37 cases (primary_source)

If a requested run is under 20 cases, explicitly warn the user that it is a small sample, not broad coverage.

2. Review results

Check:

pass/fail counts
whether failures are real or harness-related
differences across parser profiles
structured-output vs tool-calling behavior
provenance of the suite and whether that limits how strong conclusions can be

3. Classify failures

If using the automated judge path:

AFM_JUDGE_MODEL="$MODEL_ID" \
AFM_JUDGE_BASE_URL=http://127.0.0.1:9999/v1 \
node Scripts/feature-promptfoo-agentic/judges/classify-failures.mjs <report.json>

If working interactively in Codex/Claude-style CLI, classify manually using the rubric in:

docs/roadmap/promptfoo-agentic-matrix.md
Scripts/feature-promptfoo-agentic/matrix/failure-classification.yaml

Classification rubric

`afm_bug`

Use when AFM violates server/runtime/protocol invariants:

malformed JSON or SSE
broken tool_calls envelope
wrong tool_choice semantics
grammar-constrained output violates grammar/schema
stream/non-stream deterministic mismatch
parser corrupts an otherwise valid call
timeout, truncation, duplicate emission, crash

`model_quality`

Use when AFM output is valid but the model behavior is weak:

wrong tool
missing tool
unnecessary tool
wrong arguments
poor multi-turn or refusal behavior

`harness_bug`

Use when the test machinery is wrong:

assertion false negative
provider normalization issue
Promptfoo config mismatch
classification/judge pipeline issue

Reporting format

When reporting results, give:

overall run status
total tests executed
pass/fail counts per suite/profile
suite provenance summary
- how many cases came from afm_internal
- primary_source
- public_benchmark_inspired
- synthetic
failure classification summary:
- afm_bug
- model_quality
- harness_bug
remaining not_yet_classified count, if any
top next actions

Prefer concise summaries, but include concrete failing cases when they matter.

Expansion guidance

When the user asks to extend the suite, prioritize:

stronger custom assertions
streaming and grammar-specific cases
primary-source-derived framework suites
public benchmark sampling:
- BFCL
- When2Call
- StructEval
- tau-bench-style multi-turn cases
real AFM use cases:
- coding agents
- OpenClaw/Hermes-style tool orchestration
- structured output workflows

Prefer primary sources over secondary descriptions. If a suite is built from secondary material or assumptions, say so explicitly and do not overstate its authority.

Do not explode the matrix blindly. Use the layered matrix in docs/roadmap/promptfoo-agentic-matrix.md.

related-skills.json

même dépôt

afm-build-promote-nightly.md

from "scouzi1966/maclocal-api"

Use when promoting afm to a stable release — builds from main HEAD or a nightly commit, verifies patches, updates Homebrew stable tap (afm.rb), builds a PyPI wheel, updates README and version files, and verifies both brew install and pip install work. Repo admin only.

2026-04-21292

afm-release-wheel.md

from "scouzi1966/maclocal-api"

Use when user wants to build a PyPI wheel from an existing compiled afm binary and publish to PyPI. Covers staging assets, building the wheel, and providing the uv publish command. Only for official stable releases, not nightly builds.

2026-04-18292

build-afm-nightly-publish.md

from "scouzi1966/maclocal-api"

Build, test, and publish an afm-next nightly release — full from-scratch build, user testing pause, GitHub release, and Homebrew tap update. Use when user types /build-afm-nightly-publish or asks to publish a nightly build.

2026-04-18292

build-afm.md

from "scouzi1966/maclocal-api"

Build AFM from scratch — submodules, patches, webui, and Swift build. Use when user types /build-afm, asks to build afm, or needs a fresh build from a clean clone.

2026-04-18292

test-afm-binary.md

from "scouzi1966/maclocal-api"

Test a pre-built afm binary at any path — runs pre-flight safety checks, then any combination of unit tests, assertions, smart analysis, promptfoo evals, batch validation, OpenAI compat, GPU profiling. Use when user wants to validate a binary post-build, after code changes, or before release.

2026-04-18292

test-macafm.md

from "scouzi1966/maclocal-api"

Run the maclocal-api (AFM/MLX) test suite — automated assertions and smart analysis. Use when asked to test, validate, regression-check, or benchmark AFM before release, after code changes, or for model onboarding.

2026-03-28292

package.json

"author": "scouzi1966"

"repository": "scouzi1966/maclocal-api"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Analystes en assurance qualité des logiciels et testeursProfessions informatiques et mathématiques15-1253L4

name	codex-promptfoo-agentic-eval
description	Run and review the Promptfoo-based AFM agentic evaluation suite. Use when the user wants structured-output, tool-calling, grammar, guided-json, streaming, concurrency, or agentic QA coverage for AFM, and especially when they want help choosing harness options or interpreting failures.

Promptfoo Agentic Eval

Use this skill when the user wants to run, expand, or interpret the Promptfoo agentic suite for AFM.

This skill is for two linked goals:

AFM functional validation
model-quality evaluation for agentic use

Always distinguish:

afm_bug
model_quality
harness_bug

Always report provenance for the suite you run:

afm_internal
primary_source
public_benchmark_inspired
synthetic

Never present a benchmark-inspired or synthetic suite as if it were a public benchmark import.

First questions to ask

Before running the suite, ask the user the minimum needed questions:

Which model should be tested?
Which scope should be run?
- structured
- structured-stress
- toolcall
- toolcall-quality
- agentic
- frameworks
- opencode
- all
- one profile only: default, adaptive-xml, adaptive-xml-grammar
Is the goal:
- AFM functional QA
- model quality
- both
Should the run stay serial/safe, or include concurrency cases?
Should you only review existing reports, or also execute the harness?
Should the run prefer:
- primary-source-only cases
- public-benchmark-inspired cases
- synthetic representative cases
- mixed

If the user does not specify, assume:

model: the repo's current primary MLX model under test
scope: all
goal: both
run mode: serial/safe first
action: execute and then review
provenance preference: mixed, but explicitly labeled

Working directory

Run from:

cd /Volumes/edata/codex/dev/git/maclocal-api/NEXT/maclocal-api

Main assets

Read only what is needed:

Scripts/feature-promptfoo-agentic/README.md
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh
Scripts/feature-promptfoo-agentic/providers/afm_provider.mjs
Scripts/feature-promptfoo-agentic/matrix/functional-matrix.yaml
Scripts/feature-promptfoo-agentic/matrix/failure-classification.yaml
docs/roadmap/promptfoo-agentic-matrix.md

Relevant suite configs and datasets:

Scripts/feature-promptfoo-agentic/promptfooconfig.structured.yaml
Scripts/feature-promptfoo-agentic/promptfooconfig.structured-stress.yaml
Scripts/feature-promptfoo-agentic/promptfooconfig.toolcall.yaml
Scripts/feature-promptfoo-agentic/promptfooconfig.toolcall-quality.yaml
Scripts/feature-promptfoo-agentic/promptfooconfig.agentic.yaml
Scripts/feature-promptfoo-agentic/promptfooconfig.agentic-frameworks.yaml
Scripts/feature-promptfoo-agentic/promptfooconfig.opencode.yaml
Scripts/feature-promptfoo-agentic/datasets/agentic/opencode-primary-tools.yaml

If reviewing failures, inspect:

test-reports/promptfoo-agentic/*.json
test-reports/promptfoo-agentic/*.classified.json
test-reports/promptfoo-agentic/*.classified.summary.md
test-reports/promptfoo-agentic/server-*.log

Execution workflow

1. Run the harness

Use the wrapper unless the user explicitly wants a narrower manual run:

MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache \
  Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh

Allowed narrowed runs:

Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh structured
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh structured-stress
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh toolcall
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh toolcall-quality
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh agentic
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh frameworks
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh opencode
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh default
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh adaptive-xml
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh adaptive-xml-grammar

Known current suite sizes:

structured: 6 cases (afm_internal)
structured-stress: 4 cases (public_benchmark_inspired)
toolcall: 7 cases (afm_internal)
toolcall-quality: 6 cases (public_benchmark_inspired)
agentic: 4 cases (synthetic representative)
frameworks: 8 cases (mixed / currently assumption-heavy)
opencode: 37 cases (primary_source)

If a requested run is under 20 cases, explicitly warn the user that it is a small sample, not broad coverage.

2. Review results

Check:

pass/fail counts
whether failures are real or harness-related
differences across parser profiles
structured-output vs tool-calling behavior
provenance of the suite and whether that limits how strong conclusions can be

3. Classify failures

If using the automated judge path:

AFM_JUDGE_MODEL="$MODEL_ID" \
AFM_JUDGE_BASE_URL=http://127.0.0.1:9999/v1 \
node Scripts/feature-promptfoo-agentic/judges/classify-failures.mjs <report.json>

If working interactively in Codex/Claude-style CLI, classify manually using the rubric in:

docs/roadmap/promptfoo-agentic-matrix.md
Scripts/feature-promptfoo-agentic/matrix/failure-classification.yaml

Classification rubric

`afm_bug`

Use when AFM violates server/runtime/protocol invariants:

malformed JSON or SSE
broken tool_calls envelope
wrong tool_choice semantics
grammar-constrained output violates grammar/schema
stream/non-stream deterministic mismatch
parser corrupts an otherwise valid call
timeout, truncation, duplicate emission, crash

`model_quality`

Use when AFM output is valid but the model behavior is weak:

wrong tool
missing tool
unnecessary tool
wrong arguments
poor multi-turn or refusal behavior

`harness_bug`

Use when the test machinery is wrong:

assertion false negative
provider normalization issue
Promptfoo config mismatch
classification/judge pipeline issue

Reporting format

When reporting results, give:

overall run status
total tests executed
pass/fail counts per suite/profile
suite provenance summary
- how many cases came from afm_internal
- primary_source
- public_benchmark_inspired
- synthetic
failure classification summary:
- afm_bug
- model_quality
- harness_bug
remaining not_yet_classified count, if any
top next actions

Prefer concise summaries, but include concrete failing cases when they matter.

Expansion guidance

When the user asks to extend the suite, prioritize:

stronger custom assertions
streaming and grammar-specific cases
primary-source-derived framework suites
public benchmark sampling:
- BFCL
- When2Call
- StructEval
- tau-bench-style multi-turn cases
real AFM use cases:
- coding agents
- OpenClaw/Hermes-style tool orchestration
- structured output workflows

Prefer primary sources over secondary descriptions. If a suite is built from secondary material or assumptions, say so explicitly and do not overstate its authority.

Do not explode the matrix blindly. Use the layered matrix in docs/roadmap/promptfoo-agentic-matrix.md.

codex-promptfoo-agentic-eval

Promptfoo Agentic Eval

First questions to ask

Working directory

Main assets

Execution workflow

1. Run the harness

2. Review results

3. Classify failures

Classification rubric

afm_bug

model_quality

harness_bug

Reporting format

Expansion guidance

Plus depuis ce dépôt

Plus depuis ce dépôt

Promptfoo Agentic Eval

First questions to ask

Working directory

Main assets

Execution workflow

1. Run the harness

2. Review results

3. Classify failures

Classification rubric

afm_bug

model_quality

harness_bug

Reporting format

Expansion guidance

`afm_bug`

`model_quality`

`harness_bug`

`afm_bug`

`model_quality`

`harness_bug`