Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

test-afm-binary

Name: Test Afm Binary
Author: scouzi1966

// Test a pre-built afm binary at any path — runs pre-flight safety checks, then any combination of unit tests, assertions, smart analysis, promptfoo evals, batch validation, OpenAI compat, GPU profiling. Use when user wants to validate a binary post-build, after code changes, or before release.

Exécuter dans Manus

$ git log --oneline --stat

stars:292

forks:15

updated:18 avril 2026 à 15:26

SKILL.md

readonly

related-skills.json

même dépôt

afm-build-promote-nightly.md

from "scouzi1966/maclocal-api"

Use when promoting afm to a stable release — builds from main HEAD or a nightly commit, verifies patches, updates Homebrew stable tap (afm.rb), builds a PyPI wheel, updates README and version files, and verifies both brew install and pip install work. Repo admin only.

2026-04-21292

afm-release-wheel.md

from "scouzi1966/maclocal-api"

Use when user wants to build a PyPI wheel from an existing compiled afm binary and publish to PyPI. Covers staging assets, building the wheel, and providing the uv publish command. Only for official stable releases, not nightly builds.

2026-04-18292

build-afm-nightly-publish.md

from "scouzi1966/maclocal-api"

Build, test, and publish an afm-next nightly release — full from-scratch build, user testing pause, GitHub release, and Homebrew tap update. Use when user types /build-afm-nightly-publish or asks to publish a nightly build.

2026-04-18292

build-afm.md

from "scouzi1966/maclocal-api"

Build AFM from scratch — submodules, patches, webui, and Swift build. Use when user types /build-afm, asks to build afm, or needs a fresh build from a clean clone.

2026-04-18292

codex-promptfoo-agentic-eval.md

from "scouzi1966/maclocal-api"

Run and review the Promptfoo-based AFM agentic evaluation suite. Use when the user wants structured-output, tool-calling, grammar, guided-json, streaming, concurrency, or agentic QA coverage for AFM, and especially when they want help choosing harness options or interpreting failures.

2026-04-03292

test-macafm.md

from "scouzi1966/maclocal-api"

Run the maclocal-api (AFM/MLX) test suite — automated assertions and smart analysis. Use when asked to test, validate, regression-check, or benchmark AFM before release, after code changes, or for model onboarding.

2026-03-28292

package.json

"author": "scouzi1966"

"repository": "scouzi1966/maclocal-api"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Analystes en assurance qualité des logiciels et testeursProfessions informatiques et mathématiques15-1253L4

name	test-afm-binary
description	Test a pre-built afm binary at any path — runs pre-flight safety checks, then any combination of unit tests, assertions, smart analysis, promptfoo evals, batch validation, OpenAI compat, GPU profiling. Use when user wants to validate a binary post-build, after code changes, or before release.
user_invocable	true

Test AFM Binary

Test any pre-built afm binary with a menu of test suites. Validates the binary won't crash when relocated (pip/Homebrew install), then runs the selected tests.

Usage

/test-afm-binary — interactive: asks for binary path, model, and test selection
/test-afm-binary /path/to/afm — test the binary at the given path
/test-afm-binary .build/arm64-apple-macosx/release/afm — test the current build

Instructions

Step 1: Resolve Binary Path

Ask for the binary path if not provided as an argument. Default: .build/arm64-apple-macosx/release/afm.

BIN="${1:-.build/arm64-apple-macosx/release/afm}"
[ -x "$BIN" ] || BIN=".build/release/afm"
BIN_ABS="$(cd "$(dirname "$BIN")" && pwd)/$(basename "$BIN")"
echo "Binary: $BIN_ABS"

If the binary doesn't exist or isn't executable, STOP and tell the user.

Step 2: Pre-Flight Safety Checks (MANDATORY — always run)

These checks run before any test suite. They catch fatal distribution bugs that would crash every pip/Homebrew user. If any check fails, STOP — do not proceed to testing.

Check A: Binary version

REPORTED=$($BIN_ABS --version 2>&1)
echo "Version: $REPORTED"

If the version shows only a base version without a SHA suffix (e.g., v0.9.8 instead of v0.9.8-62395ab), warn the user: this likely means the binary was built with an incremental swift build instead of ./Scripts/build-from-scratch.sh. The SHA injection only happens in the build script. This is a warning, not a blocker — the binary may still be valid for testing.

Check B: Metallib present

BIN_DIR="$(dirname "$BIN_ABS")"

# Check for metallib in either location (SPM bundle or loose file)
if [ -f "$BIN_DIR/MacLocalAPI_MacLocalAPI.bundle/default.metallib" ]; then
  echo "PASS: Metallib in SPM bundle ($(du -h "$BIN_DIR/MacLocalAPI_MacLocalAPI.bundle/default.metallib" | cut -f1))"
elif [ -f "$BIN_DIR/default.metallib" ]; then
  echo "PASS: Loose metallib ($(du -h "$BIN_DIR/default.metallib" | cut -f1))"
else
  echo "FAIL: No metallib found next to binary"
  echo "The binary will crash on first inference without default.metallib"
fi

Check C: Relocated binary does NOT crash

TMPDIR=$(mktemp -d)
cp "$BIN_ABS" "$TMPDIR/"

# Copy metallib as loose file (pip wheel layout)
if [ -f "$BIN_DIR/MacLocalAPI_MacLocalAPI.bundle/default.metallib" ]; then
  cp "$BIN_DIR/MacLocalAPI_MacLocalAPI.bundle/default.metallib" "$TMPDIR/"
elif [ -f "$BIN_DIR/default.metallib" ]; then
  cp "$BIN_DIR/default.metallib" "$TMPDIR/"
fi

MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache \
  "$TMPDIR/afm" mlx -m mlx-community/SmolLM3-3B-4bit -s "hello" --max-tokens 3 2>&1 | head -3
EXIT_CODE=${PIPESTATUS[0]}
rm -rf "$TMPDIR"

if [ "$EXIT_CODE" -ne 0 ]; then
  echo "FATAL: Relocated binary crashed (exit $EXIT_CODE)"
  echo "Bundle.module fatalError is still reachable — pip/Homebrew install will crash"
  echo "STOP. Fix MLXMetalLibrary.swift — it must NOT call Bundle.module"
else
  echo "PASS: Relocated binary runs without crash"
fi

If this fails, STOP IMMEDIATELY. Do not run any tests. The binary is broken for distribution.

Check D: No Bundle.module in source code

HITS=$(grep -r 'Bundle\.module' Sources/ --include='*.swift' | grep -v '^\s*//' | grep -v '// ' | wc -l | tr -d ' ')
if [ "$HITS" -gt 0 ]; then
  echo "FAIL: Found $HITS Bundle.module call(s) in source"
  grep -rn 'Bundle\.module' Sources/ --include='*.swift' | grep -v '//'
  echo "This WILL crash when installed via pip or Homebrew"
else
  echo "PASS: No Bundle.module calls in source"
fi

Check E: Info.plist embedded with privacy usage descriptions

macOS 26 SIGABRTs any process that requests privacy-sensitive APIs (Speech Recognition, microphone, camera, etc.) without a matching *UsageDescription key in the binary's embedded Info.plist. PR #107's Apple Speech feature triggers this on every afm speech / POST /v1/audio/transcriptions / chat input_audio call.

# Verify __TEXT,__info_plist section exists
if otool -l "$BIN_ABS" | grep -q '__info_plist'; then
  echo "PASS: __info_plist section present"
else
  echo "FAIL: Missing __TEXT,__info_plist section"
  echo "Check Package.swift linker flags and Sources/MacLocalAPI/Info.plist"
fi

# Verify NSSpeechRecognitionUsageDescription key is in the embedded plist
if strings "$BIN_ABS" | grep -q 'NSSpeechRecognitionUsageDescription'; then
  echo "PASS: NSSpeechRecognitionUsageDescription embedded"
else
  echo "FAIL: NSSpeechRecognitionUsageDescription missing"
  echo "afm speech / /v1/audio/transcriptions will SIGABRT on macOS 26"
fi

Note on testing Speech from an unattended context: If this skill is running inside Claude Code / an editor terminal / any parent process that does NOT have NSSpeechRecognitionUsageDescription, macOS 26 attributes the TCC subject to the parent and the child crashes even with a correct embedded plist. This is a test-environment artifact, not a binary bug. To verify Speech end-to-end, run afm speech -f <file.wav> from a fresh Terminal.app window (stock /System/Applications/Utilities/Terminal.app).

Present pre-flight results

Check	What it catches	Result
A: Version	Incremental build (no SHA)	PASS/WARN/FAIL
B: Metallib	Missing Metal shaders → crash on inference	PASS/FAIL
C: Relocated binary	Bundle.module fatalError → crash on pip install	PASS/FAIL
D: No Bundle.module	Source code regression guard	PASS/FAIL
E: Info.plist embedded	macOS 26 SIGABRT on Speech Recognition without UsageDescription	PASS/FAIL

If B, C, D, or E fail, STOP. Do not proceed.

Step 3: Select Model

Show available models and let the user pick:

MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache ./Scripts/list-models.sh

Use AskUserQuestion with the model list. Default recommendation: mlx-community/Qwen3.5-35B-A3B-4bit (19 GB, MoE, best coverage).

For quick smoke tests, suggest mlx-community/SmolLM3-3B-4bit (1.6 GB, fast).

Step 4: Select Tests

Use AskUserQuestion with multiSelect: true. Present these options:

Option	Script	Server?	Port	Runtime	What it tests
All	(runs everything below)	—	—	~3-4 hours	Complete validation
Unit tests	`swift test`	No	—	~5s	261 Swift unit tests (XML parsing, batch scheduler, KV cache, etc.)
Assertions (smoke)	`test-assertions.sh --tier smoke`	Yes	9998	~2 min	Server reachable, basic completion, stop, logprobs, think, tools, errors
Assertions (standard)	`test-assertions.sh --tier standard`	Yes	9998	~5 min	+ streaming, cache, concurrent, kwargs, XML tools, adaptive XML, grammar, batch
Assertions (full)	`test-assertions.sh --tier full`	Yes	9998	~15 min	+ performance (TTFT, tok/s, long context 2K/4K tokens)
Assertions + grammar + forced parser	`test-assertions-multi.sh`	Managed	9998	~30 min	Full tier × 2 (auto-detect + forced qwen3_xml) with grammar constraints
Comprehensive smart analysis	`mlx-model-test.sh --smart 1:claude`	Managed	9877	~45-90 min	91 test variants across samplers, stop, JSON, tools, code, math with AI judge
Promptfoo agentic evals	`run-promptfoo-agentic.sh all`	Managed	9999	~60-120 min	137 tests × 8 server profiles: structured, toolcall, grammar, agentic, frameworks
Batch correctness	`validate_responses.py`	Yes	9999	~10-15 min	Known-answer correctness at B={1,2,4,8}
Batch mixed workload	`validate_mixed_workload.py`	Yes	9999	~15-25 min	Short+long decode mix with GPU metrics
Batch multiturn prefix	`validate_multiturn_prefix.py`	Yes	9999	~15-25 min	Multi-turn prefix cache under concurrency
OpenAI compat evals	`test-openai-compat-evals.py`	Managed	9999	~5-10 min	OpenAI Python SDK compatibility (stream, logprobs, usage)
Guided JSON evals	`test-guided-json-evals.py`	Managed	9999	~10-15 min	`response_format: json_schema` with real-world fixtures
GPU profile	`gpu-profile-report.py`	No (CLI)	—	~30-60s	DRAM bandwidth, GPU power, shader kernel names, HTML report

Step 5: Run Selected Tests

For each selected test, set the correct environment and invoke. The binary path must be passed to every script.

Environment (always set):

export MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache

Parallelism rules:

Unit tests (no server) → can run in parallel with anything
Promptfoo (port 9999) → can run in parallel with assertions (port 9998)
Batch validation (port 9999) → must NOT overlap with promptfoo
Smart analysis (port 9877) → can run in parallel with assertions (port 9998)

Per-test invocation:

Test	Command
Unit tests	`swift test`
Assertions (any tier)	Start server: `MACAFM_MLX_MODEL_CACHE=... $BIN_ABS mlx -m MODEL --port 9998 --tool-call-parser afm_adaptive_xml --enable-prefix-caching --enable-grammar-constraints &` then `./Scripts/test-assertions.sh --tier TIER --model MODEL --port 9998 --bin "$BIN_ABS" --grammar-constraints`
Assertions + grammar + forced	`./Scripts/test-assertions-multi.sh --models "MODEL" --tier full --also-forced-parser qwen3_xml --grammar-constraints` with `AFM_BINARY="$BIN_ABS"`
Smart analysis	`AFM_BIN="$BIN_ABS" ./Scripts/mlx-model-test.sh --model MODEL --prompts Scripts/test-llm-comprehensive.txt --smart 1:claude`
Promptfoo	`AFM_MODEL=MODEL AFM_BINARY="$BIN_ABS" ./Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh all`
Batch correctness	Start server: `$BIN_ABS mlx -m MODEL --port 9999 --concurrent 8 &` then `python3 Scripts/feature-mlx-concurrent-batch/validate_responses.py`
Batch mixed	Same server, then `python3 Scripts/feature-mlx-concurrent-batch/validate_mixed_workload.py`
Batch multiturn	Same server, then `python3 Scripts/feature-mlx-concurrent-batch/validate_multiturn_prefix.py`
OpenAI compat	`python3 Scripts/feature-codex-optimize-api/test-openai-compat-evals.py --start-server --model MODEL` with `AFM_BINARY="$BIN_ABS"`
Guided JSON	`python3 Scripts/feature-codex-optimize-api/test-guided-json-evals.py --start-server --model MODEL` with `AFM_BINARY="$BIN_ABS"`
GPU profile	`python3 Scripts/gpu-profile-report.py MODEL` with `AFM_BIN="$BIN_ABS"`

After each test completes, present its results immediately. Don't wait for all tests to finish before showing anything.

Step 6: Present Results Summary

After all selected tests complete, present a summary table:

Suite	Pass	Total	Rate	Notes
Pre-flight checks	N	4	—	—
Unit tests	N	N	—	—
Assertions (tier)	N	N	N%	—
...	...	...	...	—

Step 7: Open Promptfoo Web UI (if promptfoo tests were run)

After promptfoo evals complete, launch the interactive web interface:

promptfoo view -y &
# Opens browser at http://localhost:15500
# Shows all evaluations with interactive filtering, pass/fail drill-down, response comparison
# Results are persisted in ~/.promptfoo/promptfoo.db — all historical runs are visible
echo "Promptfoo UI running at http://localhost:15500 — press Ctrl+C to stop"

Leave the server running for the user to explore results. The web UI provides:

Side-by-side comparison of outputs across server profiles (default vs adaptive-xml vs grammar)
Drill-down into individual test failures with full request/response bodies
Filtering by pass/fail status, test description, or provider
Historical comparison with previous promptfoo runs

Step 8: Archive Results

TODAY=$(date +%Y-%m-%d)
ARCHIVE_DIR="test-reports/binary-test/$TODAY"
mkdir -p "$ARCHIVE_DIR"

# Copy all reports generated during this session
cp test-reports/assertions-report-*.html test-reports/assertions-report-*.jsonl "$ARCHIVE_DIR/" 2>/dev/null
cp test-reports/multi-assertions-report-*.html test-reports/multi-assertions-report-*.jsonl "$ARCHIVE_DIR/" 2>/dev/null
cp test-reports/smart-analysis-*.md "$ARCHIVE_DIR/" 2>/dev/null
cp test-reports/mlx-model-report-*.html test-reports/mlx-model-report-*.jsonl "$ARCHIVE_DIR/" 2>/dev/null

# Copy promptfoo results
PROMPTFOO_DIR="${AFM_PROMPTFOO_OUT_DIR:-/Volumes/edata/promptfoo/data/maclocal-api/current}"
cp "$PROMPTFOO_DIR"/*-mlx-community_*.json "$ARCHIVE_DIR/" 2>/dev/null

Write a SUMMARY.md in the archive directory with: binary path, version, model tested, platform, date, and a pass/fail table for every test suite run.

Interpreting Results

Server-Critical Suites (must be 100% pass — failures = server bug)

Suite	What it validates
Assertions: sections 0-8, 10-15	Core server functionality
Promptfoo: structured, toolcall, grammar (non-concurrent), frameworks	API-level tool calling and structured output
OpenAI compat evals	SDK compatibility
Batch correctness	KV cache isolation under concurrency

Model-Quality Suites (failures expected — not server bugs)

Suite	Typical pass rate	Why it varies
Promptfoo: opencode, pi, openclaw, hermes	70-90%	Model can't always pick correct tool for complex scenarios
Promptfoo: toolcall-quality	~80%	Model quality on when-to-call decisions
Promptfoo: grammar (concurrent)	50-70%	Known race condition in `--concurrent 2` grammar path
Smart analysis	Varies	AI judge scoring variance, thinking model token budget
Batch multiturn prefix	~85-90%	Model answer quality at high concurrency

When to Investigate

Any assertion failure in sections 0-8 → server bug, investigate immediately
Relocated binary crash (Check C) → Bundle.module regression, fix before doing anything else
All tool calls missing → wrong tool call format detection, check model_type in config.json
NaN/garbage in long context → SDPA regression, check MLX version (must be pinned to 0.30.3)
Streaming tool calls missing finish_reason → check MLXChatCompletionsController state machine

Quick Reference

# Smoke test the current build
/test-afm-binary .build/arm64-apple-macosx/release/afm

# Test a Homebrew-installed binary
/test-afm-binary $(brew --prefix afm-next)/bin/afm

# Test a pip-installed binary
/test-afm-binary $(python3 -c "import macafm_next; print(macafm_next.binary_path())")

test-afm-binary

Plus depuis ce dépôt

Test AFM Binary

Usage

Instructions

Step 1: Resolve Binary Path

Step 2: Pre-Flight Safety Checks (MANDATORY — always run)

Check A: Binary version

Check B: Metallib present

Check C: Relocated binary does NOT crash

Check D: No Bundle.module in source code

Check E: Info.plist embedded with privacy usage descriptions

Present pre-flight results

Step 3: Select Model

Step 4: Select Tests

Step 5: Run Selected Tests

Step 6: Present Results Summary

Step 7: Open Promptfoo Web UI (if promptfoo tests were run)

Step 8: Archive Results

Interpreting Results

Server-Critical Suites (must be 100% pass — failures = server bug)

Model-Quality Suites (failures expected — not server bugs)

When to Investigate

Quick Reference

Test AFM Binary

Usage

Instructions

Step 1: Resolve Binary Path

Step 2: Pre-Flight Safety Checks (MANDATORY — always run)

Check A: Binary version

Check B: Metallib present

Check C: Relocated binary does NOT crash

Check D: No Bundle.module in source code

Check E: Info.plist embedded with privacy usage descriptions

Present pre-flight results

Step 3: Select Model

Step 4: Select Tests

Step 5: Run Selected Tests

Step 6: Present Results Summary

Step 7: Open Promptfoo Web UI (if promptfoo tests were run)

Step 8: Archive Results

Interpreting Results

Server-Critical Suites (must be 100% pass — failures = server bug)

Model-Quality Suites (failures expected — not server bugs)

When to Investigate

Quick Reference

Plus depuis ce dépôt