一键导入
test-macafm
// Run the maclocal-api (AFM/MLX) test suite — automated assertions and smart analysis. Use when asked to test, validate, regression-check, or benchmark AFM before release, after code changes, or for model onboarding.
// Run the maclocal-api (AFM/MLX) test suite — automated assertions and smart analysis. Use when asked to test, validate, regression-check, or benchmark AFM before release, after code changes, or for model onboarding.
Use when promoting afm to a stable release — builds from main HEAD or a nightly commit, verifies patches, updates Homebrew stable tap (afm.rb), builds a PyPI wheel, updates README and version files, and verifies both brew install and pip install work. Repo admin only.
Use when user wants to build a PyPI wheel from an existing compiled afm binary and publish to PyPI. Covers staging assets, building the wheel, and providing the uv publish command. Only for official stable releases, not nightly builds.
Build, test, and publish an afm-next nightly release — full from-scratch build, user testing pause, GitHub release, and Homebrew tap update. Use when user types /build-afm-nightly-publish or asks to publish a nightly build.
Build AFM from scratch — submodules, patches, webui, and Swift build. Use when user types /build-afm, asks to build afm, or needs a fresh build from a clean clone.
Test a pre-built afm binary at any path — runs pre-flight safety checks, then any combination of unit tests, assertions, smart analysis, promptfoo evals, batch validation, OpenAI compat, GPU profiling. Use when user wants to validate a binary post-build, after code changes, or before release.
Run and review the Promptfoo-based AFM agentic evaluation suite. Use when the user wants structured-output, tool-calling, grammar, guided-json, streaming, concurrency, or agentic QA coverage for AFM, and especially when they want help choosing harness options or interpreting failures.
| name | test-macafm |
| description | Run the maclocal-api (AFM/MLX) test suite — automated assertions and smart analysis. Use when asked to test, validate, regression-check, or benchmark AFM before release, after code changes, or for model onboarding. |
Run the maclocal-api test suite: automated pass/fail assertions and AI-judge smart analysis.
Use this skill when the user asks to:
.build/release/afm. Ask if user has a custom build location.| Tier | Time | When to use | What runs |
|---|---|---|---|
| smoke | ~2 min | Quick sanity check, any small model, CI | test-assertions.sh --tier smoke |
| standard | ~15 min | After feature changes, mid-size model | test-assertions.sh --tier standard |
| full | ~60 min | Release validation, production model | test-assertions.sh --tier full + mlx-model-test.sh --smart with test-llm-comprehensive.txt + promptfoo agentic evals |
Quick guide:
# Ensure release build is current
swift build -c release
MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache \
.build/release/afm mlx -m MODEL --port 9998 \
--tool-call-parser afm_adaptive_xml \
--enable-prefix-caching \
--enable-grammar-constraints &
# Wait for server to be ready
until curl -sf http://127.0.0.1:9998/v1/models >/dev/null 2>&1; do sleep 1; done
Recommended flags for testing:
--tool-call-parser afm_adaptive_xml — best tool call parser with JSON-in-XML fallback--enable-prefix-caching — 67-79% prompt token savings on repeated requests--enable-grammar-constraints — EBNF constrained decoding forces valid XML tool calls, improving success from 60% to 100% on realistic workloads./Scripts/test-assertions.sh --tier TIER --model MODEL --port 9998
Interpret results immediately. If any FAIL, investigate before proceeding.
The smart analysis harness manages its own server (port 9877) — do NOT pass --port.
It uses test-llm-comprehensive.txt which has an [all] baseline prompt and [@ label]
template sections. The --smart flag accepts batch mode prefix and tool list.
AFM_BIN=.build/release/afm ./Scripts/mlx-model-test.sh \
--model MODEL \
--prompts Scripts/test-llm-comprehensive.txt \
--smart 1:claude
Smart analysis options:
--smart claude or --smart codex — batch mode 0 (one big swoop, may fail on large test suites)--smart 1:claude or --smart 1:codex — batch mode 1 (test-by-test, more reliable)--smart 1:claude,codex — run multiple AI judges--tests 1,5,10 — run only specific test numbers (1-indexed)Note: The [all] prompt runs for every test variant. With high max_tokens (e.g., 32768
on code tests), thinking models may generate very long reasoning for the baseline prompt.
Total run time for full suite: ~45-90 min depending on model speed.
Generates an interactive HTML report with measured DRAM bandwidth, GPU utilization/power timelines, and per-kernel Metal shader names from xctrace Shader Timeline.
One-time setup (creates custom Instruments template with Shader Timeline enabled):
python3 Scripts/create-shader-template.py
Run the profile (no server needed — uses single-prompt mode):
python3 Scripts/gpu-profile-report.py MODEL [max_tokens] [prompt]
# Default: 4096 tokens, built-in GPU analysis prompt
# Example: python3 Scripts/gpu-profile-report.py mlx-community/Qwen3.5-35B-A3B-4bit
This does everything automatically:
--gpu-profile --gpu-trace 15/tmp/afm-gpu-profile.html and opens in browserOr use individual flags on any AFM invocation:
afm mlx -m MODEL --gpu-profile -s "prompt" # Zero-overhead stats
afm mlx -m MODEL --gpu-profile-bw -s "prompt" # + mactop bandwidth (~5s)
afm mlx -m MODEL --gpu-trace 10 -s "prompt" # xctrace shader trace
Live bandwidth monitor (run in separate terminal during server requests):
./Scripts/gpu-profile.sh bandwidth
What the report shows:
What to look for:
affine_qmv_fast (decode bottleneck), steel_gemm_fused (prefill), sdpa_vector (attention)Clients can request GPU profiling data via the X-AFM-Profile HTTP header.
No server flags required — works on any running AFM server.
Two levels:
# Summary: GPU power, memory, bandwidth, tok/s
curl http://127.0.0.1:9999/v1/chat/completions \
-H "X-AFM-Profile: true" \
-d '{"model":"m","messages":[{"role":"user","content":"Hi"}]}'
# Extended: summary + 300ms time-series samples (for charts/dashboards)
curl http://127.0.0.1:9999/v1/chat/completions \
-H "X-AFM-Profile: extended" \
-d '{"model":"m","messages":[{"role":"user","content":"Hi"}]}'
Response fields (afm_profile):
gpu_power_avg_w / gpu_power_peak_w — GPU power via native IOReport (no mactop)memory_weights_gib / memory_kv_gib / memory_peak_gib — memory breakdown in GiBprefill_tok_s / decode_tok_s — throughputest_bandwidth_gbs — DRAM bandwidth from IOReport power (calibrated at startup via MLX GPU stress)chip / theoretical_bw_gbs — hardware contextgpu_samples — number of 300ms readings takenExtended adds (afm_profile_extended):
summary — same as afm_profilesamples[] — per-300ms readings: {t, bw_gbs, gpu_pct, gpu_power_w, dram_power_w}How it works internally:
Energy Model + GPU Stats channels sampled every 300ms via DispatchSource timer[DONE]) and non-streamingWhat to look for:
gpu_power_peak_w ~28W during decode on M3 Ultra (matches mactop)est_bandwidth_gbs ~170-180 GB/s for Qwen3.5-35B-A3B-4bit (21% of 800 GB/s theoretical)afm_profile absent from response when header not sent (no null pollution)The promptfoo agentic eval suite tests AFM's tool-calling and structured-output across multiple server configurations and real-world agent framework schemas. It manages its own server lifecycle.
Prerequisites: promptfoo CLI must be installed (npm install -g promptfoo).
Run the full suite:
AFM_MODEL=MODEL \
AFM_BINARY=.build/arm64-apple-macosx/release/afm \
MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache \
./Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh all
Run individual suites:
# Just structured output tests
./Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh structured
# Just tool calling (all 3 parser profiles)
./Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh toolcall
# Just grammar constraint validation (8 server phases)
./Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh grammar-constraints
# Just one agent framework
./Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh opencode
Available modes: all, structured, structured-stress, toolcall, toolcall-quality, grammar-constraints, agentic, frameworks, opencode, pi, openclaw, hermes, default, adaptive-xml, adaptive-xml-grammar
| Suite | Tests | Profiles | What it validates |
|---|---|---|---|
| structured | 6 | 1 (api json_schema) | response_format=json_schema strict compliance |
| structured-stress | 4 | 1 | Nested arrays, enums, nullable types in schema |
| toolcall | 7 | 3 (default, adaptive-xml, grammar) | Basic tool call parsing: weather, time, multi-tool |
| toolcall-quality | 6 | 3 | BFCL-inspired when-to-call decisions (should model use a tool?) |
| grammar-constraints | 17 | 8 server phases | Schema + tool enforcement across: no-grammar, grammar-enabled, adaptive-xml, concurrent, prefix-cache, mixed-strict, header downgrade/enforce |
| agentic | 4 | 3 | Multi-turn coding workflow tool chains |
| frameworks | 8 | 3 | Agent framework tool shapes (OpenCode, Pi, OpenClaw, Hermes) |
| opencode | 37 | 3 | OpenCode built-in tools (primary-source derived) |
| pi | 20 | 3 | Pi coding-agent tools |
| openclaw | 12 | 3 | OpenClaw tool coverage |
| hermes | 12 | 3 | Hermes agentic framework tools |
| Profile | AFM flags | Purpose |
|---|---|---|
default | (none) | Baseline: auto-detected tool call format |
adaptive-xml | --tool-call-parser afm_adaptive_xml | Adaptive XML with JSON-in-XML fallback |
adaptive-xml-grammar | --tool-call-parser afm_adaptive_xml --enable-grammar-constraints | Adaptive XML + EBNF grammar enforcement |
grammar-enabled | --enable-grammar-constraints | Grammar without adaptive XML |
grammar-enabled-adaptive-xml | Both flags | Regression guard: grammar + adaptive XML |
grammar-enabled-concurrent | --enable-grammar-constraints --concurrent 2 | Grammar under concurrency |
grammar-enabled-prefix-cache | --enable-grammar-constraints --enable-prefix-caching | Grammar + prefix caching interaction |
grammar-enabled-concurrent-cache | All three flags | Full feature stack |
providers/afm_provider.mjs — Custom promptfoo provider with two transports: api (OpenAI-compatible HTTP) and cli-guided-json (direct binary invocation). Supports extract modes: content, tool_calls, normalized_message, full_response. Captures responseHeaders for grammar header assertions.judges/assert-grammar-header.mjs — Validates X-Grammar-Constraints response header: expects "downgraded" when grammar not available, absent when grammar active.judges/classify-failures.mjs — Post-run AI-based failure classifier: categorizes each failure as afm_bug (server/protocol), model_quality (wrong tool/args), or harness_bug (false negative).| Variable | Default | Purpose |
|---|---|---|
AFM_MODEL | mlx-community/Qwen3.5-35B-A3B-4bit | Model to test |
AFM_BINARY | .build/arm64-apple-macosx/release/afm | Binary path |
AFM_PROMPTFOO_OUT_DIR | /Volumes/edata/promptfoo/data/maclocal-api/current | Report output dir |
AFM_PROMPTFOO_PORT | 9999 | Server port |
MACAFM_MLX_MODEL_CACHE | (none) | Model cache dir |
JSON reports per suite+profile in $AFM_PROMPTFOO_OUT_DIR:
structured-MODEL_SLUG.jsontoolcall-{default,adaptive-xml,adaptive-xml-grammar}-MODEL_SLUG.jsongrammar-{schema,tools}-{no-grammar,grammar-enabled,adaptive-xml,concurrent,prefix-cache}-MODEL_SLUG.json{agentic,frameworks,opencode,pi,openclaw,hermes}-{default,adaptive-xml,adaptive-xml-grammar}-MODEL_SLUG.jsontest-reports/assertions-report-*.htmltest-reports/smart-analysis-{tool}-*.mdtest-reports/mlx-model-report-*.html/tmp/afm-gpu-profile.html (+ /tmp/afm-metal.trace for Instruments)test-reports/assertions-report-*.jsonl, test-reports/mlx-model-report-*.jsonl$AFM_PROMPTFOO_OUT_DIR/{suite}-{profile}-MODEL_SLUG.json (default: /Volumes/edata/promptfoo/data/maclocal-api/current/)kill %1 # or whatever the background job is
| Group | Common failures | What to check |
|---|---|---|
| Stop | Stop string found in output | Check MLXModelService.swift stop buffer logic, streaming vs non-streaming paths |
| Logprobs | Schema invalid, logprob > 0 | Check resolveLogprobs() and buildChoiceLogprobs() |
| Think | <think> tags in content | Check extractThinkContent() and extractThinkTags() |
| Tools | No tool_calls, invalid JSON args | Check extractToolCallsFallback(), model's tool call format |
| Cache | cached_tokens always 0 | Check enablePrefixCaching, findPrefixLength(), PromptCacheBox |
| Concurrent | Non-200 responses | Check SerialAccessContainer locking, request queuing |
| Error | Wrong HTTP status codes | Check controller validation logic |
| Kwargs | Thinking not disabled by enable_thinking: false | Check chat_template_kwargs merging into additionalContext in MLXModelService.swift |
| Perf | Low tok/s, high TTFT | Check model quantization, Metal kernel performance |
| OpenAI-compat | Stream usage chunk missing, logprobs absent | Check StreamingUsageChunk encoding, empty choices on final chunk |
| Guided JSON | Schema validation failure, invalid JSON | Check --guided-json / response_format pipeline, grammar constraints |
| Batch | Garbage output, wrong answers at B>1 | Check BatchScheduler, KV cache isolation, mask generation |
Known patterns where AI judges score incorrectly (see references/interpreting-scores.md):
<think> — correct, not a bugmax_tokens budget on reasoning with empty visible content — model behavior, not a server bug[all] baseline prompt scored low when it runs with a code/math test's high max_tokens and system prompt — irrelevant context for the baseline prompt| Category | Typical pass rate | What failures mean |
|---|---|---|
| structured, structured-stress | 100% | Server bug in response_format pipeline — investigate immediately |
| toolcall (all profiles) | 100% | Server bug in tool call parsing — investigate immediately |
| toolcall-quality | ~80% | Model chose wrong tool or missed when-to-call — model quality, not server |
| grammar-schema / grammar-tools (non-concurrent) | 100% | Grammar constraint enforcement broken — server bug |
| grammar-schema / grammar-tools (concurrent) | ~50-70% | Known race condition in --concurrent 2 grammar path — not release blocker |
| grammar-header / grammar-mixed | 100% | X-Grammar-Constraints header or mixed-strict wiring broken — server bug |
| agentic | ~75-100% | Multi-turn failures are usually model quality; 0% pass = server bug |
| frameworks | 100% | Framework tool shapes must parse correctly — server bug if failing |
| opencode | ~70-80% | Complex 37-tool scenarios; model can't always pick correct tool — model quality |
| pi | ~80-90% | Model prompt injection resistance varies — model quality |
| openclaw | ~80-85% | Model quality on OpenClaw-specific schemas |
| hermes | ~90-100% | Hermes format failures on adaptive-xml profiles = parser difference, not bug |
Key rule: structured, toolcall, grammar-* (non-concurrent), frameworks suites should be 100% pass. Any failure there is a server bug. Everything else has model-quality variance.
Post-run failure classification (optional): Run judges/classify-failures.mjs on any result JSON to get AI-based afm_bug vs model_quality vs harness_bug classification.
ToolCallFormat.infer() and model's config.jsonScripts/apply-mlx-patches.sh --checkFull-harness concurrency sweep that starts the server, runs warmup, tests all concurrency levels, collects GPU metrics via mactop, saves JSON results, and generates a comparison chart.
Scripts/benchmarks/benchmark_afm_vs_mlxlm.py
# AFM-only concurrency sweep (recommended for quick benchmarks)
python3 Scripts/benchmarks/benchmark_afm_vs_mlxlm.py --afm-only
# Full AFM vs mlx-lm comparison (both servers, fair A/B)
python3 Scripts/benchmarks/benchmark_afm_vs_mlxlm.py
# Re-generate graph from existing results
python3 Scripts/benchmarks/benchmark_afm_vs_mlxlm.py --graph
python3 Scripts/benchmarks/benchmark_afm_vs_mlxlm.py --graph Scripts/benchmark-results/FILE.json
--concurrent N[1, 2, 4, 8, 12, 16, 20, 24, 32, 40, 50]Scripts/benchmark-results/concurrency-benchmark-TIMESTAMP.jsonScripts/benchmark-results/concurrency-benchmark-TIMESTAMP.png| Variable | Default | Purpose |
|---|---|---|
MODEL_ID | mlx-community/Qwen3.5-35B-A3B-4bit | Model to benchmark |
MAX_TOKENS | 4096 | Tokens per request (forces long decode) |
MAX_CONCURRENT | 50 | --concurrent flag value (must be >= max level) |
LEVELS | [1,2,4,8,12,16,20,24,32,40,50] | Concurrency levels to test |
AFM_PORT | 9999 | Port for AFM server |
B Agg t/s Per-req Wall GPU% GPU W
1 118.7 118.7 34.5s 94% 28.5W
2 193.9 97.0 42.2s 93% 41.6W
4 298.4 74.6 54.9s 97% 62.7W
8 407.3 50.9 80.5s 96% 75.5W
12 493.4 41.1 99.6s 98% 83.4W
16 573.9 35.9 114.2s 99% 88.2W
20 581.6 29.1 140.8s 98% 79.1W
24 629.6 27.4 149.6s 99% 83.2W
| Script | Purpose |
|---|---|
Scripts/feature-mlx-concurrent-batch/batch_stress_mactop.py | Quick stress test at arbitrary concurrency (client-only, needs running server on port 9876) |
Scripts/feature-mlx-concurrent-batch/batch_stress_ioreg.py | Same but uses ioreg for GPU stats (less accurate) |
Scripts/feature-mlx-concurrent-batch/validate_responses.py | Known-answer correctness at B={1,2,4,8} |
Scripts/feature-mlx-concurrent-batch/validate_mixed_workload.py | Mixed short+long workload batch validation |
Scripts/feature-mlx-concurrent-batch/validate_multiturn_prefix.py | Multi-turn prefix cache under concurrency |
| File | Purpose |
|---|---|
Scripts/benchmarks/benchmark_afm_vs_mlxlm.py | Full concurrency benchmark harness (server lifecycle, warmup, sweep, GPU metrics, chart generation) |
Scripts/test-assertions.sh | Automated pass/fail assertion tests (unit/smoke/standard/full tiers, includes swift test) |
Scripts/test-llm-comprehensive.txt | Comprehensive smart analysis test suite (model-generic, [@ label] template mode, has [all] baseline) |
Scripts/test-Qwen3.5-35B-A3B-4bit.txt | Model-specific test suite for Qwen3.5-35B-A3B-4bit (same tests as comprehensive, hardcoded model) |
Scripts/test-edge-cases.txt | Legacy smart analysis test prompts (smaller set) |
Scripts/test-sampling-params.sh | Sampling parameter tests (seed, temp, top_p, etc.) |
Scripts/test-structured-outputs.sh | JSON schema / structured output tests |
Scripts/test-tool-call-parsers.py | Unit tests for tool call parsing |
Scripts/mlx-model-test.sh | Test harness: runs prompts, collects results, generates reports |
Scripts/test-chat-template-kwargs.sh | Standalone chat_template_kwargs tests (includes --no-think CLI + precedence) |
Scripts/regression-test.sh | Quick regression smoke test |
Scripts/feature-codex-optimize-api/test-openai-compat-evals.py | OpenAI-python SDK compatibility evals (non-stream, stream, logprobs, vllm bench) |
Scripts/feature-codex-optimize-api/test-guided-json-evals.py | Guided JSON / structured output evals (API, streaming, CLI, SDK parse, edge cases) |
Scripts/feature-mlx-concurrent-batch/validate_responses.py | Batched generation correctness: known-answer questions at B={1,2,4,8} |
Scripts/feature-mlx-concurrent-batch/validate_mixed_workload.py | Mixed short+long workload batch validation with GPU metrics |
Scripts/feature-mlx-concurrent-batch/validate_multiturn_prefix.py | Multi-turn prefix cache validation under concurrency |
Scripts/gpu-profile-report.py | Full GPU shader profiling harness: mactop BW + --gpu-profile + --gpu-trace + HTML report |
Scripts/gpu-profile.sh | GPU profiling helpers: bandwidth monitor, capture, trace, power |
Scripts/create-shader-template.py | One-time: patches Metal System Trace template for per-kernel shader names |
Tests/MacLocalAPITests/StreamingUsageChunkTests.swift | Unit tests: streaming usage chunks, finish reasons, Foundation commonPrefixLength |
Tests/MacLocalAPITests/ConcurrentBatchTests.swift | Unit tests: RequestSlot, StreamChunk, BatchScheduler internals |
Scripts/feature-promptfoo-agentic/run-promptfoo-agentic.sh | Promptfoo agentic eval orchestrator: 11 modes, 8 server profiles, 16 configs |
Scripts/feature-promptfoo-agentic/providers/afm_provider.mjs | Custom promptfoo provider: api + cli-guided-json transports, 4 extract modes |
Scripts/feature-promptfoo-agentic/judges/assert-grammar-header.mjs | Custom assertion: validates X-Grammar-Constraints response header |
Scripts/feature-promptfoo-agentic/judges/classify-failures.mjs | AI-based failure classifier: afm_bug vs model_quality vs harness_bug |
Scripts/feature-promptfoo-agentic/promptfooconfig.*.yaml | 16 promptfoo config files (~137 test cases total) |
Scripts/feature-promptfoo-agentic/datasets/ | 16 YAML dataset files across structured, toolcall, grammar, agentic directories |
enable_thinking=false disables thinking (if model supports it)test-openai-compat-evals.py (non-stream, stream, logprobs, usage chunk)test-guided-json-evals.py (API schema, streaming schema, SDK parse)--smart 1:claude or --smart 1:codex)validate_responses.py at B={1,2,4,8}validate_mixed_workload.py (short+long decode, GPU metrics)validate_multiturn_prefix.py (multi-turn conversations under concurrency)gpu-profile-report.py (bandwidth, power, kernel names, HTML report)X-AFM-Profile: true returns afm_profile with GPU power + bandwidthX-AFM-Profile: extended returns afm_profile_extended with samples arrayafm_profile fields in response (no null pollution)[DONE]