Run any Skill in Manus with one click

$pwd:

test-opencode-tooling

Name: Test Opencode Tooling
Author: scouzi1966

// Use when testing tool call reliability between OpenCode and afm — captures streaming XML tool call errors, classifies them as afm translation bugs vs model generation errors, and produces a diagnostic report without fixing anything

Run Skill in Manus

$ git log --oneline --stat

stars:292

forks:15

updated:March 11, 2026 at 11:29

SKILL.md

readonly

related-skills.json

same repository

afm-build-promote-nightly.md

from "scouzi1966/maclocal-api"

Use when promoting afm to a stable release — builds from main HEAD or a nightly commit, verifies patches, updates Homebrew stable tap (afm.rb), builds a PyPI wheel, updates README and version files, and verifies both brew install and pip install work. Repo admin only.

2026-04-21292

afm-release-wheel.md

from "scouzi1966/maclocal-api"

Use when user wants to build a PyPI wheel from an existing compiled afm binary and publish to PyPI. Covers staging assets, building the wheel, and providing the uv publish command. Only for official stable releases, not nightly builds.

2026-04-18292

build-afm-nightly-publish.md

from "scouzi1966/maclocal-api"

Build, test, and publish an afm-next nightly release — full from-scratch build, user testing pause, GitHub release, and Homebrew tap update. Use when user types /build-afm-nightly-publish or asks to publish a nightly build.

2026-04-18292

build-afm.md

from "scouzi1966/maclocal-api"

Build AFM from scratch — submodules, patches, webui, and Swift build. Use when user types /build-afm, asks to build afm, or needs a fresh build from a clean clone.

2026-04-18292

test-afm-binary.md

from "scouzi1966/maclocal-api"

Test a pre-built afm binary at any path — runs pre-flight safety checks, then any combination of unit tests, assertions, smart analysis, promptfoo evals, batch validation, OpenAI compat, GPU profiling. Use when user wants to validate a binary post-build, after code changes, or before release.

2026-04-18292

codex-promptfoo-agentic-eval.md

from "scouzi1966/maclocal-api"

Run and review the Promptfoo-based AFM agentic evaluation suite. Use when the user wants structured-output, tool-calling, grammar, guided-json, streaming, concurrency, or agentic QA coverage for AFM, and especially when they want help choosing harness options or interpreting failures.

2026-04-03292

package.json

"author": "scouzi1966"

"repository": "scouzi1966/maclocal-api"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

name	test-opencode-tooling
description	Use when testing tool call reliability between OpenCode and afm — captures streaming XML tool call errors, classifies them as afm translation bugs vs model generation errors, and produces a diagnostic report without fixing anything

Test OpenCode Tooling

Automated loop that runs OpenCode tasks against afm, captures tool call errors from both sides, classifies each as an afm bug or model error, and generates a report. Does not fix anything.

When to Use

After changing tool call parsing code (XML, streaming, type coercion)
Onboarding a new model to verify tool call reliability
Investigating user-reported tool call failures with OpenCode
Comparing tool call error rates across models

First Questions to Ask

Prompt/PRD — Ask the user to paste the prompt text or provide a file path. This is the task OpenCode will execute (e.g., a PRD, coding task, or test scenario that exercises tool calls).

Model(s) — Which model(s) to test? Show available:

MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache ./Scripts/list-models.sh

afm start parameters — Any extra flags beyond defaults? (e.g., --tool-call-parser afm_adaptive_xml, --enable-prefix-caching, --enable-grammar-constraints, --no-think). Recommended: --tool-call-parser afm_adaptive_xml --enable-grammar-constraints — this combination gives the highest tool call success rate (100% on 35B-A3B vs 60% without grammar constraints on realistic workloads).
Iterations — How many times to run the same prompt per model? Default: 1. More runs help distinguish flaky model errors from deterministic afm bugs.
Working directory — Temp dir for OpenCode to work in. Default: create a fresh /tmp/opencode-test-TIMESTAMP per run.

OpenCode CLI Gotchas

CRITICAL: opencode run hangs silently without a PTY. It prints one INFO line and freezes — no error, no output. You must use one of these approaches:

opencode serve + run --attach (recommended): Start a headless server, then attach run to it via expect for PTY
expect wrapper: Provides the pseudo-TTY that opencode run requires

Other gotchas:

opencode.json model field must be a string, not an object — "model": "ollama/model-id" not "model": {"default": "..."}
The npm provider format (@ai-sdk/openai-compatible) is required for custom baseURL — the "api": "openai" format does NOT accept baseURL
OpenCode config is loaded from both ~/.config/opencode/opencode.json (global) AND $WORKDIR/opencode.json (local) — local overrides global
The workdir should be a git repo (git init) for OpenCode to function properly

OpenCode Log & Error Data

Log Files (limited — no tool call errors)

OpenCode writes logs to ~/.local/share/opencode/log/ in UTC-timestamped files (e.g., 2026-03-09T172212.log). These logs do NOT contain tool call errors or tool input/output. They only log permission checks, bus events, and registry start/complete.

# Find the latest OpenCode log
ls -t ~/.local/share/opencode/log/*.log | head -1

# Monitor the latest log in real-time
tail -f "$(ls -t ~/.local/share/opencode/log/*.log | head -1)"

Gotcha: Log filenames use UTC timestamps but ls -lt shows local time. A file named 2026-03-10T001322.log was created at 8:13 PM EDT. Use lsof -p <PID> | grep log to find the current session's log file if it doesn't appear in directory listings yet (OpenCode buffers writes).

When monitoring both afm and OpenCode simultaneously:

afm log: /tmp/afm-opencode-test.log (or wherever you tee'd it)
OpenCode log: ~/.local/share/opencode/log/<latest>.log

ALWAYS start OpenCode with --log-level "DEBUG" --print-logs — both opencode serve and opencode run commands must include these flags.

SQLite Database (structured tool call data with errors)

Tool call inputs, outputs, and errors are stored in OpenCode's SQLite database — not in the log files. This is the only place to get the full JSON of failed tool calls.

Database path: ~/.local/share/opencode/opencode.db

Schema: Tool calls are in the part table as JSON in the data column, keyed by session_id.

# List recent sessions
sqlite3 ~/.local/share/opencode/opencode.db \
  "SELECT id, title, datetime(time_created/1000, 'unixepoch', 'localtime') FROM session ORDER BY time_created DESC LIMIT 5;"

# Get ALL tool call errors for a session
sqlite3 ~/.local/share/opencode/opencode.db \
  "SELECT data FROM part WHERE session_id = '<SESSION_ID>' AND data LIKE '%\"status\":\"error\"%';"

# Get errors for the most recent session
sqlite3 ~/.local/share/opencode/opencode.db \
  "SELECT data FROM part WHERE session_id = (SELECT id FROM session ORDER BY time_created DESC LIMIT 1) AND data LIKE '%\"status\":\"error\"%';"

# Get all edit tool errors across all sessions
sqlite3 ~/.local/share/opencode/opencode.db \
  "SELECT data FROM part WHERE data LIKE '%\"tool\":\"edit\"%' AND data LIKE '%\"status\":\"error\"%' ORDER BY time_created DESC LIMIT 10;"

Error JSON format:

{
  "type": "tool",
  "callID": "call_8B05B790A94F4A0EBF2850C0",
  "tool": "edit",
  "state": {
    "status": "error",
    "input": {
      "filePath": "/path/to/file.py",
      "oldString": "text the model expected to find",
      "newString": "replacement text"
    },
    "error": "Error: Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.",
    "time": {
      "start": 1773102597849,
      "end": 1773102597850
    }
  }
}

Successful tool call JSON format:

{
  "type": "tool",
  "callID": "call_34CB225B0D184310BD64A839",
  "tool": "edit",
  "state": {
    "status": "completed",
    "input": {
      "filePath": "/path/to/file.py",
      "oldString": "...",
      "newString": "..."
    },
    "output": "Edit applied successfully.",
    "title": "path/to/file.py",
    "metadata": {
      "diagnostics": {},
      "diff": "Index: /path/to/file.py\n===...",
      "filediff": { "file": "...", "before": "...", "after": "..." }
    }
  }
}

Useful queries for test analysis:

# Count tool calls by status for a session
sqlite3 ~/.local/share/opencode/opencode.db \
  "SELECT json_extract(data, '$.tool') as tool,
          json_extract(data, '$.state.status') as status,
          COUNT(*) as cnt
   FROM part
   WHERE session_id = '<SESSION_ID>' AND json_extract(data, '$.type') = 'tool'
   GROUP BY tool, status;"

# Get all tool call inputs/outputs (pipe to jq for pretty-printing)
sqlite3 ~/.local/share/opencode/opencode.db \
  "SELECT data FROM part WHERE session_id = '<SESSION_ID>' AND json_extract(data, '$.type') = 'tool';" | python3 -mjson.tool

Execution Workflow

1. Setup

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
TEST_PORT=9877
OC_PORT=4096
REPORT_DIR="test-reports/opencode-tooling-${TIMESTAMP}"
mkdir -p "$REPORT_DIR"

Save the user's prompt to a file:

cat > "$REPORT_DIR/prompt.md" << 'PROMPT_EOF'
<paste user's prompt here>
PROMPT_EOF

2. Start OpenCode Serve

Create a workdir with git init and config pointing at afm:

OC_WORKDIR="/tmp/opencode-serve-${TIMESTAMP}"
mkdir -p "$OC_WORKDIR"
cd "$OC_WORKDIR" && git init -q && cd -

Write the OpenCode config. Must use npm provider with options.baseURL:

cat > "$OC_WORKDIR/opencode.json" << EOF
{
  "\$schema": "https://opencode.ai/config.json",
  "provider": {
    "ollama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "afm-test",
      "options": {
        "baseURL": "http://localhost:${TEST_PORT}/v1"
      },
      "models": {
        "${MODEL}": {
          "name": "${MODEL}"
        }
      }
    }
  }
}
EOF

Start the headless server:

cd "$OC_WORKDIR"
opencode serve --port $OC_PORT --print-logs --log-level DEBUG \
  > "$REPORT_DIR/opencode-serve.log" 2>&1 &
OC_SERVE_PID=$!
cd -

# Wait for serve to be ready
until curl -sf http://127.0.0.1:${OC_PORT}/ >/dev/null 2>&1; do sleep 1; done

3. For Each Model

3a. Start afm with verbose logging

AFM_DEBUG=1 MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache \
  .build/release/afm mlx -m "$MODEL" --port $TEST_PORT -V \
  $EXTRA_AFM_FLAGS \
  > "$REPORT_DIR/${MODEL_SLUG}-afm.log" 2>&1 &
AFM_PID=$!

# Wait for server ready
until curl -sf http://127.0.0.1:${TEST_PORT}/v1/models >/dev/null 2>&1; do sleep 1; done

Where MODEL_SLUG is the model ID with / replaced by _.

3b. Run OpenCode via expect + attach (per iteration)

expect provides the PTY that opencode run requires. The --attach flag connects to the serve instance which already has the config and workdir.

/usr/bin/expect << EXPECT_EOF > "$REPORT_DIR/${MODEL_SLUG}-run${RUN}-opencode.json" 2>&1
set timeout 600
log_user 1

spawn opencode run --attach http://localhost:${OC_PORT} --log-level "DEBUG" --print-logs --format json "${PROMPT}"
expect {
    timeout { puts "TIMEOUT"; exit 1 }
    eof { puts "EOF"; exit 0 }
}
EXPECT_EOF

The --format json flag outputs structured JSON events:

{"type":"tool_use",...} — tool call with input/output/error
{"type":"text",...} — assistant text content
{"type":"step_start",...} / {"type":"step_finish",...} — generation boundaries

Each run creates a new session on the same serve instance. The timeout (600s = 10 min) should be enough for most PRDs — increase for complex tasks.

IMPORTANT: Clean workdir between iterations. Before each run, remove all generated files from the OpenCode workdir so that results from a previous iteration don't contaminate the next one (e.g., OpenCode's "must read file before overwriting" guard triggers on leftover files). The cleanest approach is to stop opencode serve, recreate the workdir from scratch (rm -rf "$OC_WORKDIR" && mkdir -p "$OC_WORKDIR" && cd "$OC_WORKDIR" && git init -q && cd -), copy the opencode.json config back, and restart opencode serve. This ensures each iteration starts with a pristine empty git repo.

# Between iterations: reset workdir
kill $OC_SERVE_PID 2>/dev/null; wait $OC_SERVE_PID 2>/dev/null
rm -rf "$OC_WORKDIR"
mkdir -p "$OC_WORKDIR"
cd "$OC_WORKDIR" && git init -q && cd -
# Re-copy opencode.json config (same as setup step)
cat > "$OC_WORKDIR/opencode.json" << EOF
{ ... same config as before ... }
EOF
cd "$OC_WORKDIR"
opencode serve --port $OC_PORT --print-logs --log-level DEBUG \
  >> "$REPORT_DIR/opencode-serve.log" 2>&1 &
OC_SERVE_PID=$!
cd -
until curl -sf http://127.0.0.1:${OC_PORT}/ >/dev/null 2>&1; do sleep 1; done

3c. Stop afm (after all iterations for this model)

kill $AFM_PID 2>/dev/null
wait $AFM_PID 2>/dev/null

4. Stop OpenCode Serve

kill $OC_SERVE_PID 2>/dev/null

3. Analyze Logs

For each run, analyze both log files to extract and classify errors.

From afm logs (`-afm.log`), look for:

Pattern	Classification
`SKIP false </tool_call> end tag`	afm handled correctly (model emitted premature end tag)
`EMIT param[N]: key→...` with wrong value	Check if model sent wrong value (model error) or afm mangled it (afm bug)
`RECV </tool_call>` with `raw=` body	Raw model output — compare against what OpenCode received
`extractToolCallsFallback` activated	Incremental parser failed, fallback used — note if result was correct
`SEND tool_call fallback: found 0 tool calls`	Critical — tool call body couldn't be parsed at all. Usually means model emitted JSON instead of XML inside `<tool_call>` tags
`SEND tool_call name:` with JSON in name	afm extracted JSON payload as function name — model mixed formats
`coerceArgumentTypes` log entries	Type coercion activated — check if result matches schema
Malformed XML in raw body (e.g., `<function=X>` instead of `<parameter=X>`)	Model error — wrong XML tag
Duplicate `<parameter=key>` tags	Model error — model emitted same param twice
Missing `</function>` in body	Model error — incomplete XML generation

From OpenCode output (`-opencode.json`), look for:

Pattern	Classification
`"tool":"invalid"` with mangled tool name	afm parsed function name wrong — cross-ref afm `SEND tool_call name:` log
`"invalid arguments"` with `undefined` values	Parameter was lost — cross-reference afm log to determine if afm dropped it or model never sent it
`"expected number, received string"`	Type coercion failed — afm bug if schema had `type: "integer"`
Tool name not in schema	Model hallucinated tool — model error
`"command" undefined` for bash tool	Cross-ref afm raw body: if `<parameter=command>` present → afm bug; if `<function=command>` → model error
`SyntaxError` with `\\\"` in written files	Possible afm double-escaping of quotes in tool call arguments

Cross-referencing (the key step):

For each OpenCode error:

Find the corresponding tool call in afm's log (match by timestamp proximity)
Read the raw= body from afm's RECV </tool_call> log
Compare what the model generated vs what afm emitted vs what OpenCode received
Classify:
- afm schema→model bug: afm sent wrong/incomplete tool schema to the model
- afm model→client bug: Model output was correct but afm mangled it (dropped param, wrong type, truncated body)
- Model generation error: Model produced invalid XML, wrong tags, missing params, hallucinated tools

4. Generate Report

Create $REPORT_DIR/report.md:

# OpenCode Tooling Test Report
- Date: TIMESTAMP
- Model(s): ...
- Prompt: (first 200 chars)
- afm flags: ...
- Iterations per model: N

## Summary
| Model | Runs | Tool Calls | Errors | afm Bugs | Model Errors |
|-------|------|------------|--------|----------|--------------|

## Errors by Category

### afm Translation Bugs (model→client)
| # | Model | Run | Tool | Parameter | What Happened | afm Raw Body |
|---|-------|-----|------|-----------|---------------|-------------|

### afm Translation Bugs (schema→model)
| # | Model | Run | Tool | What Happened |
|---|-------|-----|------|---------------|

### Model Generation Errors
| # | Model | Run | Tool | Error Type | Raw Output |
|---|-------|-----|------|------------|------------|

## Raw Logs
- afm: [link to log file]
- OpenCode: [link to json file]

5. Present Results

Show the user:

Summary table (pass rate per model)
Each error with classification and evidence
Recommendation: which errors are actionable afm bugs vs model limitations
Do NOT propose or implement fixes — report only

Error Classification Guide

Definitely afm Bug

Parameter present in raw model output but missing in OpenCode's received arguments
Type mismatch when schema has explicit type and afm didn't coerce
Tool call body truncated (false end tag not caught)
Function name mangled or lost

Definitely Model Error

JSON inside XML tags (most common): Model emits <tool_call>{"name":"write","arguments":{...}}</tool_call> instead of XML <function=write><parameter=...> format. afm's fallback logs found 0 tool calls — content is silently lost. Qwen3-Coder-Next switches formats unpredictably, especially in longer conversations.
<function=X> used instead of <parameter=X> (wrong XML tag)
Tool name not in provided schema (hallucinated tool)
Parameter never appears in raw model output
Garbage characters in parameter values (e.g., trailing })
Incomplete XML (missing </function> or </parameter>)
<parameter=KEY> without wrapping <function=NAME> — parameters emitted without function context

Ambiguous (needs investigation)

Empty parameter value — could be model sending empty or afm dropping content
Duplicate parameters — model may emit twice, afm may deduplicate wrong
Streaming assembly errors — compare raw chunks vs assembled result
Escaped triple quotes (\\\"\\\"\\\") in written files — could be afm double-escaping or model pre-escaping

Common Mistakes

Not checking raw afm body: Always cross-reference OpenCode errors against afm's raw= log. Without this, you can't classify.
Blaming afm for model errors: Models frequently emit broken XML. Check the raw output first.
Blaming the model for afm bugs: afm has had bugs dropping empty params, false end tags, type coercion failures. Don't assume the model is always wrong.
Running without -V flag: Without verbose logging, you can't see raw model output or per-parameter emissions. Always use -V.

test-opencode-tooling

More from this repository

More from this repository

Test OpenCode Tooling

When to Use

First Questions to Ask

OpenCode CLI Gotchas

OpenCode Log & Error Data

Log Files (limited — no tool call errors)

SQLite Database (structured tool call data with errors)

Execution Workflow

1. Setup

2. Start OpenCode Serve

3. For Each Model

3a. Start afm with verbose logging

3b. Run OpenCode via expect + attach (per iteration)

3c. Stop afm (after all iterations for this model)

4. Stop OpenCode Serve

3. Analyze Logs

From afm logs (-afm.log), look for:

From OpenCode output (-opencode.json), look for:

Cross-referencing (the key step):

4. Generate Report

5. Present Results

Error Classification Guide

Definitely afm Bug

Definitely Model Error

Ambiguous (needs investigation)

Common Mistakes

Test OpenCode Tooling

When to Use

First Questions to Ask

OpenCode CLI Gotchas

OpenCode Log & Error Data

Log Files (limited — no tool call errors)

SQLite Database (structured tool call data with errors)

Execution Workflow

1. Setup

2. Start OpenCode Serve

3. For Each Model

3a. Start afm with verbose logging

3b. Run OpenCode via expect + attach (per iteration)

3c. Stop afm (after all iterations for this model)

4. Stop OpenCode Serve

3. Analyze Logs

From afm logs (-afm.log), look for:

From OpenCode output (-opencode.json), look for:

Cross-referencing (the key step):

4. Generate Report

5. Present Results

Error Classification Guide

Definitely afm Bug

Definitely Model Error

Ambiguous (needs investigation)

Common Mistakes

From afm logs (`-afm.log`), look for:

From OpenCode output (`-opencode.json`), look for:

From afm logs (`-afm.log`), look for:

From OpenCode output (`-opencode.json`), look for: