| name | test-opencode-tooling |
| description | Use when testing tool call reliability between OpenCode and afm — captures streaming XML tool call errors, classifies them as afm translation bugs vs model generation errors, and produces a diagnostic report without fixing anything |
Test OpenCode Tooling
Automated loop that runs OpenCode tasks against afm, captures tool call errors from both sides, classifies each as an afm bug or model error, and generates a report. Does not fix anything.
When to Use
- After changing tool call parsing code (XML, streaming, type coercion)
- Onboarding a new model to verify tool call reliability
- Investigating user-reported tool call failures with OpenCode
- Comparing tool call error rates across models
First Questions to Ask
- Prompt/PRD — Ask the user to paste the prompt text or provide a file path. This is the task OpenCode will execute (e.g., a PRD, coding task, or test scenario that exercises tool calls).
- Model(s) — Which model(s) to test? Show available:
MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache ./Scripts/list-models.sh
- afm start parameters — Any extra flags beyond defaults? (e.g.,
--tool-call-parser afm_adaptive_xml, --enable-prefix-caching, --enable-grammar-constraints, --no-think). Recommended: --tool-call-parser afm_adaptive_xml --enable-grammar-constraints — this combination gives the highest tool call success rate (100% on 35B-A3B vs 60% without grammar constraints on realistic workloads).
- Iterations — How many times to run the same prompt per model? Default: 1. More runs help distinguish flaky model errors from deterministic afm bugs.
- Working directory — Temp dir for OpenCode to work in. Default: create a fresh
/tmp/opencode-test-TIMESTAMP per run.
OpenCode CLI Gotchas
CRITICAL: opencode run hangs silently without a PTY. It prints one INFO line and freezes — no error, no output. You must use one of these approaches:
opencode serve + run --attach (recommended): Start a headless server, then attach run to it via expect for PTY
expect wrapper: Provides the pseudo-TTY that opencode run requires
Other gotchas:
opencode.json model field must be a string, not an object — "model": "ollama/model-id" not "model": {"default": "..."}
- The
npm provider format (@ai-sdk/openai-compatible) is required for custom baseURL — the "api": "openai" format does NOT accept baseURL
- OpenCode config is loaded from both
~/.config/opencode/opencode.json (global) AND $WORKDIR/opencode.json (local) — local overrides global
- The workdir should be a git repo (
git init) for OpenCode to function properly
OpenCode Log & Error Data
Log Files (limited — no tool call errors)
OpenCode writes logs to ~/.local/share/opencode/log/ in UTC-timestamped files (e.g., 2026-03-09T172212.log). These logs do NOT contain tool call errors or tool input/output. They only log permission checks, bus events, and registry start/complete.
ls -t ~/.local/share/opencode/log/*.log | head -1
tail -f "$(ls -t ~/.local/share/opencode/log/*.log | head -1)"
Gotcha: Log filenames use UTC timestamps but ls -lt shows local time. A file named 2026-03-10T001322.log was created at 8:13 PM EDT. Use lsof -p <PID> | grep log to find the current session's log file if it doesn't appear in directory listings yet (OpenCode buffers writes).
When monitoring both afm and OpenCode simultaneously:
- afm log:
/tmp/afm-opencode-test.log (or wherever you tee'd it)
- OpenCode log:
~/.local/share/opencode/log/<latest>.log
ALWAYS start OpenCode with --log-level "DEBUG" --print-logs — both opencode serve and opencode run commands must include these flags.
SQLite Database (structured tool call data with errors)
Tool call inputs, outputs, and errors are stored in OpenCode's SQLite database — not in the log files. This is the only place to get the full JSON of failed tool calls.
Database path: ~/.local/share/opencode/opencode.db
Schema: Tool calls are in the part table as JSON in the data column, keyed by session_id.
sqlite3 ~/.local/share/opencode/opencode.db \
"SELECT id, title, datetime(time_created/1000, 'unixepoch', 'localtime') FROM session ORDER BY time_created DESC LIMIT 5;"
sqlite3 ~/.local/share/opencode/opencode.db \
"SELECT data FROM part WHERE session_id = '<SESSION_ID>' AND data LIKE '%\"status\":\"error\"%';"
sqlite3 ~/.local/share/opencode/opencode.db \
"SELECT data FROM part WHERE session_id = (SELECT id FROM session ORDER BY time_created DESC LIMIT 1) AND data LIKE '%\"status\":\"error\"%';"
sqlite3 ~/.local/share/opencode/opencode.db \
"SELECT data FROM part WHERE data LIKE '%\"tool\":\"edit\"%' AND data LIKE '%\"status\":\"error\"%' ORDER BY time_created DESC LIMIT 10;"
Error JSON format:
{
"type": "tool",
"callID": "call_8B05B790A94F4A0EBF2850C0",
"tool": "edit",
"state": {
"status": "error",
"input": {
"filePath": "/path/to/file.py",
"oldString": "text the model expected to find",
"newString": "replacement text"
},
"error": "Error: Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.",
"time": {
"start": 1773102597849,
"end": 1773102597850
}
}
}
Successful tool call JSON format:
{
"type": "tool",
"callID": "call_34CB225B0D184310BD64A839",
"tool": "edit",
"state": {
"status": "completed",
"input": {
"filePath": "/path/to/file.py",
"oldString": "...",
"newString": "..."
},
"output": "Edit applied successfully.",
"title": "path/to/file.py",
"metadata": {
"diagnostics": {},
"diff": "Index: /path/to/file.py\n===...",
"filediff": { "file": "...", "before": "...", "after": "..." }
}
}
}
Useful queries for test analysis:
sqlite3 ~/.local/share/opencode/opencode.db \
"SELECT json_extract(data, '$.tool') as tool,
json_extract(data, '$.state.status') as status,
COUNT(*) as cnt
FROM part
WHERE session_id = '<SESSION_ID>' AND json_extract(data, '$.type') = 'tool'
GROUP BY tool, status;"
sqlite3 ~/.local/share/opencode/opencode.db \
"SELECT data FROM part WHERE session_id = '<SESSION_ID>' AND json_extract(data, '$.type') = 'tool';" | python3 -mjson.tool
Execution Workflow
1. Setup
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
TEST_PORT=9877
OC_PORT=4096
REPORT_DIR="test-reports/opencode-tooling-${TIMESTAMP}"
mkdir -p "$REPORT_DIR"
Save the user's prompt to a file:
cat > "$REPORT_DIR/prompt.md" << 'PROMPT_EOF'
<paste user's prompt here>
PROMPT_EOF
2. Start OpenCode Serve
Create a workdir with git init and config pointing at afm:
OC_WORKDIR="/tmp/opencode-serve-${TIMESTAMP}"
mkdir -p "$OC_WORKDIR"
cd "$OC_WORKDIR" && git init -q && cd -
Write the OpenCode config. Must use npm provider with options.baseURL:
cat > "$OC_WORKDIR/opencode.json" << EOF
{
"\$schema": "https://opencode.ai/config.json",
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"name": "afm-test",
"options": {
"baseURL": "http://localhost:${TEST_PORT}/v1"
},
"models": {
"${MODEL}": {
"name": "${MODEL}"
}
}
}
}
}
EOF
Start the headless server:
cd "$OC_WORKDIR"
opencode serve --port $OC_PORT --print-logs --log-level DEBUG \
> "$REPORT_DIR/opencode-serve.log" 2>&1 &
OC_SERVE_PID=$!
cd -
until curl -sf http://127.0.0.1:${OC_PORT}/ >/dev/null 2>&1; do sleep 1; done
3. For Each Model
3a. Start afm with verbose logging
AFM_DEBUG=1 MACAFM_MLX_MODEL_CACHE=/Volumes/edata/models/vesta-test-cache \
.build/release/afm mlx -m "$MODEL" --port $TEST_PORT -V \
$EXTRA_AFM_FLAGS \
> "$REPORT_DIR/${MODEL_SLUG}-afm.log" 2>&1 &
AFM_PID=$!
until curl -sf http://127.0.0.1:${TEST_PORT}/v1/models >/dev/null 2>&1; do sleep 1; done
Where MODEL_SLUG is the model ID with / replaced by _.
3b. Run OpenCode via expect + attach (per iteration)
expect provides the PTY that opencode run requires. The --attach flag connects to the serve instance which already has the config and workdir.
/usr/bin/expect << EXPECT_EOF > "$REPORT_DIR/${MODEL_SLUG}-run${RUN}-opencode.json" 2>&1
set timeout 600
log_user 1
spawn opencode run --attach http://localhost:${OC_PORT} --log-level "DEBUG" --print-logs --format json "${PROMPT}"
expect {
timeout { puts "TIMEOUT"; exit 1 }
eof { puts "EOF"; exit 0 }
}
EXPECT_EOF
The --format json flag outputs structured JSON events:
{"type":"tool_use",...} — tool call with input/output/error
{"type":"text",...} — assistant text content
{"type":"step_start",...} / {"type":"step_finish",...} — generation boundaries
Each run creates a new session on the same serve instance. The timeout (600s = 10 min) should be enough for most PRDs — increase for complex tasks.
IMPORTANT: Clean workdir between iterations. Before each run, remove all generated files from the OpenCode workdir so that results from a previous iteration don't contaminate the next one (e.g., OpenCode's "must read file before overwriting" guard triggers on leftover files). The cleanest approach is to stop opencode serve, recreate the workdir from scratch (rm -rf "$OC_WORKDIR" && mkdir -p "$OC_WORKDIR" && cd "$OC_WORKDIR" && git init -q && cd -), copy the opencode.json config back, and restart opencode serve. This ensures each iteration starts with a pristine empty git repo.
kill $OC_SERVE_PID 2>/dev/null; wait $OC_SERVE_PID 2>/dev/null
rm -rf "$OC_WORKDIR"
mkdir -p "$OC_WORKDIR"
cd "$OC_WORKDIR" && git init -q && cd -
cat > "$OC_WORKDIR/opencode.json" << EOF
{ ... same config as before ... }
EOF
cd "$OC_WORKDIR"
opencode serve --port $OC_PORT --print-logs --log-level DEBUG \
>> "$REPORT_DIR/opencode-serve.log" 2>&1 &
OC_SERVE_PID=$!
cd -
until curl -sf http://127.0.0.1:${OC_PORT}/ >/dev/null 2>&1; do sleep 1; done
3c. Stop afm (after all iterations for this model)
kill $AFM_PID 2>/dev/null
wait $AFM_PID 2>/dev/null
4. Stop OpenCode Serve
kill $OC_SERVE_PID 2>/dev/null
3. Analyze Logs
For each run, analyze both log files to extract and classify errors.
From afm logs (-afm.log), look for:
| Pattern | Classification |
|---|
SKIP false </tool_call> end tag | afm handled correctly (model emitted premature end tag) |
EMIT param[N]: key→... with wrong value | Check if model sent wrong value (model error) or afm mangled it (afm bug) |
RECV </tool_call> with raw= body | Raw model output — compare against what OpenCode received |
extractToolCallsFallback activated | Incremental parser failed, fallback used — note if result was correct |
SEND tool_call fallback: found 0 tool calls | Critical — tool call body couldn't be parsed at all. Usually means model emitted JSON instead of XML inside <tool_call> tags |
SEND tool_call name: with JSON in name | afm extracted JSON payload as function name — model mixed formats |
coerceArgumentTypes log entries | Type coercion activated — check if result matches schema |
Malformed XML in raw body (e.g., <function=X> instead of <parameter=X>) | Model error — wrong XML tag |
Duplicate <parameter=key> tags | Model error — model emitted same param twice |
Missing </function> in body | Model error — incomplete XML generation |
From OpenCode output (-opencode.json), look for:
| Pattern | Classification |
|---|
"tool":"invalid" with mangled tool name | afm parsed function name wrong — cross-ref afm SEND tool_call name: log |
"invalid arguments" with undefined values | Parameter was lost — cross-reference afm log to determine if afm dropped it or model never sent it |
"expected number, received string" | Type coercion failed — afm bug if schema had type: "integer" |
| Tool name not in schema | Model hallucinated tool — model error |
"command" undefined for bash tool | Cross-ref afm raw body: if <parameter=command> present → afm bug; if <function=command> → model error |
SyntaxError with \\\" in written files | Possible afm double-escaping of quotes in tool call arguments |
Cross-referencing (the key step):
For each OpenCode error:
- Find the corresponding tool call in afm's log (match by timestamp proximity)
- Read the
raw= body from afm's RECV </tool_call> log
- Compare what the model generated vs what afm emitted vs what OpenCode received
- Classify:
- afm schema→model bug: afm sent wrong/incomplete tool schema to the model
- afm model→client bug: Model output was correct but afm mangled it (dropped param, wrong type, truncated body)
- Model generation error: Model produced invalid XML, wrong tags, missing params, hallucinated tools
4. Generate Report
Create $REPORT_DIR/report.md:
# OpenCode Tooling Test Report
- Date: TIMESTAMP
- Model(s): ...
- Prompt: (first 200 chars)
- afm flags: ...
- Iterations per model: N
## Summary
| Model | Runs | Tool Calls | Errors | afm Bugs | Model Errors |
|-------|------|------------|--------|----------|--------------|
## Errors by Category
### afm Translation Bugs (model→client)
| # | Model | Run | Tool | Parameter | What Happened | afm Raw Body |
|---|-------|-----|------|-----------|---------------|-------------|
### afm Translation Bugs (schema→model)
| # | Model | Run | Tool | What Happened |
|---|-------|-----|------|---------------|
### Model Generation Errors
| # | Model | Run | Tool | Error Type | Raw Output |
|---|-------|-----|------|------------|------------|
## Raw Logs
- afm: [link to log file]
- OpenCode: [link to json file]
5. Present Results
Show the user:
- Summary table (pass rate per model)
- Each error with classification and evidence
- Recommendation: which errors are actionable afm bugs vs model limitations
- Do NOT propose or implement fixes — report only
Error Classification Guide
Definitely afm Bug
- Parameter present in raw model output but missing in OpenCode's received arguments
- Type mismatch when schema has explicit
type and afm didn't coerce
- Tool call body truncated (false end tag not caught)
- Function name mangled or lost
Definitely Model Error
- JSON inside XML tags (most common): Model emits
<tool_call>{"name":"write","arguments":{...}}</tool_call> instead of XML <function=write><parameter=...> format. afm's fallback logs found 0 tool calls — content is silently lost. Qwen3-Coder-Next switches formats unpredictably, especially in longer conversations.
<function=X> used instead of <parameter=X> (wrong XML tag)
- Tool name not in provided schema (hallucinated tool)
- Parameter never appears in raw model output
- Garbage characters in parameter values (e.g., trailing
})
- Incomplete XML (missing
</function> or </parameter>)
<parameter=KEY> without wrapping <function=NAME> — parameters emitted without function context
Ambiguous (needs investigation)
- Empty parameter value — could be model sending empty or afm dropping content
- Duplicate parameters — model may emit twice, afm may deduplicate wrong
- Streaming assembly errors — compare raw chunks vs assembled result
- Escaped triple quotes (
\\\"\\\"\\\") in written files — could be afm double-escaping or model pre-escaping
Common Mistakes
- Not checking raw afm body: Always cross-reference OpenCode errors against afm's
raw= log. Without this, you can't classify.
- Blaming afm for model errors: Models frequently emit broken XML. Check the raw output first.
- Blaming the model for afm bugs: afm has had bugs dropping empty params, false end tags, type coercion failures. Don't assume the model is always wrong.
- Running without
-V flag: Without verbose logging, you can't see raw model output or per-parameter emissions. Always use -V.