with one click
ln-840-benchmark-compare
// Use when benchmarking hex-line MCP against Claude built-in tools with scenario manifests, activation checks, and diff-based correctness.
// Use when benchmarking hex-line MCP against Claude built-in tools with scenario manifests, activation checks, and diff-based correctness.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | ln-840-benchmark-compare |
| description | Use when benchmarking hex-line MCP against Claude built-in tools with scenario manifests, activation checks, and diff-based correctness. |
| license | MIT |
| model | claude-haiku-4-5 |
Paths: File paths (
references/) are relative to this skill directory.
Type: L3 Worker Category: 8XX Optimization -> 840 Benchmark
Run a clean A/B benchmark in Claude Code: one session with built-in tools only, one with hex-line. The benchmark is scenario-based, diff-validated, manifest-driven, and runtime-backed. It measures activation, correctness, time, cost, and tokens. The current runner is intentionally scoped to this internal A/B. It does not, by itself, prove best-in-class against external alternatives.
| Direction | Content |
|---|---|
| Input | Repo checkout containing mcp/hex-line-mcp/, optional references/goals.md, optional references/expectations.json |
| Output | Comparison report in plugins/optimization-suite/skills/ln-840-benchmark-compare/results/{date}-comparison.md plus machine-readable benchmark summary artifact |
claude --version succeedsgit succeedsmcp/hex-line-mcp/server.mjs existsmcp/hex-line-mcp/hook.mjs existsplugins/optimization-suite/skills/ln-840-benchmark-compare/references/goals.md existsplugins/optimization-suite/skills/ln-840-benchmark-compare/references/expectations.json existsplugins/optimization-suite/skills/ln-840-benchmark-compare/references/mcp-bench.json existsbash plugins/optimization-suite/skills/ln-840-benchmark-compare/scripts/run-benchmark.sh \
[plugins/optimization-suite/skills/ln-840-benchmark-compare/references/goals.md] \
[plugins/optimization-suite/skills/ln-840-benchmark-compare/references/expectations.json]
Optional extra session profile:
EXTRA_SESSION_ID=other-mcp \
EXTRA_SESSION_LABEL="Other MCP" \
EXTRA_MCP_CONFIG=/abs/path/to/other-mcp.json \
EXTRA_SETTINGS='{"disableAllHooks":true}' \
bash plugins/optimization-suite/skills/ln-840-benchmark-compare/scripts/run-benchmark.sh
MANDATORY READ: Load references/monitor_integration_pattern.md
Stream benchmark progress:
Monitor(command="bash plugins/optimization-suite/skills/ln-840-benchmark-compare/scripts/run-benchmark.sh 2>&1 | grep --line-buffered -E 'scenario|PASS|FAIL|error|session'", timeout_ms=3600000, description="benchmark run")
Fallback: Bash(run_in_background=true).
The runner handles:
goals.mdCurrent scope:
hex-lineEXTRA_SESSION_* environment variablesExternal baseline note:
goals.md and expectations.jsonstream-json log shape and diff artifactsUse one canonical pair owned by this skill:
plugins/optimization-suite/skills/ln-840-benchmark-compare/references/goals.mdplugins/optimization-suite/skills/ln-840-benchmark-compare/references/expectations.jsonRules:
hex-line.goals.md must have a matching entry in expectations.json.expectations.json is the source of truth for correctness.Supported expectation fields per scenario:
| Field | Meaning |
|---|---|
id | Scenario identifier used in result filenames |
expectedChangedFiles | Files that must change |
forbiddenChangedFiles | Files that must not change |
requiredDiffPatterns | Regex patterns required in the saved diff |
forbiddenDiffPatterns | Regex patterns that must not appear in the diff |
requiredResultPatterns | Regex patterns required in the final assistant result text |
requiredCommands | Regex patterns that must match at least one Bash command |
exactChangedFiles | If true, no extra changed files are allowed |
The runner must pass:
node --check server.mjsnode --check hook.mjsnode --check extract-scenarios.mjsnode --check parse-results.mjshook.mjsIf preflight fails, the benchmark is invalid and must stop before scenarios run.
For each ## scenario in goals.md:
.jsonl logs and .diff.txt artifactsBuilt-in session:
Hex-line session:
server.mjsoutputStyle: "hex-line"PreToolUse hook through hook.mjsparse-results.mjs evaluates each scenario for both sessions.
Scenario pass requires:
The final report has these sections:
Interpretation rules:
invalid run means setup/adoption failure, not product performanceFAIL means correctness contract was not methex-line, not external noiseplugins/optimization-suite/skills/ln-840-benchmark-compare/results/{date}-comparison.md must answer:
hex-line activate cleanly without discovery drift?Do not treat raw time/cost as sufficient without scenario correctness.
hex-line against external alternatives, they must reuse the same goals.md, expectations.json, and diff-based evaluation rules.hex-line must say so explicitly.MANDATORY READ: Load references/benchmark_worker_runtime_contract.md, references/coordinator_summary_contract.md
Runtime CLI:
node references/scripts/benchmark-worker-runtime/cli.mjs start --skill ln-840-benchmark-compare --identifier suite-default --manifest-file <file>
node references/scripts/benchmark-worker-runtime/cli.mjs checkpoint --skill ln-840-benchmark-compare --identifier suite-default --phase PHASE_0_CONFIG --payload '{...}'
node references/scripts/benchmark-worker-runtime/cli.mjs record-summary --skill ln-840-benchmark-compare --identifier suite-default --payload '{...}'
node references/scripts/benchmark-worker-runtime/cli.mjs complete --skill ln-840-benchmark-compare --identifier suite-default
Required state fields:
report_readysummary_recordedfinal_resultself_check_passedDomain checkpoints:
PHASE_0_CONFIGPHASE_1_PREFLIGHTPHASE_2_LOAD_SUITEPHASE_3_RUN_SCENARIOSPHASE_4_PARSE_RESULTSPHASE_5_WRITE_REPORTPHASE_6_WRITE_SUMMARYPHASE_7_SELF_CHECKGuard rules:
benchmark-worker summary is recordedrunId and exact summaryArtifactPath.benchmark-worker family.MANDATORY READ: Load references/coordinator_summary_contract.md
Emit a benchmark-worker summary envelope after the comparison report is written.
Managed mode:
summaryArtifactPathStandalone mode:
.hex-skills/runtime-artifacts/runs/{run_id}/benchmark-worker/ln-840-benchmark-compare--{identifier}.jsonRecommended payload:
scenarios_totalscenarios_passedscenarios_failedactivation_validvalidity_verdictreport_pathwarningsmetrics| Pitfall | Solution |
|---|---|
| SessionStart not present in hex-line run | Fail preflight and stop |
Agent drifts into ToolSearch before hex-line use | Treat as activation problem and capture in report |
| Worktree already exists from prior crash | Remove it before adding a new one |
| Diff artifacts missing | Treat scenario correctness as failed |
| Simple scenario favors built-ins | Keep it in the suite if it is common; honesty beats cherry-picking |
| External comparison uses edited scenarios or relaxed expectations | Treat the comparison as invalid |
goals.md defines the canonical balanced suiteexpectations.json fully describes scenario correctnessplugins/optimization-suite/skills/ln-840-benchmark-compare/results/benchmark-worker summary artifact is written to the managed or standalone runtime pathVersion: 2.0.0 Last Updated: 2026-03-24