원클릭으로
nexus-eval-harness
// Run or update the Nexus LLM eval harness for arbitrary provider/model matrices. Use when comparing models in the real vault-like tool environment, changing eval configs, adding eval scenarios, or debugging eval reports.
// Run or update the Nexus LLM eval harness for arbitrary provider/model matrices. Use when comparing models in the real vault-like tool environment, changing eval configs, adding eval scenarios, or debugging eval reports.
Version bump and GitHub Actions release for the Nexus Obsidian plugin. Use when the user wants to cut a release, bump the version, or publish a new version after stable changes are ready.
Version bump and GitHub Actions release for the Nexus Obsidian plugin. Use when the user wants to cut a release, bump the version, or publish a new version after stable changes are ready.
Add, update, or verify Nexus LLM provider model definitions. Use when adding newly released models, changing OpenAI/OpenRouter/Codex/GitHub Copilot/Anthropic/Google model metadata, updating provider defaults, or live-testing whether a model ID works through the reusable provider smoke test.
Export live CLI-first Nexus tool schemas as JSON. Use when the user wants every tool schema, a subset by selector, or an artifact that reflects the current runtime command/argument shape instead of source parsing.
Create or update standalone UI mockups for Nexus before implementation. Use when the user asks for a new view, modal, workflow, layout refactor, or other substantial UX change that should be reviewed in `docs/mockups/` before production code.
| name | nexus-eval-harness |
| description | Run or update the Nexus LLM eval harness for arbitrary provider/model matrices. Use when comparing models in the real vault-like tool environment, changing eval configs, adding eval scenarios, or debugging eval reports. |
Use this skill for full Nexus model-behavior evals, not simple provider availability checks.
Use the shared harness in tests/eval/eval.test.ts. Do not create one-off runners unless the harness itself is broken and you are actively fixing it.
The harness should run provider/model/scenario jobs in parallel. Avoid sequential loops for model comparisons.
For the native vault environment, run live mode against the two-tool surface:
RUN_EVAL=1 EVAL_MODE=live EVAL_TOOL_SET=meta EVAL_TARGETS='openrouter=deepseek/deepseek-v4-pro,openrouter=deepseek/deepseek-v4-flash' npx jest tests/eval/eval.test.ts --runInBand --no-coverage --verbose
Notes:
--runInBand only keeps Jest in one worker; the harness runs the eval matrix with Promise.all.RUN_EVAL=1 is required so ordinary test runs do not hit live provider APIs.EVAL_TOOL_SET=meta restricts scenarios to the production getTools/useTools contract.EVAL_SCENARIO_NAMES narrows a run to specific scenario names.EVAL_TRACE_STREAM=1 writes per-scenario JSONL traces under test-artifacts/traces/ as chunks, tool calls, tool events, and assertions arrive..env; never print credentials.test-artifacts/.Preferred arbitrary target format:
EVAL_TARGETS='provider=model,provider=model'
Examples:
EVAL_TARGETS='openrouter=anthropic/claude-sonnet-4.6,openrouter=openai/gpt-5.4-mini'
EVAL_TARGETS='openai=gpt-5.4,openrouter=openai/gpt-5.4'
Single-provider shorthand:
EVAL_PROVIDER=openrouter EVAL_MODELS='deepseek/deepseek-v4-pro,deepseek/deepseek-v4-flash'
Useful overrides:
EVAL_CONFIG=tests/eval/configs/live.yaml
RUN_EVAL=1
EVAL_MODE=live
EVAL_SCENARIOS='tests/eval/scenarios/search-variations.eval.yaml'
EVAL_SCENARIO_NAMES='simple-read,replace-content'
EVAL_TOOL_SET=meta
EVAL_TRACE_STREAM=1
EVAL_MAX_RETRIES=3
EVAL_RETRY_DELAY_MS=1000
EVAL_RETRY_BACKOFF_MULTIPLIER=2
EVAL_RETRY_MAX_DELAY_MS=30000
EVAL_TIMEOUT_MS=120000
Retry notes:
maxRetries, but auth/validation-only stream errors should fail fast.Separate these failure classes:
workspaceId, sessionId, memory, or goal; wrong CLI flags; skipped getTools.When the user asks how a model performs “in our environment,” report the meta live-run numbers first.