Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

nexus-eval-harness

Name: Nexus Eval Harness
Author: ProfSynapse

// Run or update the Nexus LLM eval harness for arbitrary provider/model matrices. Use when comparing models in the real vault-like tool environment, changing eval configs, adding eval scenarios, or debugging eval reports.

In Manus ausführen

$ git log --oneline --stat

stars:130

forks:17

updated:29. April 2026 um 18:07

Datei-Explorer

2 Dateien

SKILL.md

readonly

name	nexus-eval-harness
description	Run or update the Nexus LLM eval harness for arbitrary provider/model matrices. Use when comparing models in the real vault-like tool environment, changing eval configs, adding eval scenarios, or debugging eval reports.

Nexus Eval Harness

Use this skill for full Nexus model-behavior evals, not simple provider availability checks.

Core Rule

Use the shared harness in tests/eval/eval.test.ts. Do not create one-off runners unless the harness itself is broken and you are actively fixing it.

The harness should run provider/model/scenario jobs in parallel. Avoid sequential loops for model comparisons.

Production-Like Vault Runs

For the native vault environment, run live mode against the two-tool surface:

RUN_EVAL=1 EVAL_MODE=live EVAL_TOOL_SET=meta EVAL_TARGETS='openrouter=deepseek/deepseek-v4-pro,openrouter=deepseek/deepseek-v4-flash' npx jest tests/eval/eval.test.ts --runInBand --no-coverage --verbose

Notes:

--runInBand only keeps Jest in one worker; the harness runs the eval matrix with Promise.all.
RUN_EVAL=1 is required so ordinary test runs do not hit live provider APIs.
EVAL_TOOL_SET=meta restricts scenarios to the production getTools/useTools contract.
EVAL_SCENARIO_NAMES narrows a run to specific scenario names.
EVAL_TRACE_STREAM=1 writes per-scenario JSONL traces under test-artifacts/traces/ as chunks, tool calls, tool events, and assertions arrive.
API keys are read from process env or repo .env; never print credentials.
Reports are written under test-artifacts/.

Target Selection

Preferred arbitrary target format:

EVAL_TARGETS='provider=model,provider=model'

Examples:

EVAL_TARGETS='openrouter=anthropic/claude-sonnet-4.6,openrouter=openai/gpt-5.4-mini'
EVAL_TARGETS='openai=gpt-5.4,openrouter=openai/gpt-5.4'

Single-provider shorthand:

EVAL_PROVIDER=openrouter EVAL_MODELS='deepseek/deepseek-v4-pro,deepseek/deepseek-v4-flash'

Useful overrides:

EVAL_CONFIG=tests/eval/configs/live.yaml
RUN_EVAL=1
EVAL_MODE=live
EVAL_SCENARIOS='tests/eval/scenarios/search-variations.eval.yaml'
EVAL_SCENARIO_NAMES='simple-read,replace-content'
EVAL_TOOL_SET=meta
EVAL_TRACE_STREAM=1
EVAL_MAX_RETRIES=3
EVAL_RETRY_DELAY_MS=1000
EVAL_RETRY_BACKOFF_MULTIPLIER=2
EVAL_RETRY_MAX_DELAY_MS=30000
EVAL_TIMEOUT_MS=120000

Retry notes:

The harness retries provider/server failures such as 408, 409, 425, 429, 5xx, timeouts, and transient transport errors with exponential backoff.
Behavioral scenario failures may still retry up to maxRetries, but auth/validation-only stream errors should fail fast.
Retry delays are per parallel job; one throttled model should not serialize the rest of the matrix.

Interpreting Results

Separate these failure classes:

Provider/API failures: stream errors, auth failures, rate limits, transport errors.
Tool contract failures: missing workspaceId, sessionId, memory, or goal; wrong CLI flags; skipped getTools.
Task failures: valid tools called but wrong tool choice, wrong order, or incomplete multi-step plan.
Harness failures: zero loaded scenarios, non-production tool surface for a vault eval, or leftover generated test vault state.

When the user asks how a model performs “in our environment,” report the meta live-run numbers first.

related-skills.json

gleiches Repository

nexus-release.md

from "ProfSynapse/nexus"

Version bump and GitHub Actions release for the Nexus Obsidian plugin. Use when the user wants to cut a release, bump the version, or publish a new version after stable changes are ready.

2026-05-27130

nexus-release.md

from "ProfSynapse/nexus"

Version bump and GitHub Actions release for the Nexus Obsidian plugin. Use when the user wants to cut a release, bump the version, or publish a new version after stable changes are ready.

2026-05-13130

nexus-model-updates.md

from "ProfSynapse/nexus"

Add, update, or verify Nexus LLM provider model definitions. Use when adding newly released models, changing OpenAI/OpenRouter/Codex/GitHub Copilot/Anthropic/Google model metadata, updating provider defaults, or live-testing whether a model ID works through the reusable provider smoke test.

2026-04-28130

nexus-tool-schemas.md

from "ProfSynapse/nexus"

Export live CLI-first Nexus tool schemas as JSON. Use when the user wants every tool schema, a subset by selector, or an artifact that reflects the current runtime command/argument shape instead of source parsing.

2026-04-23130

nexus-ui-mockups.md

from "ProfSynapse/nexus"

Create or update standalone UI mockups for Nexus before implementation. Use when the user asks for a new view, modal, workflow, layout refactor, or other substantial UX change that should be reviewed in `docs/mockups/` before production code.

2026-04-05130

package.json

"author": "ProfSynapse"

"repository": "ProfSynapse/nexus"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

Softwarequalitätssicherungsanalysten und -testerInformatik- und Mathematikberufe15-1253L4

name	nexus-eval-harness
description	Run or update the Nexus LLM eval harness for arbitrary provider/model matrices. Use when comparing models in the real vault-like tool environment, changing eval configs, adding eval scenarios, or debugging eval reports.

Nexus Eval Harness

Use this skill for full Nexus model-behavior evals, not simple provider availability checks.

Core Rule

Use the shared harness in tests/eval/eval.test.ts. Do not create one-off runners unless the harness itself is broken and you are actively fixing it.

The harness should run provider/model/scenario jobs in parallel. Avoid sequential loops for model comparisons.

Production-Like Vault Runs

For the native vault environment, run live mode against the two-tool surface:

RUN_EVAL=1 EVAL_MODE=live EVAL_TOOL_SET=meta EVAL_TARGETS='openrouter=deepseek/deepseek-v4-pro,openrouter=deepseek/deepseek-v4-flash' npx jest tests/eval/eval.test.ts --runInBand --no-coverage --verbose

Notes:

--runInBand only keeps Jest in one worker; the harness runs the eval matrix with Promise.all.
RUN_EVAL=1 is required so ordinary test runs do not hit live provider APIs.
EVAL_TOOL_SET=meta restricts scenarios to the production getTools/useTools contract.
EVAL_SCENARIO_NAMES narrows a run to specific scenario names.
EVAL_TRACE_STREAM=1 writes per-scenario JSONL traces under test-artifacts/traces/ as chunks, tool calls, tool events, and assertions arrive.
API keys are read from process env or repo .env; never print credentials.
Reports are written under test-artifacts/.

Target Selection

Preferred arbitrary target format:

EVAL_TARGETS='provider=model,provider=model'

Examples:

EVAL_TARGETS='openrouter=anthropic/claude-sonnet-4.6,openrouter=openai/gpt-5.4-mini'
EVAL_TARGETS='openai=gpt-5.4,openrouter=openai/gpt-5.4'

Single-provider shorthand:

EVAL_PROVIDER=openrouter EVAL_MODELS='deepseek/deepseek-v4-pro,deepseek/deepseek-v4-flash'

Useful overrides:

EVAL_CONFIG=tests/eval/configs/live.yaml
RUN_EVAL=1
EVAL_MODE=live
EVAL_SCENARIOS='tests/eval/scenarios/search-variations.eval.yaml'
EVAL_SCENARIO_NAMES='simple-read,replace-content'
EVAL_TOOL_SET=meta
EVAL_TRACE_STREAM=1
EVAL_MAX_RETRIES=3
EVAL_RETRY_DELAY_MS=1000
EVAL_RETRY_BACKOFF_MULTIPLIER=2
EVAL_RETRY_MAX_DELAY_MS=30000
EVAL_TIMEOUT_MS=120000

Retry notes:

The harness retries provider/server failures such as 408, 409, 425, 429, 5xx, timeouts, and transient transport errors with exponential backoff.
Behavioral scenario failures may still retry up to maxRetries, but auth/validation-only stream errors should fail fast.
Retry delays are per parallel job; one throttled model should not serialize the rest of the matrix.

Interpreting Results

Separate these failure classes:

Provider/API failures: stream errors, auth failures, rate limits, transport errors.
Tool contract failures: missing workspaceId, sessionId, memory, or goal; wrong CLI flags; skipped getTools.
Task failures: valid tools called but wrong tool choice, wrong order, or incomplete multi-step plan.
Harness failures: zero loaded scenarios, non-production tool surface for a vault eval, or leftover generated test vault state.

When the user asks how a model performs “in our environment,” report the meta live-run numbers first.

nexus-eval-harness

Nexus Eval Harness

Core Rule

Production-Like Vault Runs

Target Selection

Interpreting Results

Mehr aus diesem Repository

Nexus Eval Harness

Core Rule

Production-Like Vault Runs

Target Selection

Interpreting Results

Mehr aus diesem Repository