Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

eval-harness

Name: Eval Harness
Author: allenhutchison

// Run the eval harness against a real Obsidian instance, monitor for the harness's known reliability gotchas (concurrent runs, CLI-bridge hangs, stale child processes), and bless a result as a baseline only when the run is clean. Use when the user asks to "run the evals", "bless a baseline", "measure the impact of <change> on solve rate", "run a model sweep", or similar. Has Obsidian-CLI side effects (drives the agent view, creates eval-scratch fixtures, may set chatModelName for the duration of the run); does NOT modify plugin source code or commit unless explicitly asked.

Ejecutar en Manus

$ git log --oneline --stat

stars:453

forks:61

updated:22 de mayo de 2026, 21:59

SKILL.md

readonly

related-skills.json

mismo repositorio

shipping-code-formatter.md

from "allenhutchison/obsidian-gemini"

Use this skill whenever the user asks for a Quill Shipping Company tracking code or shipping reference. It defines the exact, non-obvious format every tracking code must follow.

2026-05-22453

gemini-scribe-help.md

from "allenhutchison/obsidian-gemini"

Answer questions about Gemini Scribe plugin features, settings, and usage, and diagnose plugin errors by reading the user's debug.log (when File Logging is enabled). Activate this skill whenever the user asks how to use the plugin or configure settings, OR reports that something went wrong with the plugin — bug, error, crash, broken behavior, "not working", "what just happened" — especially when they mention the debug log, log file, console output, or want help troubleshooting. Always activate this skill before searching the vault for plugin log files; debug.log lives in the plugin state folder which the standard read_file tool blocks.

2026-05-16453

architecture-audit.md

from "allenhutchison/obsidian-gemini"

Walk the codebase looking for TypeScript technical debt — oversized files, DRY violations, dead code, missing tests, sloppy typing (`any`/`@ts-ignore`), and weak abstractions. Categorize each finding into a discrete unit of work; open a focused PR for mechanically-safe fixes and file a GitHub issue for refactors that need design discussion. One PR or one issue per finding — never bundled. Use when the user asks to "audit the architecture", "find tech debt", "look for code smells", "do an architecture sweep", or when invoked nightly by a scheduled remote agent. Has working-tree side effects (branches + PRs) and GitHub side effects (issues, labels). Quiet-day result is "codebase looks good" with no PR or issue — that's a valid outcome.

2026-05-12453

obsidian-cli.md

from "allenhutchison/obsidian-gemini"

Use the Obsidian CLI to debug, inspect, and test Obsidian plugins during development. Covers plugin reloading, console inspection, runtime evaluation, driving the UI (commands, CDP, screenshots, mobile emulation), frontmatter properties, and common debugging recipes for the gemini-scribe plugin.

2026-05-03453

plugin-test.md

from "allenhutchison/obsidian-gemini"

Three-pass acceptance test for the obsidian-gemini plugin — unit tests, then UI/state via the Obsidian CLI (cheap pass), then API-spending verification (only with explicit user authorization). Driven by the user-facing docs as the source of truth for what should work, with extra focus on functionality shipped since the last release. The agent acts as judge between passes; later passes only run when the earlier ones pass cleanly. Use when the user asks to "test the plugin", "smoke test the release", "verify before release", "run the pre-release tests", "act as a judge on the plugin", or similar. Has Obsidian-CLI side effects (modal opens, plugin reloads, screenshots) but does NOT modify source code or commit; reports go to the working tree under `planning/test-reports/`.

2026-05-03453

audit-docs.md

from "allenhutchison/obsidian-gemini"

Walk every user-facing doc in `docs/guide/`, `docs/reference/`, plus `README.md` and `AGENTS.md`, and validate each factual claim against the code (settings names + defaults, command-palette IDs, file paths, schedule formats, tool names). Patch drift in place; add new docs only for user-visible features that aren't covered. Use when the user asks to "audit the docs", "review the docs", "validate docs against the code", "find doc gaps", "sync docs with the codebase", or similar. Writes drift fixes + new docs to the working tree and stops; the caller (a human, or the `daily-update` meta-skill) is responsible for committing and opening a PR.

2026-04-26453

package.json

"author": "allenhutchison"

"repository": "allenhutchison/obsidian-gemini"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Analistas de garantía de calidad de software y probadoresOcupaciones informáticas y matemáticas15-1253L4

name	eval-harness
description	Run the eval harness against a real Obsidian instance, monitor for the harness's known reliability gotchas (concurrent runs, CLI-bridge hangs, stale child processes), and bless a result as a baseline only when the run is clean. Use when the user asks to "run the evals", "bless a baseline", "measure the impact of <change> on solve rate", "run a model sweep", or similar. Has Obsidian-CLI side effects (drives the agent view, creates eval-scratch fixtures, may set chatModelName for the duration of the run); does NOT modify plugin source code or commit unless explicitly asked.
metadata	{"author":"obsidian-gemini","version":"1.0"}
compatibility	Requires Obsidian desktop with the CLI enabled, the plugin installed in a vault, and a Gemini API key configured in the plugin or `EVAL_JUDGE_API_KEY` for judge-matcher tasks. The Obsidian CLI must be reachable on PATH.

Eval harness — run, monitor, bless

This skill drives npm run eval against a live Obsidian instance, watches for the failure modes we've actually hit in practice, and treats baseline blessing as a quality gate rather than a rubber stamp.

The harness itself is documented operationally in evals/README.md. This skill is the agent-side procedure for using it without producing corrupted baselines or leaving orphaned state behind.

When to use this skill

The user wants to run the eval harness — full sweep, single task, or model sweep
The user wants to bless a result as the baseline for a (provider, model) pair
The user wants to measure the impact of a change on the eval suite (run, compare against baseline, decide)
A new pinned model has shipped and we want a baseline for it

Don't use this skill for:

Editing tasks or fixtures — that's regular development work, just edit evals/tasks/ and evals/fixtures/ directly
Investigating a single failed task in isolation — npm run eval -- --task=<id> once is enough; no skill orchestration needed

Vault guard (mandatory preflight)

The harness drives the agent view, creates eval-scratch/ fixtures in the vault, and may temporarily change settings.chatModelName. Running it against the wrong vault would dirty the user's actual notes.

The Obsidian CLI's vault=<name> flag does not actually route by name — it always targets the focused window. So before running any obsidian or npm run eval command, ask the user to prepare Obsidian and wait for confirmation:

Before I start the eval run, please prepare Obsidian:

Close every vault except the test vault. Focus can drift mid-run; the only safe posture is to have just the test vault open.

Open the test vault (default: Test Vault) and make it the focused window.

Open the agent view pane (Gemini Scribe ribbon icon, or palette → "Open Agent View"). The harness drives this view; if it's not visible you won't see activity, but the run still drives the model.

Save and close any unsaved work — fixtures get planted into eval-scratch/ and torn down per task.

Reply "ready" once Obsidian has only the test vault open and focused.

Then verify with the standard preflight:

EXPECTED="Test Vault"
ACTIVE=$(obsidian eval code="app.vault.getName()" | sed 's/^=> //')
if [ "$ACTIVE" != "$EXPECTED" ]; then
  echo "Aborting: focused vault is \"$ACTIVE\", expected \"$EXPECTED\"."
  exit 1
fi

If the user wants a different test vault, take the name from them and use it as EXPECTED.

Build the plugin under test (mandatory preflight)

The harness drives the plugin already installed in the test vault over the Obsidian CLI — npm run eval builds nothing. If you've changed plugin source (src/) and skip this step, the eval silently measures the old code: the run looks clean, the numbers are just wrong. That's the worst failure mode for a baseline.

Before any run that's meant to reflect current src/:

npm run build && npm run install:test-vault && obsidian plugin:reload id=gemini-scribe
# plugin:reload reports success even if onload() threw — confirm a clean load:
obsidian dev:errors

When you only changed harness files (evals/, baselines, fixtures) and not plugin source, skip this — those run in Node, not inside Obsidian, so no rebuild is needed. When in doubt, build: it's cheap, and a stale-plugin baseline is not.

Pre-run cleanup

State from a previous interrupted run can corrupt a new one (see "Operational gotchas" in evals/README.md, and #777). Always preflight:

# 1. No leftover harness processes
ps aux | grep "node evals/run.mjs" | grep -v grep

# 2. No leftover obsidian eval children
ps aux | grep "obsidian eval" | grep -v grep

# 3. No leaked collector / subscribers / scratch
obsidian eval code="JSON.stringify({hasCollector:typeof window.__evalCollector !== 'undefined', subs:(window.__evalUnsubscribers||[]).length, scratch:!!app.vault.getAbstractFileByPath('eval-scratch')})"

# 4. chatModelName matches the user's expected setting (not a previous run's override)
obsidian eval code="app.plugins.plugins['gemini-scribe'].settings.chatModelName"

If any of these show stale state, clean it up before starting:

# Kill leftover processes
kill -KILL <pid>

# Reap leaked collector + subscribers
obsidian eval code="(() => { for (const u of (window.__evalUnsubscribers || [])) try { u(); } catch {}; window.__evalUnsubscribers = []; delete window.__evalCollector; })()"

# Restore chatModelName if it's wrong (note the actual user's setting first if you don't already know it)
obsidian eval code="(() => { app.plugins.plugins['gemini-scribe'].settings.chatModelName = '<correct-id>'; })()"

Single-tenant rule

Never run two npm run eval instances against the same Obsidian process. They will fight for the agent view, leave fixtures in inconsistent state, and produce stuck CLI children that block both. Symptoms: log shows Setting up N fixture files... but no Session: line; agent view shows a session from a different task than the one printed.

If the user asks for a model sweep, run them sequentially. Wait for one process to fully exit before kicking off the next.

Running

For a single-model run against the user's currently-configured chat model:

npm run eval 2>&1 | tee /tmp/eval-run.log

For Ollama-only sweeps that include judge output matchers, provide a Gemini key for the judge without changing the plugin provider:

EVAL_JUDGE_API_KEY=... npm run eval -- --task=multi-file-summary 2>&1 | tee /tmp/eval-ollama.log

If neither EVAL_JUDGE_API_KEY nor a plugin Gemini API key is available, judge-matcher tasks record judge_skipped: true and print [judge unavailable]; treat those as harness setup failures, not model-quality regressions.

For a model sweep, the canonical pattern is one npm run eval -- --model=<id> per model, awaiting full completion of the previous before starting the next:

for MODEL in gemini-3-flash-preview gemini-2.5-flash gemini-2.5-pro; do
  npm run eval -- --model=$MODEL 2>&1 | tee /tmp/eval-$MODEL.log
  # Wait for full process exit before next iteration
done

A 21-run sweep typically takes 30–90 min depending on the model. Pro is roughly 2× slower than flash and ~9× more expensive per task.

Skip these models

*-latest pointers (gemini-flash-latest, gemini-pro-latest, etc.) — Google may swap the underlying model on any given day. Baselines against -latest aren't stable regression detectors. Use pinned IDs.
gemini-2.5-flash-lite — reproducibly hangs the harness mid-sweep (#778). Skip until that's fixed.

Monitoring during the run

Use a Monitor (or equivalent shell loop) that watches for the things that need intervention, not for routine progress. The harness's own per-turn polling already prints heartbeat lines.

What to watch for

The reliability flag worth catching is the CLI-bridge hang: Turn completed. line printed in the log, but no SOLVED / PASSED / FAILED / TIMEOUT verdict line for ≥ 30s. The harness's execFile timeoutMs: 10000 should fire but doesn't always.

Triage when you see this:

# Check if a manual obsidian eval still works (CLI itself responsive?)
obsidian eval code="1+1"
# Find stale CLI children (alive but 0 CPU for minutes)
ps aux | grep "obsidian eval" | grep -v grep

If a fresh CLI call returns instantly but the harness is still stuck, the parent's specific child is wedged. Workaround:

kill -KILL <stuck-child-pid>

The parent's exec promise will reject with "Command failed", runTask will record an ERROR verdict and continue. Note: this corrupts the run for blessing — see "Bless gate" below.

Quiet monitor pattern

Watching every per-task verdict on a 21-run sweep produces 21 notifications. Prefer a monitor that fires only on the final summary or process exit. Match evals/run.mjs directly (without --model=) so the same pattern works for both npm run eval and npm run eval -- --model=<id>:

until ! pgrep -f "node evals/run.mjs" >/dev/null; do sleep 60; done; echo "exited"

If you need to disambiguate when multiple sweeps are running back-to-back (or when restarting after a hang), capture the PID at launch and wait on it directly:

npm run eval -- --model=gemini-2.5-pro 2>&1 | tee /tmp/eval-pro.log &
EVAL_PID=$!
until ! kill -0 $EVAL_PID 2>/dev/null; do sleep 60; done; echo "exited"

Either pattern gives one notification per run. Spot-check progress between notifications by tailing the log.

Bless gate (quality bar for baselines)

A baseline must reflect actual model behavior, not harness friction. Do not bless a result if any of these happened during the run:

A manual kill -KILL of a stuck CLI child (produces a fake ERROR verdict that suppresses pass^k)
Concurrent runs against the same Obsidian instance
A second Ctrl-C while the first interrupt was still cleaning up (state may be partial)
Any unexplained ERROR line in the verdict list

If the gate fails, rerun the entire sweep before blessing. Don't bless a partial run "with a caveat" — caveats accumulate and degrade the baseline as a regression detector.

If the gate passes, bless and commit:

npm run eval:bless                # or: npm run eval:bless <result.json>

# The bless writes evals/baselines/<provider>-<sanitized-model>.json (gitignored
# results/ stays out of source control). Commit only the baseline file.
git add evals/baselines/<file>.json
git commit -m "chore(evals): bless baseline for <model>"

The commit message should include the headline numbers: pass^3, solve^3, mean cache rate, total cost, flaky-task list, plus a one-line note on anything unusual.

Cross-model observations worth surfacing

When running a model sweep, after all baselines are blessed, write up the cross-model picture in the PR body. Useful axes:

solve^3 ceiling — does spending more on a more capable model actually move the solve rate, or is the ceiling matcher brittleness rather than model capability? (We've seen pro and flash-preview tie at 71.4% — same ceiling, 9× cost difference.)
Cache hit rate variation — same plugin, same prompts, different rates. Cross-model variance is interesting (we've seen 35% → 66% across models).
Flaky-task overlap — if the same task is flaky on multiple models, it's a rubric problem (move to judge matcher per #713). If different tasks are flaky on different models, it's likely real model variance.

Cleanup after run

# Confirm no stragglers
ps aux | grep -E "node evals|obsidian eval" | grep -v grep
# Confirm no leaked plugin state
obsidian eval code="JSON.stringify({hasCollector:typeof window.__evalCollector !== 'undefined', subs:(window.__evalUnsubscribers||[]).length, scratch:!!app.vault.getAbstractFileByPath('eval-scratch')})"
# Confirm chatModelName restored
obsidian eval code="app.plugins.plugins['gemini-scribe'].settings.chatModelName"

If any of these show stale state, run the manual cleanup commands from the pre-run section.

Anti-patterns

❌ Blessing a baseline that contains a manual-kill ERROR
❌ Running two evals concurrently against the same Obsidian process
❌ Blessing a baseline against a *-latest model pointer
❌ Running model sweeps in parallel "to save time"
❌ Per-task notifications on a 21-run sweep (use a quiet monitor that only fires on exit)
❌ Skipping the bless-gate caveat because "it's just one ERROR" — the baseline is the bar; one bad run sets a wrong bar

Related skills

obsidian-cli — for ad-hoc Obsidian inspection / debugging
plugin-test — three-pass acceptance test (different scope, different cost profile)

Tracking issues

#687 — eval harness hill-climb (umbrella)
#776 — obsidian eval children hang after work completes
#777 — SIGINT handler leaks the event-bus collector + subscribers
#778 — gemini-2.5-flash-lite reproducibly hangs the harness

eval-harness

Más de este repositorio

Eval harness — run, monitor, bless

When to use this skill

Vault guard (mandatory preflight)

Build the plugin under test (mandatory preflight)

Pre-run cleanup

Single-tenant rule

Running

Skip these models

Monitoring during the run

What to watch for

Quiet monitor pattern

Bless gate (quality bar for baselines)

Cross-model observations worth surfacing

Cleanup after run

Anti-patterns

Related skills

Tracking issues

Eval harness — run, monitor, bless

When to use this skill

Vault guard (mandatory preflight)

Build the plugin under test (mandatory preflight)

Pre-run cleanup

Single-tenant rule

Running

Skip these models

Monitoring during the run

What to watch for

Quiet monitor pattern

Bless gate (quality bar for baselines)

Cross-model observations worth surfacing

Cleanup after run

Anti-patterns

Related skills

Tracking issues

Más de este repositorio