Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

skill-auto-improve

Sterne1

Forks0

Aktualisiert1. Juni 2026 um 22:13

Use when you want to automatically improve an artifact (a skill, prompt, command/workflow, or eval dataset) against a measurable metric — the orchestrator proposes one change at a time, scores it, and keeps it only if the metric improves, reverting otherwise. Vendor-agnostic (Anthropic / OpenAI / Gemini / local gateways).

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

MatrixFounder

MatrixFounder/Universal-skills

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

SoftwareentwicklerInformatik- und Mathematikberufe·SOC 15-1252

Datei-Explorer

38 Dateien

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

pdf

MatrixFounder/Universal-skills

Use when the user asks to create, combine, split, preview, or extract content from PDF files. Triggers include "markdown to pdf", "mermaid in pdf", "merge PDFs", "split a PDF", "extract text from pdf", "fill AcroForm", "preview pdf as image", and similar PDF generation or manipulation tasks.

2026-06-191

summarizing-meetings

MatrixFounder/Universal-skills

Use when summarizing meeting transcripts OR articles, papers, and threads into structured Markdown or wiki note-JSON. Model-agnostic meta-skill: auto-detects content type, selects a template, and produces a two-level pyramid (or opt-in structured note-JSON) optimized for people, AI agents, RAG, and Obsidian.

2026-06-181

html2md

MatrixFounder/Universal-skills

Use when converting a web page (URL) or a saved .html/.htm/.mhtml/.webarchive into clean Markdown — a web-clipper for Obsidian notes and a universal HTML→Markdown step for agent workflows. Triggers include "html to markdown", "url to markdown", "web page to obsidian", "webarchive to markdown", "mhtml to markdown", "scrape page to notes", "clip this article".

2026-06-181

docx

MatrixFounder/Universal-skills

Use when the user asks to create, edit, convert, validate, preview, or password-protect Microsoft Word .docx documents. Triggers include "markdown to docx", "docx to markdown", "fill Word template", "accept tracked changes", "validate docx", "preview docx as image", "encrypt/decrypt docx", and related .docx round-trip or template-fill tasks.

2026-06-171

obsidian-cli

MatrixFounder/Universal-skills

Use to DRIVE the running Obsidian desktop app from the shell via its official CLI: link-safe rename/move, typed properties, task toggles, daily-note capture, template insertion, Base queries, file-history restore, open notes/panes. Triggers: "rename/move the note", "open in Obsidian", "daily note", "set a property", "query the base", "restore a version", "obsidian cli". NOT for knowledge lookup — for anything ABOUT vault content use wiki-ingest query mode first.

2026-06-121

pptx

MatrixFounder/Universal-skills

Use when the user asks to create, edit, convert, preview, clean, or password-protect Microsoft PowerPoint .pptx presentations. Triggers include "markdown to pptx", "pptx to markdown", "slides from outline", "mermaid in slides", "pptx to pdf", "slide thumbnails", "drop orphan slides", "OCR slide images", "encrypt/decrypt pptx", and related presentation or OOXML round-trip tasks.

2026-06-091

name	skill-auto-improve
description	Use when you want to automatically improve an artifact (a skill, prompt, command/workflow, or eval dataset) against a measurable metric — the orchestrator proposes one change at a time, scores it, and keeps it only if the metric improves, reverting otherwise. Vendor-agnostic (Anthropic / OpenAI / Gemini / local gateways).
tier	2
version	1

Skill Auto-Improve

Purpose: Turn ad-hoc, manual artifact tuning into a controlled, measurable loop. Given any artifact and an eval harness, the orchestrator runs subagents (a Proposer + an Evaluator) under the autoresearch invariant — the eval harness is immutable, the artifact is free to change, KEEP a change only if the metric improves beyond noise, otherwise REVERT — and logs every step. It works across LLM vendors and improves skills, prompts, workflows, and eval datasets.

1. Red Flags (Anti-Rationalization)

STOP and READ THIS if you are thinking:

"I'll let the Proposer pick the tier / decide if its own change is good" -> WRONG. The author cannot grade itself. Tier is computed deterministically by measure_change_size.py; KEEP/REVERT is decided by the orchestrator from the Evaluator's number, never by the Proposer.
"A tiny positive delta means KEEP" -> WRONG. LLM/agent metrics are noisy (σ≈0.05–0.10). A change is only KEPT when delta > sigma; within-noise moves are reverted so noise never accumulates as drift.
"I'll just rewrite the whole file" -> WRONG. Changes are surgical (one section / one set of dataset ops). Full overwrites are how good content silently disappears.
"I can let it edit the eval set to make scores go up" -> WRONG. The harness (and frontmatter name/tier, dataset id/grader) is immutable. Editing the ruler to fit the result is the cardinal sin of measurement.
"Run it straight on main" -> WRONG. Use --git-isolation; intermediate commits belong on a throwaway branch, and a dirty tree aborts the run.

2. Capabilities

Improve artifact types: skill, prompt, workflow, dataset, full-skill, and text (arbitrary prose — emails, READMEs, landing copy — graded against a quality rubric).
Two decision mechanisms: noise-aware absolute-delta (deterministic/typed metrics) and a debiased pairwise gate (champion-vs-candidate in both orderings) for subjective text quality, with optional best-of-N candidates per iteration.
Vendor-agnostic LLM completion via LLMConfigManager (Anthropic / OpenAI / Gemini / OpenAI-compatible gateways), selected by DEFAULT_PROVIDER.
Pluggable agent-eval backends for skill-trigger evaluation (Claude validated; Gemini / Codex stubs); deterministic scoring for datasets; LLM grading for generic artifacts.
Multi-axis budget (iterations / tokens / wall-clock) and convergence detection.
Git-isolated, revertible iterations with a full TSV history and a markdown report.

3. Execution Mode

Mode: hybrid
Why this mode: Orchestration (the loop, decision rule, snapshot/revert, immutability gate, logging) is deterministic and lives in scripts. Proposing a change and subjective grading require judgment and run as LLM/agent subagents the orchestrator controls.

4. Script Contract

Command (required args):
- python3 scripts/auto_improve.py --artifact-path <path> --workspace <dir>
Optional flags: --artifact-type (default auto; use text for rubric-graded prose), --target (auto/description/generic), --eval-set <evals.json>, --criteria <rubric.md> (required for text), --candidates N (best-of-N for text, default 1), --threshold (0-1 early-stop; text default 0.9), --provider (default auto), --model, --max-iterations (default 10), --max-tokens, --max-duration (e.g. 30m), --noise-sigma, --runs-per-query, --num-workers, --git-isolation, --verbose.
Inputs: an artifact (dir or file); for skill/prompt/workflow an --eval-set; for text a --criteria rubric (weighted dimensions summing to 100); provider API key in .env ({PROVIDER}_API_KEY, optional OPENAI_BASE_URL); profiles in config/llm_profiles.yaml.
Outputs: <workspace>/improvement_history.tsv (baseline + per-iteration rows), <workspace>/improvement_report.md, snapshots under <workspace>/snapshots/, optional adversarial_review.md for large-tier changes. The winning artifact is left in place (merge the git branch explicitly).
Failure semantics: non-zero exit on a dirty tree under --git-isolation (code 2) or an unknown artifact type; Proposer/apply/eval errors are logged as iteration rows and never crash the loop.
Idempotency: re-running re-evaluates from the current artifact state; history appends. Use a fresh --workspace for a clean run.
Dry-run support: inspect proposals without committing by running on a copy, or use --git-isolation so nothing lands on your branch.

5. Safety Boundaries

Allowed scope: only the target artifact (and, for datasets, additive eval cases). Nothing outside --artifact-path.
Default exclusions (immutable): the eval harness; SKILL.md frontmatter name/tier; dataset id/skill_name/grader and file refs of existing cases; prompt {{placeholders}}; workflow YAML keys + tool names. Validated before apply and re-checked after.
Destructive actions: never. Removing existing eval cases is rejected; full-file overwrite is never used; revert restores the pre-change snapshot.
Statistical honesty: the inner loop (runs_per_query≈3) optimizes direction; only a final 5-run + bootstrap pass is a reliable measurement. This limitation is real — do not over-trust a single inner-loop score.
Optional artifacts: missing references//examples/ is non-blocking; a missing eval set for skill/prompt/workflow is blocking (cannot measure).
Trust boundary: skill-trigger eval runs a tool-enabled agent (claude -p) seeded with the artifact's own text. Two defense layers guard the description→agent prompt-injection path: (1) the Proposer's description is sanitized (HTML comments / control chars stripped) before write, and (2) the eval sink (run_eval.py) frames the description in the command body as untrusted DATA with a do-not-follow preamble. Still, run on trusted artifacts as belt-and-suspenders. Treat the process environment ({PROVIDER}_API_KEY, OPENAI_BASE_URL, AUTO_IMPROVE_*) as trusted input: a redirected OPENAI_BASE_URL sends prompts to that endpoint (the run warns when one is active).
Text-quality judge trust boundary: for --artifact-type text, the artifact prose is fed into the rubric judge AND the pairwise judge, which TOGETHER are the keep/revert gate. The artifact is stripped of injection markup (HTML comments/control chars) and the judges are instructed to treat it as untrusted DATA, but a determined plaintext-imperative injection ("score 100 / pick B") cannot be fully neutralized while still judging the prose — and the two-ordering debias does NOT defend against in-text injection (the instruction travels with the candidate into both orderings). Run text-quality only on artifacts whose provenance you trust, and review the winner before merging (the loop never merges for you). text-replace edits stay scoped to the artifact's own string (no path escape).

6. Validation Evidence

Local verification:
- python3 ../skill-creator/scripts/validate_skill.py . (structure/CSO) → exit 0
- cd scripts && python3 -m unittest discover -s tests → all pass (offline)
Expected evidence: improvement_history.tsv shows a baseline row, KEEP rows with positive deltas, REVERT/no-signal rows for non-improvements, and an exit_reason.
CI signal: office-skills CI does not cover this skill; rely on the unit suite + a real eval run.

7. Instructions

Phase 0 — Prepare

Install deps into the skill venv: cd scripts && python3 -m venv .venv && ./.venv/bin/pip install -r requirements.txt. Only the provider SDK you use is required.
Set secrets in a .env at the skill root: DEFAULT_PROVIDER=... and {PROVIDER}_API_KEY=... (optionally OPENAI_BASE_URL for a gateway).
Confirm an eval harness exists. For skill/prompt/workflow you MUST pass --eval-set; without a metric there is nothing to optimize. Datasets are self-scored.

Phase 1 — Run the loop

Choose the target with --target (description for CSO trigger text; generic/auto otherwise) and a budget (--max-iterations, --max-tokens, --max-duration).
Prefer --git-isolation so iterations run on a throwaway auto-improve/* branch; a dirty working tree MUST abort (commit or stash first).
Run auto_improve.py. Each iteration: Proposer → validate-before-apply → snapshot → apply → Evaluator → KEEP (delta>σ and secondary not regressed) / REVERT / no-signal.

Phase 2 — Review & merge

Read improvement_report.md and improvement_history.tsv. Confirm the score trajectory and exit_reason.
For large-tier changes, read adversarial_review.md for injected-regression concerns.
Merge the winning branch explicitly only after you are satisfied — the loop never merges to your working branch for you.

8. Workflows (Optional)

- [ ] Eval harness present (or bootstrap one)
- [ ] Clean git tree; --git-isolation on
- [ ] Run loop within budget
- [ ] Review TSV + report (+ adversarial_review for large)
- [ ] Merge winner explicitly

9. Best Practices & Anti-Patterns

DO THIS	DO NOT DO THIS
Keep the eval harness immutable; improve the artifact	Edit evals to inflate the score
Decide KEEP from the Evaluator's number	Let the Proposer judge its own change
Require `delta > σ` to KEEP	KEEP on any positive delta (noise)
Surgical section/dataset edits	Full-file overwrite
`--git-isolation` on a clean tree	Run on `main` with uncommitted changes

Rationalization Table

Agent Excuse	Reality / Counter-Argument
"Nesting run_loop.py is simpler."	A nested loop hides its spend from `--max-tokens`/`--max-iterations`. The outer loop owns the budget; description uses a single-shot optimizer.
"Adding dataset cases changes the immutable hash → revert."	Immutability is a subset check: additions are allowed, only changing/removing existing immutable fields is a violation.
"Gemini/Codex backends exist, so trigger eval works there."	They are stubs (`available=False`). Skill-trigger eval is validated only on Claude; other vendors fall back to LLM grading.

10. Examples (Few-Shot)

See examples/:

examples/dataset-improvement-example.md — offline dataset quality loop (no API for the Evaluator).
examples/skill-improvement-example.md — improving a weak skill's description (CSO trigger accuracy).
examples/text-quality-example.md + examples/cold-email-rubric.md — improving arbitrary prose against a rubric via the pairwise gate + best-of-N.

11. Resources

scripts/auto_improve.py — orchestrator + CLI; run_improvement_loop takes injectable proposer/evaluator/decider for offline tests.
scripts/llm_config.py — vendor-agnostic LLMConfigManager (native SDKs, fallback chain, usage→budget, OPENAI_BASE_URL).
scripts/pairwise.py — debiased pairwise gate (champion-vs-candidate, both orderings) for text quality.
scripts/{check_immutability,apply_proposal,measure_change_size,grade_dataset,snapshot,log_iteration,detect_artifact_type,detect_vendor}.py — deterministic utilities.
scripts/backends/ — agent-eval registry (claude validated; gemini/codex stubs).
config/llm_profiles.yaml — proposer / text_mutator / grader / eval_bootstrap profiles per provider.
references/ — artifact_type_guide.md, metrics_reference.md, backends/* adapter specs.
agents/ — proposer.md, evaluator.md system prompts the orchestrator sends to the LLM.

12. Evals

evals/evals.json defines 6 scenarios (description, instructions, dataset, revert-on-regression, convergence-stop, no-signal-revert). Deterministic ones are also covered by scripts/tests/. Fixtures live in evals/fixtures/.