Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

self-improving-systems

Name: Self Improving Systems
Author: ooiyeefei

// Decide whether your agent actually needs persistent memory, feedback loops, or closed-loop learning, then design the smallest thing that pays for itself. Use when the user says "add memory", "give my agent context management", "make my agent learn", "self-improving / closed-loop", "Reflexion / mem0 / Letta / MemGPT", "AriGraph", "agent memory architecture", "long-term memory for chatbot", "why does my agent keep forgetting / making the same mistake", "fine-tune from agent traces", or asks for a memory schema / experience store / reward model. Filters ruthlessly — most teams want a state cache, not memory + learning. Default position is scratchpad-only with a stateless agent shipped first.

In Manus ausführen

$ git log --oneline --stat

stars:432

forks:48

updated:8. Mai 2026 um 06:09

Datei-Explorer

10 Dateien

SKILL.md

readonly

related-skills.json

gleiches Repository

htmldrop.md

from "ooiyeefei/ccc"

This skill should be used when the user asks to "share this HTML", "publish HTML", "get a link for this file", "share this report", "make this shareable", "upload this HTML", or wants to publish any HTML artifact for others to view. ALSO use it for collaborative review on an HTML doc/spec/report — triggers include "get feedback on this", "let reviewers comment", "collect feedback", "share for review", "let people annotate this", "synthesize the feedback", "converge the feedback", "what did reviewers say", "incorporate the comments", or "improve this from the feedback". Wraps Surge.sh for zero-cost hosting with guided privacy options, plus an embedded annotation + AI converge workflow.

2026-05-27432

agentic-system-design.md

from "ooiyeefei/ccc"

Prescriptive Q&A workflow for designing agentic pipelines, multi-model councils, sub-agent hierarchies, and tool-loop hardening for any domain. Use when the user asks to "design an agent", "design a multi-agent system", "should I use a council/debate", "build a [domain] review agent" (HAZOP, finance, tutorial, marketing, compliance, accounting), "real agency vs workflow", "how to add sub-agents", "AI for [domain] review", or names patterns like "orchestrator-worker", "evaluator-optimizer", "Magentic", "ReAct", "plan-and-execute", "handoffs". Walks the user through 12 stages one question at a time and emits a buildable design doc with citations. Do NOT use for general coding questions, single-shot prompt tuning, or bare "use Claude to do X" requests with no agency requirement.

2026-05-08432

rethink-surveys.md

from "ooiyeefei/ccc"

Design, critique, or scaffold surveys grounded in Caroline Jarrett, Dillman, and Tourangeau methods. Use when designing a new survey, critiquing an existing one, scoring or clustering responses, or turning questions into an app. Triggers on "survey", "questionnaire", "user research", "customer discovery", "intent capture", "interview script", "lint my survey", "score responses", or "how to ask better questions". When the bundled MCP server is connected, prefer its deterministic tools (`critique_survey`, `get_template`, `design_survey_session`, `score_response`, `cluster_responses`) over manual reasoning. Captures real past behavior over hypotheticals, supports text/voice/AI-interviewer modalities, and includes templates for event organizers, startup founders, and gig-economy workers.

2026-05-05432

landing-page-gtm.md

from "ooiyeefei/ccc"

Build and update high-converting SaaS landing pages with GTM-aware marketing copy, competitive positioning, and sales psychology. Use when creating new landing pages, rewriting feature cards, updating marketing copy, launching product pages, or transforming technical features into customer-facing sales language. Triggers on "build landing page", "update feature cards", "rewrite marketing copy", "create product page", "launch page", "GTM", "sales copy", "competitive positioning", or when converting product features into conversion-focused web pages.

2026-03-17432

excalidraw.md

from "ooiyeefei/ccc"

Generate architecture diagrams as .excalidraw files from codebase analysis, with optional PNG/SVG export. Use when the user asks to create architecture diagrams, system diagrams, visualize codebase structure, generate excalidraw files, export excalidraw diagrams to PNG or SVG, or convert .excalidraw files to image formats.

2026-02-15432

uat-testing.md

from "ooiyeefei/ccc"

End-to-end User Acceptance Testing for web applications. Analyzes branch changes and specs to generate exhaustive test cases, sets up the local environment, executes tests via Playwright browser automation, and produces a pass/fail results report with screenshots and fix documentation. Use when the user says "run UAT", "test this feature", "UAT testing", "acceptance test", "test my branch", "generate test cases", or wants to verify a feature branch against its spec before merge.

2026-02-15432

package.json

"author": "ooiyeefei"

"repository": "ooiyeefei/ccc"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

SoftwareentwicklerInformatik- und Mathematikberufe15-1252L4

name

self-improving-systems

description

Decide whether your agent actually needs persistent memory, feedback loops, or closed-loop learning, then design the smallest thing that pays for itself. Use when the user says "add memory", "give my agent context management", "make my agent learn", "self-improving / closed-loop", "Reflexion / mem0 / Letta / MemGPT", "AriGraph", "agent memory architecture", "long-term memory for chatbot", "why does my agent keep forgetting / making the same mistake", "fine-tune from agent traces", or asks for a memory schema / experience store / reward model. Filters ruthlessly — most teams want a state cache, not memory + learning. Default position is scratchpad-only with a stateless agent shipped first.

Self-Improving Systems

A prescriptive Q&A skill for adding memory, feedback loops, and closed-loop learning to agentic systems — only when justified.

Headline message: most agents shouldn't have persistent memory.

Memory is a liability surface (drift, poisoning, debugging difficulty, GDPR/HIPAA exposure). Persistent memory is the second move, not the first. The skill's job is to filter ruthlessly so the user doesn't ship a mem0/Letta build for a problem that a 200-line conversation summary would solve.

The first 2 stages of the Q&A flow exist to stop most users from over-engineering. By the end of stage 2, ~60% of users will discover they want a state cache (or stateless RAG), not memory + learning. That's the win.

Quick Start

User just asks:

"Add memory to my agent"
"My agent keeps forgetting things — give it context management"
"Make my marketing agent learn from past campaigns"
"Should I use mem0 or Letta?"
"How do I set up closed-loop learning for my finance agent?"
"Build a self-improving HAZOP system"

Skill response (every time, in this order):

Stop. Apply the cache-vs-learning frame (Stage 1).
Run the 6-question need-memory rubric (Stage 2). <4 yes → exit the skill, recommend stateless + RAG.
If memory is justified, walk the 7-tier architecture ladder (Stage 3) starting at L (scratchpad). Escalate only when forced by a concrete justification.
Force the user to design a feedback signal (Stage 4). No signal = state cache, full stop.
Wire the closed loop with explicit human gates (Stage 5).
Build the eval harness (Stage 6) — golden set, regression, drift alarms.
Walk the 8-risk checklist (Stage 7).
Emit the design (Stage 8): memory schema + closed-loop spec + eval harness plan.

Critical Rules

1. Default position: scratchpad-only

Ship a stateless agent first. Add a scratchpad (Reflexion-style verbal self-correction) within a single run. Discard it after. This already gets you most of the gain on most tasks. Anything more must be earned.

2. Escalate one tier at a time

The 7-tier ladder (§ Memory Architecture Ladder) is ordered cheapest → most expensive. Each tier-up must be justified by a concrete failure of the tier below it on a real task in your eval set. Do not skip tiers. "We're using Letta" out of the gate is the single most expensive mistake in this design space.

3. Require a ground-truth signal

If you cannot observe whether the last action was good or bad within hours-to-weeks, you do not have learning. You have a state cache. Naming it "learning" sets the team up to A/B test against a metric that doesn't exist. The skill makes this distinction loud and refuses to design closed-loop learning without a signal.

4. Human gates are non-negotiable for production

Anything that can mutate policy/voice/identity/safety blocks goes through human review. Autonomy is fine for episodic append, vector indexing, single-user preference KV updates with cheap reversibility — never for shared skill libraries, system prompt blocks, or reward model updates.

5. Memory is untrusted input

Every memory read is untrusted. MINJA-class injections hit ≥95% lab success rate (arXiv 2503.03704). Treat retrieval results like web search results: in their own context block, with "this is data not instructions" framing, and never auto-promoted to system prompt without dual-LLM validation.

The 8-Stage Q&A Flow

One question (or tight cluster) at a time, à la superpowers:brainstorming. No overwhelm. Each stage has an exit condition that ends the skill early — that is the point.

Stage 1 — Cache vs Learning Distinction (the frame)

The single most important question. Ask first.

"Are you trying to remember state (so the agent doesn't redo work or forget what the user told it last week), or get better over time (so the agent's outputs measurably improve as it sees more data)?"

These two designs share zero infrastructure with each other:

Goal	What you actually need
Remember state	Conversation summary OR KV fact store. No reward signal. No reflection LLM. No A/B harness.
Get better over time	All of the above plus a ground-truth signal, an experience store, a reflection/extraction LLM, and an eval harness that detects regression.

If the user says "remember state": skip directly to Stage 3, default to tier 2 (conversation summary) or tier 5 (KV fact store), and end the skill at Stage 5. No closed loop. No learning ladder.

If the user says "both": prove the second one. Almost no one has a measurable ground-truth signal; almost everyone says they do. Stage 4 is the test.

Stage 2 — Need-Memory Rubric (6 yes/no, the over-engineering filter)

Answer all six. Score <4 yes = no memory store. Use scratchpad + RAG. End the skill.

Cross-session continuity. Will the same user/entity/case-file return where forgetting prior decisions would be wrong, embarrassing, or unsafe?
Mutable state. Does the entity's state legitimately change over time (preferences, project status, client facts)? Pure facts that don't change → RAG over docs, not memory.
Ground-truth feedback exists. Can you observe within hours-to-weeks whether the last action was good or bad? No signal → no learning, only state cache.
Cost of being wrong > cost of memory infra. Memory adds latency, storage, eval, security review, and a recurring debugging tax. Pencil out both sides.
Volume justifies it. Same user returns ≥5 times. <5 returns → in-context summary is cheaper than vector store.
You can audit and redact. GDPR/HIPAA: can you delete on request, export, explain a memory? If no, do not store one.

If you got "yes" only on (1) and (2): you need a state cache, not memory + learning. Say it out loud. Skill recommends tier 2 or 5 and exits.

Stage 3 — Architecture Selection (start at L tier)

Walk the 7-tier memory architecture ladder (next section). Default recommendation: tier 1 (scratchpad-only). Escalate exactly one tier per concrete justification. Justification = "tier N fails on this specific task in our eval set, here's the trace."

Most "we need memory" requests resolve at tier 2 (conversation summary) or tier 5 (KV fact store). Tier 6 (graph) and tier 7 (hierarchical OS-style / Letta) require >3 entities × >50 relationships and a real long-horizon agent, not a chatbot.

Deep dive: references/architectures.md

Stage 4 — Feedback Signal Design

If Stage 1 ended with "remember state only", skip this stage.

For learning, the signal determines everything. Walk the per-domain table:

Domain	Signal	Latency	Risk
Marketing / content	Engagement deltas (CTR, dwell, conversion, save/share) + variant A/B win-rate + brand-safety review	hours-days	Vanity metrics → reward hacking; mitigate with composite reward + brand-fidelity LLM-judge
Finance / compliance	Audit findings, reconciliation breaks, regulator outcomes	weeks	Sparse signal → use intermediate proxies + sparse human signoff (hybrid RLAIF)
HAZOP / safety	Incident-DB recall (held-out incident set), expert reviewer agreement	continuous	Never let agent's own write-back update incident DB
Tutorials / education	Completion rate, comprehension quiz scores, time-to-first-success	minutes-days	Cleanest closed loop — verifier is cheap and online
Code-emitting agents	Unit tests, type-check, runtime	minutes	The gold standard — verifier is free and deterministic
General LLM-as-judge	Held-out judge with calibrated rubric	continuous	Sample-audit 5–10% against humans to catch drift

Rule, repeat once per Q&A session: No signal = state cache, not learning. If the user can't name a signal, do not design a learning loop. Recommend they ship the state cache first, instrument the signal in production, and revisit the skill in a quarter.

Deep dive: references/feedback-signals.md

Stage 5 — Closed-Loop Wiring with Human Gates

If Stage 4 produced no signal, skip this stage and the next two.

The reference closed loop:

[run event: input + agent trace + outputs]
       │
       ▼
[signal collector] ──── engagement / verifier / human review (async)
       │
       ▼
[experience store] (append-only, immutable, signed)
       │   ├── episodic events (raw)
       │   ├── extracted facts (KV)        ← extraction LLM, validated
       │   └── learned skills/playbooks    ← reflection LLM, human-gated
       │
       ▼
[retrieval layer] (hybrid: vector + BM25 + entity link)
       │
       ▼
[state mutator]
       │   ├── AUTONOMOUS: low-risk fields (recency, prefs)
       │   └── HUMAN-GATED: anything that changes policy/voice/identity
       │
       ▼
[next run] ─── core memory in prompt + retrieved episodic + skill lookup

Where humans gate (non-negotiable for production):

Promotion of any item to "core memory" / system-prompt block
Schema changes in graph memory
Skill-library additions used by >1 user (Voyager-style accumulation needs review when shared)
Reward model updates / fine-tunes from agent feedback

Where it can be autonomous: episodic append, vector indexing, retrieval scoring tweaks, single-user preference KV updates with cheap reversibility, Reflexion-style within-task verbal self-correction (lives in scratchpad, not persistent memory).

Stage 6 — Eval Harness

Six patterns, ship at least the first three before going live:

Golden set — 50–500 hand-curated (input, expected behavior, expected memory side-effect) tuples; include adversarial / poisoning attempts.
Regression on memory side-effects — assert get(user, "allergies") == ["peanut"] after run X.
Drift alarms via OpenTelemetry GenAI semconv — judge-score rolling mean, memory-store size growth rate, retrieval hit-rate distribution, % of runs that mutate core memory.
A/B between agent versions — slice traffic, compare composite reward over fixed window.
LLM-as-judge with human calibration — 5–10% audit; recompute judge–human Cohen's κ weekly.
Held-out human-written tasks — never trained on; detects distribution collapse from self-play.

Deep dive: references/eval-harness.md

Stage 7 — Risks Checklist

Walk all 8 once. Each must have a concrete mitigation in the design doc.

Memory poisoning (MINJA, ≥95% lab injection success)
Prompt injection via memory
Reward hacking
Drift / staleness
Context rot / window blowup (200K models often unreliable past ~130K)
Runaway self-modification
Distribution collapse in self-play
Multi-agent context explosion

Deep dive: references/risks.md

Stage 8 — Output

Produce the design document:

Memory schema — chosen tier(s), data model, retention/TTL, redaction hooks
Closed-loop spec — signal source, collector, experience store, retrieval, mutator, human-gate list
Eval harness plan — golden set sketch, regression assertions, OTel metric list, A/B split, judge-calibration cadence
Risk register — 8 risks × 1 mitigation each
Build order — what ships in week 1 (state cache only), week 4 (signal collection on production traffic), week 12 (closed loop activated behind feature flag)

Memory Architecture Ladder (escalate only when justified)

L → L → M → M → M → H → XH
1    2    3    4    5    6    7

#	Architecture	Use case	Cost	Pitfall	Citation
1	Scratchpad-only (in-run, discarded)	Multi-step reasoning within one task; ReAct loops; debate transcripts	L	Don't fake durability — make it obvious to LLM and ops nothing persists	Reflexion
2	Conversation summary (rolling LLM compaction into system prompt)	Single-session chat, support tickets, ≤1 day horizon	L	Summaries lossy-compress unpredictably; pin facts verbatim, summarize narrative	Anthropic context engineering
3	Episodic stream (append-only event log, recency × importance × relevance retrieval)	Long-running personas, simulations, journal-style apps where order matters	M	Bespoke scoring; without reflection, bloats fast	Generative Agents (Park et al., 2023)
4	Vector RAG over interactions	Knowledge retrieval, FAQ, doc Q&A, low-personalization	M	Reactive only — won't surface "favorite color" on "birthday" query	Letta — RAG vs Agent Memory
5	Key-value fact store (mem0 single-pass ADD)	Personalization (name, prefs, history), CRM-like agents	M	Bad extractors poison store; need write-time validators	mem0 paper
6	Graph memory (mem0g, AriGraph)	Multi-hop reasoning over relationships	H	Schema drift kills you; LLM-extended schemas degrade into vector store with extra steps	mem0g
7	Hierarchical OS-style (Letta / MemGPT, agent self-edits via tools)	High-stakes long-horizon agents	XH	Self-editing memory is prompt-injection bomb on untrusted input	MemGPT, Letta

Default recommendation in the skill: start at #1, escalate one tier at a time. Many "we need memory" requests are actually #2.

Deep dive: references/architectures.md

Anti-Patterns (load-bearing — call out before user picks the wrong path)

Anti-pattern	Test	Fix
Memory because it's cool	Adding mem0/Letta to a one-shot pipeline	Skip memory. Stateless + RAG.
Cache labeled "memory"	No feedback signal exists in the user's domain	Honest naming: call it a "state cache" not "learning". Design accordingly.
Vector RAG for personalization	"What's my favorite color?" returns nothing because the user never asked it; embeddings can't surface unprompted facts	KV fact store, not vector RAG
Self-editing memory on untrusted input	Letta with user-pasted content writing into core memory	Quarantined-LLM pattern; never untrusted source → core memory
Reward hacking via vanity metrics	Engagement-only signal → clickbait drift; finance "% reviewed" → rubber-stamping	Composite rewards: engagement + brand-fidelity judge + sample audit; finance: composite includes materiality threshold + reviewer agreement
Memory as the first move	Building memory store before the stateless agent has shipped	Ship stateless first. Instrument the signal. Decide a quarter later.
Graph memory by default	Modeling 1 brand's 5 competitors as a graph	Stay in KV+vector until >3 entities × >50 relationships. Graph schemas drift; LLM-extended schemas degrade into vector stores with extra steps.
Self-play with no external verifier	Agent training on its own outputs, no held-out signal	Pin a verifier external to the model. V-STaR / Quiet-STaR loops without external verification narrow capability.
Forgetting context-rot	Stuffing 130K of memory into context "because the model supports 200K"	Compaction + retrieval + sub-agent isolation; 200K models often unreliable past ~130K (Anthropic)

Self-Improvement Playbook Ladder (cheapest first)

Reflexion → Generative Agents → Voyager → mem0 → Letta
   1            2                  3        4      5

Tier	Pattern	When	Citation
1	In-loop verbal correction, no persistence	Cheapest learning; the first move before ANY memory store. ~91% pass@1 HumanEval at the time of publication. Lives in the scratchpad.	Reflexion (Shinn et al., 2023)
2	Long-horizon persona / social sims	Memory stream + reflection + planning loop. For agents that need to act in character over days/weeks.	Generative Agents (Park et al., 2023)
3	Skill-library accumulation	Tool-using agents solving novel-but-related tasks; "what worked for Brand X in vertical Y" patterns.	Voyager (Wang et al., 2023)
4	Production fact memory	Chat-like personalization at scale. 91.6 LoCoMo, ~90% token savings vs full-context.	mem0 (arXiv 2504.19413, ECAI 2025)
5	Self-editing hierarchical memory	Highest power, highest attack surface. Use only when long-horizon autonomy is the product, not a nice-to-have.	MemGPT → Letta

The skill walks the user up this ladder only when justified by a concrete failure of the tier below. Most production systems sit at tier 1 + tier 4. Tier 5 is appropriate for <5% of agentic projects.

Deep dive: references/playbook-ladder.md

Reference Files

File	Contents
`references/architectures.md`	Deep-dive on the 7 memory architectures with cost ratings L→XH
`references/feedback-signals.md`	Per-domain feedback signal design + the no-signal-no-learning rule
`references/eval-harness.md`	The 6 eval patterns: golden set, regression, drift alarms, A/B, judge calibration, held-out tasks
`references/risks.md`	The 8 risks with citations and mitigations (MINJA, prompt injection, reward hacking, drift, context rot, runaway self-mod, distribution collapse, multi-agent explosion)
`references/playbook-ladder.md`	Reflexion → Generative Agents → Voyager → mem0 → Letta progression
`references/case-studies.md`	Brandling Mutation Engine "state cache, not learning" lesson + marketing/finance/HAZOP/tutorial-gen worked examples through the memory/feedback lens

Examples

The examples/ directory will hold:

reflexion-loop.md — cheapest first move, scratchpad-only
kv-store-mem0.md — production personalization with extraction validation
eval-harness.md — golden set runner with regression assertions

Output Contract

A skill run is complete when the user has:

A documented answer to "cache or learning?" (Stage 1).
A scored need-memory rubric (Stage 2).
A chosen architecture tier with justification for not stopping at the previous tier (Stage 3).
(If learning) a named feedback signal with latency, source, and mitigations (Stage 4).
(If learning) a closed-loop spec with human gates marked explicitly (Stage 5).
An eval harness plan with at least patterns 1–3 from §3.8 (Stage 6).
A risk register: 8 rows × 1 mitigation each (Stage 7).
A build order showing what ships when (Stage 8).

If the user wants to skip steps, the skill refuses. The whole point is the filter.

Design Philosophy

Memory is a liability surface. The cheapest memory is the one you didn't add.

Every memory tier you add carries a recurring debugging tax (why did it remember that? why did it forget this?), a security tax (every read is untrusted input), a privacy tax (GDPR/HIPAA delete-on-request), and an eval tax (regression on memory side-effects). Stateless agents fail in ways you can reproduce by re-running the input. Memoryful agents fail in ways you can't.

The skill's stance: earn each tier with a real failure on a real eval set. When in doubt, ship the lower tier and instrument the signal. Decide next quarter.

self-improving-systems

Mehr aus diesem Repository

Mehr aus diesem Repository

Self-Improving Systems

Headline message: most agents shouldn't have persistent memory.

Quick Start

Critical Rules

1. Default position: scratchpad-only

2. Escalate one tier at a time

3. Require a ground-truth signal

4. Human gates are non-negotiable for production

5. Memory is untrusted input

The 8-Stage Q&A Flow

Stage 1 — Cache vs Learning Distinction (the frame)

Stage 2 — Need-Memory Rubric (6 yes/no, the over-engineering filter)

Stage 3 — Architecture Selection (start at L tier)

Stage 4 — Feedback Signal Design

Stage 5 — Closed-Loop Wiring with Human Gates

Stage 6 — Eval Harness

Stage 7 — Risks Checklist

Stage 8 — Output

Memory Architecture Ladder (escalate only when justified)

Anti-Patterns (load-bearing — call out before user picks the wrong path)

Self-Improvement Playbook Ladder (cheapest first)

Reference Files

Examples

Output Contract

Design Philosophy

Self-Improving Systems

Headline message: most agents shouldn't have persistent memory.

Quick Start

Critical Rules

1. Default position: scratchpad-only

2. Escalate one tier at a time

3. Require a ground-truth signal

4. Human gates are non-negotiable for production

5. Memory is untrusted input

The 8-Stage Q&A Flow

Stage 1 — Cache vs Learning Distinction (the frame)

Stage 2 — Need-Memory Rubric (6 yes/no, the over-engineering filter)

Stage 3 — Architecture Selection (start at L tier)

Stage 4 — Feedback Signal Design

Stage 5 — Closed-Loop Wiring with Human Gates

Stage 6 — Eval Harness

Stage 7 — Risks Checklist

Stage 8 — Output

Memory Architecture Ladder (escalate only when justified)

Anti-Patterns (load-bearing — call out before user picks the wrong path)

Self-Improvement Playbook Ladder (cheapest first)

Reference Files

Examples

Output Contract

Design Philosophy