with one click
prompt-engineering
// Write, rewrite, evaluate, and self-improve agent prompts dynamically. Each agent can call this skill to optimize its own context, role, constraints, and output quality over time.
// Write, rewrite, evaluate, and self-improve agent prompts dynamically. Each agent can call this skill to optimize its own context, role, constraints, and output quality over time.
| name | prompt-engineering |
| description | Write, rewrite, evaluate, and self-improve agent prompts dynamically. Each agent can call this skill to optimize its own context, role, constraints, and output quality over time. |
| trigger | Agent calls `/prompt-improve` or the Context-Manager plugin invokes it during self-improvement cycles |
| allowed_tools | Memory(create_entities, search_nodes, add_observations), Sequential-Thinking(sequentialthinking), Write |
This skill enables any agent to dynamically rewrite its own system prompt based on execution quality feedback. It turns prompt engineering from a one-time setup task into a continuous, automated optimization loop.
generate)Generate an initial prompt template for a given agent role and objective.
Usage: prompt-engineering generate <agent-role> [objective]
Example:
prompt-engineering generate "code-reviewer" "Review Python code for security vulnerabilities and suggest improvements"
Generates a structured prompt with:
rewrite)Rewrite an existing prompt based on quality feedback or failure analysis.
Usage: prompt-engineering rewrite <current-prompt> [--reason "<why it underperformed>"] [--strategy "more-examples|tighter-constraints|different-role|split-task"]
Example:
prompt-engineering rewrite "review this code" --reason "agent missed 3 SQLi vulnerabilities" --strategy "tighter-constraints"
Rewrite Strategies:
| Strategy | Action |
|---|---|
more-examples | Add targeted few-shot examples that demonstrate the missed pattern |
tighter-constraints | Strengthen guardrails (e.g., "ALWAYS check for: X, Y, Z") |
different-role | Reframe the agent's identity (e.g., "You are a penetration tester" → "You are a security auditor") |
split-task | Break one complex prompt into multiple specialized sub-prompts |
| expand-context | Add more background (docs, prior results, related code) |
simplify | Remove noise, make instruction more direct and unambiguous |
evaluate)After a task completes, evaluate the quality of the prompt that produced it.
Usage: prompt-engineering evaluate <output> [--expected <expected-output>] [--criteria relevance,completeness,accuracy] [--confidence 0.0-1.0]
Evaluation Criteria (scored 0.0–1.0 each):
Produces a composite quality score and identifies specific weaknesses.
history)Track prompt versions alongside quality metrics to build a training signal.
Usage: prompt-engineering history <agent-role> [--limit 10]
Tracks:
audit)A privileged operation where one agent audits another agent's prompt and suggests improvements.
Usage: prompt-engineering audit <agent-name> [--target-task "description of recent task"]
This analyzes an agent's prompt effectiveness using available tools:
1. Agent loads current prompt from memory MCP
2. Agent runs /prompt-improve evaluate on last 3 task outputs
3. If average score < 0.75 → /prompt-improve rewrite with [strategy]
4. New prompt deployed for this session
Every N iterations:
- Evaluate current output quality
- If quality degrading → /prompt-improve rewrite (real-time)
- If quality stable → continue with current prompt
1. Final evaluation of all outputs produced
2. Store quality metrics + prompt version in memory MCP
3. Identify top-performing prompt patterns
4. Update prompt version history
All prompt versions and quality scores are stored in memory MCP:
{
"entityType": "prompt-version",
"name": "agent:qa-guardian:prompt:v3",
"observations": [
{
"text": "Prompt template for QA agent v3 — improved SQLi detection",
"confidence": 0.88,
"quality_score": 0.82,
"strategy": "tighter-constraints",
"compared_to": "v2",
"improvement": "+0.15",
"date": "2026-05-08"
}
]
}
The core improvement loop runs in the Context-Manager plugin:
// On task completion (hook in context-manager.ts)
async onTaskComplete(result) {
const quality = await this.evaluate(result.output, result.expected);
await this.storeObservation(result.agent, result.prompt, quality);
if (quality.score < this.config.qualityThreshold) {
const rewrite = await this.promptEngineering.rewrite(
result.prompt,
{ reason: quality.weakness, strategy: autoSelectStrategy(quality) }
);
this.updateAgentPrompt(result.agent, rewrite);
console.log(`🎯 Prompt auto-rewritten: ${quality.score} → ${rewrite.expectedImprovement}`);
}
}
function autoSelectStrategy(quality) {
// Analyze which dimension failed most → pick best strategy
const weakest = Object.entries(quality.dimensions)
.sort((a, b) => a[1] - b[1])[0][0];
const strategyMap = {
"relevance": "simplify",
"completeness": "expand-context",
"accuracy": "more-examples",
"formatCompliance": "tighter-constraints",
"actionability": "split-task",
"efficiency": "simplify"
};
return strategyMap[weakest] || "more-examples";
}
Before (Static prompt):
User asks: "How do I fix memory leak in my Node.js app?"
Agent prompt: "You are a Node.js developer. Help with debugging."
Result: Generic debugging tips, misses memory-specific analysis.
After (Self-improved prompt):
User asks: "How do I fix memory leak in my Node.js app?"
Agent prompt: "You are an expert Node.js performance engineer specializing in memory diagnostics.
- ALWAYS check: heap snapshots, V8 flags (--trace-gc, --inspect), event loop lag
- Use heapdump, memwatch-next, node-inspect
- Analyze: retained objects, circular refs, event listener leaks
- Format: provide exact code fixes with line numbers"
Result: Targeted, actionable fix with specific tools and code patterns.
| Metric | Description | Target |
|---|---|---|
prompt_quality_score | Composite of all evaluation criteria | > 0.80 |
rewrite_improvement_delta | Score change after rewrite | +0.15 or more |
self_correct_rate | % of rewrites that improve quality | > 70% |
iterations_to_convergence | Rewrites before quality stabilizes | < 5 |
token_efficiency | Output quality per token spent | Improving trend |
failure_pattern_catch_rate | % of known failure patterns caught by evaluator | > 85% |