with one click
prompt-engineering
// Generate, improve, analyze, and debug LLM prompts. Activate only when the user explicitly asks to write, rewrite, optimize, diagnose, or design a prompt/system prompt, or names a prompt-engineering technique.
// Generate, improve, analyze, and debug LLM prompts. Activate only when the user explicitly asks to write, rewrite, optimize, diagnose, or design a prompt/system prompt, or names a prompt-engineering technique.
| name | prompt-engineering |
| title | Prompt Engineering โ SOTA 2026 Toolkit |
| description | Generate, improve, analyze, and debug LLM prompts. Activate only when the user explicitly asks to write, rewrite, optimize, diagnose, or design a prompt/system prompt, or names a prompt-engineering technique. |
| license | Apache-2.0 |
| compatibility | Works with any LLM. Anthropic-specific notes flagged inline (Claude 4.7 behavior, XML tags, effort levels, structured outputs). |
| domains | assistant |
| rules | ["match(\\bprompt\\s+engineer(?:ing|ed|er)?\\b)","match(\\b(write|rewrite|design|build|craft|improve|fix|debug|analy[sz]e|optimi[sz]e|structure)\\b.{0,80}\\b(a\\s+|the\\s+|my\\s+|this\\s+|an\\s+)?(system\\s+prompt|agent\\s+prompt|prompt|prompts|prompting)\\b)","match(\\b(system\\s+prompt|agent\\s+prompt|prompt|prompts|prompting)\\b.{0,80}\\b(write|rewrite|design|build|craft|improve|fix|debug|analy[sz]e|optimi[sz]e|structure)\\b)","match(\\b(few|one|zero|multi|N)[-\\s]?shot\\b)","match(\\bchain[-\\s]of[-\\s]thought\\b|\\bcot\\b)","match(\\btree[-\\s]of[-\\s]thought\\b|\\btot\\b)","match(\\breact\\s+(prompt|agent|pattern)\\b)","match(\\bself[-\\s]consistency\\b)","match(\\bstep[-\\s]back\\s+prompt\\w*)","match(\\bdspy\\b)","match(\\bprompt\\s+(template|framework|pattern)\\b)"] |
This skill helps the user generate, improve, analyze, and debug LLM prompts using state-of-the-art 2026 techniques. Use it when the user wants help writing a prompt, fixing one that misbehaves, picking a technique for their goal, or designing an agent's system prompt. The skill encodes Anthropic's published priority order, the major reasoning and agentic patterns, the most-used frameworks, token-efficiency rules, and the documented failure modes with fixes.
Default framing: the prompt is a contract. Specify what the model should know, do, output, and avoid โ in that order โ using structure the model can parse unambiguously.
Three layers compose every effective prompt:
Iteration is the work. Expert prompts hit requirements first try only 40โ60% of the time; the same prompt after 3 iterations hits 85%. Treat prompts as code: specify, test, iterate.
The five-step authoring loop:
Apply in this order โ earlier items have higher impact per minute spent.
<example> tags inside <examples>). Make them relevant, diverse, and cover edge cases. Few-shot is the single most reliable steerer of format and tone.<instructions>, <context>, <input>, <output_format> โ Claude was trained to treat these as semantic separators, not decoration.<document index="n"> with <source> and <document_content>. For 20k+ token inputs, ask the model to extract relevant quotes first into <quotes> before answering.| Problem type | Technique | When to reach for it |
|---|---|---|
| Simple lookup, classification | Zero-shot direct | Default; no scaffolding |
| Format-sensitive output | Few-shot (3โ5 examples) | Style, tone, structure |
| Multi-step math/logic | Chain-of-Thought ("think step by step") | GSM8K-style 17.7% โ 78.7% jump on PaLM |
| Open-ended planning, multiple valid paths | Tree-of-Thought (explore branches) | Game-of-24: GPT-4 went 4% โ 74% |
| High-stakes correctness, single answer | Self-Consistency (sample N, majority vote) | +12โ18% over CoT, no training |
| Abstract problem hard to attack | Step-Back (ask the general principle first) | Physics, law, framework-application |
| Long task with distinct stages | Plan-and-Solve (write plan, stop, then execute) | Reduces drift on multi-section outputs |
| Wrong first try, can be self-judged | Reflexion / Self-Refine (draft โ critique โ revise) | +10โ25% quality on creative tasks |
| Tool-use agent | ReAct (thought โ action โ observation loop) | The standard agentic pattern |
Agent prompts are runbooks: goal, tools, decision criteria, error handling, stopping conditions. For Claude Opus 4.7 specifically:
Claude 4.5+ is far more responsive to the system prompt than 3.x. Aggressive language written to defeat under-triggering on older models now causes over-triggering. The fix is tonal, not structural โ substance stays, theatre goes.
CRITICAL: YOU MUST use this tool when... โ Use this tool when...๐จ HARD RULES + stacked NEVER/ALWAYS bullets โ plain Don't โฆ / Do โฆ bullets in a <critical> blockMANDATORY: Run validation โ Run validation after edits.DEFAULT TO using web search โ Use web search when it would enhance your understanding of the problem.After every 3 tool calls, summarize โ drop entirely on 4.7 (internalised)Reserve all-caps and "must" for one or two genuine safety hard-stops (e.g. Never force-push to main). Stacking ten dilutes attention and the model treats them as flavour.
Substance test: delete the NEVER/ALWAYS/MUST and lowercase the line. Does the rule still make sense? Soften it. Does it now read as filler? Cut it.
Full recipe with verbatim Anthropic quotes, parallel-tool-calls block, message-history rules, and authoring checklist: reference/claude-4-emphasis-and-tools.md.
| Framework | Slots | Best for |
|---|---|---|
| RTF | Role ยท Task ยท Format | Fast one-offs, draft quality |
| RACE | Role ยท Action ยท Context ยท Expectation | Everyday structured tasks (default beginner pick) |
| CRAFT | Context ยท Role ยท Action ยท Format ยท Tone | Most general-purpose work |
| RISEN | Role ยท Input ยท Steps ยท Expectation ยท Narrowing | Multi-step procedures |
| CRISPE | Context ยท Response format ยท Input ยท System role ยท Persona ยท Execution | Creative, multi-variant generation |
These are scaffolding for thinking, not magic. Output quality comes from spec clarity, not letter-counting.
Use tool X when ... lands harder than CRITICAL: YOU MUST use X. See the "Tone calibration" section above and reference/claude-4-emphasis-and-tools.md.Write the answer inside <answer> tags.The 12 documented LLM failure modes (Masood 2026, Galileo 2026):
| Mode | Symptom | Fix |
|---|---|---|
| Hallucination | Confident wrong facts | Ground in cited context; allow "I don't know"; add quotes-first step |
| Sycophancy | Agrees with user's wrong premise | "Disagree with confidence when the evidence supports it" |
| Context rot | Quality drops as context grows | Trim context; RAG instead of dumping full docs |
| Instruction attenuation | Ignores rules later in long prompt | Move binding rules to the tail (recency); use XML anchors; restate the constraint |
| Underspecification | Vague output | Add success criteria + format spec + 1 example |
| Task drift (agentic) | Model wanders off goal | Explicit stopping conditions; goal restated each turn |
| Incorrect tool invocation | Wrong tool, wrong args | Describe each tool's purpose and pre/post-conditions |
| Reward hacking | Gaming the metric | Multi-criteria evaluation; adversarial test cases |
| Positional bias | First option over-selected | Randomise option order; test counter-balanced sets |
| Mode collapse | Repetitive output | Increase temperature; add diversity instruction |
| Degeneration loops | Stuck repeating | Max-iteration cap; explicit exit pattern |
| Version drift | Same prompt, new model, worse | Re-evaluate on every model upgrade; pin version in prod |
When the user shows you a prompt that misbehaves:
<critical> block if the harness uses XML structure.Don't change prompts without tests. Build a 5โ20 input eval set with known-good outputs. Score on accuracy, relevance, coherence, completeness, conciseness. Tools: PromptFoo (open-source), LangSmith, Braintrust. Iteration target: 55% โ 85% after 3 rounds is the documented expert baseline โ if you're not improving, your eval set is missing the actual failure mode.
Bad:
Summarise this article.
Good:
<task>Summarise the article below in 3 bullet points for a technical audience.</task>
<output_format>
- Each bullet: 1 sentence, max 25 words.
- Lead with the most surprising fact, not the topic.
- No "the article discusses..." preamble.
</output_format>
<article>
{{ARTICLE}}
</article>
What changed: explicit format, audience, length cap, opening rule. The model now knows what "right" looks like.
User: "I'm asking the model to solve a logic puzzle and it gets the answer wrong half the time."
Diagnosis: multi-step reasoning, single answer, high-stakes. โ Self-Consistency.
Fix: sample 5 CoT runs at temperature 0.7, take the majority answer. Ship the most-frequent answer, not the first.
<role>
You are a code-review agent. Find correctness bugs in the diff provided.
</role>
<task>
For each finding, output: file:line, severity (low/med/high), confidence (0โ1), one-paragraph explanation.
</task>
<rules>
- Report every issue you find, including low-severity. A separate filter ranks them downstream.
- A "bug" is anything that could cause incorrect behaviour, a test failure, or a misleading result.
- Style and naming preferences are not bugs.
- If the diff has no bugs, return an empty findings list. Do not invent issues.
</rules>
<output_format>
JSON array of {file, line, severity, confidence, explanation}.
</output_format>
<diff>
{{DIFF}}
</diff>
Effort: xhigh. Adaptive thinking: on. The literal-instruction-follower behaviour of 4.7 means stating "report every issue including low-severity" is load-bearing โ without it, recall drops.
Bad (query first, dump everything):
What does this contract say about termination clauses?
[50,000 tokens of legal documents]
Good (data first, quote-first, query last):
<documents>
<document index="1">
<source>contract_2026.pdf</source>
<document_content>{{CONTRACT}}</document_content>
</document>
</documents>
Find quotes from the contract that govern termination. Place them in <quotes> tags. Then answer: under what conditions can either party terminate, with what notice period? Place the answer in <answer> tags.
Why: data-at-top is the documented +30% accuracy pattern; quote-first cuts through document noise; trailing query gets recency attention.
Here is my system prompt:
<prompt>{{CURRENT_PROMPT}}</prompt>
Here are 3 user inputs and the unwanted outputs the prompt produced:
<failure index="1">
<input>{{INPUT_1}}</input>
<bad_output>{{OUTPUT_1}}</bad_output>
<desired>{{DESIRED_1}}</desired>
</failure>
<!-- repeat for 2, 3 -->
Diagnose which of these failure modes is at play (hallucination, underspecification, instruction attenuation, context rot, sycophancy, task drift, format error, other). For each diagnosed mode, propose ONE specific change to the prompt that would fix it. Output a revised prompt that applies the highest-impact fix only.
This pattern delivers 10โ25% quality lift on first iteration with no model fine-tuning.
Before shipping a prompt, verify:
<context>, <input>, <output_format>).{{CWD}} / {{DATE}} / dynamic placeholders in the static portion (they break caching).Within-domain skills that pair naturally with this one:
tap-agent-authoring (XML block order, U-shape attention, caching rules).In-skill references:
reference/claude-4-emphasis-and-tools.md โ Claude 4.6+ tone calibration, parallel-tool-calls block, message-history rules, dial-back recipes.External authoritative sources: