| name | prompt-engineering |
| description | Applies TDD methodology and research-backed practices (Meincke 2025) for writing or improving LLM prompts: measure baseline, test don't assume, iterate rigorously. Prevents assuming universal techniques work. Includes persuasion principles for compliance. Triggers: 'write a prompt', 'improve prompt', 'prompt not working', general prompting, application development. Not for formatting-only tasks requiring no iteration. |
Prompt Engineering
Overview
Prompt engineering is complicated and contingent (Meincke et al. 2025).
Techniques that help in one context may hurt in another.
The solution: test, don't assume.
Core principle: Measure baseline, apply minimal changes, test rigorously.
TDD Methodology for Prompts
Follow red-green-refactor skill for the core TDD methodology.
This section defines prompt-specific specializations.
Red Phase: Prompt-Specific Baseline
In addition to red-green-refactor baseline requirements:
-
Create test cases:
- Minimum 5-10 representative examples
- Include edge cases
- Cover expected input variations
-
Run current prompt and document results:
- How many test cases pass?
- What patterns in failures?
- Where does it struggle?
Prompt-specific forbidden rationalizations:
- "I'll just try this technique"
- "Step-by-step thinking always helps"
- "Being polite to AI improves output"
- "This is a simple prompt, no need to test"
- "I know this works from experience"
Green Phase: Prompt Techniques
Apply one technique at a time. Test after each change.
Core Techniques
1. Clear, Explicit Instructions (highest priority)
State exactly what you want.
Modern models respond best to direct communication.
Bad: "Explain climate change"
Good: "Write 3 paragraphs explaining climate change for high school students. Use bullet points for key facts. Maintain neutral tone."
Required elements:
- Task description
- Output format
- Tone/style
- Constraints
- Audience (if relevant)
2. Examples (Few-Shot)
Show the model what you want through examples.
Provide 2-5 input-output pairs demonstrating desired behavior.
More examples = better performance, but diminishing returns after ~5.
Format:
Example 1:
Input: [example input]
Output: [desired output]
Example 2:
Input: [example input]
Output: [desired output]
Now complete:
Input: [actual input]
Output:
When an abstract rule misclassifies a borderline case during testing,
the repair is rarely a longer abstract definition. Add the specific
misclassified case verbatim, with the correct verdict, inside the rule.
The example does the discrimination work the abstract definition cannot.
Illustration — a rule that flags redundant transition sentences:
Drop sentences that re-state a fact already implied by surrounding prose.
Example: "The Lambda's MemorySize lives outside this repo. It still
needs a bump on the infra side..." — "infra side" already implies
"outside this repo", so drop the first sentence.
Abstract rules invoke whatever priors the model has; "implied by
surrounding prose" is interpreted differently by different models.
A worked example pins the rule to a concrete classification and acts
as a few-shot anchor every time the rule is read.
Decision rule: when testing reveals a misclassified case, copy that
case into the rule with the correct verdict. Prefer a real draft/
revision pair over a synthetic example — real cases are already
calibrated to the intent.
For the full per-delta refinement loop, see
rule-verification-loop.md.
3. Chain of Thought
For reasoning tasks, ask model to show its work.
Add: "Think through this step-by-step" or "Explain your reasoning"
Note: Meincke et al. found CoT value decreasing with newer models.
Test whether it helps your use case.
4. Structure and Formatting
Separate different parts clearly.
Use:
- Clear headings
- Whitespace between sections
- Consistent formatting
Use XML tags when sections must not bleed into each other
(scoped rules, multi-section instruction docs, mixed instructions + data + examples),
or when parsing output programmatically.
Models do not suggest XML for context bleed unprompted — apply proactively.
See structural-delimiters.md
for model-specific preferences (Claude, GPT, Gemini) and the rendering caveat.
5. Output Format Specification
Be explicit about format.
For applications: Use SDK schema/configuration when available.
Don't embed JSON schema in prompt if SDK provides native schema support.
For chat: Specify format in instructions ("Respond in bullet points", "Use markdown table").
Refactor Phase: Prompt-Specific Testing
In addition to red-green-refactor testing requirements:
Single-run testing masks variability.
Meincke et al. showed 60-point swings on individual questions.
Performance collapses at strict thresholds.
Required testing:
-
Run prompt on all test cases (minimum 5-10)
-
For high-stakes applications: Run multiple times per test case
-
Measure:
- Success rate
- Consistency across runs
- Edge case handling
-
Match evaluation to use case:
- Exploratory tasks: 51% majority-correct may suffice
- Critical applications: Need higher consistency
For parallel fresh-agent pressure runs (the operational form of step 2),
see pressure-testing.md.
Only after rigorous testing can you claim improvement.
Persuasion Principles
LLMs respond to psychological persuasion principles.
Apply these when writing prompts for agent compliance.
See persuasion-principles.md
for the full framework and ethical guidelines.
Three Use Cases
1. Claude Configuration (Skills/Agents)
When creating skills or agent prompts:
- Apply TDD methodology from this skill
- Use frontmatter for metadata (name, description)
- Structure with clear headings
- Test agent behavior against baseline
- Refine based on observed failures
Note: Specialized guidance for agent documentation exists
(quality gates, token efficiency requirements)
but core TDD methodology remains the same.
2. Ad-Hoc Prompting
When using chat interfaces:
- Start with clear instructions
- Add examples if output misses the mark
- Refine based on actual responses
- Iterate quickly in conversation
3. Application Development
When building applications with LLM APIs:
Separate concerns:
- Prompt content: Task instructions, examples, context
- SDK configuration: Output schema, system prompts, temperature, safety settings
Decision criteria - Use SDK configuration for:
- Output format/schema (JSON structure, field types)
- Generation parameters (temperature, top_k, max_tokens)
- System-level instructions (role, behavior guidelines)
- Safety settings, stop sequences
Use prompt content for:
- Task-specific instructions
- Domain context
- Examples demonstrating desired behavior
- Input data to process
Don't embed in prompt what SDK handles natively.
This improves maintainability and leverages SDK optimizations.
Keep prompts focused on the task, let SDK handle structure.
Example separation:
// SDK Configuration
schema = {type: "object", properties: {...}}
systemPrompt = "You are a helpful assistant"
temperature = 0.7
// Prompt Content (focused on task)
userPrompt = "Analyze the following customer review and extract sentiment..."
Don't mix these in one blob.
Context-Dependent Effectiveness
Meincke et al. findings:
- No universal formulas exist
- Politeness sometimes helps, sometimes hurts
- Constraints help in some cases, hurt in others
- Small wording changes can cause large swings
Implications:
- Test techniques in your context
- Don't copy prompts blindly from internet
- What worked for others may not work for you
- Measure, don't assume
Red Flags
Stop immediately if you catch yourself thinking:
- "Being polite always helps" → Test it
- "Chain of thought is best practice" → Test it
- "XML tags will improve this" → Identify the failure mode first (e.g., context bleed, mixed content), then test
- "This worked on Twitter" → Test in your context
- "I'll embed the schema in the prompt" → Check if SDK supports native schema
- "One test run looks good" → Run multiple test cases
- "Let me try this technique" → Measure baseline first
- "I'll put everything in the prompt" → Separate SDK config from content
- "Simple prompts don't need testing" → All prompts need baseline
- "Let me add multiple improvements" → One change at a time
All of these mean: Stop. Follow TDD methodology.
Anti-Patterns
Forbidden without testing:
- Applying multiple techniques simultaneously
- Assuming universal best practices
- Single-shot evaluation
- Copying prompts without validation
- Embedding structure in prompt when SDK handles it
- Changing prompts without baseline measurement
Required before claiming success:
- Test on multiple cases
- Measure consistency
- Compare to baseline
- Document what actually improved
Writing Anti-Pattern Sections
When writing the anti-patterns section of a prompt,
never put a token (word, command, concept) under a "Bad" label
when a valid use of that token exists.
Agents avoid the token wholesale,
even when a "Good" variant is shown nearby.
Name the dangerous condition, not the token.
See label-poisoning.md for
examples across tool invocation, prose, concepts, and APIs.
Common Issues
Technique applied before failure mode identified: skipped Red phase.
Document what the current prompt does wrong before reaching for a fix.
Baseline not documented: improvement can't be measured.
Record pass/fail counts before any change.
Integration with Other Skills
- red-green-refactor: Core TDD methodology
- verification-before-completion: Test prompts before claiming they work
- systematic-debugging: When prompts fail, investigate systematically
Examples
Good: TDD Approach
User: I need a prompt to summarize customer reviews.
Agent: Let me start by establishing a baseline.
First, what are your success criteria?
- How long should summaries be?
- What information must be included?
- Who is the audience?
[User provides criteria]
Agent: Let me create 5 test reviews covering different scenarios.
I'll run a simple prompt first to measure baseline performance.
[Tests baseline, documents results]
Agent: Baseline passes 2/5 test cases. Common failure: missing sentiment.
Let me add explicit instruction to include sentiment.
[Tests improved version]
Agent: Now passes 4/5. Remaining failure is edge case with sarcasm.
Adding example of sarcastic review handling.
[Tests final version]
Agent: Passes 5/5 test cases. Ready for your review.
Bad: Assumption-Based Approach
User: I need a prompt to summarize customer reviews.
Agent: Here's a prompt using best practices:
Please think step-by-step and carefully analyze the review.
Be polite and thorough. Use structured output.
I used chain of thought, politeness, and structure — these are proven techniques.
[No baseline, no testing, assumed techniques work]
LLM Anti-Pattern Detection
If you output these phrases before testing:
- "Here's a prompt using best practices..." → Test it first
- "Chain of thought will improve..." → Measure baseline
- "Being polite helps..." → Prove it
- "XML tags make this clearer" → Identify the structural problem first (e.g., context bleed, mixed content), then apply and test
- "This should work..." → Run test cases
When detected: Stop, measure baseline, test rigorously, then proceed.