Run any Skill in Manus with one click

$pwd:

prompt-engineering

Name: Prompt Engineering
Author: sebnow

// Applies TDD methodology and research-backed practices (Meincke 2025) for writing or improving LLM prompts: measure baseline, test don't assume, iterate rigorously. Prevents assuming universal techniques work. Includes persuasion principles for compliance. Triggers: 'write a prompt', 'improve prompt', 'prompt not working', general prompting, application development. Not for formatting-only tasks requiring no iteration.

Run Skill in Manus

$ git log --oneline --stat

stars:7

forks:0

updated:May 16, 2026 at 20:55

File Explorer

6 files

SKILL.md

readonly

package.json

"author": "sebnow"

"repository": "sebnow/configs"

View GitHub Repository

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

Run any Skill with one click

name

prompt-engineering

description

Applies TDD methodology and research-backed practices (Meincke 2025) for writing or improving LLM prompts: measure baseline, test don't assume, iterate rigorously. Prevents assuming universal techniques work. Includes persuasion principles for compliance. Triggers: 'write a prompt', 'improve prompt', 'prompt not working', general prompting, application development. Not for formatting-only tasks requiring no iteration.

Prompt Engineering

Overview

Prompt engineering is complicated and contingent (Meincke et al. 2025). Techniques that help in one context may hurt in another. The solution: test, don't assume.

Core principle: Measure baseline, apply minimal changes, test rigorously.

TDD Methodology for Prompts

Follow red-green-refactor skill for the core TDD methodology. This section defines prompt-specific specializations.

Red Phase: Prompt-Specific Baseline

In addition to red-green-refactor baseline requirements:

Create test cases:
- Minimum 5-10 representative examples
- Include edge cases
- Cover expected input variations
Run current prompt and document results:
- How many test cases pass?
- What patterns in failures?
- Where does it struggle?

Prompt-specific forbidden rationalizations:

"I'll just try this technique"
"Step-by-step thinking always helps"
"Being polite to AI improves output"
"This is a simple prompt, no need to test"
"I know this works from experience"

Green Phase: Prompt Techniques

Apply one technique at a time. Test after each change.

Core Techniques

1. Clear, Explicit Instructions (highest priority)

State exactly what you want. Modern models respond best to direct communication.

Bad: "Explain climate change" Good: "Write 3 paragraphs explaining climate change for high school students. Use bullet points for key facts. Maintain neutral tone."

Required elements:

Task description
Output format
Tone/style
Constraints
Audience (if relevant)

2. Examples (Few-Shot)

Show the model what you want through examples.

Provide 2-5 input-output pairs demonstrating desired behavior. More examples = better performance, but diminishing returns after ~5.

Format:

Example 1:
Input: [example input]
Output: [desired output]

Example 2:
Input: [example input]
Output: [desired output]

Now complete:
Input: [actual input]
Output:

When an abstract rule misclassifies a borderline case during testing, the repair is rarely a longer abstract definition. Add the specific misclassified case verbatim, with the correct verdict, inside the rule. The example does the discrimination work the abstract definition cannot.

Illustration — a rule that flags redundant transition sentences:

Drop sentences that re-state a fact already implied by surrounding prose.

Example: "The Lambda's MemorySize lives outside this repo. It still needs a bump on the infra side..." — "infra side" already implies "outside this repo", so drop the first sentence.

Abstract rules invoke whatever priors the model has; "implied by surrounding prose" is interpreted differently by different models. A worked example pins the rule to a concrete classification and acts as a few-shot anchor every time the rule is read.

Decision rule: when testing reveals a misclassified case, copy that case into the rule with the correct verdict. Prefer a real draft/ revision pair over a synthetic example — real cases are already calibrated to the intent.

For the full per-delta refinement loop, see rule-verification-loop.md.

3. Chain of Thought

For reasoning tasks, ask model to show its work.

Add: "Think through this step-by-step" or "Explain your reasoning"

Note: Meincke et al. found CoT value decreasing with newer models. Test whether it helps your use case.

4. Structure and Formatting

Separate different parts clearly.

Use:

Clear headings
Whitespace between sections
Consistent formatting

Use XML tags when sections must not bleed into each other (scoped rules, multi-section instruction docs, mixed instructions + data + examples), or when parsing output programmatically. Models do not suggest XML for context bleed unprompted — apply proactively.

See structural-delimiters.md for model-specific preferences (Claude, GPT, Gemini) and the rendering caveat.

5. Output Format Specification

Be explicit about format.

For applications: Use SDK schema/configuration when available. Don't embed JSON schema in prompt if SDK provides native schema support.

For chat: Specify format in instructions ("Respond in bullet points", "Use markdown table").

Refactor Phase: Prompt-Specific Testing

In addition to red-green-refactor testing requirements:

Single-run testing masks variability. Meincke et al. showed 60-point swings on individual questions. Performance collapses at strict thresholds.

Required testing:

Run prompt on all test cases (minimum 5-10)
For high-stakes applications: Run multiple times per test case
Measure:
- Success rate
- Consistency across runs
- Edge case handling
Match evaluation to use case:
- Exploratory tasks: 51% majority-correct may suffice
- Critical applications: Need higher consistency

For parallel fresh-agent pressure runs (the operational form of step 2), see pressure-testing.md.

Only after rigorous testing can you claim improvement.

Persuasion Principles

LLMs respond to psychological persuasion principles. Apply these when writing prompts for agent compliance.

See persuasion-principles.md for the full framework and ethical guidelines.

Three Use Cases

1. Claude Configuration (Skills/Agents)

When creating skills or agent prompts:

Apply TDD methodology from this skill
Use frontmatter for metadata (name, description)
Structure with clear headings
Test agent behavior against baseline
Refine based on observed failures

Note: Specialized guidance for agent documentation exists (quality gates, token efficiency requirements) but core TDD methodology remains the same.

2. Ad-Hoc Prompting

When using chat interfaces:

Start with clear instructions
Add examples if output misses the mark
Refine based on actual responses
Iterate quickly in conversation

3. Application Development

When building applications with LLM APIs:

Separate concerns:

Prompt content: Task instructions, examples, context
SDK configuration: Output schema, system prompts, temperature, safety settings

Decision criteria - Use SDK configuration for:

Output format/schema (JSON structure, field types)
Generation parameters (temperature, top_k, max_tokens)
System-level instructions (role, behavior guidelines)
Safety settings, stop sequences

Use prompt content for:

Task-specific instructions
Domain context
Examples demonstrating desired behavior
Input data to process

Don't embed in prompt what SDK handles natively. This improves maintainability and leverages SDK optimizations.

Keep prompts focused on the task, let SDK handle structure.

Example separation:

// SDK Configuration
schema = {type: "object", properties: {...}}
systemPrompt = "You are a helpful assistant"
temperature = 0.7

// Prompt Content (focused on task)
userPrompt = "Analyze the following customer review and extract sentiment..."

Don't mix these in one blob.

Context-Dependent Effectiveness

Meincke et al. findings:

No universal formulas exist
Politeness sometimes helps, sometimes hurts
Constraints help in some cases, hurt in others
Small wording changes can cause large swings

Implications:

Test techniques in your context
Don't copy prompts blindly from internet
What worked for others may not work for you
Measure, don't assume

Red Flags

Stop immediately if you catch yourself thinking:

"Being polite always helps" → Test it
"Chain of thought is best practice" → Test it
"XML tags will improve this" → Identify the failure mode first (e.g., context bleed, mixed content), then test
"This worked on Twitter" → Test in your context
"I'll embed the schema in the prompt" → Check if SDK supports native schema
"One test run looks good" → Run multiple test cases
"Let me try this technique" → Measure baseline first
"I'll put everything in the prompt" → Separate SDK config from content
"Simple prompts don't need testing" → All prompts need baseline
"Let me add multiple improvements" → One change at a time

All of these mean: Stop. Follow TDD methodology.

Anti-Patterns

Forbidden without testing:

Applying multiple techniques simultaneously
Assuming universal best practices
Single-shot evaluation
Copying prompts without validation
Embedding structure in prompt when SDK handles it
Changing prompts without baseline measurement

Required before claiming success:

Test on multiple cases
Measure consistency
Compare to baseline
Document what actually improved

Writing Anti-Pattern Sections

When writing the anti-patterns section of a prompt, never put a token (word, command, concept) under a "Bad" label when a valid use of that token exists. Agents avoid the token wholesale, even when a "Good" variant is shown nearby. Name the dangerous condition, not the token.

See label-poisoning.md for examples across tool invocation, prose, concepts, and APIs.

Common Issues

Technique applied before failure mode identified: skipped Red phase. Document what the current prompt does wrong before reaching for a fix.

Baseline not documented: improvement can't be measured. Record pass/fail counts before any change.

Integration with Other Skills

red-green-refactor: Core TDD methodology
verification-before-completion: Test prompts before claiming they work
systematic-debugging: When prompts fail, investigate systematically

Examples

Good: TDD Approach

User: I need a prompt to summarize customer reviews.
Agent: Let me start by establishing a baseline.

First, what are your success criteria?
- How long should summaries be?
- What information must be included?
- Who is the audience?

[User provides criteria]

Agent: Let me create 5 test reviews covering different scenarios.
I'll run a simple prompt first to measure baseline performance.

[Tests baseline, documents results]

Agent: Baseline passes 2/5 test cases. Common failure: missing sentiment.
Let me add explicit instruction to include sentiment.

[Tests improved version]

Agent: Now passes 4/5. Remaining failure is edge case with sarcasm.
Adding example of sarcastic review handling.

[Tests final version]

Agent: Passes 5/5 test cases. Ready for your review.

Bad: Assumption-Based Approach

User: I need a prompt to summarize customer reviews.
Agent: Here's a prompt using best practices:

Please think step-by-step and carefully analyze the review.
Be polite and thorough. Use structured output.

I used chain of thought, politeness, and structure — these are proven techniques.

[No baseline, no testing, assumed techniques work]

LLM Anti-Pattern Detection

If you output these phrases before testing:

"Here's a prompt using best practices..." → Test it first
"Chain of thought will improve..." → Measure baseline
"Being polite helps..." → Prove it
"XML tags make this clearer" → Identify the structural problem first (e.g., context bleed, mixed content), then apply and test
"This should work..." → Run test cases

When detected: Stop, measure baseline, test rigorously, then proceed.

name

prompt-engineering

description

Prompt Engineering

Overview

Prompt engineering is complicated and contingent (Meincke et al. 2025). Techniques that help in one context may hurt in another. The solution: test, don't assume.

Core principle: Measure baseline, apply minimal changes, test rigorously.

TDD Methodology for Prompts

Follow red-green-refactor skill for the core TDD methodology. This section defines prompt-specific specializations.

Red Phase: Prompt-Specific Baseline

In addition to red-green-refactor baseline requirements:

Create test cases:
- Minimum 5-10 representative examples
- Include edge cases
- Cover expected input variations
Run current prompt and document results:
- How many test cases pass?
- What patterns in failures?
- Where does it struggle?

Prompt-specific forbidden rationalizations:

"I'll just try this technique"
"Step-by-step thinking always helps"
"Being polite to AI improves output"
"This is a simple prompt, no need to test"
"I know this works from experience"

Green Phase: Prompt Techniques

Apply one technique at a time. Test after each change.

Core Techniques

1. Clear, Explicit Instructions (highest priority)

State exactly what you want. Modern models respond best to direct communication.

Bad: "Explain climate change" Good: "Write 3 paragraphs explaining climate change for high school students. Use bullet points for key facts. Maintain neutral tone."

Required elements:

Task description
Output format
Tone/style
Constraints
Audience (if relevant)

2. Examples (Few-Shot)

Show the model what you want through examples.

Provide 2-5 input-output pairs demonstrating desired behavior. More examples = better performance, but diminishing returns after ~5.

Format:

Example 1:
Input: [example input]
Output: [desired output]

Example 2:
Input: [example input]
Output: [desired output]

Now complete:
Input: [actual input]
Output:

Illustration — a rule that flags redundant transition sentences:

Drop sentences that re-state a fact already implied by surrounding prose.

Example: "The Lambda's MemorySize lives outside this repo. It still needs a bump on the infra side..." — "infra side" already implies "outside this repo", so drop the first sentence.

For the full per-delta refinement loop, see rule-verification-loop.md.

3. Chain of Thought

For reasoning tasks, ask model to show its work.

Add: "Think through this step-by-step" or "Explain your reasoning"

Note: Meincke et al. found CoT value decreasing with newer models. Test whether it helps your use case.

4. Structure and Formatting

Separate different parts clearly.

Use:

Clear headings
Whitespace between sections
Consistent formatting

See structural-delimiters.md for model-specific preferences (Claude, GPT, Gemini) and the rendering caveat.

5. Output Format Specification

Be explicit about format.

For applications: Use SDK schema/configuration when available. Don't embed JSON schema in prompt if SDK provides native schema support.

For chat: Specify format in instructions ("Respond in bullet points", "Use markdown table").

Refactor Phase: Prompt-Specific Testing

In addition to red-green-refactor testing requirements:

Single-run testing masks variability. Meincke et al. showed 60-point swings on individual questions. Performance collapses at strict thresholds.

Required testing:

Run prompt on all test cases (minimum 5-10)
For high-stakes applications: Run multiple times per test case
Measure:
- Success rate
- Consistency across runs
- Edge case handling
Match evaluation to use case:
- Exploratory tasks: 51% majority-correct may suffice
- Critical applications: Need higher consistency

For parallel fresh-agent pressure runs (the operational form of step 2), see pressure-testing.md.

Only after rigorous testing can you claim improvement.

Persuasion Principles

LLMs respond to psychological persuasion principles. Apply these when writing prompts for agent compliance.

See persuasion-principles.md for the full framework and ethical guidelines.

Three Use Cases

1. Claude Configuration (Skills/Agents)

When creating skills or agent prompts:

Apply TDD methodology from this skill
Use frontmatter for metadata (name, description)
Structure with clear headings
Test agent behavior against baseline
Refine based on observed failures

Note: Specialized guidance for agent documentation exists (quality gates, token efficiency requirements) but core TDD methodology remains the same.

2. Ad-Hoc Prompting

When using chat interfaces:

Start with clear instructions
Add examples if output misses the mark
Refine based on actual responses
Iterate quickly in conversation

3. Application Development

When building applications with LLM APIs:

Separate concerns:

Prompt content: Task instructions, examples, context
SDK configuration: Output schema, system prompts, temperature, safety settings

Decision criteria - Use SDK configuration for:

Output format/schema (JSON structure, field types)
Generation parameters (temperature, top_k, max_tokens)
System-level instructions (role, behavior guidelines)
Safety settings, stop sequences

Use prompt content for:

Task-specific instructions
Domain context
Examples demonstrating desired behavior
Input data to process

Don't embed in prompt what SDK handles natively. This improves maintainability and leverages SDK optimizations.

Keep prompts focused on the task, let SDK handle structure.

Example separation:

// SDK Configuration
schema = {type: "object", properties: {...}}
systemPrompt = "You are a helpful assistant"
temperature = 0.7

// Prompt Content (focused on task)
userPrompt = "Analyze the following customer review and extract sentiment..."

Don't mix these in one blob.

Context-Dependent Effectiveness

Meincke et al. findings:

No universal formulas exist
Politeness sometimes helps, sometimes hurts
Constraints help in some cases, hurt in others
Small wording changes can cause large swings

Implications:

Test techniques in your context
Don't copy prompts blindly from internet
What worked for others may not work for you
Measure, don't assume

Red Flags

Stop immediately if you catch yourself thinking:

"Being polite always helps" → Test it
"Chain of thought is best practice" → Test it
"XML tags will improve this" → Identify the failure mode first (e.g., context bleed, mixed content), then test
"This worked on Twitter" → Test in your context
"I'll embed the schema in the prompt" → Check if SDK supports native schema
"One test run looks good" → Run multiple test cases
"Let me try this technique" → Measure baseline first
"I'll put everything in the prompt" → Separate SDK config from content
"Simple prompts don't need testing" → All prompts need baseline
"Let me add multiple improvements" → One change at a time

All of these mean: Stop. Follow TDD methodology.

Anti-Patterns

Forbidden without testing:

Applying multiple techniques simultaneously
Assuming universal best practices
Single-shot evaluation
Copying prompts without validation
Embedding structure in prompt when SDK handles it
Changing prompts without baseline measurement

Required before claiming success:

Test on multiple cases
Measure consistency
Compare to baseline
Document what actually improved

Writing Anti-Pattern Sections

See label-poisoning.md for examples across tool invocation, prose, concepts, and APIs.

Common Issues

Technique applied before failure mode identified: skipped Red phase. Document what the current prompt does wrong before reaching for a fix.

Baseline not documented: improvement can't be measured. Record pass/fail counts before any change.

Integration with Other Skills

red-green-refactor: Core TDD methodology
verification-before-completion: Test prompts before claiming they work
systematic-debugging: When prompts fail, investigate systematically

Examples

Good: TDD Approach

User: I need a prompt to summarize customer reviews.
Agent: Let me start by establishing a baseline.

First, what are your success criteria?
- How long should summaries be?
- What information must be included?
- Who is the audience?

[User provides criteria]

Agent: Let me create 5 test reviews covering different scenarios.
I'll run a simple prompt first to measure baseline performance.

[Tests baseline, documents results]

Agent: Baseline passes 2/5 test cases. Common failure: missing sentiment.
Let me add explicit instruction to include sentiment.

[Tests improved version]

Agent: Now passes 4/5. Remaining failure is edge case with sarcasm.
Adding example of sarcastic review handling.

[Tests final version]

Agent: Passes 5/5 test cases. Ready for your review.

Bad: Assumption-Based Approach

User: I need a prompt to summarize customer reviews.
Agent: Here's a prompt using best practices:

Please think step-by-step and carefully analyze the review.
Be polite and thorough. Use structured output.

I used chain of thought, politeness, and structure — these are proven techniques.

[No baseline, no testing, assumed techniques work]

LLM Anti-Pattern Detection

If you output these phrases before testing:

"Here's a prompt using best practices..." → Test it first
"Chain of thought will improve..." → Measure baseline
"Being polite helps..." → Prove it
"XML tags make this clearer" → Identify the structural problem first (e.g., context bleed, mixed content), then apply and test
"This should work..." → Run test cases

When detected: Stop, measure baseline, test rigorously, then proceed.