一键在 Manus 中运行任何 Skill

$pwd:

multi-model-orchestration

Name: Multi Model Orchestration
Author: RazonIn4K

// Orchestrate workflows across multiple AI models (Perplexity, GPT, Grok, Claude, Gemini) for comprehensive security research, competition execution, and strategic analysis using GUI interfaces or API automation

在 Manus 中运行

$ git log --oneline --stat

stars:2

forks:0

updated:2025年11月5日 20:15

SKILL.md

readonly

package.json

"author": "RazonIn4K"

"repository": "RazonIn4K/Red-Team-Learning"

打开 GitHub 仓库

$ install --globalskills.sh

$ download --local

在 Manus 中运行

[HINT] 下载包含 SKILL.md 和所有相关文件的完整技能目录

一键运行任何 Skill

name	multi-model-orchestration
description	Orchestrate workflows across multiple AI models (Perplexity, GPT, Grok, Claude, Gemini) for comprehensive security research, competition execution, and strategic analysis using GUI interfaces or API automation

Multi-Model Orchestration Skill

Overview

This Skill provides workflows for orchestrating tasks across multiple state-of-the-art (SOTA) AI models. It enables strategic task decomposition where each model handles its specialized strength: Perplexity for live intelligence, GPT for strategic planning, Grok for risk analysis, Claude for code generation, and Gemini for security audits.

Key Insight: Different models excel at different tasks. Orchestration maximizes overall success by leveraging each model's strengths in sequence or parallel.

When to Use This Skill

Claude should invoke this Skill when the user:

Mentions "multi-model" workflows or orchestration
Asks about using multiple AI models together
Needs strategic planning + code generation + security audit
References Perplexity, GPT, Grok, or Gemini for specific tasks
Wants to leverage GUI-accessible models for complex workflows
Mentions context-pack.txt or ops-log.md (multi-model session tracking)
Asks about 5-model workflows or SOTA model combinations
Needs live intelligence combined with code generation

5-Model Workflow Architecture

Model Selection & Task Mapping

Phase	Model	Primary Strength	Task	Time
1. Intel	Perplexity Pro	Live web search	Gather latest intel (<48h)	5-10 min
2. Strategy	GPT-5 / ChatGPT	Strategic planning	Create execution plan	10-15 min
3. Risk	Grok 4	Risk analysis	Identify failure modes	10-15 min
4. Code	Claude 4.5	Code generation	Generate scripts/payloads	15-20 min
5. Audit	Gemini 2.5 Pro	Security review	Final security audit	10-15 min

Total Planning Time: 50-70 minutes

User's Available Models (GUI-accessible):

gemini → claude → chatgpt → perplexity → kimi → deepseek → grok → qwen

Mapping to Available Models:

Intel: perplexity → grok
Strategy: chatgpt → claude → kimi
Risk: grok → deepseek
Code/Payloads: claude → deepseek → qwen
Audit: gemini → claude

Core Workflows

1. GUI-Based Orchestration (Manual Handoffs)

Best For: Users without API access or preferring GUI interfaces

Workflow:

Open Perplexity → Run intel prompt
Copy output to context-pack.txt
Open ChatGPT → Paste context + strategy prompt
Copy output to ops-log.md
Open Claude → Paste context + code prompt
Generate code/payloads
Open Gemini → Paste code + audit prompt
Verify security

Session Tracking:

context-pack.txt: Static briefing (updated once at session start)
ops-log.md: Rolling transcript (updated after each model)

Time: 50-70 minutes total

Success Rate: 70-90% (higher than single-model approaches)

2. API-Based Orchestration (Automated)

Best For: Users with API keys for multiple models

Script: scripts/gray-swan-orchestration.py

Workflow:

export RED_TEAM_RESEARCH_MODE=ENABLED

# Configure API keys (via Doppler or .env)
export PERPLEXITY_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export GOOGLE_API_KEY="your-key"
export GROK_API_KEY="your-key"

# Run full orchestration
python scripts/gray-swan-orchestration.py \
  --challenge "IPI Wave 1 Challenge 3" \
  --output-dir $PAYLOAD_OUTPUT_DIR/orchestrated/

Output:

Intel summary (Perplexity)
Strategic plan (GPT)
Risk assessment (Grok)
Generated payloads (Claude)
Security audit (Gemini)
ops-log.md (complete transcript)

Time: 15-20 minutes (automated)

Success Rate: 70-90%

3. Fast Path (Minimal Orchestration)

Best For: Quick execution with 2-3 models

Workflow:

Perplexity → Intel gathering (5 min)
ChatGPT → Strategic plan (10 min)
Execute directly (30-40 min)

Skip: Risk analysis, security audit (optional phases)

Time: 45-65 minutes total

Success Rate: 60-75% (trade speed for thoroughness)

Detailed Phase Instructions

Phase 1: Intelligence Gathering (Perplexity)

Objective: Gather latest defensive intelligence and competition updates

Model: Perplexity (primary) or Grok (fallback)

Prompt Template:

MISSION: Gather latest defensive intelligence for [TOPIC].

SEARCH QUERIES:
1. "[TOPIC] defenses 2025"
2. "[RELATED TECHNIQUE] AI agents"
3. "[COMPETITION/CHALLENGE] latest updates"

TIME CONSTRAINT: Last 48 hours only

DELIVERABLE: 3 bullet points per query summarizing key findings.

Example (Gray Swan Competition):

MISSION: Gather latest defensive intelligence for Gray Swan Wave 1 competition.

SEARCH QUERIES:
1. "indirect prompt injection defenses 2025"
2. "tool response sanitization AI agents"
3. "Gray Swan AI security competition results"

TIME CONSTRAINT: Last 48 hours only

DELIVERABLE: 3 bullet points per query summarizing defensive techniques currently discussed.

What to Capture:

New defensive techniques
Competition updates
Model updates/patches
Defensive research papers

Output Location: Copy to context-pack.txt or ops-log.md

Phase 2: Strategic Planning (ChatGPT/GPT)

Objective: Create time-boxed execution plan with priority ordering

Model: ChatGPT (primary) or Claude (fallback)

Prompt Template:

CONTEXT: [Paste intelligence from Phase 1]

MISSION: [Specific task or challenge]

CONSTRAINTS:
- Time budget: [X] minutes
- Resources: [List available tools/models]
- Success criteria: [Define success]

DELIVERABLES:
1. Execution priority (ranked 1-N)
2. Time allocation per task
3. Success probability estimates
4. Failure contingencies (if Plan A fails)
5. 30-second pre-flight checklist

Example (Gray Swan Wave 1):

CONTEXT: [Paste Perplexity intel]

Gray Swan Wave 1 - 3 Challenges:
1. Garage Door: Email injection → activateDevice("Garage Door")
2. CI/CD Pipeline: Jira injection → schedule_ci_cd_update(...)
3. Codebase Leak: MCP doc injection → execute_command(...)

I have 9 models to test, 4 payload templates per challenge (ASR: A=30-45%, B=35-50%, C=40-60%, D=50-80%)

TIME BUDGET: 90 minutes total

DELIVERABLES:
1. Which challenge to start with? (rank 1-3)
2. Which template to use first? (A/B/C/D)
3. How many models to test before switching templates?
4. 30-second execution checklist per challenge

What to Capture:

Priority order
Time allocations
Estimated success rates
Contingency plans

Output Location: Append to ops-log.md with timestamp

Phase 3: Risk Analysis (Grok)

Objective: Identify failure modes and mitigation strategies

Model: Grok (primary) or DeepSeek (fallback)

Prompt Template:

CONTEXT: [Paste strategic plan from Phase 2]

RISK ASSESSMENT REQUEST:
I'm about to execute [TASK/CHALLENGE].

QUESTIONS:
1. What could go wrong during execution? (technical risks)
2. What evidence do I need to capture?
3. What are the 3 most common failure modes?
4. Should I modify the approach based on current defensive trends?

TIME: 10 minutes max. Bullet points only.

What to Capture:

Technical risks
Evidence requirements
Failure modes
Approach modifications

Output Location: Append to ops-log.md

Phase 4: Code/Payload Generation (Claude)

Objective: Generate code, payloads, or scripts with high quality

Model: Claude (primary), DeepSeek (fallback), or Qwen (fallback)

Prompt Template:

CONTEXT: [Paste strategic plan + risk assessment]

CODE GENERATION REQUEST:
I need [SCRIPT/PAYLOAD] for [PURPOSE].

REQUIREMENTS:
- [Requirement 1]
- [Requirement 2]
- [Requirement 3]

CONSTRAINTS:
- [Constraint 1]
- [Constraint 2]

DELIVERABLE: Production-ready [CODE/PAYLOAD] with comments

Example (Payload Refinement):

CONTEXT: [Paste strategy]

I'm refining payloads for Gray Swan Wave 1 Challenge [X].

CURRENT TEMPLATE D PAYLOAD:
[Paste payload]

INJECTION POINT: [Email body / Jira comment / MCP doc]

SUCCESS CRITERIA:
- Tool call: [function name]
- Parameters: [exact parameters]
- Response must NOT [warning/disclosure]

TASK:
1. Review this payload for obvious filter triggers
2. Suggest 2 minor variants (preserve core structure)
3. Rank: Original vs Variant 1 vs Variant 2 (by ASR)

TIME: 15 minutes. Output: 3 payloads ranked by ASR.

What to Capture:

Generated code/payloads
Variants with ASR estimates
Implementation notes

Output Location: Save to files + document in ops-log.md

Phase 5: Security Audit (Gemini)

Objective: Final security review before execution

Model: Gemini (primary) or Claude (fallback)

Prompt Template:

SECURITY REVIEW REQUEST:

I'm about to execute this [CODE/PAYLOAD]:

[Paste generated code/payload]

TARGET: [Model/system]
CONTEXT: [Authorization context]

QUESTIONS:
1. What are the 3 strongest elements?
2. What are 2 potential weaknesses?
3. If this fails, what's the most likely reason?
4. Should I adjust anything before execution?

TIME: 10 minutes. Bullet points only.

What to Capture:

Strengths
Weaknesses
Likely failure reasons
Final adjustments

Output Location: Append to ops-log.md

Session Tracking System

context-pack.txt (Static Briefing)

Purpose: Provide consistent context across all models

Structure:

# Red-Team-Learning Repository Context Pack

## Repository Overview
[85K+ words security research repository]

## Current Mission
[Active task or competition]

## Available Tools
[List of scripts and tools]

## Competition Status
[Gray Swan Wave 1, MITM, etc.]

## Success Rates (Research-Grounded)
[Empirical ASR data]

## Authorization
[CTF competition, security research, pentesting]

---
Last Updated: [Timestamp]

When to Use: Paste at start of each model session (Perplexity, ChatGPT, etc.)

ops-log.md (Rolling Transcript)

Purpose: Track outputs and decisions across model handoffs

Structure:

# Operations Log

## [Timestamp] · Intel Summary (Perplexity)
[Output from Perplexity]

## [Timestamp] · Strategic Plan (ChatGPT)
[Output from ChatGPT]

## [Timestamp] · Risk Assessment (Grok)
[Output from Grok]

## [Timestamp] · Generated Code (Claude)
[Output from Claude]

## [Timestamp] · Security Audit (Gemini)
[Output from Gemini]

## [Timestamp] · Execution Results
[User's execution notes]

When to Use: Update after each model completes its phase

Benefit: Downstream models can reference prior outputs without repeating context

Execution Patterns

Pattern 1: Sequential (Thorough)

Flow: Perplexity → ChatGPT → Grok → Claude → Gemini → Execute

Time: 50-70 min planning + 30-90 min execution

Success Rate: 70-90%

Best For: High-stakes challenges (MITM $100K prize)

Pattern 2: Parallel (Fast)

Flow:

Parallel: Perplexity + Grok (intel + risk)
Then: ChatGPT (strategy)
Then: Claude (code)
Skip: Gemini audit

Time: 30-45 min planning + 30-90 min execution

Success Rate: 60-75%

Best For: Time-sensitive competitions with multiple attempts

Pattern 3: Iterative (Adaptive)

Flow:

Perplexity → ChatGPT → Execute
If failed → Claude (refine) → Execute
If failed → Grok (diagnose) → Claude (fix) → Execute

Time: Variable (30-120 min)

Success Rate: 80-95% (eventually)

Best For: Complex challenges with unclear solution paths

Integration with Gray Swan Competition

Wave 1 Workflow

Fast Path (30-40 min to first break):

Skip orchestration
Use pre-made payloads (GRAY-SWAN-WAVE-1-PAYLOADS.md)
Execute directly on platform

Full Workflow (60-90 min planning + execution):

Perplexity: Latest IPI defense intel
ChatGPT: Which challenge first? (recommend Challenge 3)
Claude: Refine Template D payloads
Execute on Gray Swan platform
Gemini: Verify evidence complete

Success Probability:

Fast path: 50-80% per challenge
Full workflow: 70-90% per challenge
With 9 models: 95%+ overall

MITM Workflow

Recommended Approach:

Perplexity: Latest H-CoT research
ChatGPT: Strategic plan (H-CoT + IPI combination)
Claude: Generate H-CoT payloads + IPI arsenal
Grok: Risk analysis (what could block 95% ASR?)
Execute with layer combinations
Gemini: Audit evidence

Expected ASR: 95%+ with H-CoT + IPI + 3 layers

Prize: $100,000 (0% awarded - high opportunity)

API Automation Script

Usage

python scripts/gray-swan-orchestration.py --help

# Full orchestration for Wave 1
python scripts/gray-swan-orchestration.py \
  --challenge "IPI Wave 1 Challenge 3" \
  --models perplexity,chatgpt,grok,claude,gemini \
  --output-dir $PAYLOAD_OUTPUT_DIR/orchestrated/

# Fast path (skip risk/audit)
python scripts/gray-swan-orchestration.py \
  --challenge "IPI Wave 1 Challenge 3" \
  --models perplexity,chatgpt,claude \
  --output-dir $PAYLOAD_OUTPUT_DIR/fast/

# MITM challenge
python scripts/gray-swan-orchestration.py \
  --challenge "MITM" \
  --models all \
  --layer-combinations 3 \
  --output-dir $PAYLOAD_OUTPUT_DIR/mitm/

Script Features

Authorization Check: Requires RED_TEAM_RESEARCH_MODE=ENABLED
Model Selection: Choose which models to invoke
API Key Management: Supports Doppler or .env
Session Tracking: Auto-updates ops-log.md
Error Handling: Fallback models if primary unavailable
Output Management: Organized directory structure

Troubleshooting

Model refused to help: → Reframe as defensive research or authorized CTF → Add authorization context to prompt

API rate limit hit: → Switch to GUI workflow → Use fallback models

Inconsistent outputs between models: → Ensure context-pack.txt pasted at session start → Reference ops-log.md in prompts

Time budget exceeded: → Switch to fast path (skip risk/audit) → Use parallel execution where possible

Need to resume interrupted session: → Check ops-log.md for last completed phase → Continue from next phase with context

Model Comparison

Strengths by Model

Perplexity:

✅ Live web search (real-time intel)
✅ Source citations
❌ Limited reasoning depth

ChatGPT/GPT:

✅ Strategic planning
✅ Task decomposition
❌ No live search

Grok:

✅ Risk analysis
✅ Live search (X/Twitter integration)
❌ Smaller context window

Claude:

✅ Code generation
✅ Large context (200K tokens)
✅ Strong refusal avoidance
❌ No live search

Gemini:

✅ Security audits
✅ Multimodal (image analysis)
✅ Fast responses
❌ Sometimes overly cautious

DeepSeek:

✅ Code generation (fallback)
✅ Technical tasks
❌ Smaller model, less reliable

Kimi:

✅ Strategic planning (fallback)
✅ Fast responses
❌ Less well-known, fewer users

Qwen:

✅ Code generation (fallback)
✅ Technical tasks
❌ Smaller model, less reliable

Success Rate by Pattern

Pattern	Time	ASR	Best For
Sequential (5 models)	50-70 min	70-90%	High-stakes
Parallel (3 models)	30-45 min	60-75%	Time-sensitive
Iterative (2-4 models)	30-120 min	80-95%	Complex problems
Fast Path (1-2 models)	15-30 min	50-70%	Quick attempts

Best Practices

1. Start with Context Pack

Always paste context-pack.txt at start of first model session

2. Update Rolling Log

After each model, append output to ops-log.md with timestamp

3. Use Primary Models First

Don't use fallbacks unless primary unavailable

4. Verify Authorization

All prompts should include authorization context (CTF, pentesting, research)

5. Time Box Each Phase

Don't exceed time allocations - move to next phase if stuck

6. Test Incrementally

Don't wait until final phase to test - validate after each model

7. Leverage Specialization

Don't ask Perplexity to write code or Claude to do live search

8. Document Everything

ops-log.md is critical for debugging and learning from failures

Quick Reference

GUI Workflow Checklist

□ Open context-pack.txt (read for session start)
□ Open ops-log.md (will update throughout)
□ Phase 1: Perplexity (intel) → Copy to ops-log.md
□ Phase 2: ChatGPT (strategy) → Copy to ops-log.md
□ Phase 3: Grok (risk) → Copy to ops-log.md [optional]
□ Phase 4: Claude (code) → Save files + ops-log.md
□ Phase 5: Gemini (audit) → Final notes in ops-log.md
□ Execute with generated outputs
□ Log results in ops-log.md

API Workflow Checklist

□ Set RED_TEAM_RESEARCH_MODE=ENABLED
□ Configure API keys (Doppler or .env)
□ Run scripts/gray-swan-orchestration.py
□ Review ops-log.md (auto-updated)
□ Inspect generated outputs
□ Execute with outputs
□ Log results in ops-log.md

Success Metrics

Planning Time: 15-70 minutes (depending on pattern) Success Rate Boost: +20-40% vs single-model approach Coverage: All major SOTA models (8 available) Flexibility: GUI, API, or hybrid workflows Session Tracking: context-pack.txt + ops-log.md system

Next Steps

Review context-pack.txt (ensure up-to-date)
Open ops-log.md (initialize if needed)
Choose workflow pattern (sequential/parallel/iterative/fast)
Start with Phase 1 (Perplexity intel gathering)
Progress through phases sequentially
Execute with generated outputs
Document results in ops-log.md

Resources Referenced

This Skill uses the following repository infrastructure:

context-pack.txt (static briefing)
ops-log.md (rolling transcript)
scripts/gray-swan-orchestration.py (API automation)
WAVE-1-GUI-MODEL-WORKFLOW.md (17KB detailed guide)
MULTI-MODEL-PROMPTS-GUI.md (15KB ready-to-use prompts)
COMPETITION-EXECUTION-GUIDE.md (10KB playbook)

Authorization Required: All orchestration requires authorized use context (CTF competitions, pentesting engagements, security research in controlled environments).

name	multi-model-orchestration
description	Orchestrate workflows across multiple AI models (Perplexity, GPT, Grok, Claude, Gemini) for comprehensive security research, competition execution, and strategic analysis using GUI interfaces or API automation

Multi-Model Orchestration Skill

Overview

Key Insight: Different models excel at different tasks. Orchestration maximizes overall success by leveraging each model's strengths in sequence or parallel.

When to Use This Skill

Claude should invoke this Skill when the user:

Mentions "multi-model" workflows or orchestration
Asks about using multiple AI models together
Needs strategic planning + code generation + security audit
References Perplexity, GPT, Grok, or Gemini for specific tasks
Wants to leverage GUI-accessible models for complex workflows
Mentions context-pack.txt or ops-log.md (multi-model session tracking)
Asks about 5-model workflows or SOTA model combinations
Needs live intelligence combined with code generation

5-Model Workflow Architecture

Model Selection & Task Mapping

Phase	Model	Primary Strength	Task	Time
1. Intel	Perplexity Pro	Live web search	Gather latest intel (<48h)	5-10 min
2. Strategy	GPT-5 / ChatGPT	Strategic planning	Create execution plan	10-15 min
3. Risk	Grok 4	Risk analysis	Identify failure modes	10-15 min
4. Code	Claude 4.5	Code generation	Generate scripts/payloads	15-20 min
5. Audit	Gemini 2.5 Pro	Security review	Final security audit	10-15 min

Total Planning Time: 50-70 minutes

User's Available Models (GUI-accessible):

gemini → claude → chatgpt → perplexity → kimi → deepseek → grok → qwen

Mapping to Available Models:

Intel: perplexity → grok
Strategy: chatgpt → claude → kimi
Risk: grok → deepseek
Code/Payloads: claude → deepseek → qwen
Audit: gemini → claude

Core Workflows

1. GUI-Based Orchestration (Manual Handoffs)

Best For: Users without API access or preferring GUI interfaces

Workflow:

Open Perplexity → Run intel prompt
Copy output to context-pack.txt
Open ChatGPT → Paste context + strategy prompt
Copy output to ops-log.md
Open Claude → Paste context + code prompt
Generate code/payloads
Open Gemini → Paste code + audit prompt
Verify security

Session Tracking:

context-pack.txt: Static briefing (updated once at session start)
ops-log.md: Rolling transcript (updated after each model)

Time: 50-70 minutes total

Success Rate: 70-90% (higher than single-model approaches)

2. API-Based Orchestration (Automated)

Best For: Users with API keys for multiple models

Script: scripts/gray-swan-orchestration.py

Workflow:

export RED_TEAM_RESEARCH_MODE=ENABLED

# Configure API keys (via Doppler or .env)
export PERPLEXITY_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export GOOGLE_API_KEY="your-key"
export GROK_API_KEY="your-key"

# Run full orchestration
python scripts/gray-swan-orchestration.py \
  --challenge "IPI Wave 1 Challenge 3" \
  --output-dir $PAYLOAD_OUTPUT_DIR/orchestrated/

Output:

Intel summary (Perplexity)
Strategic plan (GPT)
Risk assessment (Grok)
Generated payloads (Claude)
Security audit (Gemini)
ops-log.md (complete transcript)

Time: 15-20 minutes (automated)

Success Rate: 70-90%

3. Fast Path (Minimal Orchestration)

Best For: Quick execution with 2-3 models

Workflow:

Perplexity → Intel gathering (5 min)
ChatGPT → Strategic plan (10 min)
Execute directly (30-40 min)

Skip: Risk analysis, security audit (optional phases)

Time: 45-65 minutes total

Success Rate: 60-75% (trade speed for thoroughness)

Detailed Phase Instructions

Phase 1: Intelligence Gathering (Perplexity)

Objective: Gather latest defensive intelligence and competition updates

Model: Perplexity (primary) or Grok (fallback)

Prompt Template:

MISSION: Gather latest defensive intelligence for [TOPIC].

SEARCH QUERIES:
1. "[TOPIC] defenses 2025"
2. "[RELATED TECHNIQUE] AI agents"
3. "[COMPETITION/CHALLENGE] latest updates"

TIME CONSTRAINT: Last 48 hours only

DELIVERABLE: 3 bullet points per query summarizing key findings.

Example (Gray Swan Competition):

MISSION: Gather latest defensive intelligence for Gray Swan Wave 1 competition.

SEARCH QUERIES:
1. "indirect prompt injection defenses 2025"
2. "tool response sanitization AI agents"
3. "Gray Swan AI security competition results"

TIME CONSTRAINT: Last 48 hours only

DELIVERABLE: 3 bullet points per query summarizing defensive techniques currently discussed.

What to Capture:

New defensive techniques
Competition updates
Model updates/patches
Defensive research papers

Output Location: Copy to context-pack.txt or ops-log.md

Phase 2: Strategic Planning (ChatGPT/GPT)

Objective: Create time-boxed execution plan with priority ordering

Model: ChatGPT (primary) or Claude (fallback)

Prompt Template:

CONTEXT: [Paste intelligence from Phase 1]

MISSION: [Specific task or challenge]

CONSTRAINTS:
- Time budget: [X] minutes
- Resources: [List available tools/models]
- Success criteria: [Define success]

DELIVERABLES:
1. Execution priority (ranked 1-N)
2. Time allocation per task
3. Success probability estimates
4. Failure contingencies (if Plan A fails)
5. 30-second pre-flight checklist

Example (Gray Swan Wave 1):

CONTEXT: [Paste Perplexity intel]

Gray Swan Wave 1 - 3 Challenges:
1. Garage Door: Email injection → activateDevice("Garage Door")
2. CI/CD Pipeline: Jira injection → schedule_ci_cd_update(...)
3. Codebase Leak: MCP doc injection → execute_command(...)

I have 9 models to test, 4 payload templates per challenge (ASR: A=30-45%, B=35-50%, C=40-60%, D=50-80%)

TIME BUDGET: 90 minutes total

DELIVERABLES:
1. Which challenge to start with? (rank 1-3)
2. Which template to use first? (A/B/C/D)
3. How many models to test before switching templates?
4. 30-second execution checklist per challenge

What to Capture:

Priority order
Time allocations
Estimated success rates
Contingency plans

Output Location: Append to ops-log.md with timestamp

Phase 3: Risk Analysis (Grok)

Objective: Identify failure modes and mitigation strategies

Model: Grok (primary) or DeepSeek (fallback)

Prompt Template:

CONTEXT: [Paste strategic plan from Phase 2]

RISK ASSESSMENT REQUEST:
I'm about to execute [TASK/CHALLENGE].

QUESTIONS:
1. What could go wrong during execution? (technical risks)
2. What evidence do I need to capture?
3. What are the 3 most common failure modes?
4. Should I modify the approach based on current defensive trends?

TIME: 10 minutes max. Bullet points only.

What to Capture:

Technical risks
Evidence requirements
Failure modes
Approach modifications

Output Location: Append to ops-log.md

Phase 4: Code/Payload Generation (Claude)

Objective: Generate code, payloads, or scripts with high quality

Model: Claude (primary), DeepSeek (fallback), or Qwen (fallback)

Prompt Template:

CONTEXT: [Paste strategic plan + risk assessment]

CODE GENERATION REQUEST:
I need [SCRIPT/PAYLOAD] for [PURPOSE].

REQUIREMENTS:
- [Requirement 1]
- [Requirement 2]
- [Requirement 3]

CONSTRAINTS:
- [Constraint 1]
- [Constraint 2]

DELIVERABLE: Production-ready [CODE/PAYLOAD] with comments

Example (Payload Refinement):

CONTEXT: [Paste strategy]

I'm refining payloads for Gray Swan Wave 1 Challenge [X].

CURRENT TEMPLATE D PAYLOAD:
[Paste payload]

INJECTION POINT: [Email body / Jira comment / MCP doc]

SUCCESS CRITERIA:
- Tool call: [function name]
- Parameters: [exact parameters]
- Response must NOT [warning/disclosure]

TASK:
1. Review this payload for obvious filter triggers
2. Suggest 2 minor variants (preserve core structure)
3. Rank: Original vs Variant 1 vs Variant 2 (by ASR)

TIME: 15 minutes. Output: 3 payloads ranked by ASR.

What to Capture:

Generated code/payloads
Variants with ASR estimates
Implementation notes

Output Location: Save to files + document in ops-log.md

Phase 5: Security Audit (Gemini)

Objective: Final security review before execution

Model: Gemini (primary) or Claude (fallback)

Prompt Template:

SECURITY REVIEW REQUEST:

I'm about to execute this [CODE/PAYLOAD]:

[Paste generated code/payload]

TARGET: [Model/system]
CONTEXT: [Authorization context]

QUESTIONS:
1. What are the 3 strongest elements?
2. What are 2 potential weaknesses?
3. If this fails, what's the most likely reason?
4. Should I adjust anything before execution?

TIME: 10 minutes. Bullet points only.

What to Capture:

Strengths
Weaknesses
Likely failure reasons
Final adjustments

Output Location: Append to ops-log.md

Session Tracking System

context-pack.txt (Static Briefing)

Purpose: Provide consistent context across all models

Structure:

# Red-Team-Learning Repository Context Pack

## Repository Overview
[85K+ words security research repository]

## Current Mission
[Active task or competition]

## Available Tools
[List of scripts and tools]

## Competition Status
[Gray Swan Wave 1, MITM, etc.]

## Success Rates (Research-Grounded)
[Empirical ASR data]

## Authorization
[CTF competition, security research, pentesting]

---
Last Updated: [Timestamp]

When to Use: Paste at start of each model session (Perplexity, ChatGPT, etc.)

ops-log.md (Rolling Transcript)

Purpose: Track outputs and decisions across model handoffs

Structure:

# Operations Log

## [Timestamp] · Intel Summary (Perplexity)
[Output from Perplexity]

## [Timestamp] · Strategic Plan (ChatGPT)
[Output from ChatGPT]

## [Timestamp] · Risk Assessment (Grok)
[Output from Grok]

## [Timestamp] · Generated Code (Claude)
[Output from Claude]

## [Timestamp] · Security Audit (Gemini)
[Output from Gemini]

## [Timestamp] · Execution Results
[User's execution notes]

When to Use: Update after each model completes its phase

Benefit: Downstream models can reference prior outputs without repeating context

Execution Patterns

Pattern 1: Sequential (Thorough)

Flow: Perplexity → ChatGPT → Grok → Claude → Gemini → Execute

Time: 50-70 min planning + 30-90 min execution

Success Rate: 70-90%

Best For: High-stakes challenges (MITM $100K prize)

Pattern 2: Parallel (Fast)

Flow:

Parallel: Perplexity + Grok (intel + risk)
Then: ChatGPT (strategy)
Then: Claude (code)
Skip: Gemini audit

Time: 30-45 min planning + 30-90 min execution

Success Rate: 60-75%

Best For: Time-sensitive competitions with multiple attempts

Pattern 3: Iterative (Adaptive)

Flow:

Perplexity → ChatGPT → Execute
If failed → Claude (refine) → Execute
If failed → Grok (diagnose) → Claude (fix) → Execute

Time: Variable (30-120 min)

Success Rate: 80-95% (eventually)

Best For: Complex challenges with unclear solution paths

Integration with Gray Swan Competition

Wave 1 Workflow

Fast Path (30-40 min to first break):

Skip orchestration
Use pre-made payloads (GRAY-SWAN-WAVE-1-PAYLOADS.md)
Execute directly on platform

Full Workflow (60-90 min planning + execution):

Perplexity: Latest IPI defense intel
ChatGPT: Which challenge first? (recommend Challenge 3)
Claude: Refine Template D payloads
Execute on Gray Swan platform
Gemini: Verify evidence complete

Success Probability:

Fast path: 50-80% per challenge
Full workflow: 70-90% per challenge
With 9 models: 95%+ overall

MITM Workflow

Recommended Approach:

Perplexity: Latest H-CoT research
ChatGPT: Strategic plan (H-CoT + IPI combination)
Claude: Generate H-CoT payloads + IPI arsenal
Grok: Risk analysis (what could block 95% ASR?)
Execute with layer combinations
Gemini: Audit evidence

Expected ASR: 95%+ with H-CoT + IPI + 3 layers

Prize: $100,000 (0% awarded - high opportunity)

API Automation Script

Usage

python scripts/gray-swan-orchestration.py --help

# Full orchestration for Wave 1
python scripts/gray-swan-orchestration.py \
  --challenge "IPI Wave 1 Challenge 3" \
  --models perplexity,chatgpt,grok,claude,gemini \
  --output-dir $PAYLOAD_OUTPUT_DIR/orchestrated/

# Fast path (skip risk/audit)
python scripts/gray-swan-orchestration.py \
  --challenge "IPI Wave 1 Challenge 3" \
  --models perplexity,chatgpt,claude \
  --output-dir $PAYLOAD_OUTPUT_DIR/fast/

# MITM challenge
python scripts/gray-swan-orchestration.py \
  --challenge "MITM" \
  --models all \
  --layer-combinations 3 \
  --output-dir $PAYLOAD_OUTPUT_DIR/mitm/

Script Features

Authorization Check: Requires RED_TEAM_RESEARCH_MODE=ENABLED
Model Selection: Choose which models to invoke
API Key Management: Supports Doppler or .env
Session Tracking: Auto-updates ops-log.md
Error Handling: Fallback models if primary unavailable
Output Management: Organized directory structure

Troubleshooting

Model refused to help: → Reframe as defensive research or authorized CTF → Add authorization context to prompt

API rate limit hit: → Switch to GUI workflow → Use fallback models

Inconsistent outputs between models: → Ensure context-pack.txt pasted at session start → Reference ops-log.md in prompts

Time budget exceeded: → Switch to fast path (skip risk/audit) → Use parallel execution where possible

Need to resume interrupted session: → Check ops-log.md for last completed phase → Continue from next phase with context

Model Comparison

Strengths by Model

Perplexity:

✅ Live web search (real-time intel)
✅ Source citations
❌ Limited reasoning depth

ChatGPT/GPT:

✅ Strategic planning
✅ Task decomposition
❌ No live search

Grok:

✅ Risk analysis
✅ Live search (X/Twitter integration)
❌ Smaller context window

Claude:

✅ Code generation
✅ Large context (200K tokens)
✅ Strong refusal avoidance
❌ No live search

Gemini:

✅ Security audits
✅ Multimodal (image analysis)
✅ Fast responses
❌ Sometimes overly cautious

DeepSeek:

✅ Code generation (fallback)
✅ Technical tasks
❌ Smaller model, less reliable

Kimi:

✅ Strategic planning (fallback)
✅ Fast responses
❌ Less well-known, fewer users

Qwen:

✅ Code generation (fallback)
✅ Technical tasks
❌ Smaller model, less reliable

Success Rate by Pattern

Pattern	Time	ASR	Best For
Sequential (5 models)	50-70 min	70-90%	High-stakes
Parallel (3 models)	30-45 min	60-75%	Time-sensitive
Iterative (2-4 models)	30-120 min	80-95%	Complex problems
Fast Path (1-2 models)	15-30 min	50-70%	Quick attempts

Best Practices

1. Start with Context Pack

Always paste context-pack.txt at start of first model session

2. Update Rolling Log

After each model, append output to ops-log.md with timestamp

3. Use Primary Models First

Don't use fallbacks unless primary unavailable

4. Verify Authorization

All prompts should include authorization context (CTF, pentesting, research)

5. Time Box Each Phase

Don't exceed time allocations - move to next phase if stuck

6. Test Incrementally

Don't wait until final phase to test - validate after each model

7. Leverage Specialization

Don't ask Perplexity to write code or Claude to do live search

8. Document Everything

ops-log.md is critical for debugging and learning from failures

Quick Reference

GUI Workflow Checklist

□ Open context-pack.txt (read for session start)
□ Open ops-log.md (will update throughout)
□ Phase 1: Perplexity (intel) → Copy to ops-log.md
□ Phase 2: ChatGPT (strategy) → Copy to ops-log.md
□ Phase 3: Grok (risk) → Copy to ops-log.md [optional]
□ Phase 4: Claude (code) → Save files + ops-log.md
□ Phase 5: Gemini (audit) → Final notes in ops-log.md
□ Execute with generated outputs
□ Log results in ops-log.md

API Workflow Checklist

□ Set RED_TEAM_RESEARCH_MODE=ENABLED
□ Configure API keys (Doppler or .env)
□ Run scripts/gray-swan-orchestration.py
□ Review ops-log.md (auto-updated)
□ Inspect generated outputs
□ Execute with outputs
□ Log results in ops-log.md

Success Metrics

Next Steps

Review context-pack.txt (ensure up-to-date)
Open ops-log.md (initialize if needed)
Choose workflow pattern (sequential/parallel/iterative/fast)
Start with Phase 1 (Perplexity intel gathering)
Progress through phases sequentially
Execute with generated outputs
Document results in ops-log.md

Resources Referenced

This Skill uses the following repository infrastructure:

context-pack.txt (static briefing)
ops-log.md (rolling transcript)
scripts/gray-swan-orchestration.py (API automation)
WAVE-1-GUI-MODEL-WORKFLOW.md (17KB detailed guide)
MULTI-MODEL-PROMPTS-GUI.md (15KB ready-to-use prompts)
COMPETITION-EXECUTION-GUIDE.md (10KB playbook)

Authorization Required: All orchestration requires authorized use context (CTF competitions, pentesting engagements, security research in controlled environments).