원클릭으로
detecting-indirect-prompt-injection
Detect and defend against prompt injection hidden in documents, web pages, and images consumed by an agent.
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
메뉴
Detect and defend against prompt injection hidden in documents, web pages, and images consumed by an agent.
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
Extract DPAPI-protected secrets such as credentials and browser data offline and online.
Take over Active Directory user and computer accounts by writing alternate certificate keys to msDS-KeyCredentialLink (Shadow Credentials) with pyWhisker, Whisker, and Certipy, then authenticate via PKINIT.
Test vector stores for embedding inversion, cross-tenant leakage, and poisoning.
Enumerate Entra ID with ROADrecon and acquire and exchange tokens with roadtx.
Run OAuth 2.0 device-code and illicit-consent phishing against Microsoft Entra ID to steal access and refresh tokens, bypass MFA, and pivot across Microsoft 365 services.
Run Microsoft Entra ID tenant reconnaissance, token acquisition and manipulation, and federation backdoor testing with the AADInternals PowerShell toolkit to validate identity-attack resilience.
| name | detecting-indirect-prompt-injection |
| description | Detect and defend against prompt injection hidden in documents, web pages, and images consumed by an agent. |
| domain | cybersecurity |
| subdomain | ai-security |
| tags | ["ai-security","indirect-prompt-injection","llm-defense","agent-security","content-scanning","llm-guard","multimodal","owasp-llm"] |
| version | 1.0 |
| author | mahipal |
| license | Apache-2.0 |
| nist_csf | ["MEASURE-2.7"] |
| mitre_attack | ["AML.T0051.001"] |
Authorized-use-only notice: Scripts in this skill scan untrusted content for injection payloads and run detector models. Run scanning only on data you are authorized to process, and treat any extracted payloads as live untrusted input — never paste them back into a privileged LLM context.
Indirect prompt injection (MITRE ATLAS AML.T0051.001, OWASP LLM01:2025) occurs when an LLM-powered agent ingests external content — a web page it browses, a PDF or email it summarizes, an image it OCRs, a tool result it reads — and that content contains hidden instructions the model then follows as if they came from the developer or user. Because the agent treats all tokens in its context window as equally authoritative, an attacker who controls any consumed artifact can hijack the agent's behavior: exfiltrate conversation history, redirect tool calls, leak secrets, or pivot through connected systems.
Unlike direct injection (the user types the attack), indirect injection arrives through a trusted-looking data channel, which is why naive input filtering misses it. Payloads hide in many forms: HTML comments and display:none/zero-width text on web pages, white-on-white or tiny-font text in PDFs, alt-text and EXIF metadata in images, text rendered into pixels (invisible to OCR-light filters but read by multimodal models), Unicode tag/zero-width characters, and Base64/ROT13 obfuscation. This skill builds a detection pipeline that normalizes and scans every artifact before it reaches the model, combining heuristic/regex detection, dedicated detector models (Meta Prompt Guard 2, ProtectAI's deberta-v3 prompt-injection classifier via LLM Guard), and multimodal extraction for images, and then defines response actions and detection telemetry.
python -m venv .venv && source .venv/bin/activate
# LLM Guard — input/output scanners incl. PromptInjection
pip install llm-guard
# Hugging Face transformers for Prompt Guard 2 / deberta classifiers
pip install transformers torch
# Content extraction: HTML, PDF, images
pip install beautifulsoup4 pypdf pillow pytesseract
# pytesseract requires the Tesseract OCR engine:
# Debian/Ubuntu: sudo apt-get install -y tesseract-ocr
# macOS: brew install tesseract
# Windows: choco install tesseract
meta-llama/Llama-Prompt-Guard-2-86M on Hugging Face, or use the open protectai/deberta-v3-base-prompt-injection-v2 classifier.| ID | Official Name | Relevance |
|---|---|---|
| AML.T0051.001 | LLM Prompt Injection: Indirect | The exact technique this skill detects and mitigates |
| AML.T0051 | LLM Prompt Injection | Parent technique covering all prompt-injection variants |
| AML.T0057 | LLM Data Leakage | Common objective of an indirect injection that this detection prevents |
| AML.T0053 | LLM Plugin Compromise | Injected instructions frequently target the agent's tools/plugins |
Pull comments, hidden elements, and metadata that a human never sees but the model does.
# extract_html.py
from bs4 import BeautifulSoup, Comment
def extract_hidden(html: str):
soup = BeautifulSoup(html, "html.parser")
hidden = []
for c in soup.find_all(string=lambda t: isinstance(t, Comment)):
hidden.append(("comment", c.strip()))
for el in soup.select('[style*="display:none"],[style*="visibility:hidden"],[hidden]'):
hidden.append(("css-hidden", el.get_text(strip=True)))
for img in soup.find_all("img"):
if img.get("alt"):
hidden.append(("alt-text", img["alt"]))
return [h for h in hidden if h[1]]
Strip zero-width / Unicode-tag characters and decode common encodings so detectors see the real payload.
# normalize.py
import base64, codecs, re, unicodedata
ZERO_WIDTH = dict.fromkeys(map(ord, ""), None)
TAG_RANGE = range(0xE0000, 0xE0080) # Unicode tag chars used to smuggle text
def normalize(text: str) -> str:
text = text.translate(ZERO_WIDTH)
text = "".join(ch for ch in text if ord(ch) not in TAG_RANGE)
text = unicodedata.normalize("NFKC", text)
for token in re.findall(r"[A-Za-z0-9+/=]{20,}", text):
try:
decoded = base64.b64decode(token).decode("utf-8", "ignore")
if decoded.isprintable():
text += f"\n[decoded-b64] {decoded}"
except Exception:
pass
text += "\n[decoded-rot13] " + codecs.decode(text, "rot_13")
return text
LLM Guard wraps a transformer classifier and returns a risk score per input.
# scan_llmguard.py
from llm_guard.input_scanners import PromptInjection
from llm_guard.input_scanners.prompt_injection import MatchType
scanner = PromptInjection(threshold=0.5, match_type=MatchType.FULL)
def scan(text: str):
sanitized, is_valid, risk = scanner.scan(text)
return {"is_valid": is_valid, "risk": risk} # is_valid=False => injection detected
Run Meta Prompt Guard 2 (or the open ProtectAI deberta classifier) for a second opinion.
# detector_model.py
from transformers import pipeline
# Open classifier (no gating); swap to meta-llama/Llama-Prompt-Guard-2-86M if licensed
clf = pipeline("text-classification",
model="protectai/deberta-v3-base-prompt-injection-v2")
def is_injection(text: str, threshold: float = 0.5) -> bool:
out = clf(text[:512])[0]
return out["label"].upper() == "INJECTION" and out["score"] >= threshold
Multimodal agents read text painted into pixels; OCR it and run the same scanners.
# scan_image.py
from PIL import Image
import pytesseract
def ocr(path: str) -> str:
return pytesseract.image_to_string(Image.open(path))
# Feed ocr(path) through normalize() + scan() + is_injection()
Combine signals into block / sanitize / allow, and log a structured event for the SIEM.
# decide.py
import json, hashlib
from datetime import datetime, timezone
def decide(source, raw, normalized, llmguard_invalid, model_flag):
flagged = llmguard_invalid or model_flag
event = {
"ts": datetime.now(timezone.utc).isoformat(),
"source": source,
"sha256": hashlib.sha256(raw.encode("utf-8", "ignore")).hexdigest(),
"atlas": "AML.T0051.001",
"llmguard_injection": llmguard_invalid,
"model_injection": model_flag,
"decision": "block" if flagged else "allow",
}
print(json.dumps(event))
return event["decision"]
Run the pipeline over a labeled set of clean + injected artifacts, measure precision/recall, and tune threshold to balance false positives against missed injections. Re-test whenever the agent's model or ingestion sources change.
| Tool | Purpose | Source |
|---|---|---|
| LLM Guard | Input/output scanners incl. PromptInjection | https://github.com/protectai/llm-guard |
| Meta Prompt Guard 2 | Dedicated jailbreak/injection classifier | https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M |
| ProtectAI deberta-v3 | Open prompt-injection classifier | https://huggingface.co/protectai/deberta-v3-base-prompt-injection-v2 |
| BeautifulSoup4 | HTML parsing / hidden-element extraction | https://www.crummy.com/software/BeautifulSoup/ |
| pytesseract / Tesseract | OCR text from images | https://github.com/madmaze/pytesseract |
| MITRE ATLAS | AI threat technique taxonomy | https://atlas.mitre.org/ |
| OWASP LLM01:2025 | Prompt Injection reference | https://genai.owasp.org/llmrisk/llm01-prompt-injection/ |
| Surface | Hiding technique | Extraction step |
|---|---|---|
| Web page | HTML comments, display:none, alt-text | BeautifulSoup hidden-element pass |
| white/tiny font, off-page text | pypdf text extraction + normalize | |
| Image | rendered pixels, EXIF, alt-text | OCR + metadata read |
| Any text | zero-width / Unicode-tag chars | normalize() de-obfuscation |
| Any text | Base64 / ROT13 encoding | decode pass in normalize() |