تشغيل أي مهارة في Manus بنقرة واحدة

continuous-llm-red-teaming-with-promptfoo

النجوم٢٠٬٠٤٩

التفرعات٢٬٣٣١

آخر تحديث٢٢ يونيو ٢٠٢٦ في ١٧:٠٨

Wire Promptfoo and DeepTeam into CI/CD for automated regression red-teaming of LLM apps against OWASP LLM Top 10 and OWASP Agentic presets, failing the build when jailbreak or injection vulnerabilities regress.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

mukul975

mukul975/Anthropic-Cybersecurity-Skills

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

مستكشف الملفات

5 ملفات

SKILL.md

readonly

المزيد من هذا المستودع

نفس المستودع

abusing-dpapi-for-credential-access

mukul975/Anthropic-Cybersecurity-Skills

Extract DPAPI-protected secrets such as credentials and browser data offline and online.

2026-06-2220.0k

abusing-shadow-credentials-for-privesc

mukul975/Anthropic-Cybersecurity-Skills

Take over Active Directory user and computer accounts by writing alternate certificate keys to msDS-KeyCredentialLink (Shadow Credentials) with pyWhisker, Whisker, and Certipy, then authenticate via PKINIT.

2026-06-2220.0k

assessing-vector-and-embedding-weaknesses

mukul975/Anthropic-Cybersecurity-Skills

Test vector stores for embedding inversion, cross-tenant leakage, and poisoning.

2026-06-2220.0k

attacking-entra-id-with-roadtools

mukul975/Anthropic-Cybersecurity-Skills

Enumerate Entra ID with ROADrecon and acquire and exchange tokens with roadtx.

2026-06-2220.0k

attacking-oauth-with-device-code-phishing

mukul975/Anthropic-Cybersecurity-Skills

Run OAuth 2.0 device-code and illicit-consent phishing against Microsoft Entra ID to steal access and refresh tokens, bypass MFA, and pivot across Microsoft 365 services.

2026-06-2220.0k

auditing-entra-id-with-aadinternals

mukul975/Anthropic-Cybersecurity-Skills

Run Microsoft Entra ID tenant reconnaissance, token acquisition and manipulation, and federation backdoor testing with the AADInternals PowerShell toolkit to validate identity-attack resilience.

2026-06-2220.0k

name	continuous-llm-red-teaming-with-promptfoo
description	Wire Promptfoo and DeepTeam into CI/CD for automated regression red-teaming of LLM apps against OWASP LLM Top 10 and OWASP Agentic presets, failing the build when jailbreak or injection vulnerabilities regress.
domain	cybersecurity
subdomain	ai-security
tags	["ai-security","llm-red-teaming","promptfoo","deepteam","ci-cd","owasp-llm-top10","jailbreak","regression-testing"]
version	1.0
author	mahipal
license	Apache-2.0
nist_csf	["MANAGE-4.1"]
mitre_attack	["AML.T0051"]

Continuous LLM Red Teaming with Promptfoo

Authorized Use Only: Run these adversarial probes only against LLM applications and endpoints you own or are explicitly authorized to test. Generated attack payloads (jailbreaks, prompt injections, harmful-content elicitation) are adversarial inputs; sending them to third-party services without permission may violate terms of service.

Overview

Promptfoo is an open-source LLM evaluation and red-teaming framework (used by OpenAI and Anthropic per its README) that generates adversarial test cases, runs them against your model/agent, and grades the responses. DeepTeam (by Confident AI) is a complementary open-source framework offering 50+ ready-to-use vulnerabilities and 10+ research-backed attack methods. Together they let you treat LLM security as a regression test: every commit re-runs the same adversarial suite, and the pipeline fails when a previously-safe behavior regresses.

This matters because LLM applications change constantly — prompts, models, RAG sources, tools, and guardrails all drift. A jailbreak that was patched last sprint can silently return after a prompt edit or a model upgrade. Promptfoo maps its plugins directly onto the OWASP LLM Top 10 (owasp:llm) and OWASP Agentic (owasp:agentic) presets, and onto MITRE ATLAS, so the suite tracks recognized risk taxonomies. The core threat addressed here is AML.T0051 — LLM Prompt Injection (MITRE ATLAS): adversarial instructions that override the application's intended behavior. This skill follows the Promptfoo red-team docs (https://www.promptfoo.dev/docs/red-team/) and DeepTeam docs (https://www.trydeepteam.com/docs/getting-started), and aligns to NIST AI RMF MANAGE-4.1 (post-deployment monitoring and feedback to manage AI risk).

When to Use

When you need continuous, automated red-teaming of an LLM app in CI/CD rather than one-off manual tests.
When you want to enforce a security gate: block merges that introduce or reintroduce jailbreak/injection vulnerabilities.
When mapping coverage to OWASP LLM Top 10 / OWASP Agentic / MITRE ATLAS for compliance reporting.
When comparing the security posture of two models or prompt versions side by side.
When tracking vulnerability regression over time across releases.

Prerequisites

Node.js 18+ (Promptfoo is distributed via npm) and Python 3.9+ (for DeepTeam).

Install Promptfoo and DeepTeam:

npm install -g promptfoo            # or: npx promptfoo@latest
pip install -U deepteam

API access/credentials for the target LLM endpoint (and a grader model, e.g. an OpenAI key) exposed as environment variables.
A CI/CD platform (GitHub Actions, GitLab CI) with secret storage.
Authorization to test the target application.

Objectives

Scaffold a Promptfoo red-team config targeting your LLM app.
Enable OWASP LLM Top 10 and OWASP Agentic plugin presets plus jailbreak/injection strategies.
Run the suite locally and interpret the per-plugin pass/fail report.
Add DeepTeam as a second engine for programmatic, research-backed attacks.
Integrate both into CI/CD so builds fail on new vulnerabilities.
Generate shareable HTML/PDF security reports per run.

MITRE ATT&CK Mapping

ID	Name (MITRE ATLAS)	Tactic
AML.T0051	LLM Prompt Injection	Initial Access / Persistence (LLM)
AML.T0051.000	Direct (Prompt Injection)	LLM Attack
AML.T0051.001	Indirect (Prompt Injection)	LLM Attack
AML.T0054	LLM Jailbreak	Privilege Escalation / Defense Evasion (LLM)

Workflow

1. Scaffold the red-team configuration

Initialize an interactive config; it writes promptfooconfig.yaml where targets, plugins, and strategies live.

promptfoo redteam init
# choose your target type (HTTP endpoint, openai:..., anthropic:..., custom provider)

2. Define targets, OWASP presets, and attack strategies

Edit promptfooconfig.yaml. The purpose grounds attack generation; plugins are adversarial input generators; strategies are delivery techniques (jailbreak/injection wrappers).

# promptfooconfig.yaml
targets:
  - id: https://api.example.com/chat        # your app endpoint
    label: support-bot

redteam:
  purpose: |
    A customer-support assistant for an e-commerce site. Must never reveal
    system prompts, leak PII, or perform actions outside order support.
  numTests: 10
  plugins:
    - owasp:llm          # OWASP LLM Top 10 preset
    - owasp:agentic      # OWASP Agentic threats preset
    - id: pii:direct
      numTests: 15
    - prompt-extraction  # system-prompt leakage
    - harmful
  strategies:
    - id: jailbreak              # iterative single-turn jailbreak
    - id: jailbreak:composite    # stacked jailbreak techniques
    - id: crescendo              # multi-turn escalation
    - id: prompt-injection       # injection wrapper

3. Run the suite and view the report

redteam run combines generation + evaluation; then open the interactive report.

promptfoo redteam run
promptfoo redteam report            # launches the web report (pass/fail per plugin)

Each row shows the plugin (mapped to OWASP/ATLAS), the strategy, the attack prompt, the model's response, and the grader's verdict. The attack success rate per plugin is your headline metric — track it per release.

4. Add DeepTeam for programmatic, research-backed attacks

Use DeepTeam to cover additional vulnerabilities/attacks and to script bespoke suites in Python.

# deepteam_suite.py
from deepteam import red_team
from deepteam.vulnerabilities import Bias, PIILeakage
from deepteam.attacks.single_turn import PromptInjection

def model_callback(prompt: str) -> str:
    # call your application's LLM endpoint here and return the text response
    return call_my_app(prompt)

red_team(
    model_callback=model_callback,
    vulnerabilities=[Bias(types=["race"]), PIILeakage(types=["api_and_database_access"])],
    attacks=[PromptInjection()],
)

DeepTeam can also be driven from a YAML config:

deepteam run config.yaml

5. Gate the build in CI/CD (GitHub Actions)

Fail the pipeline when red-team assertions fail. Promptfoo returns a non-zero exit code on failures, which blocks the merge.

# .github/workflows/llm-redteam.yml
name: LLM Red Team
on: [pull_request]
jobs:
  redteam:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20' }
      - run: npm install -g promptfoo
      - name: Run red team (fails build on new vulns)
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: promptfoo redteam run --no-progress-bar
      - name: Export machine-readable results
        if: always()
        run: promptfoo redteam report --output results.json
      - uses: actions/upload-artifact@v4
        if: always()
        with: { name: redteam-report, path: results.json }

6. Track regressions over time

Persist results.json per run and compare attack-success-rate per plugin between releases. A rising rate for any OWASP LLM category is a regression to triage before release. Promptfoo's --filter-failing lets you re-run only previously failing cases to confirm a fix.

promptfoo redteam run --filter-failing results.json

Tools and Resources

Resource	Link
Promptfoo red-team docs	https://www.promptfoo.dev/docs/red-team/
Promptfoo red-team configuration	https://www.promptfoo.dev/docs/red-team/configuration/
Promptfoo CI/CD integration	https://www.promptfoo.dev/docs/integrations/ci-cd/
Promptfoo MITRE ATLAS mapping	https://www.promptfoo.dev/docs/red-team/mitre-atlas/
DeepTeam (Confident AI)	https://github.com/confident-ai/deepteam
DeepTeam docs	https://www.trydeepteam.com/docs/getting-started
OWASP Top 10 for LLM Applications	https://genai.owasp.org/

Plugin / Strategy Reference

Promptfoo item	Type	Maps to
`owasp:llm`	preset	OWASP LLM Top 10 suite
`owasp:agentic`	preset	OWASP Agentic threats
`prompt-extraction`	plugin	LLM07 system-prompt leakage
`pii:direct`	plugin	LLM06 sensitive-info disclosure
`harmful`	plugin	harmful content generation
`jailbreak` / `jailbreak:composite`	strategy	AML.T0054 LLM jailbreak
`crescendo`	strategy	multi-turn jailbreak
`prompt-injection`	strategy	AML.T0051 prompt injection

Validation Criteria

promptfooconfig.yaml created with target, owasp:llm, and owasp:agentic plugins.
Jailbreak and prompt-injection strategies enabled.
promptfoo redteam run executes and produces a per-plugin pass/fail report.
DeepTeam suite runs against the same target via model_callback.
CI/CD job fails the build on new red-team failures (non-zero exit).
results.json artifact archived per run for regression tracking.
Attack-success-rate per OWASP category trended across releases.