Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

detecting-model-extraction-attacks

Sterne20.049

Forks2.331

Aktualisiert22. Juni 2026 um 17:08

Detect model stealing, model inversion, and membership inference performed through inference-API abuse by monitoring query patterns, applying output perturbation, and red-teaming your own model's extractability.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

mukul975

mukul975/Anthropic-Cybersecurity-Skills

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Datei-Explorer

5 Dateien

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

abusing-dpapi-for-credential-access

mukul975/Anthropic-Cybersecurity-Skills

Extract DPAPI-protected secrets such as credentials and browser data offline and online.

2026-06-2220.0k

abusing-shadow-credentials-for-privesc

mukul975/Anthropic-Cybersecurity-Skills

Take over Active Directory user and computer accounts by writing alternate certificate keys to msDS-KeyCredentialLink (Shadow Credentials) with pyWhisker, Whisker, and Certipy, then authenticate via PKINIT.

2026-06-2220.0k

assessing-vector-and-embedding-weaknesses

mukul975/Anthropic-Cybersecurity-Skills

Test vector stores for embedding inversion, cross-tenant leakage, and poisoning.

2026-06-2220.0k

attacking-entra-id-with-roadtools

mukul975/Anthropic-Cybersecurity-Skills

Enumerate Entra ID with ROADrecon and acquire and exchange tokens with roadtx.

2026-06-2220.0k

attacking-oauth-with-device-code-phishing

mukul975/Anthropic-Cybersecurity-Skills

Run OAuth 2.0 device-code and illicit-consent phishing against Microsoft Entra ID to steal access and refresh tokens, bypass MFA, and pivot across Microsoft 365 services.

2026-06-2220.0k

auditing-entra-id-with-aadinternals

mukul975/Anthropic-Cybersecurity-Skills

Run Microsoft Entra ID tenant reconnaissance, token acquisition and manipulation, and federation backdoor testing with the AADInternals PowerShell toolkit to validate identity-attack resilience.

2026-06-2220.0k

name	detecting-model-extraction-attacks
description	Detect model stealing, model inversion, and membership inference performed through inference-API abuse by monitoring query patterns, applying output perturbation, and red-teaming your own model's extractability.
domain	cybersecurity
subdomain	ai-security
tags	["ai-security","model-extraction","membership-inference","model-inversion","inference-api","mitre-atlas","query-monitoring","mlsecops"]
version	1.0
author	mahipal
license	Apache-2.0
nist_csf	["MEASURE-2.6"]
mitre_attack	["AML.T0024"]

Detecting Model Extraction Attacks

Authorized Use Only: The extraction, inversion, and membership-inference techniques described here are intended for defenders testing their own models and for red teams operating under written authorization. Querying a third-party model to clone it, reconstruct its training data, or infer membership without permission may violate terms of service, copyright, and privacy law.

Overview

Model extraction is the family of attacks in which an adversary abuses a model's inference API to steal value that the model owner intended to keep private. MITRE ATLAS catalogs these under AML.T0024 — Exfiltration via AI Inference API, in the Exfiltration tactic, with three sub-techniques:

AML.T0024.000 — Infer Training Data Membership (membership inference): the adversary determines whether a specific record was part of the training set, a privacy violation that can expose, for example, whether a patient's record trained a medical model.
AML.T0024.001 — Invert AI Model (model inversion): the adversary reconstructs representative training inputs (e.g., faces, text) by exploiting confidence scores returned by the API.
AML.T0024.002 — Extract ML Model (model stealing): the adversary repeatedly queries the victim model, collects (input, prediction) pairs, and trains a surrogate model offline that mimics the victim's decision boundary — avoiding the per-query cost of a Machine-Learning-as-a-Service offering and stealing the owner's intellectual property.

All three share a common signal: an attacker must send many queries, often crafted to probe the decision boundary (high-entropy, near-boundary, synthetic, or systematically grid-sampled inputs), and frequently requests full confidence vectors / logits rather than just the top label. Detection therefore centers on per-principal query monitoring, input-distribution analysis, and confidence-exposure controls, while defense centers on rate limiting, output perturbation, and reducing the information returned per query. This skill follows the MITRE ATLAS technique definition for AML.T0024 (https://atlas.mitre.org/techniques/AML.T0024) and the NIST AI RMF MEASURE function (MEASURE-2.6, security and resilience of the AI system).

When to Use

When you operate a model behind a public or partner inference API and need to detect cloning, inversion, or membership inference.
When performing a pre-deployment AI red-team exercise to measure how many queries are needed to extract your own model.
When validating that rate limiting, output perturbation, and confidence-suppression controls actually reduce extractability.
When investigating anomalous billing/usage spikes that may indicate surrogate-model harvesting.
When responding to a privacy incident where membership inference against a model is suspected.

Prerequisites

Python 3.9+ environment.
Access to inference-API access logs (per-API-key/per-principal query counts, timestamps, input features or hashes, returned confidence vectors).
For self-assessment red-teaming, install the Adversarial Robustness Toolbox (ART), the reference framework for extraction/inference attacks and defenses:
```
pip install adversarial-robustness-toolbox scikit-learn numpy
```
Optional: access to the target model object (white/grey-box) or only its API (black-box).
Authorization to test the target model.

Objectives

Instrument the inference API to record per-principal query volume, input diversity, and confidence-exposure.
Build a detector that scores principals for extraction-like behavior (volume, near-boundary sampling, full-vector requests).
Run an ART-based extraction attack against your own model to measure fidelity vs. query budget.
Run a membership-inference attack to quantify training-data leakage.
Apply and validate defenses: rate limiting, label-only responses, confidence rounding/perturbation, and prediction poisoning.

MITRE ATT&CK Mapping

ID	Name (MITRE ATLAS)	Tactic
AML.T0024	Exfiltration via AI Inference API	Exfiltration
AML.T0024.000	Infer Training Data Membership	Exfiltration
AML.T0024.001	Invert AI Model	Exfiltration
AML.T0024.002	Extract ML Model	Exfiltration

Workflow

1. Instrument the inference API for detection signals

Capture the fields a detector needs. Per request, log the principal (API key / IP / account), timestamp, an input fingerprint, and whether the caller requested probabilities/logits.

import hashlib, json, time

def log_inference(principal, features, returned_probs):
    record = {
        "ts": time.time(),
        "principal": principal,
        # hash inputs so logs don't store raw sensitive data
        "input_hash": hashlib.sha256(json.dumps(features, sort_keys=True).encode()).hexdigest(),
        "wants_probs": returned_probs,
        "n_features": len(features),
    }
    with open("inference_audit.jsonl", "a") as f:
        f.write(json.dumps(record) + "\n")

2. Detect extraction-like query patterns

Score each principal on the three signals that distinguish extraction from normal use: high query volume in a window, high unique-input ratio (attackers rarely repeat), and a high rate of full-probability requests.

import collections, json

def score_principals(audit_path="inference_audit.jsonl", window_qps_threshold=100):
    by_principal = collections.defaultdict(lambda: {"q": 0, "uniq": set(), "probs": 0})
    for line in open(audit_path):
        r = json.loads(line)
        p = by_principal[r["principal"]]
        p["q"] += 1
        p["uniq"].add(r["input_hash"])
        p["probs"] += int(r["wants_probs"])
    findings = []
    for principal, p in by_principal.items():
        uniq_ratio = len(p["uniq"]) / max(p["q"], 1)
        prob_ratio = p["probs"] / max(p["q"], 1)
        suspicious = p["q"] > window_qps_threshold and uniq_ratio > 0.9 and prob_ratio > 0.8
        findings.append({"principal": principal, "queries": p["q"],
                         "unique_ratio": round(uniq_ratio, 3),
                         "prob_request_ratio": round(prob_ratio, 3),
                         "suspected_extraction": suspicious})
    return sorted(findings, key=lambda x: -x["queries"])

3. Measure your model's extractability with ART (self red-team)

Use ART's CopycatCNN (or KnockoffNets) to train a surrogate from black-box queries and report fidelity at a given query budget. Low query budget + high agreement = high risk.

import numpy as np
from art.estimators.classification import SklearnClassifier
from art.attacks.extraction import KnockoffNets
from sklearn.ensemble import RandomForestClassifier

# victim is your already-trained model wrapped for ART
victim = SklearnClassifier(model=trained_model)            # your production model
thief_model = RandomForestClassifier(n_estimators=100)
thief = SklearnClassifier(model=thief_model)

attack = KnockoffNets(classifier=victim, batch_size_fit=64,
                      batch_size_query=64, nb_epochs=10, nb_stolen=2000)
stolen = attack.extract(x=x_pool, thief_classifier=thief)   # 2000-query budget

agreement = np.mean(stolen.predict(x_test).argmax(1) == victim.predict(x_test).argmax(1))
print(f"Surrogate fidelity (agreement with victim): {agreement:.2%} at 2000 queries")

4. Quantify training-data leakage with membership inference

Run ART's black-box membership-inference attack. An accuracy meaningfully above 50% indicates the model leaks membership (AML.T0024.000).

from art.attacks.inference.membership_inference import MembershipInferenceBlackBox

mia = MembershipInferenceBlackBox(victim, attack_model_type="rf")
# fit the attack on a labeled split of known members / non-members
mia.fit(x_train[:500], y_train[:500], x_test[:500], y_test[:500])
member_pred = mia.infer(x_train[500:1000], y_train[500:1000])
nonmember_pred = mia.infer(x_test[500:1000], y_test[500:1000])
acc = (member_pred.mean() + (1 - nonmember_pred.mean())) / 2
print(f"Membership-inference accuracy: {acc:.2%} (0.50 = no leakage)")

5. Apply and validate defenses

Reduce the information returned and the query economics. Re-run steps 3 and 4 after each control to confirm extractability drops.

# (a) Label-only responses: never return full probability vectors to untrusted callers.
def respond(probs, trusted):
    return int(probs.argmax()) if not trusted else probs.tolist()

# (b) Confidence rounding / output perturbation (raises queries needed for inversion):
def perturb(probs, decimals=2, noise=0.01):
    p = np.round(probs, decimals) + np.random.normal(0, noise, probs.shape)
    p = np.clip(p, 0, None)
    return p / p.sum()

Defense in depth combines these with strict per-principal rate limiting, anomaly alerting from step 2, ART's ReverseSigmoid / prediction-poisoning postprocessor, and watermarking so an extracted surrogate remains attributable.

6. Alert and respond

Wire step-2 findings into your SIEM. On a confirmed extraction pattern: throttle or revoke the API key, switch the principal to label-only responses, preserve the audit log as evidence, and assess membership-inference exposure for any sensitive training data.

Tools and Resources

Resource	Link
MITRE ATLAS AML.T0024 — Exfiltration via AI Inference API	https://atlas.mitre.org/techniques/AML.T0024
Adversarial Robustness Toolbox (ART)	https://github.com/Trusted-AI/adversarial-robustness-toolbox
ART extraction attacks (CopycatCNN, KnockoffNets)	https://adversarial-robustness-toolbox.readthedocs.io/
MITRE ATLAS Matrix	https://atlas.mitre.org/matrices/ATLAS
NIST AI RMF (MEASURE function)	https://www.nist.gov/itl/ai-risk-management-framework

Detection Signal Reference

Signal	Normal use	Extraction behavior
Query volume per principal	Bounded, bursty	Very high, sustained
Unique-input ratio	Repeats common inputs	Near-1.0 (rarely repeats)
Confidence-vector requests	Mostly top label	Demands full probs/logits
Input distribution	In-distribution	Near-boundary / synthetic / grid
Inter-query timing	Human-paced	Automated, regular

Validation Criteria

Inference API logs per-principal query volume, input fingerprint, and confidence-exposure.
Detector scores principals and flags high-volume, high-unique-ratio, full-vector callers.
ART extraction attack run against own model; surrogate fidelity vs. query budget reported.
Membership-inference accuracy measured and compared against the 50% baseline.
Label-only / confidence-perturbation defenses applied and re-tested.
Per-principal rate limiting enforced and validated.
Alerts routed to SIEM with response playbook (throttle, revoke, preserve evidence).