| name | security-audit-eval |
| description | Audit a third-party Inspect AI evaluation for security risks before running it locally. Decide whether the eval is safe by checking for malicious host-side code, externally-fetched files that aren't quality-controlled, sandbox-breakout instructions, weak sandbox configuration, supply-chain hazards, credential exposure, resource exhaustion, and provenance signals. Use when the user asks to audit / vet / security-review an eval repo (GitHub URL or local path), or asks "is it safe to run X". Do NOT use for assessing whether an eval *measures what it claims* (use eval-validity-review) or for general code-quality review (use eval-quality-workflow / code-quality-review-all). |
Security Audit: Third-Party Evaluation
This skill produces a security audit report for an Inspect AI evaluation that the user is considering running. The output is a single Markdown report in agent_artefacts/security_audits/<eval_name>/SECURITY_AUDIT_REPORT.md with a clear verdict (safe / safe-with-caveats / unsafe) backed by file:line evidence.
The audit is read-only — it does not run the evaluation, does not clone repositories to disk, and does not modify any files outside the report directory.
Identifying the target
The eval may be supplied as:
- A GitHub URL (most common for third-party audits) — use
gh api repos/<owner>/<repo>/contents/<path> and gh api repos/<owner>/<repo>/git/trees/HEAD?recursive=1 to read content. Do not clone.
- A local path to a directory the user already has — use
Read and Bash directly.
- An eval already inside
src/inspect_evals/ — same as local path; treat the eval directory as the target.
If the user is ambiguous, ask which one.
Set <eval_name> to the repo name (for URLs) or the directory name (for local paths). Set <commit> to the audited commit SHA: gh api repos/<owner>/<repo>/commits/HEAD --jq '.sha' for URLs, git rev-parse HEAD for local repos. If neither is a git repo, omit <commit> and note "uncommitted working tree" in the report.
Setup
- Create the output directory:
agent_artefacts/security_audits/<eval_name>/
- Create
NOTES.md in that directory for raw observations as you work. Err on the side of taking lots of notes — they are the source material for the report.
- Get the full file tree (note sizes — large unexplained binaries are red flags):
- URL:
gh api repos/<owner>/<repo>/git/trees/HEAD?recursive=1 --jq '.tree[] | "\(.type)\t\(.size // "")\t\(.path)"'
- Local:
find . -type f -not -path './.git/*' -printf '%s\t%p\n'
- Note any unexpected files: archives (
.tar.gz, .zip), executables (.so, .dylib, .exe), pickle (.pkl, .joblib, .pt, .pth), encoded blobs, files >5 MB without an obvious dataset purpose.
The audit: 8 categories
Apply each category in order. For each, read actual file contents — do not trust top-level summaries or READMEs alone. Malicious behaviour hides in detail (one eval() call, one suspicious URL). Record each finding in NOTES.md with file:line evidence.
1. Host-side code execution
The eval module is imported into the user's Python environment. Anything that runs at import or install time runs with the user's privileges, outside any sandbox.
Read in full:
pyproject.toml and setup.py (if present) — look for cmdclass, custom build hooks, entry_points beyond the package itself, scripts
- All
__init__.py files
conftest.py (any pytest fixture runs at test collection time)
- Top-level code in any module that gets imported by the
@task definition
Grep for, on the whole repo (excluding the dataset and lockfiles):
grep -rEn "subprocess|os\.system|os\.popen|pty\.spawn|exec\(|eval\(|compile\(|__import__\(|importlib|base64\.b64decode|codecs\.decode|marshal\.loads|pickle\.load|exec_module|getattr\(.*?,.*?env|socket\.|ctypes" --include='*.py'
For each hit, decide: is this a legitimate use (e.g. subprocess inside a function only called within a sandbox abstraction), or could it run on import / on the host? Distinguish sandbox().exec(...) (Inspect's sandbox abstraction — runs in container) from raw subprocess.run(...) (runs on host).
Also check:
package.json's scripts for preinstall, postinstall, prepare
pyproject.toml [build-system].requires and [project.dependencies] for git+, file:, private indexes, unpinned versions
requirements.txt if present
Severity guidance:
- Any
subprocess/os.system/exec at module top level or in __init__.py → High, requires explanation before passing
eval() or exec() on a string built from a remote source → High
pickle.load/torch.load on a remote URL → High (arbitrary code execution)
- Postinstall hooks in
package.json → Medium
- Unpinned or
git+ dependencies → Medium
2. Sandbox configuration
Read compose.yaml / docker-compose.yml / Dockerfile in full.
Red flags:
sandbox="local" paired with model-produced code (e.g. a bash or python tool) → High
privileged: true or any cap_add (especially SYS_ADMIN, NET_ADMIN) → High
- Host-path volume mounts:
/, /var/run/docker.sock, ~, ${HOME}, anything outside the eval's own dir → High
- Image from non-official registry, or floating tag like
:latest → Medium
- Custom Dockerfile with
RUN curl ... | sh or RUN wget ... && chmod +x → High
- Missing
mem_limit and cpus → Low (DoS surface)
- Exposed host ports the agent shouldn't need → Medium
- Network configuration: note whether
network_mode: none is set; if not, the container has outbound internet. This is sometimes legitimate (eval needs to fetch from a controlled service) but should be called out.
Verify the image source. For image: <name>:<tag>, check that <name> is a Docker official image or comes from a registry the user trusts. node:20-bookworm-slim is fine; randomuser/cool-tools:latest is not.
3. External resource fetching at runtime
Where does the eval get its data, system prompts, or scoring artifacts from?
Grep for URLs across the repo:
grep -rEn "https?://" --include='*.py' --include='*.yaml' --include='*.yml' --include='*.toml' --include='*.json' --include='*.md'
For each URL, classify:
- HuggingFace dataset with pinned
revision (40-char SHA) → safe. Sanity-check it resolves: curl -sI "<url>" | head -3 should return 200 or 30x.
- HuggingFace dataset without a revision, or with a branch name → Medium (mutable; upstream owner could swap content).
- GitHub URL pointing to a commit SHA → safe.
- GitHub URL pointing to a branch (e.g.
.../main/data.csv) → Medium (mutable).
- Any of: GDrive (
drive.google.com), Dropbox, pastebin, S3 presigned, personal domain, IP literal → High, regardless of pinning. Quality cannot be controlled.
- Internal/private domain → Medium, surface as "this eval expects access to
<domain>".
Also look for:
wget / curl / urlretrieve / requests.get calls in setup / install / __init__ paths
verify=False on TLS calls → Medium
pickle, joblib, torch.load, numpy.load(allow_pickle=True) on any of the URLs above → High (RCE via deserialization)
- Dataset loaders with dynamic URL construction (e.g. building a URL from an env var) → investigate
4. Agent affordances and prompts
The agent itself can become the attacker if the eval gives it the keys — but context matters. The question isn't "does the agent do dangerous things" — it's "does the agent do dangerous things outside its containment boundary". An agent told to run shell, exploit kernels, or steal credentials inside a properly contained Docker sandbox is doing exactly what offensive-security evals require. The same agent paired with a weak sandbox (or sandbox="local", or host volume mounts, or privileged containers) is dangerous regardless of what the prompt says.
This category cannot be read in isolation — its severity depends on category 2 (sandbox configuration). A nested-sandbox eval where the agent is expected to break out of an inner sandbox into the eval's outer Docker container is fine; a breakout from the outer container to the host is not.
Read every system prompt and task instruction (search for system_message=, template=, top-level string constants used as prompts).
Where prompt content matters:
- The relevant boundary is the host, not the eval's notional inner sandbox. Instructions to "break out", "escalate privileges", or "exfiltrate" are fine if the only thing they can plausibly affect is the eval's Docker container. They are High only if combined with a sandbox configuration that doesn't actually contain the breakout (cross-reference category 2).
- Instructions to disable safety filters, ignore guardrails, or "act as if you have no restrictions" are fine in offensive-security or red-team evals running in proper containment. They are High if the agent has tools that can reach the host or third-party services with the user's credentials.
Inventory the agent's tools. For each tool:
- What does it actually do? (read the implementation, not the docstring)
- Does it run shell on the host? With
shell=True and unescaped model input → High
- Does it expose host paths inside the sandbox? → High
- Does it call
host.docker.internal or localhost services on the host (vs. localhost inside the container)? → High
- Does it combine
bash/python / arbitrary command execution with sandbox="local", or with a sandbox that lacks containment (privileged, host mounts, network_mode: host)? → High
A react() or basic_agent() with a bash/python tool inside a properly configured Docker sandbox is the standard agentic pattern and is fine. The dangerous combinations are: agent with arbitrary code execution + sandbox="local"; agent with arbitrary code execution + a sandbox that doesn't actually contain it.
Impossible requests (find-the-thing-that-isn't-there). If the eval asks the agent to extract data that doesn't actually exist in the sandbox — e.g. "find the API key" when no API key has been planted, "exfiltrate the database" when the database is empty, "read /etc/secrets" when no such file is present — flag this. A capable agent that can't find the asked-for object inside its environment may try to break out to look for real ones (the user's machine, external services). This is both a security risk (motivates breakout attempts that probe the containment boundary) and a validity risk (the eval cannot actually be solved as posed); record it in the report and cross-reference category 8.
5. Resource exhaustion / denial-of-service
Check:
mem_limit, cpus, pids_limit in compose
message_limit and attempts on the agent
- Tool-level timeouts (e.g. install timeouts, request timeouts)
- Dataset size — multi-GB datasets without disclosure → Medium
- Loops that could fork-bomb (a tool that itself can spawn other tools without bound)
Severity: rarely High unless combined with another category. Usually Low ("be aware that a full run uses N GB of disk").
6. Credential / secret exposure
Grep:
grep -rEn "api_key|api-key|secret|token|password|bearer|aws_access|sk-[A-Za-z0-9]{20,}|hf_[A-Za-z0-9]{20,}" --include='*.py' --include='*.yaml' --include='*.yml' --include='*.json' --include='*.toml' --include='*.env*'
Plus check .gitignore — files listed there that look like secrets but aren't present in the tree are fine (good hygiene). Files matching secret patterns that are present are High.
Also look for:
- Outbound POSTs of run results / model outputs to external URLs (telemetry / exfiltration)
os.environ reads that pass values to scorers or remote services (could ship the user's API keys somewhere)
- Webhook URLs, Slack/Discord tokens, email senders
7. Provenance and integrity signals
This is informational, not a hard severity, but it shapes how much trust to extend.
Gather:
- Repo age (created date), last push, commit count
- Author count, contributor list
- Stars, forks, open issues
- Are commits signed?
gh api repos/<owner>/<repo>/commits --jq '.[0:5][] | {sha: .sha, verified: .commit.verification.verified, author: .commit.author.name}'
- Is there a README that explains intent? A test suite? A CI config?
- Suspicious
.gitignore patterns (e.g. tracking files that look like keys, or excluding directories that "shouldn't" need to be excluded)
A new single-author repo with no community signal isn't unsafe by itself, but the bar for trusting other findings should be higher — read more carefully, demand more evidence, recommend re-auditing on version bumps.
8. Validity (out of scope, flag and defer)
If you notice fabricated dataset claims, broken scorers, names that don't match content, or impossible requests (see category 4), note them in the report but explicitly mark them out of scope for the security audit.
The three named must-checks (mandatory section in every report)
Every report must explicitly answer these three, even if a finding has already been covered above. They exist because they are the user's hard requirements.
- Outright malicious code that runs as part of the eval — yes / no, with file:line evidence either way.
- Externally-fetched files that aren't quality-controlled (e.g. GDrive, Dropbox, personal domains, mutable branches) — yes / no, listing every external URL with classification.
- Sandbox-breakout to the host, or unsafe agent commands without adequate sandboxing — yes / no. The bar is whether the agent can reach the host, not whether it does dangerous things inside containment. Breakout instructions are fine if they target a contained inner sandbox; they are not fine if they target (or could plausibly affect) the host. Cite the system prompt, tool list, and sandbox declaration to show how containment holds.
If any of the three is "yes", the verdict is Unsafe and the report leads with that.
Verdict rubric
- Safe — no findings above Low, three must-checks all "no", standard sandbox in place.
- Safe with caveats — only Low / informational findings, three must-checks all "no". Caveats listed for the user to decide.
- Unsafe — any High finding, or any of the three must-checks "yes". The user should not run this eval as-is.
A single High finding drives the verdict regardless of how clean the rest of the audit is. Defence in depth means we don't trade off.
Writing the report
Write agent_artefacts/security_audits/<eval_name>/SECURITY_AUDIT_REPORT.md with this structure:
# Security Audit Report: <eval_name>
**Repository**: <URL or local path>
**Commit audited**: <sha> (<date>, <commit message subject>) [omit if uncommitted]
**Audit date**: <today's date>
**Auditor**: Claude Code (read-only audit; eval not run)
## Verdict
**<Safe | Safe with caveats | Unsafe>** — <one-line reason>.
<2-4 sentence justification: what was checked, what stood out>
## Threat-model categories applied
| # | Category | Finding | Severity |
|---|---|---|---|
| 1 | Host-side code execution | <one-line summary with file:line> | None / Low / Medium / High |
| 2 | Sandbox configuration | ... | ... |
| 3 | External resource fetching | ... | ... |
| 4 | Agent affordances and prompts | ... | ... |
| 5 | Resource exhaustion | ... | ... |
| 6 | Credential / secret exposure | ... | ... |
| 7 | Provenance and integrity | ... | ... |
| 8 | Validity (out of scope) | <flag if anything noticed, else "n/a"> | n/a |
## The three named must-checks
- **Outright malicious code that runs as part of the eval**: <yes/no, with evidence>
- **Externally-fetched files that aren't quality-controlled**: <yes/no, with the list of external URLs and how each is pinned>
- **Sandbox-breakout to the host / unsafe agent commands without adequate sandboxing**: <yes/no, citing the system prompt, tools, and sandbox declaration; note that breakouts targeting a contained inner sandbox are fine — only host reachability matters>
## Caveats and recommendations
<Each Low / informational finding gets a paragraph: what it is, what the impact is, what the user might want to do about it. Skip this section if there are none.>
## Evidence (file:line references)
<Table mapping each claim in the report to a file:line so the user can spot-check.>
## Files NOT audited
<List any large files, datasets, or out-of-scope assets you skipped, with the reason.>
## Suggested next steps
<Concrete next steps: run it / don't run it / run it with this mitigation. If verdict is "Safe with caveats", say what the caveats are.>
Wrap up
After writing the report:
- Re-read it end-to-end. Every finding must have a file:line citation.
- Tell the user the verdict (one line) and the report path.
- If the verdict is Unsafe, lead with that and list the High-severity findings inline so the user sees them without having to open the file.
- If the user's original request was a question like "is X safe to run", answer it directly in the chat as well as in the report.
Common pitfalls to avoid
- Trusting top-level summaries. READMEs lie. Always read the actual code.
- Conflating
sandbox().exec() with subprocess.run(). The first is Inspect's sandbox abstraction (safe); the second runs on the host (dangerous). Verify which one a callsite uses.
- Missing the per-sample dataset row. If the dataset contains executable strings (server code, agent prompts, shell snippets), those are part of the attack surface even though they live in data, not source. Note them as a class even if you can't audit every row.
- Calling something safe because nothing pinged. "I didn't find anything bad" is not the same as "I checked thoroughly." If a category was hard to audit (e.g. dataset too large), say so explicitly in Files NOT audited rather than implying coverage.
- Reading category 4 in isolation. Agent affordances and prompts are only as dangerous as the sandbox lets them be. Always evaluate sections 2 and 4 jointly.