Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

security-audit-eval

Name: Security Audit Eval
Author: UKGovernmentBEIS

// Audit a third-party Inspect AI evaluation for security risks before running it locally. Decide whether the eval is safe by checking for malicious host-side code, externally-fetched files that aren't quality-controlled, sandbox-breakout instructions, weak sandbox configuration, supply-chain hazards, credential exposure, resource exhaustion, and provenance signals. Use when the user asks to audit / vet / security-review an eval repo (GitHub URL or local path), or asks "is it safe to run X". Do NOT use for assessing whether an eval *measures what it claims* (use eval-validity-review) or for general code-quality review (use eval-quality-workflow / code-quality-review-all).

Ejecutar en Manus

$ git log --oneline --stat

stars:518

forks:336

updated:30 de abril de 2026, 15:58

SKILL.md

readonly

related-skills.json

mismo repositorio

eval-report-workflow.md

from "UKGovernmentBEIS/inspect_evals"

Create an evaluation report for a README by selecting models, estimating costs, running evaluations, and formatting results tables. Use when user asks to make/create/generate an evaluation report. Trigger when the user asks you to run the "Make An Evaluation Report" workflow.

2026-05-24518

ci-maintenance-workflow.md

from "UKGovernmentBEIS/inspect_evals"

CI and GitHub Actions maintenance workflows — fix a failing test from a CI URL, fix a failing smoke test, add @pytest.mark.slow markers to slow tests, or review a PR against agent-checkable standards. Use when user asks to fix a failing test, fix a smoke test, mark slow tests, or review a PR. Trigger when the user asks you to run the "Write a PR For A Failing Test", "Fix A Failing Smoke Test", "Mark Slow Tests", or "Review PR According to Agent-Checkable Standards" workflow.

2026-05-12518

create-eval.md

from "UKGovernmentBEIS/inspect_evals"

Redirect to the inspect-evals-template for creating new evaluations. New evals are no longer created in this repository — they live in standalone repos. Use when user asks to create/implement/build a new evaluation.

2026-05-04518

prepare-submission-workflow.md

from "UKGovernmentBEIS/inspect_evals"

Prepare an evaluation for PR submission as an entry to the register. Use when user asks to prepare an eval for submission or finalize a PR. Trigger when the user asks you to run the "Prepare Evaluation For Submission" workflow.

2026-05-04518

ensure-test-coverage.md

from "UKGovernmentBEIS/inspect_evals"

Ensure test coverage for a single evaluation - both reviewing existing tests and creating missing ones. Analyzes testable components, checks tests against repository conventions, reports coverage gaps, and creates or improves tests. Use when user asks to check/review/create/add/ensure tests for an eval. Use whenever you are asked to review an evaluation that contains tests, or whenever you need to write a suite of tests. Do NOT use for fixing a specific failing CI test (use ci-maintenance-workflow instead).

2026-04-30518

eval-validity-review.md

from "UKGovernmentBEIS/inspect_evals"

Review a single evaluation's validity — whether its claims hold up, whether its name is accurate, whether samples can be both succeeded and failed at, and whether scoring measures ground truth. Use when user asks to check validity of an eval, or as part of the Master Checklist workflow. Do NOT use for code quality or test coverage (use eval-quality-workflow or ensure-test-coverage instead).

2026-04-30518

package.json

"author": "UKGovernmentBEIS"

"repository": "UKGovernmentBEIS/inspect_evals"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Analistas de seguridad de la informaciónOcupaciones informáticas y matemáticas15-1212L4

name

security-audit-eval

description

Audit a third-party Inspect AI evaluation for security risks before running it locally. Decide whether the eval is safe by checking for malicious host-side code, externally-fetched files that aren't quality-controlled, sandbox-breakout instructions, weak sandbox configuration, supply-chain hazards, credential exposure, resource exhaustion, and provenance signals. Use when the user asks to audit / vet / security-review an eval repo (GitHub URL or local path), or asks "is it safe to run X". Do NOT use for assessing whether an eval *measures what it claims* (use eval-validity-review) or for general code-quality review (use eval-quality-workflow / code-quality-review-all).

Security Audit: Third-Party Evaluation

This skill produces a security audit report for an Inspect AI evaluation that the user is considering running. The output is a single Markdown report in agent_artefacts/security_audits/<eval_name>/SECURITY_AUDIT_REPORT.md with a clear verdict (safe / safe-with-caveats / unsafe) backed by file:line evidence.

The audit is read-only — it does not run the evaluation, does not clone repositories to disk, and does not modify any files outside the report directory.

Identifying the target

The eval may be supplied as:

A GitHub URL (most common for third-party audits) — use gh api repos/<owner>/<repo>/contents/<path> and gh api repos/<owner>/<repo>/git/trees/HEAD?recursive=1 to read content. Do not clone.
A local path to a directory the user already has — use Read and Bash directly.
An eval already inside src/inspect_evals/ — same as local path; treat the eval directory as the target.

If the user is ambiguous, ask which one.

Set <eval_name> to the repo name (for URLs) or the directory name (for local paths). Set <commit> to the audited commit SHA: gh api repos/<owner>/<repo>/commits/HEAD --jq '.sha' for URLs, git rev-parse HEAD for local repos. If neither is a git repo, omit <commit> and note "uncommitted working tree" in the report.

Setup

Create the output directory: agent_artefacts/security_audits/<eval_name>/
Create NOTES.md in that directory for raw observations as you work. Err on the side of taking lots of notes — they are the source material for the report.
Get the full file tree (note sizes — large unexplained binaries are red flags):
- URL: gh api repos/<owner>/<repo>/git/trees/HEAD?recursive=1 --jq '.tree[] | "\(.type)\t\(.size // "")\t\(.path)"'
- Local: find . -type f -not -path './.git/*' -printf '%s\t%p\n'
Note any unexpected files: archives (.tar.gz, .zip), executables (.so, .dylib, .exe), pickle (.pkl, .joblib, .pt, .pth), encoded blobs, files >5 MB without an obvious dataset purpose.

The audit: 8 categories

Apply each category in order. For each, read actual file contents — do not trust top-level summaries or READMEs alone. Malicious behaviour hides in detail (one eval() call, one suspicious URL). Record each finding in NOTES.md with file:line evidence.

1. Host-side code execution

The eval module is imported into the user's Python environment. Anything that runs at import or install time runs with the user's privileges, outside any sandbox.

Read in full:

pyproject.toml and setup.py (if present) — look for cmdclass, custom build hooks, entry_points beyond the package itself, scripts
All __init__.py files
conftest.py (any pytest fixture runs at test collection time)
Top-level code in any module that gets imported by the @task definition

Grep for, on the whole repo (excluding the dataset and lockfiles):

grep -rEn "subprocess|os\.system|os\.popen|pty\.spawn|exec\(|eval\(|compile\(|__import__\(|importlib|base64\.b64decode|codecs\.decode|marshal\.loads|pickle\.load|exec_module|getattr\(.*?,.*?env|socket\.|ctypes" --include='*.py'

For each hit, decide: is this a legitimate use (e.g. subprocess inside a function only called within a sandbox abstraction), or could it run on import / on the host? Distinguish sandbox().exec(...) (Inspect's sandbox abstraction — runs in container) from raw subprocess.run(...) (runs on host).

Also check:

package.json's scripts for preinstall, postinstall, prepare
pyproject.toml [build-system].requires and [project.dependencies] for git+, file:, private indexes, unpinned versions
requirements.txt if present

Severity guidance:

Any subprocess/os.system/exec at module top level or in __init__.py → High, requires explanation before passing
eval() or exec() on a string built from a remote source → High
pickle.load/torch.load on a remote URL → High (arbitrary code execution)
Postinstall hooks in package.json → Medium
Unpinned or git+ dependencies → Medium

2. Sandbox configuration

Read compose.yaml / docker-compose.yml / Dockerfile in full.

Red flags:

sandbox="local" paired with model-produced code (e.g. a bash or python tool) → High
privileged: true or any cap_add (especially SYS_ADMIN, NET_ADMIN) → High
Host-path volume mounts: /, /var/run/docker.sock, ~, ${HOME}, anything outside the eval's own dir → High
Image from non-official registry, or floating tag like :latest → Medium
Custom Dockerfile with RUN curl ... | sh or RUN wget ... && chmod +x → High
Missing mem_limit and cpus → Low (DoS surface)
Exposed host ports the agent shouldn't need → Medium
Network configuration: note whether network_mode: none is set; if not, the container has outbound internet. This is sometimes legitimate (eval needs to fetch from a controlled service) but should be called out.

Verify the image source. For image: <name>:<tag>, check that <name> is a Docker official image or comes from a registry the user trusts. node:20-bookworm-slim is fine; randomuser/cool-tools:latest is not.

3. External resource fetching at runtime

Where does the eval get its data, system prompts, or scoring artifacts from?

Grep for URLs across the repo:

grep -rEn "https?://" --include='*.py' --include='*.yaml' --include='*.yml' --include='*.toml' --include='*.json' --include='*.md'

For each URL, classify:

HuggingFace dataset with pinned revision (40-char SHA) → safe. Sanity-check it resolves: curl -sI "<url>" | head -3 should return 200 or 30x.
HuggingFace dataset without a revision, or with a branch name → Medium (mutable; upstream owner could swap content).
GitHub URL pointing to a commit SHA → safe.
GitHub URL pointing to a branch (e.g. .../main/data.csv) → Medium (mutable).
Any of: GDrive (drive.google.com), Dropbox, pastebin, S3 presigned, personal domain, IP literal → High, regardless of pinning. Quality cannot be controlled.
Internal/private domain → Medium, surface as "this eval expects access to <domain>".

Also look for:

wget / curl / urlretrieve / requests.get calls in setup / install / __init__ paths
verify=False on TLS calls → Medium
pickle, joblib, torch.load, numpy.load(allow_pickle=True) on any of the URLs above → High (RCE via deserialization)
Dataset loaders with dynamic URL construction (e.g. building a URL from an env var) → investigate

4. Agent affordances and prompts

The agent itself can become the attacker if the eval gives it the keys — but context matters. The question isn't "does the agent do dangerous things" — it's "does the agent do dangerous things outside its containment boundary". An agent told to run shell, exploit kernels, or steal credentials inside a properly contained Docker sandbox is doing exactly what offensive-security evals require. The same agent paired with a weak sandbox (or sandbox="local", or host volume mounts, or privileged containers) is dangerous regardless of what the prompt says.

This category cannot be read in isolation — its severity depends on category 2 (sandbox configuration). A nested-sandbox eval where the agent is expected to break out of an inner sandbox into the eval's outer Docker container is fine; a breakout from the outer container to the host is not.

Read every system prompt and task instruction (search for system_message=, template=, top-level string constants used as prompts).

Where prompt content matters:

The relevant boundary is the host, not the eval's notional inner sandbox. Instructions to "break out", "escalate privileges", or "exfiltrate" are fine if the only thing they can plausibly affect is the eval's Docker container. They are High only if combined with a sandbox configuration that doesn't actually contain the breakout (cross-reference category 2).
Instructions to disable safety filters, ignore guardrails, or "act as if you have no restrictions" are fine in offensive-security or red-team evals running in proper containment. They are High if the agent has tools that can reach the host or third-party services with the user's credentials.

Inventory the agent's tools. For each tool:

What does it actually do? (read the implementation, not the docstring)
Does it run shell on the host? With shell=True and unescaped model input → High
Does it expose host paths inside the sandbox? → High
Does it call host.docker.internal or localhost services on the host (vs. localhost inside the container)? → High
Does it combine bash/python / arbitrary command execution with sandbox="local", or with a sandbox that lacks containment (privileged, host mounts, network_mode: host)? → High

A react() or basic_agent() with a bash/python tool inside a properly configured Docker sandbox is the standard agentic pattern and is fine. The dangerous combinations are: agent with arbitrary code execution + sandbox="local"; agent with arbitrary code execution + a sandbox that doesn't actually contain it.

Impossible requests (find-the-thing-that-isn't-there). If the eval asks the agent to extract data that doesn't actually exist in the sandbox — e.g. "find the API key" when no API key has been planted, "exfiltrate the database" when the database is empty, "read /etc/secrets" when no such file is present — flag this. A capable agent that can't find the asked-for object inside its environment may try to break out to look for real ones (the user's machine, external services). This is both a security risk (motivates breakout attempts that probe the containment boundary) and a validity risk (the eval cannot actually be solved as posed); record it in the report and cross-reference category 8.

5. Resource exhaustion / denial-of-service

Check:

mem_limit, cpus, pids_limit in compose
message_limit and attempts on the agent
Tool-level timeouts (e.g. install timeouts, request timeouts)
Dataset size — multi-GB datasets without disclosure → Medium
Loops that could fork-bomb (a tool that itself can spawn other tools without bound)

Severity: rarely High unless combined with another category. Usually Low ("be aware that a full run uses N GB of disk").

6. Credential / secret exposure

Grep:

grep -rEn "api_key|api-key|secret|token|password|bearer|aws_access|sk-[A-Za-z0-9]{20,}|hf_[A-Za-z0-9]{20,}" --include='*.py' --include='*.yaml' --include='*.yml' --include='*.json' --include='*.toml' --include='*.env*'

Plus check .gitignore — files listed there that look like secrets but aren't present in the tree are fine (good hygiene). Files matching secret patterns that are present are High.

Also look for:

Outbound POSTs of run results / model outputs to external URLs (telemetry / exfiltration)
os.environ reads that pass values to scorers or remote services (could ship the user's API keys somewhere)
Webhook URLs, Slack/Discord tokens, email senders

7. Provenance and integrity signals

This is informational, not a hard severity, but it shapes how much trust to extend.

Gather:

Repo age (created date), last push, commit count
Author count, contributor list
Stars, forks, open issues
Are commits signed? gh api repos/<owner>/<repo>/commits --jq '.[0:5][] | {sha: .sha, verified: .commit.verification.verified, author: .commit.author.name}'
Is there a README that explains intent? A test suite? A CI config?
Suspicious .gitignore patterns (e.g. tracking files that look like keys, or excluding directories that "shouldn't" need to be excluded)

A new single-author repo with no community signal isn't unsafe by itself, but the bar for trusting other findings should be higher — read more carefully, demand more evidence, recommend re-auditing on version bumps.

8. Validity (out of scope, flag and defer)

If you notice fabricated dataset claims, broken scorers, names that don't match content, or impossible requests (see category 4), note them in the report but explicitly mark them out of scope for the security audit.

The three named must-checks (mandatory section in every report)

Every report must explicitly answer these three, even if a finding has already been covered above. They exist because they are the user's hard requirements.

Outright malicious code that runs as part of the eval — yes / no, with file:line evidence either way.
Externally-fetched files that aren't quality-controlled (e.g. GDrive, Dropbox, personal domains, mutable branches) — yes / no, listing every external URL with classification.
Sandbox-breakout to the host, or unsafe agent commands without adequate sandboxing — yes / no. The bar is whether the agent can reach the host, not whether it does dangerous things inside containment. Breakout instructions are fine if they target a contained inner sandbox; they are not fine if they target (or could plausibly affect) the host. Cite the system prompt, tool list, and sandbox declaration to show how containment holds.

If any of the three is "yes", the verdict is Unsafe and the report leads with that.

Verdict rubric

Safe — no findings above Low, three must-checks all "no", standard sandbox in place.
Safe with caveats — only Low / informational findings, three must-checks all "no". Caveats listed for the user to decide.
Unsafe — any High finding, or any of the three must-checks "yes". The user should not run this eval as-is.

A single High finding drives the verdict regardless of how clean the rest of the audit is. Defence in depth means we don't trade off.

Writing the report

Write agent_artefacts/security_audits/<eval_name>/SECURITY_AUDIT_REPORT.md with this structure:

# Security Audit Report: <eval_name>

**Repository**: <URL or local path>
**Commit audited**: <sha> (<date>, <commit message subject>)  [omit if uncommitted]
**Audit date**: <today's date>
**Auditor**: Claude Code (read-only audit; eval not run)

## Verdict

**<Safe | Safe with caveats | Unsafe>** — <one-line reason>.

<2-4 sentence justification: what was checked, what stood out>

## Threat-model categories applied

| # | Category | Finding | Severity |
|---|---|---|---|
| 1 | Host-side code execution | <one-line summary with file:line> | None / Low / Medium / High |
| 2 | Sandbox configuration | ... | ... |
| 3 | External resource fetching | ... | ... |
| 4 | Agent affordances and prompts | ... | ... |
| 5 | Resource exhaustion | ... | ... |
| 6 | Credential / secret exposure | ... | ... |
| 7 | Provenance and integrity | ... | ... |
| 8 | Validity (out of scope) | <flag if anything noticed, else "n/a"> | n/a |

## The three named must-checks

- **Outright malicious code that runs as part of the eval**: <yes/no, with evidence>
- **Externally-fetched files that aren't quality-controlled**: <yes/no, with the list of external URLs and how each is pinned>
- **Sandbox-breakout to the host / unsafe agent commands without adequate sandboxing**: <yes/no, citing the system prompt, tools, and sandbox declaration; note that breakouts targeting a contained inner sandbox are fine — only host reachability matters>

## Caveats and recommendations

<Each Low / informational finding gets a paragraph: what it is, what the impact is, what the user might want to do about it. Skip this section if there are none.>

## Evidence (file:line references)

<Table mapping each claim in the report to a file:line so the user can spot-check.>

## Files NOT audited

<List any large files, datasets, or out-of-scope assets you skipped, with the reason.>

## Suggested next steps

<Concrete next steps: run it / don't run it / run it with this mitigation. If verdict is "Safe with caveats", say what the caveats are.>

Wrap up

After writing the report:

Re-read it end-to-end. Every finding must have a file:line citation.
Tell the user the verdict (one line) and the report path.
If the verdict is Unsafe, lead with that and list the High-severity findings inline so the user sees them without having to open the file.
If the user's original request was a question like "is X safe to run", answer it directly in the chat as well as in the report.

Common pitfalls to avoid

Trusting top-level summaries. READMEs lie. Always read the actual code.
Conflating sandbox().exec() with subprocess.run(). The first is Inspect's sandbox abstraction (safe); the second runs on the host (dangerous). Verify which one a callsite uses.
Missing the per-sample dataset row. If the dataset contains executable strings (server code, agent prompts, shell snippets), those are part of the attack surface even though they live in data, not source. Note them as a class even if you can't audit every row.
Calling something safe because nothing pinged. "I didn't find anything bad" is not the same as "I checked thoroughly." If a category was hard to audit (e.g. dataset too large), say so explicitly in Files NOT audited rather than implying coverage.
Reading category 4 in isolation. Agent affordances and prompts are only as dangerous as the sandbox lets them be. Always evaluate sections 2 and 4 jointly.

security-audit-eval

Más de este repositorio

Más de este repositorio

Security Audit: Third-Party Evaluation

Identifying the target

Setup

The audit: 8 categories

1. Host-side code execution

2. Sandbox configuration

3. External resource fetching at runtime

4. Agent affordances and prompts

5. Resource exhaustion / denial-of-service

6. Credential / secret exposure

7. Provenance and integrity signals

8. Validity (out of scope, flag and defer)

The three named must-checks (mandatory section in every report)

Verdict rubric

Writing the report

Wrap up

Common pitfalls to avoid

Security Audit: Third-Party Evaluation

Identifying the target

Setup

The audit: 8 categories

1. Host-side code execution

2. Sandbox configuration

3. External resource fetching at runtime

4. Agent affordances and prompts

5. Resource exhaustion / denial-of-service

6. Credential / secret exposure

7. Provenance and integrity signals

8. Validity (out of scope, flag and defer)

The three named must-checks (mandatory section in every report)

Verdict rubric

Writing the report

Wrap up

Common pitfalls to avoid