with one click
skillsbench
SkillsBench contribution workflow. Use when: (1) Creating benchmark tasks, (2) Understanding repo structure, (3) Preparing PRs for task submission.
Menu
SkillsBench contribution workflow. Use when: (1) Creating benchmark tasks, (2) Understanding repo structure, (3) Preparing PRs for task submission.
| name | skillsbench |
| description | SkillsBench contribution workflow. Use when: (1) Creating benchmark tasks, (2) Understanding repo structure, (3) Preparing PRs for task submission. |
Benchmark evaluating how well AI agents use skills.
# 1. Install and initialize a task
# v1.1 docs target BenchFlow 0.6.2; uv installs the current stable CLI.
uv tool install benchflow
bench tasks init <task-id>
# 2. Write files (see CONTRIBUTING.md for templates)
# 3. Validate
bench tasks check tasks/my-task
bench eval run --tasks-dir tasks/my-task --agent oracle --sandbox docker # Must pass 100%
# 4. Test with agent (with skills)
bench eval run --tasks-dir tasks/my-task --agent claude-agent-acp \
--model <model> --skill-mode with-skill \
--skills-dir tasks/my-task/environment/skills/
# 5. Test WITHOUT skills
bench eval run --tasks-dir tasks/my-task --agent claude-agent-acp \
--model <model> --skill-mode no-skill
# 6. Submit PR (see PR template)
task.md prompt body and oracle/solve.sh must be human-authoredSkillsBench task authoring — walk a contributor from idea to submission-ready task following CONTRIBUTING.md and the task-implementation rubric. Use when the user wants to create a new SkillsBench task, scaffold a task from an existing workflow (notebook, Excel workbook, document, dataset), convert a prompt or a benchmark item into a SkillsBench task, write skills for a task, or prepare a SkillsBench PR. Pairs with `task-review` (run that as a self-check before submitting).
SkillsBench task PR review — classifies the task track (standard / research / multimodal), runs static policy checks against the track-specific rubric, benchmarks the task across oracle plus Claude and Codex (with and without skills), audits trajectories for cheating and skill invocation, and produces a `pr-N-task-timestamp-run.txt` review report alongside a `prN.zip` bundle of trajectories. Use when reviewing a SkillsBench task PR (by number, branch, or local task path), when the user asks to review a task, run benchmarks on a PR, audit a submission, classify a task as research or multimodal track, or prepare a comment to post on a SkillsBench PR.
Methodology for clause-by-clause review of a contract against a structured deviation policy ("playbook"). Covers how to walk a playbook, locate the matching provision in the contract, apply rule types (max-value, must-be-present, must-be-absent, acceptable-set, must-have-feature), classify the result (ok / risk / reject), choose the prescribed action, and ground each finding in a verbatim excerpt. Use whenever reviewing any contract — NDA, MSA, vendor DD questionnaire, lease, DPA — against a structured rules-based playbook.
Reference for the standard clauses found in commercial non-disclosure agreements (mutual and one-way) — what each clause does, the surface forms it appears in, and how to recognise it in unfamiliar drafting. Use when reviewing, comparing, or extracting provisions from any confidentiality / NDA / mutual NDA / standstill-and-confidentiality agreement.
Read Microsoft Excel (.xlsx) files robustly with `openpyxl` (or `pandas`). Covers multi-sheet workbooks, header rows, empty cells, merged cells, comma-separated list cells, and converting a sheet to a list-of-dicts the rest of your code can consume. Use when a task input or reference document is an `.xlsx` file rather than JSON/CSV.
Build unified multi-level category taxonomy from hierarchical product category paths from any e-commerce companies using embedding-based recursive clustering with intelligent category naming via weighted word frequency analysis.