Run any Skill in Manus with one click

behavioral-evals

Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/google-gemini/gemini-cli --skill behavioral-evals

Copy and paste this command into Claude Code to install the skill

Source

google-gemini/gemini-cli

Stars104,900

Forks13,989

UpdatedApril 7, 2026 at 19:26

File Explorer

7 files

SKILL.md

readonly

name	behavioral-evals
description	Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests.

Behavioral Evals

Overview

Behavioral evaluations (evals) are tests that validate the agent's decision-making (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.

[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.

🔄 Workflow Decision Tree

Does a prompt/tool change need validation?
- No -> Normal integration tests.
- Yes -> Continue below.
Is it UI/Interaction heavy?
- Yes -> Use appEvalTest (AppRig). See creating.md.
- No -> Use evalTest (TestRig). See creating.md.
Is it a new test?
- Yes -> Set policy to USUALLY_PASSES.
- No -> ALWAYS_PASSES (locks in regression).
Are you fixing a failure or promoting a test?
- Fixing -> See fixing.md.
- Promoting -> See promoting.md.

📋 Quick Checklist

1. Setup Workspace

Seed the workspace with necessary files using the files object to simulate a realistic scenario (e.g., NodeJS project with package.json).

Details in creating.md

2. Write Assertions

Audit agent decisions using rig.setBreakpoint() (AppRig only) or index verification on rig.readToolLogs().

Details in creating.md

3. Verify

Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.

See evals/README.md for running commands.

📦 Bundled Resources

Detailed procedural guides:

creating.md: Assertion strategies, Rig selection, Mock MCPs.
fixing.md: Step-by-step automated investigation, architecture diagnosis guidelines.
promoting.md: Candidate identification criteria and threshold guidelines.

More from this repository

same repository

agent-tui

google-gemini/gemini-cli

Main Agents: Do NOT use this skill directly. If you need to test the TUI, invoke the `tui_tester` subagent. Drive terminal UI (TUI) applications programmatically for testing, automation, and inspection. Use when: automating CLI/TUI interactions, regression testing terminal apps, or verifying interactive behavior. Also use when: user asks "what is agent-tui", "what does agent-tui do", "demo agent-tui", "show me agent-tui", "how does agent-tui work", or wants to see it in action.

2026-05-18104.9k

tui-tester

google-gemini/gemini-cli

Expert guidance for testing Gemini CLI behavior and visual output using terminal automation.

2026-05-18104.9k

critique

google-gemini/gemini-cli

Expertise in auditing and fixing repository scripts and GitHub Actions workflows to ensure technical robustness and security.

2026-05-12104.9k

memory

google-gemini/gemini-cli

Expertise in maintaining persistent bot memory, synchronizing with previous sessions via the Task Ledger, and preserving decision logs.

2026-05-12104.9k

metrics

google-gemini/gemini-cli

Expertise in analyzing time-series repository health metrics, investigating root causes, and proposing proactive workflow improvements.

2026-05-12104.9k

prs

google-gemini/gemini-cli

Expertise in managing the Git and GitHub Pull Request lifecycle, including staging changes, generating PR descriptions, and branch management.

2026-05-12104.9k

Source

google-gemini

google-gemini/gemini-cli

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

name	behavioral-evals
description	Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests.

Behavioral Evals

Overview

[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.

🔄 Workflow Decision Tree

Does a prompt/tool change need validation?
- No -> Normal integration tests.
- Yes -> Continue below.
Is it UI/Interaction heavy?
- Yes -> Use appEvalTest (AppRig). See creating.md.
- No -> Use evalTest (TestRig). See creating.md.
Is it a new test?
- Yes -> Set policy to USUALLY_PASSES.
- No -> ALWAYS_PASSES (locks in regression).
Are you fixing a failure or promoting a test?
- Fixing -> See fixing.md.
- Promoting -> See promoting.md.

📋 Quick Checklist

1. Setup Workspace

Seed the workspace with necessary files using the files object to simulate a realistic scenario (e.g., NodeJS project with package.json).

Details in creating.md

2. Write Assertions

Audit agent decisions using rig.setBreakpoint() (AppRig only) or index verification on rig.readToolLogs().

Details in creating.md

3. Verify

Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.

See evals/README.md for running commands.

📦 Bundled Resources

Detailed procedural guides:

creating.md: Assertion strategies, Rig selection, Mock MCPs.
fixing.md: Step-by-step automated investigation, architecture diagnosis guidelines.
promoting.md: Candidate identification criteria and threshold guidelines.