Run any Skill in Manus with one click

trustworthy-agents-framework

Five-principle framework for building and governing trustworthy AI agents. Covers human control, value alignment, secure interactions, transparency, and privacy in agent architecture design.

Run Skill in Manus

Overview

Five-principle framework for building and governing trustworthy AI agents. Covers human control, value alignment, secure interactions, transparency, and privacy in agent architecture design.

Install command

npx skills add https://github.com/hiyenwong/ai_collection --skill trustworthy-agents-framework

Copy and paste this command into Claude Code to install the skill

Source

hiyenwong/ai_collection

Stars1

Forks0

UpdatedJune 4, 2026 at 02:00

SKILL.md

readonly

name	trustworthy-agents-framework
description	Five-principle framework for building and governing trustworthy AI agents. Covers human control, value alignment, secure interactions, transparency, and privacy in agent architecture design.

Overview

Comprehensive framework for designing, deploying, and governing trustworthy AI agents. Establishes five core principles that must be satisfied for AI agents to be considered trustworthy: human control, value alignment, secure interactions, transparency, and privacy. Addresses prompt injection risks, tool-use security, and multi-agent coordination safety.

Architecture

Agent Model: Core LLM with safety guardrails and constitutional constraints
Agent Harness: Execution environment managing tool access, state, and action logging
Tools: External APIs and functions with permission-based access control
Environment: Sandboxed execution context preventing unauthorized data access
Agent Behavior Layers:
- Self-directed loop: plans, acts, observes, adjusts, repeats
- Behavior depends on model + harness + tools + environment working together
Five Core Principles:
- Human Control: Humans retain ultimate decision authority
  - Permission tiers: always allow, needs approval, block
  - Plan Mode: shows intended plan for review before execution
- Value Alignment: Agent behavior consistent with human values
  - Training on ambiguous situations reinforces pausing over assuming
  - Constitution reinforces "raising concerns, seeking clarification, or declining to proceed"
- Secure Interactions: Protection against prompt injection, tool misuse, data exfiltration
  - Multi-layer defenses: training, monitoring, red-teaming
- Transparency: Agent actions and reasoning are auditable and explainable
- Privacy: Agent respects data boundaries and minimizes information exposure

Key Findings

Prompt injection remains the highest-risk attack vector for agent deployment
Agent behavior depends on all four layers (model, harness, tools, environment) working together
Claude's rate of checking in roughly doubles on complex tasks vs simple tasks
Tool-use security requires explicit permission models, not implicit trust
Multi-agent systems introduce emergent risks not present in single-agent designs
Open standards like MCP (Model Context Protocol, donated to Linux Foundation) improve interoperability but expand attack surface
Auditability must be built into agent architecture, not bolted on after deployment

Ecosystem Recommendations

Benchmarks: Rigorous, standardized ways to compare agent systems on prompt injection resistance and uncertainty surfacing
Evidence Sharing: Publishing how agents are used and where they struggle
Open Standards: Protocols like MCP (Model Context Protocol, donated to Linux Foundation) improve interoperability but expand attack surface

Runtime Risk Management (Pre-Action Safety Layer)

See the actuarial-runtime-ai-agents skill for a mathematical framework that operationalizes the "Secure Interactions" principle at the action level:

Pre-action risk tolls: Every side-effect-bearing action carries a counterfactual risk toll computed against a safe default, replacing post-hoc liability with pre-transaction underwriting
Budget gating theorem: Translates risk tolerance into executed-action budget guarantees with high-probability bounds
No-splitting property: Prevents agents from gaming the system by decomposing large actions into smaller sub-actions
Underwriting boundary design: Determines the gaming-resistance of the entire system This connects governance-level principles (human control, value alignment) to mathematical runtime guarantees on individual agent actions.

Methodology Steps

Threat Modeling: Identify attack vectors specific to agent deployment context
Principle Mapping: Map each of the five trustworthiness principles to concrete implementation requirements
Architecture Design: Design agent model, harness, tools, and environment with security boundaries
Access Control: Implement least-privilege tool access with explicit permission grants
Audit Logging: Build comprehensive action and reasoning logging into agent harness
Red Team Testing: Systematically test for prompt injection, tool misuse, and data leakage
Deployment Monitoring: Continuously monitor agent behavior for drift from trustworthiness principles
Incident Response: Establish procedures for agent behavior rollback and human takeover

Applications

Enterprise AI agent deployment
Multi-agent system governance
Agent security assessment
Tool-use safety design
AI agent compliance and auditing
Prompt injection defense

Code Availability

Framework based on Anthropic research. MCP standard is open source.

Activation Keywords

trustworthy agents, AI governance, prompt injection, human control, value alignment, secure interactions, agent transparency, agent privacy, MCP, agent security, multi-agent safety

More from this repository

same repository

attachment-representations-interbrain-synchrony

hiyenwong/ai_collection

Attachment representations in early childhood as independent endogenous driver of interbrain synchrony during remote cooperation. Novel Remote Partner-Belief Manipulation paradigm isolates attachment representations by manipulating partner-belief. EEG synchrony concentrated at P4 channel (right TPJ). Activation: attachment, interbrain synchrony, EEG hyperscanning, child-adult interaction, attachment representations, social neuroscience, partner-belief manipulation, early childhood, mother-child interaction, brain synchronization, attachment security, social-emotional development.

2026-06-041

sleep-replay-acceleration-sharp

hiyenwong/ai_collection

SHARP (Sleep-based Hierarchical Accelerated Replay) 方法论 — 睡眠启发的分层加速回放框架用于长程非平稳时序模式识别。受啮齿动物慢波睡眠中加速回放启发，通过分离记忆模块和模式识别模块实现无反向传播的长程信用分配。适用于流式时序学习、长程依赖建模、神经科学启发的 AI 架构。触发词：睡眠回放、加速回放、SHARP、时序学习、长程依赖、流式学习、慢波睡眠、hierarchical replay

2026-06-041

piston-control-two-ion-quantum

hiyenwong/ai_collection

Inverse-engineering methodology for piston operations in trapped-ion quantum devices. One ion serves as classical piston driven by Coulomb interaction with quantum-controlled ion. Stationary state determined self-consistently. Inverse-engineering protocols enable precise control of classical ion motion. Provides route toward controlled piston dynamics in microscopic quantum devices.

2026-06-041

quantum-fault-trees-minimal-cut

hiyenwong/ai_collection

Quantum fault tree analysis methodology using quantum computing. Extends classical reliability engineering fault trees to quantum domain. Identifies minimal cut sets in system reliability analysis using quantum algorithms. Applicable to safety-critical systems, cyber-physical systems, and quantum system reliability engineering.

2026-06-041

adaptive-hybrid-feature-fusion-medical

hiyenwong/ai_collection

Adaptive Hybrid Quantum-Classical Feature Fusion methodology for medical image classification. Addresses optimization asymmetries between quantum and classical paradigms using Temperature-Scaled Hybrid Fusion (TSHF), Dynamic Hybrid Fusion (DHF), and Static Hybrid Fusion (SHF) strategies. Use when designing hybrid quantum-classical ML pipelines for healthcare/medical imaging, especially when combining ResNet backbones with variational quantum circuits for diagnostic tasks.

2026-06-041

adaptive-spiking-neuron-asn

hiyenwong/ai_collection

Adaptive Spiking Neuron (ASN) methodology for vision and language modeling. Implements trainable membrane potential dynamics with adaptive firing mechanisms for efficient Spiking Neural Networks (SNNs). Activation: adaptive spiking neuron, ASN, spiking neural network vision language, SNN adaptive neuron, neuromorphic vision language model.

2026-06-041

Source

hiyenwong

hiyenwong/ai_collection

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

name	trustworthy-agents-framework
description	Five-principle framework for building and governing trustworthy AI agents. Covers human control, value alignment, secure interactions, transparency, and privacy in agent architecture design.

Overview

Architecture

Agent Model: Core LLM with safety guardrails and constitutional constraints
Agent Harness: Execution environment managing tool access, state, and action logging
Tools: External APIs and functions with permission-based access control
Environment: Sandboxed execution context preventing unauthorized data access
Agent Behavior Layers:
- Self-directed loop: plans, acts, observes, adjusts, repeats
- Behavior depends on model + harness + tools + environment working together
Five Core Principles:
- Human Control: Humans retain ultimate decision authority
  - Permission tiers: always allow, needs approval, block
  - Plan Mode: shows intended plan for review before execution
- Value Alignment: Agent behavior consistent with human values
  - Training on ambiguous situations reinforces pausing over assuming
  - Constitution reinforces "raising concerns, seeking clarification, or declining to proceed"
- Secure Interactions: Protection against prompt injection, tool misuse, data exfiltration
  - Multi-layer defenses: training, monitoring, red-teaming
- Transparency: Agent actions and reasoning are auditable and explainable
- Privacy: Agent respects data boundaries and minimizes information exposure

Key Findings

Prompt injection remains the highest-risk attack vector for agent deployment
Agent behavior depends on all four layers (model, harness, tools, environment) working together
Claude's rate of checking in roughly doubles on complex tasks vs simple tasks
Tool-use security requires explicit permission models, not implicit trust
Multi-agent systems introduce emergent risks not present in single-agent designs
Open standards like MCP (Model Context Protocol, donated to Linux Foundation) improve interoperability but expand attack surface
Auditability must be built into agent architecture, not bolted on after deployment

Ecosystem Recommendations

Benchmarks: Rigorous, standardized ways to compare agent systems on prompt injection resistance and uncertainty surfacing
Evidence Sharing: Publishing how agents are used and where they struggle
Open Standards: Protocols like MCP (Model Context Protocol, donated to Linux Foundation) improve interoperability but expand attack surface

Runtime Risk Management (Pre-Action Safety Layer)

See the actuarial-runtime-ai-agents skill for a mathematical framework that operationalizes the "Secure Interactions" principle at the action level:

Pre-action risk tolls: Every side-effect-bearing action carries a counterfactual risk toll computed against a safe default, replacing post-hoc liability with pre-transaction underwriting
Budget gating theorem: Translates risk tolerance into executed-action budget guarantees with high-probability bounds
No-splitting property: Prevents agents from gaming the system by decomposing large actions into smaller sub-actions
Underwriting boundary design: Determines the gaming-resistance of the entire system This connects governance-level principles (human control, value alignment) to mathematical runtime guarantees on individual agent actions.

Methodology Steps

Threat Modeling: Identify attack vectors specific to agent deployment context
Principle Mapping: Map each of the five trustworthiness principles to concrete implementation requirements
Architecture Design: Design agent model, harness, tools, and environment with security boundaries
Access Control: Implement least-privilege tool access with explicit permission grants
Audit Logging: Build comprehensive action and reasoning logging into agent harness
Red Team Testing: Systematically test for prompt injection, tool misuse, and data leakage
Deployment Monitoring: Continuously monitor agent behavior for drift from trustworthiness principles
Incident Response: Establish procedures for agent behavior rollback and human takeover

Applications

Enterprise AI agent deployment
Multi-agent system governance
Agent security assessment
Tool-use safety design
AI agent compliance and auditing
Prompt injection defense

Code Availability

Framework based on Anthropic research. MCP standard is open source.

Activation Keywords

trustworthy agents, AI governance, prompt injection, human control, value alignment, secure interactions, agent transparency, agent privacy, MCP, agent security, multi-agent safety