Run any Skill in Manus with one click

judge-trajectory

Stars0

Forks0

UpdatedApril 10, 2026 at 15:57

Evaluate agent trajectories, tool-use traces, intermediate artifacts, runtime failures, and side effects when process quality is part of the verdict. Use this whenever the judging task involves agent runs, tool calls, planning quality, web or code traces, process reliability, or any comparison where the final answer alone is not enough, even if the user only says judge the run, compare traces, or evaluate the agent workflow.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

Tyler-R-Kendrick

Tyler-R-Kendrick/copilot-auto-training

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations·SOC 15-1253

File Explorer

7 files

SKILL.md

readonly

name	judge-trajectory
description	Evaluate agent trajectories, tool-use traces, intermediate artifacts, runtime failures, and side effects when process quality is part of the verdict. Use this whenever the judging task involves agent runs, tool calls, planning quality, web or code traces, process reliability, or any comparison where the final answer alone is not enough, even if the user only says judge the run, compare traces, or evaluate the agent workflow.
license	MIT
compatibility	Works in agents that support the Agent Skills standard. The `judge` agent should load this skill through the agent-skills MCP server when process evidence matters.
metadata	{"author":"your-org","version":"0.1.0"}

Trajectory Judging

Use this skill when the decisive evidence lives in the process: plans, tool calls, intermediate artifacts, failure handling, gathered evidence, or side effects.

Read references/trajectory-techniques.md when you need the benchmark rationale for process-aware judging, trajectory rubrics, verifier-backed evidence gathering, and skepticism toward narrated reasoning.

When to use this skill

Agent-run evaluation where trajectories, tool traces, or intermediate artifacts matter.
Web, code, RAG, or tool-use comparisons where process reliability is part of the verdict.
Candidate comparisons where runtime failures, evidence-gathering quality, or side effects should influence the score.

Do not use this skill alone for clean outcome-only response comparison. In those cases, switch to an outcome-focused judging contract instead.

Required inputs

Candidate trajectories, tool traces, transcripts, failure logs, or intermediate artifacts.
The baseline trajectory when available.
Any task contract, reference, criteria, benchmark notes, or end-state artifacts needed to judge whether the process achieved the right goal.
Outcome artifacts when the verdict needs both process and end-state quality.

Trajectory judging workflow

Lock a process-aware rubric before judging. Keep it to 4 to 7 dimensions and define explicit pass, partial, or fail boundaries.
Build a trajectory evidence ledger from plans, tool calls, retrieved evidence, intermediate artifacts, side effects, failure logs, and final outputs.
Score every candidate against the same rubric. Make process quality first-class evidence instead of treating it as optional commentary.
Separate operational failures from quality failures. Placeholder mismatches, broken tool usage, missing evidence gathering, and unhandled exceptions should appear explicitly in the verdict.
Use final outcomes as one dimension, not the whole verdict, when process quality clearly matters.
Run a robustness check before finalizing. Watch for order effects, benchmark overfitting, and unjustified trust in chain-of-thought or self-explanations.
Return a concise decision package that preserves the process rubric, decisive trace evidence, rejected-candidate failure modes, and calibrated confidence.

Default trajectory dimensions

Plan suitability.
Evidence gathering or retrieval quality.
Tool correctness and tool sequencing.
Failure handling and recovery.
Side effects or state-change quality.
Final outcome quality.

Output package

Selected candidate and margin.
Locked trajectory rubric.
Decisive trace evidence summary.
Main weaknesses and concrete failure modes in rejected candidates.
Confidence or uncertainty note.

Assets

assets/trajectory-rubric-template.md provides a reusable rubric and evidence-ledger template for process-aware judging.

More from this repository

same repository

trainer-optimize

Tyler-R-Kendrick/copilot-auto-training

Improve a markdown prompt file using Agent Lightning APO (Automatic Prompt Optimization). Use when the user asks to optimize or improve a markdown prompt, or starts a message with /trainer-optimize.

2026-04-130

trainer-train-agent

Tyler-R-Kendrick/copilot-auto-training

Own the end-to-end trainer loop for agent contract targets (*.agent.md files, custom agent definitions, and agent instruction documents). Use this whenever the caller needs to research, synthesize datasets, optimize, validate, and write back a trained candidate for an agent-type target. Prefer this specialized loop whenever the selected target defines tool routing, MCP skill configuration, agent personas, or handoff behavior rather than raw prompts, code, or skill definitions.

2026-04-120

trainer-train-code

Tyler-R-Kendrick/copilot-auto-training

Own the end-to-end trainer loop for Python code targets optimized with Microsoft Trace (nodes, bundles, models, and trainable agent components). Use this whenever the caller needs to research, synthesize test-based datasets, optimize, validate, and write back a trained candidate for a code-type target. Prefer this specialized loop for any Python file or callable that benefits from deterministic, test-based or benchmark-based feedback rather than open-ended language instruction quality.

2026-04-120

trainer-train-code

Tyler-R-Kendrick/copilot-auto-training

2026-04-120

trainer-train-prompt

Tyler-R-Kendrick/copilot-auto-training

Own the end-to-end trainer loop for prompt-like files (*.prompt.md, *.prompty, *.instructions.md, system prompts, and other natural-language instruction artifacts). Use this whenever the caller needs to research, synthesize datasets, optimize, validate, and write back a trained candidate for a prompt-type target. Prefer this specialized loop for any file whose primary content is natural-language instructions rather than code, skill configuration, or agent contracts.

2026-04-120

trainer-train-prompt

Tyler-R-Kendrick/copilot-auto-training

2026-04-120

name	judge-trajectory
description	Evaluate agent trajectories, tool-use traces, intermediate artifacts, runtime failures, and side effects when process quality is part of the verdict. Use this whenever the judging task involves agent runs, tool calls, planning quality, web or code traces, process reliability, or any comparison where the final answer alone is not enough, even if the user only says judge the run, compare traces, or evaluate the agent workflow.
license	MIT
compatibility	Works in agents that support the Agent Skills standard. The `judge` agent should load this skill through the agent-skills MCP server when process evidence matters.
metadata	{"author":"your-org","version":"0.1.0"}

Trajectory Judging

Use this skill when the decisive evidence lives in the process: plans, tool calls, intermediate artifacts, failure handling, gathered evidence, or side effects.

When to use this skill

Agent-run evaluation where trajectories, tool traces, or intermediate artifacts matter.
Web, code, RAG, or tool-use comparisons where process reliability is part of the verdict.
Candidate comparisons where runtime failures, evidence-gathering quality, or side effects should influence the score.

Do not use this skill alone for clean outcome-only response comparison. In those cases, switch to an outcome-focused judging contract instead.

Required inputs

Candidate trajectories, tool traces, transcripts, failure logs, or intermediate artifacts.
The baseline trajectory when available.
Any task contract, reference, criteria, benchmark notes, or end-state artifacts needed to judge whether the process achieved the right goal.
Outcome artifacts when the verdict needs both process and end-state quality.

Trajectory judging workflow

Lock a process-aware rubric before judging. Keep it to 4 to 7 dimensions and define explicit pass, partial, or fail boundaries.
Build a trajectory evidence ledger from plans, tool calls, retrieved evidence, intermediate artifacts, side effects, failure logs, and final outputs.
Score every candidate against the same rubric. Make process quality first-class evidence instead of treating it as optional commentary.
Separate operational failures from quality failures. Placeholder mismatches, broken tool usage, missing evidence gathering, and unhandled exceptions should appear explicitly in the verdict.
Use final outcomes as one dimension, not the whole verdict, when process quality clearly matters.
Run a robustness check before finalizing. Watch for order effects, benchmark overfitting, and unjustified trust in chain-of-thought or self-explanations.
Return a concise decision package that preserves the process rubric, decisive trace evidence, rejected-candidate failure modes, and calibrated confidence.

Default trajectory dimensions

Plan suitability.
Evidence gathering or retrieval quality.
Tool correctness and tool sequencing.
Failure handling and recovery.
Side effects or state-change quality.
Final outcome quality.

Output package

Selected candidate and margin.
Locked trajectory rubric.
Decisive trace evidence summary.
Main weaknesses and concrete failure modes in rejected candidates.
Confidence or uncertainty note.

Assets

assets/trajectory-rubric-template.md provides a reusable rubric and evidence-ledger template for process-aware judging.