تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

trajectory-review

Name: Trajectory Review
Author: majiayu000

// Post-hoc diagnosis of a failed agent trajectory. Classifies the first unrecoverable step into one of nine failure categories (plan adherence, hallucinated information, invalid tool call, misread tool output, intent–plan mismatch, under-specified intent, unsupported intent, guardrail trigger, system failure) and produces an evidence-backed root-cause report.

تشغيل في Manus

$ git log --oneline --stat

stars:٢٠

forks:٢

updated:٢٢ مايو ٢٠٢٦ في ١٢:١٠

SKILL.md

readonly

related-skills.json

نفس المستودع

agentsmd-audit.md

from "majiayu000/vibeguard"

Audit AGENTS.md / CLAUDE.md against the five high-leverage patterns (progressive disclosure, procedural workflows, decision tables, production code examples, domain rules with concrete alternatives). Reports per-pattern coverage, anti-patterns, and a prioritized fix list.

2026-05-2220

awk-posix-compat.md

from "majiayu000/vibeguard"

Shell 脚本中 awk 的 POSIX 兼容性指南。 Use when: 编写或审查包含 awk 的 shell 脚本，尤其是需要 macOS + Linux 跨平台运行的场景。触发词: awk, BSD awk, POSIX regex, [[:space:]], guard 脚本, 跨平台 shell

2026-05-2220

eval-harness.md

from "majiayu000/vibeguard"

Assessment-driven development — Quantify code generation quality with pass@k / pass^k metrics, automatically scored by Grader.

2026-05-2220

iterative-retrieval.md

from "majiayu000/vibeguard"

Iterative retrieval — 4-stage loop (DISPATCH→EVALUATE→REFINE→LOOP) to pinpoint relevant information in the code base. Up to 3 rounds.

2026-05-2220

strategic-compact.md

from "majiayu000/vibeguard"

Strategic compression — Manual compression of contexts at logical boundaries rather than arbitrary automatic compression. Key decisions and constraints are preserved and intermediate exploration is discarded.

2026-05-2220

vibeguard.md

from "majiayu000/vibeguard"

AI-assisted development of anti-hallucination specifications. Check out the seven-layer defense architecture, quantitative indicators, execution templates and practical cases. Used for code review, task startup inspection, and weekly review.

2026-05-2220

package.json

"author": "majiayu000"

"repository": "majiayu000/vibeguard"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

$ useful --forSOC

محللو ضمان جودة البرمجيات والمختبرونمهن الحاسوب والرياضيات15-1253L4

name	trajectory-review
description	Post-hoc diagnosis of a failed agent trajectory. Classifies the first unrecoverable step into one of nine failure categories (plan adherence, hallucinated information, invalid tool call, misread tool output, intent–plan mismatch, under-specified intent, unsupported intent, guardrail trigger, system failure) and produces an evidence-backed root-cause report.

Trajectory Review

Overview

When an agent run fails, the failure mode is rarely "the model is bad". It is usually one of a small set of recurring problems on the trajectory: the agent skipped a planned step, invented a fact, called a tool wrong, misread a tool's output, or pursued the wrong subgoal entirely. Output-only review cannot distinguish these — they all surface as "the answer was wrong".

This skill takes a captured trajectory (tool calls, intermediate outputs, final response) and locates the first unrecoverable step, classifies it into one of nine categories, and reports the root cause with citations into the trajectory.

The taxonomy and four-stage diagnostic procedure are adapted from Microsoft Research's AgentRx framework (2026-04). The classes themselves are stable across agent stacks; the diagnostic stages are how this skill operates inside a Claude Code or Codex session.

When to Activate

An agent run produced a wrong or incomplete result and you have the trajectory.
A user reports "the agent is broken" and a postmortem is needed.
A regression appeared after a model upgrade and you need to know whether it is a model issue or a harness issue.
A new capability shipped and you want to characterize the failure modes that remain.
A user says "review the trajectory", "diagnose this run", or "why did the agent fail".

Do not use this skill to evaluate a passing run. For "evals pass but I don't trust it", use the W-18 three-axis evaluation framing instead.

The nine failure categories

ID	Category	Recognition signal
F1	Plan adherence failure	A required step in the stated plan is missing from the trajectory, or an unplanned step appears
F2	Hallucinated information	The trajectory cites a fact, file, function, or value that the tool outputs and prior context never produced
F3	Invalid tool invocation	A tool call has malformed arguments, wrong types, missing required fields, or an unsupported method
F4	Misread tool output	The tool returned correctly, but the agent's next step uses a value that is not in the output, or interprets a list as a single item
F5	Intent–plan mismatch	The plan addresses a different goal than the user's request — e.g. user asks to debug, agent plans to refactor
F6	Under-specified intent	The user's request lacks information the agent needs; the agent guesses rather than asking
F7	Unsupported intent	No available tool can do what the user wants and the agent does not say so
F8	Guardrail triggered	A safety, permissions, or rate-limit guardrail blocked the action and the agent did not surface that
F9	System failure	An external endpoint, network call, or runtime crashed and the agent treated the empty response as a valid one

The same trajectory can show multiple categories. The skill reports them all but identifies which one is the first unrecoverable step — the point past which the run could not have produced the right answer regardless of what came after.

Four-stage diagnostic procedure

Stage 1 — Trajectory normalization

Convert whatever was captured (chat transcript, tool-call log, JSONL events, screen recording transcript) into a uniform sequence of (step_index, role, action, payload, observed_output) records. If a stage is missing (for example, the user only provided the final answer), say so and stop. Do not invent the missing trajectory.

Stage 2 — Constraint synthesis

For each tool used in the trajectory, restate the contract the tool enforced or should have enforced: required arguments, allowed values, declared post-conditions. Source these from the tool's schema if available, otherwise from the project's AGENTS.md / CLAUDE.md declarations.

For the user's request, restate the goal as a checklist of intermediate states the trajectory must reach.

This stage is where most diagnoses become possible — once the contracts are explicit, F3, F4, F8, and F9 become mechanical to detect.

Stage 3 — Guarded evaluation (per step)

Walk the trajectory step by step. For each step, evaluate it against:

the prior step's observed output (does this step depend on a value that was actually produced?)
the tool contract (does this call respect the schema?)
the plan declared earlier in the trajectory (does this step appear in the plan, or is it unplanned?)
the user's goal checklist (does this step advance any required intermediate state?)

Mark each step as ok | warn | fail, with the specific check that failed. Do not jump ahead; the first fail is the first unrecoverable step.

Stage 4 — Classification and root-cause attribution

For the first fail step, assign an F-class. Cite:

the step index
the failed check from stage 3
the contract or plan element that was violated
one or two earlier steps that contributed (for example, an F4 misread is often caused by a prior over-summarization)

If the first fail is genuinely a system-level fault (F9), say so without escalating to a deeper class. The bias here matters: classifying everything as "model hallucination" hides harness bugs.

Output format

# Trajectory review — <run id or label>

## Trajectory
- captured stages: <plan | tool calls | outputs | final answer> — <complete | partial>
- step count: N

## Tool contracts (synthesized)
- <tool name>: <args>, <constraints>, <post-conditions>
- ...

## Goal checklist
1. <required intermediate state>
2. ...

## Step-by-step evaluation
| step | action | check | result |
|------|--------|-------|--------|
| 1 | ... | ... | ok |
| 2 | ... | ... | warn |
| 3 | ... | ... | fail (first unrecoverable) |

## Root cause
- Class: <F1–F9>
- First unrecoverable step: <index>
- Failed check: <name>
- Contributing prior steps: <indices>
- Evidence: <verbatim citation from the trajectory>

## Recommendations
- <smallest harness or prompt change that prevents this class on this trajectory>
- <rule, guard, or eval case that would have caught it>

Boundaries

This skill diagnoses one trajectory at a time. For aggregate analysis across many runs, use a separate batch tool. Do not generalize a single trajectory's class to a system-wide claim.
It does not rerun the trajectory. The diagnosis is on what was captured.
It does not rewrite the agent's prompt or skills. Recommendations are descriptive; implementation is a separate explicit ask.
For a passing trajectory whose path concerns you anyway, switch to W-18 three-axis evaluation rather than running this skill.

Red Flags

Marking the last failed step instead of the first unrecoverable one. The last step is usually a downstream consequence.
Defaulting to F2 (hallucination) without checking F4 (misread). They look identical in the final answer but require opposite fixes.
Classifying an F9 system failure as F1 plan adherence because the agent retried oddly after the timeout. The retry behavior is a symptom, not the cause.
Producing a class with no citation. Every F-class assignment must point to a specific step and contract.
Building a multi-step reasoning chain on top of a step that was already marked fail. The classification stops at the first unrecoverable step.

Checklist

Inventory the available trajectory inputs before assigning a failure class.
Identify the first unrecoverable step, not the noisiest downstream symptom.
Cite the concrete event or missing event that supports each classification.
Separate recommendations for the agent, harness, and task prompt.

Related rules

W-01 — no fixes without root cause. The first unrecoverable step is the root cause; downstream symptoms are not.
W-15 — low-information loop detection. If the trajectory shows three rounds of shrinking diff with no progress, the F-class is more likely F1 or F5 than F2.
W-18 — evaluations must validate path. The nine-class taxonomy is what an axis-1 (tool selection) and axis-2 (step adherence) eval would assert against.
SEC-12 — silent drift in MCP tool descriptions. If F4 (misread tool output) recurs across trajectories, audit the MCP tool descriptions before blaming the model.

trajectory-review

المزيد من هذا المستودع

المزيد من هذا المستودع

Trajectory Review

Overview

When to Activate

The nine failure categories

Four-stage diagnostic procedure

Stage 1 — Trajectory normalization

Stage 2 — Constraint synthesis

Stage 3 — Guarded evaluation (per step)

Stage 4 — Classification and root-cause attribution

Output format

Boundaries

Red Flags

Checklist

Related rules

Trajectory Review

Overview

When to Activate

The nine failure categories

Four-stage diagnostic procedure

Stage 1 — Trajectory normalization

Stage 2 — Constraint synthesis

Stage 3 — Guarded evaluation (per step)

Stage 4 — Classification and root-cause attribution

Output format

Boundaries

Red Flags

Checklist

Related rules