| name | assess |
| license | MIT |
| compatibility | Claude Code 2.1.76+. Requires memory MCP server. |
| description | Assesses and rates quality 0-10 across multiple dimensions (correctness, maintainability, security, performance, testability, simplicity) with pros/cons analysis. Compares against project conventions and prior decisions from memory. Produces structured evaluation reports with actionable improvement suggestions. Use when evaluating code, designs, architectures, or comparing alternative approaches. |
| context | fork |
| version | 1.7.0 |
| author | OrchestKit |
| tags | ["assessment","evaluation","quality","comparison","pros-cons","rating"] |
| user-invocable | true |
| allowed-tools | ["AskUserQuestion","Read","Grep","Glob","Task","TaskCreate","TaskUpdate","TaskList","ToolSearch","mcp__memory__search_nodes","Bash"] |
| skills | ["code-review-playbook","quality-gates","architecture-decision-record","memory","chain-patterns"] |
| argument-hint | [code-path-or-topic] [--render=markdown|json-render|both] [--effort=low|medium|high|xhigh] |
| complexity | high |
| persuasion-type | guidance |
| effort | high |
| model | sonnet |
| hooks | {"PreToolUse":[{"matcher":"Read","command":"${CLAUDE_PLUGIN_ROOT}/hooks/bin/run-hook.mjs skill/assessment-baseline-loader","once":true}]} |
| metadata | {"category":"document-asset-creation","mcp-server":"memory"} |
| triggers | {"keywords":["assess","asses","rate","evaluate","grade","score","compare","how good","how bad","red flags","trade-offs","pros and cons","good enough"],"examples":["rate this code from 0 to 10","is this approach good enough for production?","evaluate the trade-offs between Redis vs Postgres"],"anti-triggers":["fix","implement","build","test","commit","review pr","explore"]} |
Assess
Comprehensive assessment skill for answering "is this good?" with structured evaluation, scoring, and actionable recommendations.
Quick Start
/ork:assess backend/app/services/auth.py
/ork:assess our caching strategy
/ork:assess --model=opus the current database schema
/ork:assess frontend/src/components/Dashboard
Effort levels (CC 2.1.111+ adds xhigh)
| Effort | Behavior |
|---|
low / medium | Subset of dimensions, faster turnaround |
high (default) | All six dimensions with pros/cons |
xhigh (Opus 4.7 only) | All six dimensions + one additional assessor pass focused on uncertainty/caveats; emits confidence per dimension |
xhigh silently falls back to high on non-Opus-4.7 models. /ork:doctor warns when xhigh is used without Opus 4.7.
Argument Resolution
TARGET = "$ARGUMENTS"
MODEL_OVERRIDE = None
for token in "$ARGUMENTS".split():
if token.startswith("--model="):
MODEL_OVERRIDE = token.split("=", 1)[1]
TARGET = TARGET.replace(token, "").strip()
Pass MODEL_OVERRIDE to all Agent() calls via model=MODEL_OVERRIDE when set. Accepts symbolic names (opus, sonnet, haiku) or full IDs (claude-opus-4-6) per CC 2.1.74.
Effort detection (CC 2.1.120+)
${CLAUDE_EFFORT} is the primary signal. CC 2.1.120 sets this env var from /effort or the model picker. --effort= token in $ARGUMENTS is the explicit override fallback (also covers older CC).
EFFORT = os.environ.get("CLAUDE_EFFORT")
for token in "$ARGUMENTS".split():
if token.startswith("--effort="):
EFFORT = token.split("=", 1)[1]
TARGET = TARGET.replace(token, "").strip()
EFFORT = EFFORT or "high"
Use EFFORT to gate dimension count, agent count, and the optional xhigh uncertainty pass — see "Effort levels" table above. On CC < 2.1.120 the env var is unset; the explicit --effort= override is the only path. /ork:doctor warns when xhigh is requested without Opus 4.7.
STEP -1: MCP Probe + Resume Check
Load: Read("${CLAUDE_PLUGIN_ROOT}/skills/chain-patterns/references/mcp-detection.md")
ToolSearch(query="select:mcp__memory__search_nodes")
Write(".claude/chain/capabilities.json", {
"memory": probe_memory.found,
"skill": "assess",
"timestamp": now()
})
state = Read(".claude/chain/state.json")
if state.skill == "assess" and state.status == "in_progress":
last_handoff = Read(f".claude/chain/{state.last_handoff}")
Phase Handoffs
| Phase | Handoff File | Contents |
|---|
| 0 | 00-intent.json | Dimensions, target, mode |
| 1 | 01-baseline.json | Initial codebase scan results |
| 2 | 02-evaluation.json | Per-dimension scores + evidence |
| 3 | 03-report.json | Final report, grade, recommendations |
STEP 0: Verify User Intent with AskUserQuestion
BEFORE creating tasks, clarify assessment dimensions:
AskUserQuestion(
questions=[{
"question": "What dimensions to assess?",
"header": "Dimensions",
"options": [
{"label": "Full assessment (Recommended)", "description": "All dimensions: quality, maintainability, security, performance", "markdown": "```\nFull Assessment (7 phases)\n──────────────────────────\n Dimensions scored 0-10:\n ┌─────────────────────────────┐\n │ Correctness ████████░░ │\n │ Maintainability ██████░░░░ │\n │ Security █████████░ │\n │ Performance ███████░░░ │\n │ Testability ██████░░░░ │\n │ Architecture ████████░░ │\n │ Documentation █████░░░░░ │\n └─────────────────────────────┘\n + Pros/cons + alternatives\n + Effort estimates + report\n Agents: 4 parallel evaluators\n```"},
{"label": "Code quality only", "description": "Readability, complexity, best practices", "markdown": "```\nCode Quality Focus\n──────────────────\n Dimensions scored 0-10:\n ┌─────────────────────────────┐\n │ Correctness ████████░░ │\n │ Maintainability ██████░░░░ │\n │ Testability ██████░░░░ │\n └─────────────────────────────┘\n Skip: security, performance\n Agents: 1 code-quality-reviewer\n Output: Score + best practice gaps\n```"},
{"label": "Security focus", "description": "Vulnerabilities, attack surface, compliance", "markdown": "```\nSecurity Focus\n──────────────\n ┌──────────────────────────┐\n │ OWASP Top 10 check │\n │ Dependency CVE scan │\n │ Auth/AuthZ flow review │\n │ Data flow tracing │\n │ Secrets detection │\n └──────────────────────────┘\n Agent: security-auditor\n Output: Vuln list + severity\n + remediation steps\n```"},
{"label": "Quick score", "description": "Just give me a 0-10 score with brief notes", "markdown": "```\nQuick Score\n───────────\n Single pass, ~2 min:\n\n Read target ──▶ Score ──▶ Done\n 7.2/10\n\n Output:\n ├── Composite score (0-10)\n ├── Grade (A-F)\n ├── 3 strengths\n └── 3 improvements\n No agents, no deep analysis\n```"}
],
"multiSelect": false
}]
)
Based on answer, adjust workflow:
- Full assessment: All 7 phases, parallel agents
- Code quality only: Skip security and performance phases
- Security focus: Prioritize security-auditor agent
- Quick score: Single pass, brief output
STEP 0b: Select Orchestration Mode
Load details: Read("${CLAUDE_SKILL_DIR}/references/orchestration-mode.md") for env var check logic, Agent Teams vs Task Tool comparison, and mode selection rules.
Task Management (CC 2.1.16)
TaskCreate(
subject="Assess: {target}",
description="Comprehensive evaluation with quality scores and recommendations",
activeForm="Assessing {target}"
)
TaskCreate(subject="Understand target and gather context", activeForm="Understanding target")
TaskCreate(subject="Discover scope and build file list", activeForm="Discovering scope")
TaskCreate(subject="Rate quality across 7 dimensions", activeForm="Rating quality")
TaskCreate(subject="Analyze pros and cons", activeForm="Analyzing pros/cons")
TaskCreate(subject="Compare alternatives", activeForm="Comparing alternatives")
TaskCreate(subject="Generate improvement suggestions", activeForm="Generating suggestions")
TaskCreate(subject="Compile assessment report", activeForm="Compiling report")
TaskUpdate(taskId="3", addBlockedBy=["2"])
TaskUpdate(taskId="4", addBlockedBy=["3"])
TaskUpdate(taskId="5", addBlockedBy=["4"])
TaskUpdate(taskId="6", addBlockedBy=["4"])
TaskUpdate(taskId="7", addBlockedBy=["5", "6"])
TaskUpdate(taskId="8", addBlockedBy=["7"])
task = TaskGet(taskId="2")
TaskUpdate(taskId="2", status="in_progress")
TaskUpdate(taskId="2", status="completed")
What This Skill Answers
| Question | How It's Answered |
|---|
| "Is this good?" | Quality score 0-10 with reasoning |
| "What are the trade-offs?" | Structured pros/cons list |
| "Should we change this?" | Improvement suggestions with effort |
| "What are the alternatives?" | Comparison with scores |
| "Where should we focus?" | Prioritized recommendations |
Workflow Overview
| Phase | Activities | Output |
|---|
| 1. Target Understanding | Read code/design, identify scope | Context summary |
| 1.5. Scope Discovery | Build bounded file list | Scoped file list |
| 2. Quality Rating | 7-dimension scoring (0-10) | Scores with reasoning |
| 3. Pros/Cons Analysis | Strengths and weaknesses | Balanced evaluation |
| 4. Alternative Comparison | Score alternatives | Comparison matrix |
| 5. Improvement Suggestions | Actionable recommendations | Prioritized list |
| 6. Effort Estimation | Time and complexity estimates | Effort breakdown |
| 7. Assessment Report | Compile findings | Final report |
Phase 1: Target Understanding
Identify what's being assessed and gather context:
Read(file_path="$ARGUMENTS[0]")
Grep(pattern="$ARGUMENTS[0]", output_mode="files_with_matches")
mcp__memory__search_nodes(query="$ARGUMENTS[0]")
Phase 1.5: Scope Discovery
Load Read("${CLAUDE_SKILL_DIR}/references/scope-discovery.md") for the full file discovery, limit application (MAX 30 files), and sampling priority logic. Always include the scoped file list in every agent prompt.
Progressive Output (CC 2.1.76)
Output results incrementally as each evaluation phase completes:
| After Phase | Show User |
|---|
| 1. Target Understanding | Scope summary, file list, context |
| 1.5. Scope Discovery | Bounded file list (max 30 files) |
| 2. Quality Rating | Each dimension's score as the evaluating agent returns |
| 3. Pros/Cons | Balanced evaluation summary |
For Phase 2 parallel agents, show each dimension's score as soon as the evaluating agent returns — don't wait for all 4 agents. If any dimension scores below 4/10, flag it immediately as a priority concern requiring user attention.
Phase 2: Quality Rating (7 Dimensions)
Rate each dimension 0-10 with weighted composite score. Load Read("${CLAUDE_PLUGIN_ROOT}/skills/quality-gates/references/unified-scoring-framework.md") for dimensions, weights, grade interpretation, and per-dimension criteria. Load Read("${CLAUDE_SKILL_DIR}/references/quality-model.md") for assess-specific overrides.
Load Read("${CLAUDE_SKILL_DIR}/references/agent-spawn-definitions.md") for Task Tool mode spawn patterns and Agent Teams alternative.
Composite Score: Weighted average of all 7 dimensions (see quality-model.md).
Phases 3-7: Analysis, Comparison & Report
Load Read("${CLAUDE_SKILL_DIR}/references/phase-templates.md") for output templates for pros/cons, alternatives, improvements, effort, and the final report.
See also: Read("${CLAUDE_SKILL_DIR}/references/alternative-analysis.md") | Read("${CLAUDE_SKILL_DIR}/references/improvement-prioritization.md")
Phase 7b: Emit Dashboard Spec (json-render)
Parse --render= from $ARGUMENTS. Default is both.
| Mode | Behavior |
|---|
markdown | Current behavior — markdown assessment report only. No spec emitted. |
json-render | Emit .claude/chain/assess-dashboard.json only. Skip markdown report. |
both | Emit spec and markdown. Default — human reads the report, downstream skills parse the spec. |
When emitting a spec:
- Load format and catalog:
Read("${CLAUDE_SKILL_DIR}/references/dashboard-spec.md"). Example: references/dashboard-example.json.
- Build the spec using only catalog types:
Card, StatGrid, DataTable, StatusBadge, BarMeter, Markdown. Top-level fields composite (number) and grade (string) are required for assess specs.
- One
BarMeter per dimension scored. The verdict element is a StatusBadge with status success/warning/error mapped from grade (A/B → success, C → warning, D/F → error).
- Write to
.claude/chain/assess-dashboard.json with compact JSON.
- Validate before declaring success:
node "${CLAUDE_SKILL_DIR}/scripts/render-spec.mjs" .claude/chain/assess-dashboard.json --check
If validation fails, fall back to markdown-only and surface the error. Never write a partial spec.
- For
--render=both, render the markdown view from the spec:
node "${CLAUDE_SKILL_DIR}/scripts/render-spec.mjs" .claude/chain/assess-dashboard.json
This guarantees JSON spec and markdown report stay in sync.
xhigh effort: when effort=xhigh is active, add a sibling Markdown element per dimension containing confidence and caveats from the uncertainty pass. Reference list it in the dimensions Card's children alongside the BarMeter. See references/dashboard-spec.md for the exact pattern.
Downstream consumption: /ork:implement reads .claude/chain/assess-dashboard.json and pulls the lowest-scoring dimension and high-priority improvements (effort ≤ 2 AND impact ≥ 4) without parsing markdown tables. Measured: assess spec ≈ 830 tokens vs ~3500 token markdown for the same content.
Self-Reported Uncertainty (Opus 4.7 only, xhigh effort)
Opus 4.7 is materially better than 4.6 at honestly reporting its own limits. When xhigh effort is active, enrich each dimension's rating with a confidence level and a list of caveats — things the model couldn't verify, assumptions it relied on, or cases it didn't test.
Output schema per dimension (JSON):
{
"dimension": "security",
"score": 7.2,
"confidence": "medium",
"caveats": [
"Didn't execute the SQL queries against a real DB to confirm parameterization",
"Assumed NODE_ENV=production in deployment; didn't verify CI config",
"Reviewed 12 of 15 handlers; remaining 3 deferred by scope filter"
],
"evidence": ["src/api/auth.ts:42", "src/middleware/guard.ts:88"]
}
Rules:
- Do not use
confidence as an auto-gate. It's a signal for the human reader, not a pass/fail threshold.
caveats must be specific. "Didn't check X" with file paths beats "uncertainty about security".
- If a caveat is cheap to resolve, resolve it instead of recording it. Caveats are for things that genuinely can't be verified within the skill's scope (e.g., production runtime behavior, future input patterns).
- Composite score still computes from
score only — not weighted by confidence — to keep the number comparable across runs.
Grade Interpretation
Load Read("${CLAUDE_PLUGIN_ROOT}/skills/quality-gates/references/unified-scoring-framework.md") for grade thresholds and scoring criteria.
Key Decisions
| Decision | Choice | Rationale |
|---|
| 7 dimensions | Comprehensive coverage | All quality aspects without overwhelming |
| 0-10 scale | Industry standard | Easy to understand and compare |
| Parallel assessment | 4 agents (7 dimensions) | Fast, thorough evaluation |
| Effort/Impact scoring | 1-5 scale | Simple prioritization math |
Rules Quick Reference
| Rule | Impact | What It Covers |
|---|
complexity-metrics (load ${CLAUDE_SKILL_DIR}/rules/complexity-metrics.md) | HIGH | 7-criterion scoring (1-5), complexity levels, thresholds |
complexity-breakdown (load ${CLAUDE_SKILL_DIR}/rules/complexity-breakdown.md) | HIGH | Task decomposition strategies, risk assessment |
Related Skills
ork:verify - Post-implementation verification
ork:code-review-playbook - Code review patterns
ork:quality-gates - Task complexity assessment, gate patterns
Version: 1.7.0 (April 2026) — ${CLAUDE_EFFORT} env var as primary effort signal (CC 2.1.120, #1540)