Run any Skill in Manus with one click

Get Started

evaluation

Stars0

Forks0

UpdatedMarch 1, 2026 at 22:51

Reference templates for Codex evaluation. Used by build/improve orchestrators — not executed directly.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

Objective-Arts

Objective-Arts/lens

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations·SOC 15-1253

SKILL.md

readonly

name	evaluation
description	Reference templates for Codex evaluation. Used by build/improve orchestrators — not executed directly.

Evaluation Reference

Templates and formats for the Phase 8 evaluation loop. The orchestrator in /build and /improve reads these templates and injects them into single-purpose agents.

This file is NOT executed directly. The orchestrator owns the score-fix-report loop.

Rubric Loading

Read .claude/rubric/AUTO-DETECT.md for the detection table
Always load: .claude/rubric/base.md and .claude/rubric/product-quality.md
Auto-detect domains: check target files against the detection table, load matching domain rubrics
Combine into {RUBRIC_CRITERIA}

If a rubric file doesn't exist, skip it and continue.

Scorecard Prompt

The orchestrator injects this into the SCORE agent's codex exec command:

cd {TARGET} && codex exec -s read-only -o /tmp/lens-eval-scores.md "PRODUCTION READINESS SCORECARD

Score this codebase 1-10 on each dimension. No partial credit — round to
the nearest integer. A 5 means acceptable for production. Below 5 means
you would block the PR. Above 5 means you would approve with confidence.

Also check against these criteria:
{RUBRIC_CRITERIA}

1. SECURITY (1-10)
   Injection, traversal, secrets, trust boundaries, input validation

2. STRUCTURE (1-10)
   Single responsibility, file organization, dependency direction,
   interface clarity, no god objects

3. ERROR HANDLING (1-10)
   Cause chains preserved, no swallowed errors, explicit failure paths,
   no log-and-continue

4. NAMING (1-10)
   Intent-revealing names, no abbreviations, no generic names (data,
   result, info, item), consistent vocabulary

5. COMPLEXITY (1-10)
   Function length, nesting depth, branching factor, parameter count,
   cognitive load per function

6. TYPE SAFETY (1-10)
   No any, proper narrowing, discriminated unions where appropriate,
   inference used correctly

7. TESTABILITY (1-10)
   Pure functions, injectable dependencies, observable behavior,
   no hidden state

OUTPUT FORMAT (strict — one line per dimension, then total):

SECURITY: N/10 — one sentence justification. Top 3 weakest files: file:line, file:line, file:line
STRUCTURE: N/10 — one sentence justification. Top 3 weakest files: file:line, file:line, file:line
ERROR_HANDLING: N/10 — one sentence justification. Top 3 weakest files: file:line, file:line, file:line
NAMING: N/10 — one sentence justification. Top 3 weakest files: file:line, file:line, file:line
COMPLEXITY: N/10 — one sentence justification. Top 3 weakest files: file:line, file:line, file:line
TYPE_SAFETY: N/10 — one sentence justification. Top 3 weakest files: file:line, file:line, file:line
TESTABILITY: N/10 — one sentence justification. Top 3 weakest files: file:line, file:line, file:line

TOTAL: NN/70

Do not explain the scoring system. Do not add caveats. Score and justify." 2>&1

Scoreboard Format

The orchestrator prints this after parsing SCORE agent output:

EVAL_SCORES (iteration {N}):
  Security:       {N}/10
  Structure:      {N}/10
  Error Handling: {N}/10
  Naming:         {N}/10
  Complexity:     {N}/10
  Type Safety:    {N}/10
  Testability:    {N}/10
  TOTAL:          {NN}/70
  Below 9:        {list of dimensions below 9, or "none"}

Report Template

The report agent replaces .claude/eval-report.md with:

# Eval Report — {TARGET}

**Date:** {ISO date}
**Evaluator:** Codex
**Iterations:** {N}

## Scores

| Dimension | Initial | Final |
|-----------|---------|-------|
| Security | N/10 | N/10 |
| Structure | N/10 | N/10 |
| Error Handling | N/10 | N/10 |
| Naming | N/10 | N/10 |
| Complexity | N/10 | N/10 |
| Type Safety | N/10 | N/10 |
| Testability | N/10 | N/10 |
| **Total** | **NN/70** | **NN/70** |

## Fixes Applied ({count})

| # | Dimension | File | Fix |
|---|-----------|------|-----|
| 1 | {dim} | {file:line} | {what was fixed} |

Known pitfalls are maintained in canon/pitfalls/SKILL.md. If you discover a new recurring pattern during evaluation, note it in the report — it can be added to the pitfalls canon in a future release.

More from this repository

same repository

canon-audit

Objective-Arts/lens

Audit a project against a canon's rules and checklist. Read-only — produces prioritized report without fixing. Works with any canon (nextjs, sql, typescript, etc.).

2026-03-020

lens

Objective-Arts/lens

Lens home base - status, help, and setup

2026-03-020

build

Objective-Arts/lens

Plan and build a new feature with quality gates.

2026-03-020

change

Objective-Arts/lens

Simple changes done right. Make the change, clean up after yourself, report what happened.

2026-03-020

cleanup

Objective-Arts/lens

Review against canons + quality gate, fix findings, verify. Claude-native — no external models.

2026-03-020

improve

Objective-Arts/lens

Plan and improve existing code with quality gates.

2026-03-020

Evaluation Reference

Templates and formats for the Phase 8 evaluation loop. The orchestrator in /build and /improve reads these templates and injects them into single-purpose agents.

This file is NOT executed directly. The orchestrator owns the score-fix-report loop.

Rubric Loading

Read .claude/rubric/AUTO-DETECT.md for the detection table

Always load: .claude/rubric/base.md and .claude/rubric/product-quality.md

Auto-detect domains: check target files against the detection table, load matching domain rubrics

Combine into {RUBRIC_CRITERIA}

If a rubric file doesn't exist, skip it and continue.

Scorecard Prompt

The orchestrator injects this into the SCORE agent's codex exec command:

cd {TARGET} && codex exec -s read-only -o /tmp/lens-eval-scores.md "PRODUCTION READINESS SCORECARD Score this codebase 1-10 on each dimension. No partial credit — round to the nearest integer. A 5 means acceptable for production. Below 5 means you would block the PR. Above 5 means you would approve with confidence. Also check against these criteria: {RUBRIC_CRITERIA} 1. SECURITY (1-10) Injection, traversal, secrets, trust boundaries, input validation 2. STRUCTURE (1-10) Single responsibility, file organization, dependency direction, interface clarity, no god objects 3. ERROR HANDLING (1-10) Cause chains preserved, no swallowed errors, explicit failure paths, no log-and-continue 4. NAMING (1-10) Intent-revealing names, no abbreviations, no generic names (data, result, info, item), consistent vocabulary 5. COMPLEXITY (1-10) Function length, nesting depth, branching factor, parameter count, cognitive load per function 6. TYPE SAFETY (1-10) No any, proper narrowing, discriminated unions where appropriate, inference used correctly 7. TESTABILITY (1-10) Pure functions, injectable dependencies, observable behavior, no hidden state OUTPUT FORMAT (strict — one line per dimension, then total): SECURITY: N/10 — one sentence justification. Top 3 weakest files: file:line, file:line, file:line STRUCTURE: N/10 — one sentence justification. Top 3 weakest files: file:line, file:line, file:line ERROR_HANDLING: N/10 — one sentence justification. Top 3 weakest files: file:line, file:line, file:line NAMING: N/10 — one sentence justification. Top 3 weakest files: file:line, file:line, file:line COMPLEXITY: N/10 — one sentence justification. Top 3 weakest files: file:line, file:line, file:line TYPE_SAFETY: N/10 — one sentence justification. Top 3 weakest files: file:line, file:line, file:line TESTABILITY: N/10 — one sentence justification. Top 3 weakest files: file:line, file:line, file:line TOTAL: NN/70 Do not explain the scoring system. Do not add caveats. Score and justify." 2>&1

Scoreboard Format

The orchestrator prints this after parsing SCORE agent output:

EVAL_SCORES (iteration {N}): Security: {N}/10 Structure: {N}/10 Error Handling: {N}/10 Naming: {N}/10 Complexity: {N}/10 Type Safety: {N}/10 Testability: {N}/10 TOTAL: {NN}/70 Below 9: {list of dimensions below 9, or "none"}

Report Template

The report agent replaces .claude/eval-report.md with:

# Eval Report — {TARGET} **Date:** {ISO date} **Evaluator:** Codex **Iterations:** {N} ## Scores | Dimension | Initial | Final | |-----------|---------|-------| | Security | N/10 | N/10 | | Structure | N/10 | N/10 | | Error Handling | N/10 | N/10 | | Naming | N/10 | N/10 | | Complexity | N/10 | N/10 | | Type Safety | N/10 | N/10 | | Testability | N/10 | N/10 | | **Total** | **NN/70** | **NN/70** | ## Fixes Applied ({count}) | # | Dimension | File | Fix | |---|-----------|------|-----| | 1 | {dim} | {file:line} | {what was fixed} |

Known pitfalls are maintained in canon/pitfalls/SKILL.md. If you discover a new recurring pattern during evaluation, note it in the report — it can be added to the pitfalls canon in a future release.