一键导入
code-quality-assessment
// Quantitative and qualitative code quality assessment with prioritized refactoring recommendations
// Quantitative and qualitative code quality assessment with prioritized refactoring recommendations
| name | code-quality-assessment |
| description | Quantitative and qualitative code quality assessment with prioritized refactoring recommendations |
REPORT_FILE = specs/architecture/code_quality_assessment.md
Target directories: If the target directory for REPORT_FILE does not exist, create it. The assessment is the first artifact that justifies the directory's existence.
Recommended Tools: Make sure you've read ~/.liza/AGENT_TOOLS.md list_directory_tree and codebase_search (fast and token-efficient semantic search) may be specifically useful.
Quality is not a binary. Measure it, grade it, and direct investment where it will compound.
Invoked for: periodic health checks, pre-refactoring assessment, onboarding orientation, or explicit code quality evaluation.
Metrics Collection → Subsystem Analysis → Synthesis → Recommendations
Templates anchor cognition. Complete each phase before the next. The skill is a measurement framework to apply to what you found, not boxes to fill with platitudes.
Report format: See references/report-format.md for the output template.
| Mode | When | Scope |
|---|---|---|
| Full Assessment | First assessment, periodic health check, major milestone | All phases, all sections, full report |
| Targeted Assessment | Evaluate specific subsystem(s) or concern area | Scoped metrics + analysis for named components only |
| Reassessment | After refactoring or significant changes | Delta comparison against previous REPORT_FILE |
| Enrichment | Improve coverage of existing assessment | Independent analysis → merge → verify → update |
| Quick Health Check | Verify existing findings still hold | Metrics refresh + finding verification only |
Phase applicability:
| Mode | Phase 1 (Metrics) | Phase 2 (Subsystem Analysis) | Phase 3 (Synthesis) | Phase 4 (Recs) | Output |
|---|---|---|---|---|---|
| Full Assessment | ✓ Complete | ✓ Complete | ✓ Complete | ✓ Complete | New REPORT_FILE |
| Targeted Assessment | Scoped | Scoped | Scoped | Scoped | Targeted section in REPORT_FILE |
| Reassessment | ✓ Fresh | Delta comparison | Update | Update | Revised REPORT_FILE |
| Enrichment | ✓ Fresh (independent) | Update (add to existing) | Update | Update | Revised REPORT_FILE |
| Quick Health Check | Refresh only | Verify only | — Skip | — Skip | Updated metrics + verification notes |
Mode selection (first match wins):
Use complete process: Phase 1 → Phase 2 → Phase 3 → Phase 4.
Time Budget: Phase 1 (Metrics) ~30% of effort. Phase 2 (Subsystem Analysis) ~40%. Phases 3+4 (Synthesis + Recommendations) ~30%. Most missed findings come from rushed metrics collection — especially the File Size Distribution scan. If you're tempted to skip ahead, you're under-investing in discovery.
Pairing checkpoint: After Phase 1, present the Metrics Dashboard and identified subsystems before proceeding to Phase 2. This catches scope gaps early (missed languages, wrong LOC counts, missing subsystems).
Default output: REPORT_FILE (if not specified and doesn't exist yet).
Scope to named subsystems only. Collect metrics only for the targeted components.
Requires existing REPORT_FILE.
*(reassessment YYYY-MM-DD)*Same anti-anchoring protocol as the software-architecture-review skill.
First check: Verify REPORT_FILE exists. If it doesn't, this is Full Assessment, not Enrichment.
Header check (BEFORE discovery): Read REPORT_FILE until you find the Mode: line to extract:
⚠️ CRITICAL: You MUST NOT read REPORT_FILE findings until Step 2. Reading findings early causes anchoring — you'll confirm existing findings instead of discovering new ones.
Process:
Independent Analysis (Phase 1 + Phase 2) — Complete as if no report exists. Explore the codebase fresh. Hold findings in memory. Do not read the existing report.
Merge Phase — Only now read REPORT_FILE. Compare your fresh findings against it.
Verification — For each finding in the existing report, verify:
Gap Analysis — List:
Update — Revise REPORT_FILE with:
*(pass N)* or *(pass N, [lens] lens)*Time Budget: Independent Analysis (step 1) should be at least as thorough as merge + verification combined.
Each enrichment pass uses a different primary lens. Continue from the previous pass's lens.
Lenses:
Rotation order: Complexity → Dependencies → CI/Build → Test Coverage → Documentation → (wrap to Complexity)
The first 3 passes cover the highest-value lenses (Complexity, Dependencies, CI/Build) as primary.
How to apply: During Phase 1+2, start with your primary lens. Spend ~40% of discovery time on it before broadening. The leading lens gets deepest attention while context is fresh.
Complexity lens — systematic scan: When Complexity is your primary lens:
Structural scan (start here): Run the LOC scan from Phase 1.3. Flag ALL files >500 LOC as potential god classes. For each, investigate before merge phase.
Design-level scan (after structural): For each complex function or file identified, ask "why is this complex?" before recommending how to fix it:
The structural scan finds where complexity lives. The design-level scan identifies what kind of complexity it is. Different kinds need different remedies — recommending extraction for a design problem addresses the symptom, not the cause.
Recommended: Run enrichment 3 times for solid coverage. Additional passes provide diminishing returns.
⚠️ MANDATORY after 3+ passes: If pass number ≥ 3, present options before proceeding:
Pass [N] exists ([previous lens] lens). Per skill, 3 passes provide solid coverage.
Options:
1. Pass [N+1] Enrichment ([next lens] lens) — full independent discovery + merge
2. Reassessment — fresh metrics + delta comparison
3. Quick Health Check — verify existing findings still hold
Which approach?
Fastest mode. No new discovery — verification only.
Quantitative backbone of the assessment. Language-agnostic.
Scan for manifest files to determine primary language(s):
| Manifest | Language |
|---|---|
go.mod | Go |
package.json | JavaScript/TypeScript |
pyproject.toml, requirements.txt, setup.py | Python |
Cargo.toml | Rust |
pom.xml, build.gradle, build.gradle.kts | Java/Kotlin |
*.csproj, *.sln | C#/.NET |
Gemfile | Ruby |
mix.exs | Elixir |
Multi-language projects: collect metrics per language, report the primary language first.
Collect per language (prefer cloc, scc, or tokei when available; fall back to wc -l):
Language-specific collection hints:
| Language | Production LOC | Test Files | Test Count | Dependencies |
|---|---|---|---|---|
| Go | *.go excluding *_test.go | *_test.go | rg "func Test" | require blocks in go.mod |
| Python | *.py excluding test_*, *_test.py | test_*.py, *_test.py | rg "def test_" | pyproject.toml / requirements.txt |
| TS/JS | *.ts, *.js excluding *.test.*, *.spec.*, node_modules/ | *.test.*, *.spec.* | `rg 'it( | test('` |
| Rust | *.rs excluding tests/ | tests/, #[cfg(test)] modules | rg '#\[test\]' | [dependencies] in Cargo.toml |
These are approximations for order-of-magnitude assessment, not precision tooling.
Identify files exceeding thresholds. List top 20 largest files:
# Go
find . -name "*.go" ! -name "*_test.go" ! -path "*/vendor/*" -exec wc -l {} + | sort -rn | head -20
# Python
find . -name "*.py" ! -path "*/__pycache__/*" ! -path "*/venv/*" ! -path "*/.venv/*" -exec wc -l {} + | sort -rn | head -20
# TypeScript/JavaScript
find . \( -name "*.ts" -o -name "*.tsx" -o -name "*.js" -o -name "*.jsx" \) ! -path "*/node_modules/*" -print0 | xargs -0 wc -l | sort -rn | head -20
# Rust
find . -name "*.rs" ! -path "*/target/*" -exec wc -l {} + | sort -rn | head -20
# Or use cloc/scc/tokei if available — they handle exclusions automatically
.github/workflows/, .gitlab-ci.yml, Jenkinsfile, Makefile, etc..pre-commit-config.yamlScan for patterns that indicate quality discipline or gaps:
Magic literals: Search for hardcoded string values used in control flow, event dispatch, identity comparison, or configuration. Language-specific patterns:
| Language | Scan Approach |
|---|---|
| Go | rg -n '"[a-z_]*"' -g '*.go' -g '!*_test.go' filtered to control flow contexts |
| Python | rg -n "['\"]\w+['\"]" -g '*.py' in if/match/dispatch contexts |
| TS/JS | rg -n "['\"]\w+['\"]" -g '*.ts' in switch/if/event contexts |
Not every string literal is a magic value. Focus on:
Provenance classification — for each category of magic literal found, classify:
| Category | Example | Fix |
|---|---|---|
| System constant | Event name, error code | Extract to typed constant |
| Configuration value | Default port, timeout | Extract to config with default |
| User-supplied identity | Agent ID, workspace name | Resolve from runtime state — a constant doesn't fix this |
The third category is the most severe: a hardcoded value that should be dynamic means the system
silently assumes a specific runtime configuration. A const only consolidates the assumption;
it doesn't fix it.
Suppression markers: Count nolint, noqa, @ts-ignore, # type: ignore, eslint-disable
Panic/exit calls: panic(), os.Exit(), process.exit(), sys.exit() in non-main code
Untyped escape hatches: interface{}, any, Any, object in production code
TODO/FIXME/HACK: Count and assess whether tracked or abandoned
Assemble findings into the dashboard format from references/report-format.md.
Qualitative assessment of each component.
Walk the project structure. Identify natural subsystem boundaries:
In Pairing mode: confirm identified subsystems before proceeding.
Each subsystem gets a 1–5 star rating:
| Stars | Meaning |
|---|---|
| ★★★★★ | Exemplary — clean, well-tested, minimal concerns |
| ★★★★☆ | Strong — solid engineering with minor concerns |
| ★★★☆☆ | Adequate — functional but meaningful improvement opportunities |
| ★★☆☆☆ | Concerning — significant issues affecting maintainability or reliability |
| ★☆☆☆☆ | Critical — serious problems requiring immediate attention |
Rating dimensions (weight equally unless context justifies otherwise):
For each subsystem:
### [Subsystem Name] (`path/`) ★★★★☆
**Strengths:**
- [Specific strength with evidence]
**Concerns:**
- [Specific concern with evidence — file names, LOC counts, concrete observations]
Discipline: Every strength and concern must cite evidence. "Well-tested" is not a strength — "2.5:1 test-to-production ratio with table-driven subtests" is. "Large file" is not a concern — "handlers.go at 918 LOC mixing 30+ handler functions" is.
After subsystem analysis, assess these quality dimensions that span subsystems. Include each section only when there is substantive content — an empty section is worse than an absent one.
Testing & Quality Infrastructure:
Pre-Commit & CI Pipeline:
Documentation & Specifications:
Aggregate findings into overall assessment.
One paragraph: what the project is and its overall engineering quality.
Key Strengths (3–5 bullet points): The most impactful positive patterns. Synthesize — don't enumerate. "Clean architecture" is not a strength; "Strict layer separation with dependencies pointing inward — no infrastructure types leak into domain" is.
Areas for Improvement (3–5 bullet points): The most impactful concerns. Same evidence discipline.
| Grade | Meaning |
|---|---|
| A+ | Exceptional — exemplary across all dimensions; teaching reference quality |
| A | Excellent — strong across all dimensions; concerns are minor |
| A- | Excellent with concerns — strong foundation, meaningful structural or coverage gaps |
| B+ | Good — solid engineering; several areas need attention |
| B | Good with gaps — functional and maintainable; notable testing or structural gaps |
| B- | Adequate — works but shows systematic underinvestment in quality |
| C+ | Below expectations — multiple significant concerns; improvement needed before scaling |
| C | Concerning — serious quality issues affecting reliability or maintainability |
| C- | Poor — widespread problems; refactoring required before feature work |
| D | Critical — fundamental issues; quality debt threatens project viability |
| F | Failing — unmaintainable; rebuild considerations warranted |
Grading discipline:
Prioritized refactoring roadmap.
| Priority | Criteria | Typical Actions |
|---|---|---|
| P1: High Impact / Low Risk | Structural improvements that don't change behavior. Clear, safe, high ROI. | File splits, module extraction, grouping, adding missing CI gates, extracting typed constants from magic literals |
| P2: Medium Impact / Medium Risk | Quality improvements requiring broader changes. | Coverage enforcement, test additions, API cleanup, dependency updates, design pattern introduction (strategy, declarative registration), resolving hardcoded identities |
| P3: Strategic / Long-term | Investments that compound over time. May require architecture changes. | Fuzz testing, spec-code automation, tooling, major decompositions |
For each recommendation:
#### N.M [Title]
- **What**: [Specific files/components to change and how]
- **Risk**: [Low / Medium / High] — [rationale]
- **Impact**: [What improves and why it matters]
- **Depends on**: [Other recommendations, if any]
Every recommendation must trace to a finding in Phase 2 or Phase 3. No generic "best practices" without project-specific justification.
ISSUES_FILE = specs/architecture/architectural-issues.md
Significant findings (subsystem concerns rated ★★☆☆☆ or below, P1 recommendations, cross-cutting concerns) should be persisted to ISSUES_FILE for long-term tracking.
Persistence format:
### [Issue Title]
**Skill:** code-quality-assessment
**Category:** [Subsystem concern / Cross-cutting / RECOMMENDATION]
**Issue:** [Description]
**Implication:** [Why it matters]
**Direction:** [Suggested approach, if any]
What to persist:
What NOT to persist:
This skill assesses the whole repository, not individual diffs. Scope constraints apply to persistence (what gets written to ISSUES_FILE), not to analysis (what gets examined).
Liza mode (multi-agent):
Pairing mode:
Pairing mode: Before saving findings to ISSUES_FILE:
Found [N] quality issues worth persisting:
1. [Issue title] — [one-line summary]
2. ...
Save to specs/architecture/architectural-issues.md? (y/n/select specific)
Wait for user confirmation.
Liza mode: Save automatically after assessment completion.
| Skill | Relationship |
|---|---|
| software-architecture-review | Complementary. Code quality assesses health metrics and grading; architecture review assesses structural patterns, smells, and dependency direction. For deeper structural analysis, invoke architecture review. |
| testing | Downstream. Testing skill provides detailed test methodology; code quality assessment provides the bird's-eye view of testing adequacy. |
| clean-code | Downstream. Code quality assessment identifies refactoring targets; clean-code executes the transformations. |
| code-review | Orthogonal. Code review evaluates diffs; code quality evaluates the whole codebase. Assessment findings provide context for reviewers. |
Healthy ranges for calibrating assessments. These are norms, not targets.
| Metric | Healthy Range | Warning Signs |
|---|---|---|
| Test-to-production ratio | 0.5:1 – 3:1 | <0.3:1 (undertested), >5:1 (possibly testing implementation details) |
| Max file LOC | <500 | >500 (god class candidate), >800 (almost certainly needs splitting) |
| Direct dependencies | Varies by ecosystem | >50 for a focused tool; >200 for any project |
| CI coverage enforcement | Present | Absent when test ratio is healthy (culture without enforcement) |
| TODOs in production code | 0 ideal | >10 untracked (deferred maintenance) |
| Pre-commit hooks | Present | None configured in a team project |
| Magic literals in dispatch | 0 (all typed constants) | >5 untyped strings in control flow (typo risk, no IDE support) |
Calibration varies by ecosystem. Anchoring to wrong norms produces misleading grades.
| Language | Typical Test Approach | Dependency Norms | File Size Norms |
|---|---|---|---|
| Go | Table-driven t.Run() subtests; stdlib testing | Minimal (stdlib-first) | Packages <500 LOC typical |
| Python | pytest; fixtures-heavy | pip ecosystem; moderate deps normal | Modules <300 LOC typical |
| TypeScript/JS | Jest/Vitest; mock-heavy | npm ecosystem; high dep count normal | Components <200 LOC typical |
| Rust | #[test] modules; property testing via proptest | Cargo ecosystem; moderate deps | Modules <500 LOC typical |
| Java/Kotlin | JUnit; Spring test | Maven/Gradle; high deps normal | Classes <300 LOC typical |
| C#/.NET | xUnit/NUnit; mock-heavy | NuGet; moderate deps | Classes <300 LOC typical |
| Ruby | RSpec/Minitest; fixtures | Bundler; moderate deps | Classes <200 LOC typical |
| Elixir | ExUnit; doctests | Hex; minimal deps typical | Modules <300 LOC typical |
A Go project with 50 dependencies is notable; a TypeScript project with 50 is unremarkable. Grade relative to ecosystem.
Code Review Protocol
Coordinate Pairing-mode doer/reviewer sessions through a Markdown blackboard. Use when the user invokes /adversarial-pairing with role and blackboard-path arguments or asks multiple pairing agents to coordinate plan review, implementation, staged code review, and follow-up review rounds without Liza multi-agent mode.
Analyze Liza `.liza/agent-prompts/` and `.liza/agent-outputs/` from a context-engineering perspective: prompt payload shape, context budget use, cacheability, duplicated or missing context, instruction hierarchy, tool-output pressure, role-specific context fit, and prompt-output feedback loops. Use when diagnosing agent context bloat, prompt drift, poor agent handoffs, repeated misunderstandings, excessive tool output, or whether Liza agents received the right information at the right time.
Analyze Liza agents logs
Transform vision documents into structured epics that bound story-writing
Transform requirements into user stories for coding tasks