| name | deterministic-metric-design |
| description | Inventing deterministic metrics — turning a fuzzy property like 'maintainability', 'risk', or 'how reducible this code is' into a deterministic, computable number an agent can trust and optimize. Covers the path from construct to adoption — operationalizing the construct, confronting computability limits (Kolmogorov, Rice) with sound proxies, picking the right measurement scale, proving properties (monotonicity, invariance, the Weyuker/Briand axioms), guaranteeing determinism, establishing construct validity (not just LOC in disguise), and hardening against Goodhart-style gaming when an agent optimizes the metric. Trigger when designing, reviewing, or validating a quantitative metric, score, measure, or index — and even when the user doesn't say 'metric' but wants to quantify, score, rank, or measure code/behavior, build a deterministic optimization target, or invent a measure for something previously unquantified (e.g., behavior-preserving codebase-size reduction). |
dot-skills Deterministic Metric Design Best Practices
Design metrics that are deterministic, computable, provable, and valid — measures an agent can trust and optimize against without gaming them. The 44 rules across 8 categories take a metric from a fuzzy construct to an adoptable, machine-checkable number: define the construct, confront computability limits with sound proxies, ground it in measurement theory, prove its properties, pin its determinism, validate it empirically, harden it against optimization pressure, and package it for adoption.
A running example threads through every category — a deterministic measure of behavior-preserving codebase-size reduction (shrink code without changing how the app works). It is the ideal stress test because its ideal form is provably out of reach (Kolmogorov complexity is uncomputable; program equivalence is undecidable by Rice's theorem), so the whole craft is building a deterministic, tractable proxy with a proven guarantee.
This is the measurement-design layer that the *-algorithms skills apply (Big-O, NDCG, cyclomatic, MoJoFM) but never teach.
When to Apply
Use this skill when:
- Designing a new metric, score, or index — or reviewing someone's proposed metric for rigor
- Asked to "quantify", "measure", "score", or "rank" a property that has no agreed measure yet
- Building a deterministic optimization target an agent will push on (e.g., reduce code size without changing behavior)
- Auditing an existing metric that "feels off" — it suspiciously tracks LOC, jumps between runs, or gets gamed
- Turning a research idea or formula into something computable, reproducible, and adoptable
Workflow: Define → Make Computable → Prove → Validate → Harden
The categories are ordered by cascade severity — an upstream mistake poisons everything below it. Work top-down, and jump straight to a category using this table:
Each reference file is a {category}-{slug}.md containing: WHY it matters, an Incorrect example with the failure annotated, a Correct example with the minimal fix, and a reference. The incorrect/correct examples are metric definitions and procedures, not application code — the contrast is a badly-designed measure versus the fixed one.
Rule Categories by Priority
| # | Category | Prefix | Impact | Rules |
|---|
| 1 | Construct Definition & Operationalization | def- | CRITICAL | 6 |
| 2 | Computability & Tractability | comp- | CRITICAL | 7 |
| 3 | Measurement-Theoretic Foundations | meas- | HIGH | 5 |
| 4 | Proof of Metric Properties | prop- | HIGH | 6 |
| 5 | Determinism & Reproducibility | det- | HIGH | 5 |
| 6 | Construct Validity & Calibration | valid- | MEDIUM-HIGH | 6 |
| 7 | Optimization Safety & Anti-Gaming | game- | MEDIUM | 5 |
| 8 | Aggregation, Reporting & Adoption | agg- | LOW-MEDIUM | 4 |
See references/_sections.md for the full ordering rationale.
Quick Reference
1. Construct Definition & Operationalization (CRITICAL)
2. Computability & Tractability (CRITICAL)
3. Measurement-Theoretic Foundations (HIGH)
4. Proof of Metric Properties (HIGH)
5. Determinism & Reproducibility (HIGH)
6. Construct Validity & Calibration (MEDIUM-HIGH)
7. Optimization Safety & Anti-Gaming (MEDIUM)
8. Aggregation, Reporting & Adoption (LOW-MEDIUM)
How to Use
- Identify where you are with the Workflow table and open the matching first rule.
- Work the categories top-down —
def- and comp- are CRITICAL because a fuzzy construct or an uncomputable ideal makes everything downstream noise or unusable.
- When proposing or critiquing a metric, quote the rule by file path so reviewers can check the reasoning.
- For a new metric, produce a one-page spec naming: construct, proxy, scale + unit + zero, proven properties, determinism guarantees, validity evidence, guardrails, and version — one line per category here.
- See
references/_sections.md for ordering rationale and assets/templates/_template.md when adding rules.
Reference Files
Related Skills
same-results-less-code, code-simplifier, complexity-optimizer, knip-deadcode — prescriptive code-reduction skills. This skill supplies the measurement layer they lack: a deterministic, behavior-preserving reduction metric to target and verify.
algorithmic-complexity-review, computer-science-algorithms — apply existing measures (Big-O). This skill teaches how to design new ones.
opensearch-function-scoring-algorithms — applied ranking metrics (NDCG, A/B tests). This skill is the foundational methodology beneath its eval- category.