// Actively test and verify specific claims about correctness, safety, or behavior through experiments and observable evidence. Use when: (1) asked to verify a claim about code behavior, correctness, or performance, (2) statements about behavior appear without verification evidence or test results, (3) language includes modal uncertainty markers like "should work" where verification is possible, (4) discussion analyzes implementation through reading rather than execution and observation.
| name | validating-claim-verify |
| description | Actively test and verify specific claims about correctness, safety, or behavior through experiments and observable evidence. Use when: (1) asked to verify a claim about code behavior, correctness, or performance, (2) statements about behavior appear without verification evidence or test results, (3) language includes modal uncertainty markers like "should work" where verification is possible, (4) discussion analyzes implementation through reading rather than execution and observation. |
Move from hypothesis to verified fact by designing and running experiments that test specific claims about code behavior.
When making claims about behavior, correctness, or performance, treat them as hypotheses requiring experimental verification rather than facts established through static analysis alone.
Identify the specific claim - Articulate exactly what assertion is being made about the code's behavior, whether that's "this function handles empty arrays correctly," "this component renders without errors," or "this system processes requests within 200ms." Vague claims like "this should work" resist testing and therefore remain forever unverified.
Assess verifiability - Determine whether the claim can be actively tested through experiments or must remain an assumption based on static analysis. Claims about runtime behavior, correctness under specific inputs, and performance characteristics invite verification, while claims about future maintainability or hypothetical scenarios often resist it.
Design the experiment - Specify exactly what test, observation, or execution would prove or disprove the claim, choosing the verification method that matches its scope: isolated tests for specific behaviors, integrated tests for interactions between parts, observation for system-level patterns, or cross-system monitoring for platform-wide claims.
Run the experiment - Execute the test, perform the observation, examine the results. Mental simulation produces plausible predictions; direct observation produces evidence.
Interpret the evidence - Examine what the experiment actually showed, distinguishing between claims fully supported by evidence, claims partially supported requiring qualification, and claims contradicted by results. Adjust assertions to match the strength of evidence rather than the strength of initial intuition.
Document confidence level - State both the conclusion and its basis, making clear whether reporting verified behavior from direct observation, inferred behavior from analysis, or assumed behavior requiring validation. Claims then carry appropriate epistemic weight and readers know what's proven versus what's plausible.
I'm reviewing a PR that adds parseDate(input: string) and the author claims "it handles all edge cases correctly." That's a claim requiring verification. I identify specific sub-claims: handles empty strings, handles invalid formats, handles boundary dates like year 9999 or year 1. I assess these are verifiable through unit tests. Looking at the test file, I find tests for valid ISO dates but none for empty string, malformed input, or boundary years. I cannot verify the claim from tests alone, so I run the function directly in a Node REPL: parseDate('') throws an unhandled error, parseDate('invalid') returns Invalid Date object, parseDate('9999-12-31') works. My experiment shows the claim is false - edge cases are not all handled correctly. I document: "Verified through manual testing that the function handles valid dates and far-future dates, but crashes on empty strings and returns Invalid Date for malformed input instead of throwing a clear error. The claim about handling all edge cases is contradicted by evidence."
I'm implementing a checkout flow where PaymentForm submits to PaymentService, and I claim "the form correctly handles payment failures." This is verifiable through integration testing. I design an experiment: mock the PaymentService to return a failure, trigger form submission, observe whether the form shows error UI and allows retry. I write the integration test and run it. The test fails - the form shows a loading spinner forever when payment fails because it only handles the success callback. Running the experiment revealed my claim was wrong. I fix the form to handle both success and failure, run the test again, and it passes. Now I can verify the claim: "Confirmed through integration testing that PaymentForm handles payment failures by displaying error messages and enabling retry. Evidence: test suite includes mock failure scenarios and all pass."
I'm proposing a caching layer and claim "this will reduce database load by 80% for read-heavy endpoints." This is verifiable through production monitoring or load testing. I design an experiment: deploy the cache to staging with production-like traffic, measure database query rate before and after, compare results. I run a load test simulating 1000 requests/second against the staging environment. Before cache: 950 database queries/second. After cache: 400 database queries/second. That's a 58% reduction, not 80%. The experiment shows my claim overstated the impact. I revise based on evidence: "Load testing in staging shows the cache reduces database queries by approximately 60% under read-heavy load (1000 req/s), with potential for higher reduction as cache warms up. Evidence: staging metrics over 30-minute test period, 95% cache hit rate achieved."