Run any Skill in Manus with one click

grade-tests

Grades a specified set of test methods individually and produces a concise table mapping each test (fully-qualified name) to a letter grade (A–F), a score band, and a one-line note — designed to be posted as a PR comment. Use when the caller wants per-test feedback on a curated list of methods (for example, the new or modified tests in a pull request), not a suite-wide audit. Polyglot: .NET (MSTest/xUnit/NUnit/TUnit), Python (pytest/unittest), TS/JS (Jest/Vitest/Mocha/node:test), Java (JUnit/TestNG), Go, Ruby (RSpec/Minitest), Rust, Swift (XCTest/Swift Testing), Kotlin (JUnit/Kotest), PowerShell (Pester), C++ (GoogleTest/Catch2/doctest). Input is a list of test methods (or method bodies / file+line spans); output is a compact markdown table plus a short summary. DO NOT USE FOR: full suite audits (use test-quality-auditor agent or test-anti-patterns), writing new tests (use code-testing-generator agent or writing-mstest-tests), fixing failures, or measuring code coverage.

Run Skill in Manus

Stars3,323

Forks245

UpdatedJune 5, 2026 at 15:42

Source

dotnet

dotnet/skills

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

SKILL.md

readonly

More from this repository

same repository

code-testing-agent

dotnet/skills

Generates and writes new unit tests for any programming language — scaffolds .NET test projects, pytest suites, Vitest/Jest suites, Go test files, and JUnit suites, and configures coverage tooling (coverlet, pytest-cov, @vitest/coverage-v8) as part of test generation. Use when asked to generate tests, generate pytest tests, generate Vitest tests, write unit tests, add tests, improve coverage, comprehensive tests, or scaffold a new test project or suite for an app, service, library, REST API, blueprint, or package — including project-wide, multi-file test generation across services, repositories, routes, and modules. Supports C#/.NET, Python (pytest, Flask/Django), TypeScript/JavaScript (Vitest, Jest, Mocha), Go, Rust, Java (JUnit). Runs a research, planning, and implementation pipeline so tests compile and pass. DO NOT USE FOR: running existing tests (use run-tests); analyzing existing coverage reports (use coverage-analysis or crap-score); MSTest modernization (use writing-mstest-tests).

2026-06-053.3k

migrate-xunit-to-mstest

dotnet/skills

Migrate .NET test projects from xUnit.net (v2 or v3) to MSTest v4. USE FOR: convert/migrate xUnit tests to MSTest, replace xunit/xunit.v3 packages, port [Fact]/[Theory]/[InlineData]/[MemberData]/[ClassData] to [TestMethod]/[DataRow]/[DynamicData], port Assert.Equal/True/Throws/ThrowsAsync to Assert.AreEqual/IsTrue/ThrowsExactly/ThrowsExactlyAsync, port IClassFixture/ ICollectionFixture/IDisposable/IAsyncLifetime/ITestOutputHelper/[Trait]/[Fact(Skip)] to MSTest equivalents, preserve xUnit parallel-class default via [assembly: Parallelize(Scope = ClassLevel)], remove xunit.runner.json. DO NOT USE FOR: xUnit v2 -> v3 upgrade (use migrate-xunit-to-xunit-v3); MSTest -> xUnit, NUnit/TUnit -> MSTest (no skills exist); MSTest version upgrades (use migrate-mstest-v1v2-to-v3 or migrate-mstest-v3-to-v4); VSTest <-> MTP only (use migrate-vstest-to-mtp); general .NET upgrades.

2026-06-053.3k

writing-mstest-tests

dotnet/skills

Write new MSTest unit tests and fix existing MSTest code using MSTest 3.x/4.x modern APIs and best practices. USE FOR: write or create MSTest unit tests, fix or modernize MSTest assertions, better MSTest assertion than Assert.IsTrue, replace hard cast with MSTest type assertion, MSTest assertion APIs (IsInstanceOfType, Contains, ContainsSingle, HasCount, IsEmpty, IsNotEmpty, DoesNotContain, StartsWith, EndsWith, MatchesRegex, IsGreaterThan, IsInRange, IsNull), fix swapped Assert.AreEqual arguments, replace ExpectedException with Assert.Throws, data-driven tests (DataRow, DynamicData, ValueTuples), test lifecycle (sealed classes, TestInitialize, TestCleanup), async tests and cancellation tokens, test parallelization (Parallelize / DoNotParallelize), MSTest.Sdk project setup. DO NOT USE FOR: broad test quality audits (use test-anti-patterns), running tests (use run-tests), MSTest version migration (use migrate-mstest-v1v2-to-v3 or migrate-mstest-v3-to-v4), xUnit/NUnit/TUnit, or non-.NET languages.

2026-06-043.3k

code-testing-extensions

dotnet/skills

Provides file paths to language-specific extension files for the code-testing pipeline. Call this skill to discover available extension guidance files (e.g., dotnet.md for .NET, cpp.md for C++). Do not use directly — invoked by code-testing agents and skills that need language-specific references.

2026-06-023.3k

assertion-quality

dotnet/skills

Analyzes the variety and depth of assertions across test suites in any language. Use when the user asks to evaluate assertion quality, find shallow testing, identify assertion-free tests (no assertions or only trivial ones like Assert.IsNotNull / expect(x).toBeTruthy() / assert x is not None), flag self-referential or tautological assertions (output equals input on identity/round-trip operations), measure assertion coverage diversity, or audit whether tests verify different facets of correctness. Produces metrics and actionable recommendations. Polyglot: .NET (MSTest/xUnit/NUnit/TUnit), Python (pytest/unittest), TS/JS (Jest/Vitest/Mocha/Jasmine/node:test), Java (JUnit/TestNG), Go, Ruby (RSpec/Minitest), Rust, Swift (XCTest/Swift Testing), Kotlin (JUnit/Kotest), PowerShell (Pester), C++ (GoogleTest/Catch2/doctest). DO NOT USE FOR: writing new tests (use code-testing-agent, or writing-mstest-tests for MSTest), anti-patterns like flakiness or duplication (use test-anti-patterns), fixing assertions.

2026-06-013.3k

test-analysis-extensions

dotnet/skills

Provides file paths to language-specific reference files for the test ANALYSIS skills (assertion-quality, test-anti-patterns, test-gap-analysis, test-smell-detection, test-tagging). Call this skill to discover available extension files (e.g., dotnet.md for .NET/MSTest/xUnit/NUnit/TUnit, python.md for pytest/unittest, typescript.md for Jest/Vitest/Mocha, java.md for JUnit/TestNG, etc.). Do not use directly — invoked by the test-quality-auditor agent and polyglot analysis skills that need framework-specific lookup tables (test markers, assertion APIs, skip annotations, sleep patterns, mystery guest indicators, integration markers, setup/teardown, tag-support capability).

2026-06-013.3k

name	grade-tests
description	Grades a specified set of test methods individually and produces a concise table mapping each test (fully-qualified name) to a letter grade (A–F), a score band, and a one-line note — designed to be posted as a PR comment. Use when the caller wants per-test feedback on a curated list of methods (for example, the new or modified tests in a pull request), not a suite-wide audit. Polyglot: .NET (MSTest/xUnit/NUnit/TUnit), Python (pytest/unittest), TS/JS (Jest/Vitest/Mocha/node:test), Java (JUnit/TestNG), Go, Ruby (RSpec/Minitest), Rust, Swift (XCTest/Swift Testing), Kotlin (JUnit/Kotest), PowerShell (Pester), C++ (GoogleTest/Catch2/doctest). Input is a list of test methods (or method bodies / file+line spans); output is a compact markdown table plus a short summary. DO NOT USE FOR: full suite audits (use test-quality-auditor agent or test-anti-patterns), writing new tests (use code-testing-generator agent or writing-mstest-tests), fixing failures, or measuring code coverage.
license	MIT

Grade Tests

Grade a curated list of test methods and produce a compact, PR-comment-friendly report: one row per test method with a letter grade, a score band, and a one-line note explaining the grade. The skill does not discover tests on its own — the caller (typically a PR automation workflow or a human reviewer holding a specific list) provides the test methods to grade.

Language-specific guidance: Call the test-analysis-extensions skill to discover available extension files, then read the file matching the target codebase's language and framework (e.g., extensions/dotnet.md, extensions/python.md, extensions/typescript.md, extensions/go.md). You MUST read the relevant extension file before scoring assertions or anti-patterns, because assertion APIs and idiomatic patterns differ significantly across frameworks.

Why a Per-Test Grade

Suite-wide audits (test-anti-patterns, assertion-quality, test-smell-detection) produce excellent diagnostic reports, but they are hard to consume as a short PR comment. Reviewers of a PR mostly want to know: for the tests this PR adds or changes, are they good? This skill answers that question with a one-row-per-test verdict that fits in a comment table.

When to Use

A PR automation workflow needs to post a comment grading the tests introduced or modified in a pull request.
A reviewer has a specific list of tests (a file, a class, a method list, or a diff hunk) and wants a per-test verdict rather than a suite report.
A maintainer wants to triage which of N tests in a contribution deserve follow-up improvements.

When Not to Use

The caller wants a full suite audit or comparative metrics — use test-anti-patterns (pragmatic) or test-smell-detection (formal) and let the test-quality-auditor agent orchestrate.
The caller wants to write new tests — use code-testing-generator (any language) or writing-mstest-tests (MSTest specifically).
The caller wants to measure code coverage or CRAP scores — use coverage-analysis or crap-score (.NET only).
The caller wants to fix issues directly in test code — invoke the appropriate editing skill.
No specific list of tests is provided. Do not try to grade every test in the workspace; ask the caller for an explicit list or scope.

Inputs

Input	Required	Description
Test methods	Yes	A scope to grade. Provide one of: (a) an explicit list of test method names (fully-qualified, e.g. `Namespace.ClassName.TestMethodName`); (b) one or more file paths plus an explicit instruction to grade every test declared in those files; or (c) a diff hunk / PR identifier whose changed tests should be graded. File paths are recommended but optional when method names are unambiguous in the workspace. Ambiguous requests like "grade my tests" with no scope are rejected up-front (see Step 0); this skill is for curated input and does not auto-grade an entire workspace.
Test bodies / spans	Recommended	The exact source lines for each test method. If omitted, read them from the listed files.
Production code	No	The code under test, for judging whether assertions cover the meaningful behaviors. When unavailable, mark relevant findings as "Unverified" rather than guessing.
Diff context	No	When grading PR changes, the unified diff for each test method helps focus on what actually changed.

Step 0: Validate the input

Before doing anything else, check that the caller provided one of:

An explicit list of test method names, or
One or more file paths plus an explicit instruction to grade every test declared in those files (e.g., "grade every test in OrderTests.cs"), or
A diff hunk or PR identifier whose changed tests should be graded.

If the request is ambiguous (e.g., "Grade my tests", "Are these tests any good?" with no scope, "Review the test suite"), do not load extensions, do not read files, and do not grade anything. Reply with a short message asking the caller to provide an explicit list / file(s) / diff, and optionally point them at test-quality-auditor agent or test-anti-patterns skill for full-suite analysis. Stop there.

Workflow

Step 1: Detect language and load extension

Identify the target codebase's language and test framework from the file extensions and the test method markers in the provided list. Call the test-analysis-extensions skill and read the matching extension file (e.g., extensions/dotnet.md for MSTest/xUnit/NUnit/TUnit, extensions/python.md for pytest, extensions/typescript.md for Jest/Vitest, extensions/go.md for the standard testing package). If the input contains tests from multiple languages, load each relevant extension and grade each test using its language's conventions.

Step 2: Resolve the test bodies

For each entry in the input list:

If the test body is provided inline, use it directly.
Otherwise read the file at the given path and locate the method by its fully-qualified name. Capture the full method body, including attributes / decorators / fixtures and any helper code that the test calls.
If a method cannot be found, record it as N/A — method not found and continue. Never invent a body to grade.

Step 3: Score each test

Start every test at grade A (score band 90–100), then apply deductions strictly for observable issues in the captured body. Do not deduct for hypothetical concerns (e.g., "could have more negative assertions") unless the production code clearly demands them and the production code is available.

Three sub-dimensions

Compute three sub-grades (each A–F) that together drive the overall grade.

A. Assertion strength

Read the loaded language extension's assertion API list and classify every assertion in the test body. Score from highest to lowest:

Sub-grade	Pattern
A	At least one meaningful value assertion (equality / structural / exception / state) plus, where appropriate, additional checks (negative, type, collection contents). Mock-call verifications (`Verify`, `toHaveBeenCalledWith`, `Should -Invoke`) and bare assertion forms (pytest `assert`, Go `if got != want { t.Errorf(...) }`, Rust `assert!()`) count as real assertions.
B	One clear meaningful assertion that verifies the behavior under test.
C	Only trivial assertions (single `IsNotNull` / `toBeDefined` / `assert x is not None`), or assertions that check a single field while the operation produces a richer result.
D	One self-referential / tautological assertion (`Assert.AreEqual(x, x)`, `assert dto.name == dto.name`, round-trip identity without a non-trivial input), or broad exception assertions (`Assert.ThrowsException<Exception>`).
F	No assertions at all; all assertions are always-true literals (`Assert.IsTrue(true)`, `assert True`, `expect(true).toBe(true)`) — these verify nothing and are equivalent to having no assertions; or all assertions are silently un-awaited (e.g., `expect(promise).resolves.toBe(x)` without `await`/`return`, async TUnit/xUnit `Assert.ThrowsAsync` without `await`, pytest-asyncio with un-awaited coroutine).

Exception tests (Assert.ThrowsException<T>, pytest.raises, expect(fn).toThrow, assertThrows, #[should_panic], Should -Throw, EXPECT_THROW) are complete on their own — do not require additional assertions.

B. Structure & focus

Sub-grade	Pattern
A	Clear Arrange-Act-Assert (or Given-When-Then) separation. Single behavior under test. Body under ~30 lines. Setup uses framework conventions.
B	One mild structural issue (slightly long body, missing blank lines between phases) but intent is clear.
C	Multiple behaviors mixed in one test, or AAA phases interleaved enough to slow comprehension.
D	Conditional logic in the test (`if`/`switch` driving assertions) — except for idiomatic Go/Rust table-driven sub-test loops; or test relies on previous test state (ordering dependency).
F	Test exceeds ~60 lines and verifies multiple unrelated behaviors; or shares mutable state with other tests through statics/globals without reset.

C. Anti-pattern hygiene

Scan against the catalog below. The Anti-pattern sub-grade is computed in two passes and combined deterministically:

Hard ceiling pass. Every Critical or High finding sets a maximum sub-grade (F, D, or C as labeled). Take the worst ceiling across all matched Critical/High findings — these do not accumulate (a single F finding caps the sub-grade at F regardless of how many other Critical/High findings are present).
Medium-deduction pass. Start from A, then for each Medium finding deduct one sub-grade level (A→B, B→C, C→D, D→F). These do accumulate across findings.

The final Anti-pattern sub-grade is the worse of the two passes (i.e., min(hard_ceiling, A − medium_count)). Low findings never affect the grade — mention them in the note only.

Examples (Critical/High and Medium counts → Anti-pattern sub-grade):

Zero Critical/High, 1 Medium → B (A − 1)
Zero Critical/High, 3 Medium → D (A − 3)
One C-ceiling (e.g., over-mocking), 0 Medium → C
One C-ceiling, 2 Medium → D (min(C, A − 2 = C) = C, but a third Medium would tip to D)
One F-finding (e.g., swallowed exception) plus any number of Medium → F

Critical (drop straight to F or D)

No assertions at all → F (also drives Assertion sub-grade to F)
Swallowed exceptions: try { … } catch { } (.NET), bare except: pass (Python), try { … } catch (e) {} (JS/TS/Java), defer recover() without re-panic (Go), rescue StandardError with no assertion (Ruby), empty catch (Kotlin/Swift) → F
Assert-in-catch pattern (Assert.Fail(ex.Message) instead of Assert.ThrowsException) → D
Always-true literal assertions (Assert.IsTrue(true), assert True, expect(true).toBe(true)) → F (verifies nothing; also drives Assertion sub-grade to F)
Self-referential / tautological assertions on bound values (Assert.AreEqual(x, x), assert dto.name == dto.name) → D
Commented-out assertions → D

High (drop one or two sub-grades)

Wall-clock sleep used for synchronization: Thread.Sleep, Task.Delay, time.sleep, setTimeout-based wait, Thread.sleep, time.Sleep, sleep, std::thread::sleep, Start-Sleep, std::this_thread::sleep_for (in a unit test) → D
Unseeded randomness, wall-clock reads without abstraction (DateTime.Now, datetime.now(), Date.now(), System.currentTimeMillis(), time.Now(), Time.now, Instant::now(), Get-Date, system_clock::now) → D
Hard-coded environment-dependent paths (C:\…, /tmp/…, network hosts) → D
Ordering dependency on mutable static / package globals → D
Broad exception assertion (Assert.ThrowsException<Exception>, pytest.raises(Exception), expect(fn).toThrow(Error) without matcher, #[should_panic] without expected = "…", Should -Throw without -ExpectedMessage, EXPECT_ANY_THROW) → C
Over-mocking: more mock setup lines than test logic, or verifying exact call sequences instead of outcomes → C
Implementation coupling: reflection on private members, casting to internal types to access state → C

Medium (drop one sub-grade)

Poor name: Test1, TestMethod, test, single-word name that says nothing about scenario or expected outcome (judge against the language extension's convention) → drop one sub-grade
Magic values: unexplained 42, "foo", 0x1234 in arrange/assert without naming or comment → drop one sub-grade
Giant test (>30 lines covering a single behavior) → drop one sub-grade
Assertion messages that just repeat the assertion text → drop one sub-grade
Missing AAA / GWT separation when the test is non-trivial → drop one sub-grade

Low (note only, no deduction)

Unused setup/teardown hooks; print debugging left in (Console.WriteLine, print, console.log, System.out.println, fmt.Println, puts, dbg!, Write-Host, std::cout); inconsistent naming versus siblings; leftover TODO comments. Mention in the note column but do not deduct.

Combining sub-grades

Convert sub-grades to numeric points: A=4, B=3, C=2, D=1, F=0.

Overall score band = weighted average: 0.45 × Assertion + 0.30 × Anti-pattern + 0.25 × Structure
Map to letter:
- ≥ 3.5 → A (band 90–100)
- ≥ 2.8 → B (band 80–89)
- ≥ 2.0 → C (band 70–79)
- ≥ 1.2 → D (band 60–69)
- < 1.2 → F (band 0–59)
The overall grade is capped at the worst sub-grade — if any sub-grade is F, the overall grade is F; if the worst sub-grade is D, the overall grade is at most D; and so on. A test that fails on any one dimension cannot earn a higher overall grade than that dimension.

Report the letter grade and the score band (not a single 0–100 number). False precision invites bikeshedding; bands keep the conversation focused on the rubric.

Step 4: Build the note

The note column is one short sentence (target ≤ 120 characters). State the single most important reason for the grade. Examples:

A (90–100): Clear AAA structure; equality + exception assertions on the public contract.
B (80–89): Good assertion variety, mildly long body — consider splitting into per-condition tests.
C (70–79): Only checks IsNotNull on the result; no value verification.
D (60–69): Self-referential assertion: round-trip identity verifies plumbing, not transformation.
F (0–59): No assertions — test executes the method but never verifies anything.

If a test gets A with no notable issues, the note may simply be No issues found. — do not invent weaknesses to justify the grade.

Step 5: Report

Produce two sections.

1. Summary

A short paragraph (2–4 sentences) covering: total tests graded, grade distribution, most common issue, and the single most important recommendation.

2. Per-test table

| Test | Grade | Band | Notes |
|------|-------|------|-------|
| `Namespace.ClassName.Test_Method_Condition_Expected` | A | 90–100 | Clear AAA; equality + exception assertions. |
| `Namespace.ClassName.Test_Other` | C | 70–79 | Only `IsNotNull` — no value verification. |
| `Namespace.ClassName.Test_Old` | F | 0–59 | No assertions. |

Caps and ordering:

If the table would exceed 50 rows, show all tests graded below B first (worst to best), then a sample of the best tests, and wrap any overflow in a collapsed <details> block.
Within the same grade, order by file path then by method name for determinism.
If the diff context is provided, prefix each test name with a (new) or (modified) marker.

If multiple languages are present, produce one table per language and prefix each section with the language name and framework.

Validation

Common Pitfalls

Pitfall	Solution
Grading every test in the workspace when no list is provided	Ask the caller for the explicit list; this skill is for curated input.
Inflating deductions to justify the grade	Start at A; deduct only for observable issues.
Penalizing exception tests for low assertion count	Exception assertions are complete on their own.
Treating `IsNotNull` before a value assertion as trivial	Only flag when the null check is the only assertion.
Treating any Boolean assertion as effectively assertion-free	Only always-true literals (`Assert.IsTrue(true)`, `assert True`) are; meaningful `Assert.IsTrue(result.IsValid)` is a real assertion.
Flagging Go/Rust table-driven loops as conditional logic	They are idiomatic; do not deduct.
Treating pytest bare `assert` or Go `if got != want { t.Error… }` as missing-framework	Both are canonical; count in the correct assertion category.
Penalizing tests when production code is unavailable	Mark concerns about uncovered behaviors as `Unverified` and do not deduct.
Using a fake-precise score (e.g., 87/100)	Use the score band only — 90–100, 80–89, 70–79, 60–69, 0–59.
Spilling a 500-row table into a PR comment	Apply the row cap from Step 5; collapse extras into `<details>`.
Re-reporting an existing finding three times under different categories	Pick the most fitting category and report once.
Inventing weaknesses for A-grade tests to make the note "balanced"	If a test is clean, the note may simply read `No issues found.`