Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

llm-judge

Evaluate any system against its requirements. Trigger only when user explicitly asks for a "judge report" or similar evaluation. Follow the structured process in SKILL.md to produce a comprehensive, evidence-based assessment. Do not use with others skills or for tasks other than system evaluation.

In Manus ausführen

Überblick

Installationsbefehl

npx skills add https://github.com/Czarnak/llm-mas-experiments --skill llm-judge

Kopieren Sie diesen Befehl und fügen Sie ihn in Claude Code ein, um den Skill zu installieren

Quelle

Czarnak/llm-mas-experiments

Sterne0

Forks0

Aktualisiert22. Mai 2026 um 18:18

Datei-Explorer

6 Dateien

SKILL.md

readonly

Quelle

Czarnak

Czarnak/llm-mas-experiments

GitHub-Repository öffnen Creator-Repositorys ansehen

Installationsbefehl

Download

In Manus ausführen

Nützlich fürSOC

Softwarequalitätssicherungsanalysten und -testerInformatik- und Mathematikberufe15-1253L4

name	llm-judge
description	Evaluate any system against its requirements. Trigger only when user explicitly asks for a "judge report" or similar evaluation. Follow the structured process in SKILL.md to produce a comprehensive, evidence-based assessment. Do not use with others skills or for tasks other than system evaluation.
disable-model-invocation	true

Judge Skill

Purpose

Perform an independent, evidence-based evaluation of a system against its stated requirements. The judge does not propose redesigns or regenerate artifacts — it only reports what it finds. Every finding must be grounded in the provided materials.

Works for:

Conceptual-only designs (architecture documents, GAIA models, system specs)
Mixed systems (design documents + code)
Code-only projects

Output Schema

Produce a JudgeReport with the following fields. Emit it as structured JSON at the end of your analysis.

JudgeReport {
  summary:                  str          # 2–4 sentences, overall verdict
  recommendation:           "pass" | "needs_revision" | "fail"
  scores:                   JudgeScores
  findings:                 JudgeFinding[]
  missing_requirements:     str[]        # requirements not addressed at all
  logical_inconsistencies:  str[]        # internal contradictions in the system
  implementation_risks:     str[]        # risks if system is built/deployed as-is
}

JudgeScores {
  requirements_fit:      int  # 0–10  How well the system addresses all stated requirements
  logical_consistency:   int  # 0–10  Internal coherence; no contradictions or gaps
  agent_communication_correctness: int  # 0–10  For multi-agent systems: communication logic is complete and correct
  overall:               int  # 0–10  Holistic score; not an arithmetic average
}

JudgeFinding {
  category:       str                              # Area being evaluated (e.g. "Security", "Requirements", "API Design")
  severity:       "low" | "medium" | "high" | "critical"
  description:    str                              # Specific issue or strength found
  evidence:       str                              # Short quote or reference from the materials
}

Mapping note for GAIA / multi-agent systems: component_coverage maps to agent_service_coverage and interface_logic maps to communication_logic in the original JudgeReport schema. Use the original field names when writing JSON if the consuming system expects them.

Evaluation Process

Work through the five phases in order. Do not skip a phase even if materials seem thin — note what is absent instead.

Phase 1 — Intake & Classification

Read every provided artifact: requirements list, description, design documents, architecture diagrams, code files, test files.
Classify the system:
- conceptual-only — no code present
- design + code — both design docs and implementation exist
- code-only — implementation with no separate design docs
Detect programming languages if code is present.

If code is present, load the language-specific checklist before Phase 4:

Language	File to read
Python	`references/review-checklist/python.md`
C++	`references/review-checklist/cpp.md`
C#	`references/review-checklist/csharp.md`
Kotlin	`references/review-checklist/kotlin.md`
Go	`references/review-checklist/golang.md`

Load only the files for languages actually present. Do not load all of them.

Phase 2 — Requirements Analysis

List every stated requirement (functional and non-functional).
For each requirement, determine:
- Satisfied — explicit evidence in the design or code
- Partially satisfied — addressed but incomplete
- Missing — no evidence of it being addressed
Note conflicting or ambiguous requirements.
Populate missing_requirements with requirements that are absent entirely.

Phase 3 — System Consistency & Coverage

Check the system as a whole, independent of individual requirements:

Logical consistency

Are component responsibilities clearly defined and non-overlapping?
Do data flows make sense end-to-end?
Are there circular dependencies, undefined states, or unreachable paths?
Are assumptions stated and reasonable?

Component coverage (agents / services / modules)

Does every required capability have a responsible component?
Are there orphaned components with no role?

Agent communication correctness (for multi-agent systems)

Are agent roles justified and non-redundant?
Are all necessary communication paths between agents defined?
Are message formats and protocols specified?
Are error paths and fallbacks defined on interfaces?
Are there any communication paths that are undefined or ambiguous?

Interface logic (communication / APIs / events)

Are all communication paths between components defined?
Are request/response contracts specified?
Are error paths and fallbacks defined on interfaces?
For multi-agent systems: are message formats, protocols, and orchestration logic clear?

Implementation alignment (skip for conceptual-only, score null)

Does the code structure match the described architecture?
Are all described components implemented?
Does the implementation introduce components or behaviors not in the design?

Populate logical_inconsistencies with contradictions or structural gaps found.

Phase 4 — Code Review (skip for conceptual-only)

Apply the rules below in severity order. Load the language-specific reference from Phase 1 before starting this phase.

Use the confidence filter:

Confidence	Action
>80%	Include the finding
50–80%	Omit unless security-related
<50%	Always omit

Consolidate: multiple instances of the same issue → one finding with count.

CRITICAL — Security (flag unconditionally, no confidence threshold)

Hardcoded credentials — keys/passwords/tokens in source → env vars or secret manager
SQL injection — string concatenation in queries → parameterized queries
Command injection — user input in subprocess/shell → validate + shell=False
Path traversal — user-controlled paths unsanitized → resolve + prefix check
Auth bypass — missing auth checks on protected routes
Insecure deserialization — pickle/eval/exec on untrusted data
Secrets in logs — logging tokens, passwords, or PII
XSS — unescaped user input rendered in HTML
CSRF — state-changing endpoints without CSRF protection

CRITICAL — Error Handling

Empty except/catch — except: pass or catch {} swallows failures silently
Missing resource cleanup — no with/using/try-finally for files, sockets, DB connections
Blocking async — sync I/O or .result()/.wait() inside an async context

HIGH — Code Quality

Large functions (>50 lines) → extract helpers
Large files (>500 lines) → extract modules
Deep nesting (>4 levels) → use early returns
Dead code — commented-out blocks, unused imports, unreachable branches
Debug artifacts — print(), console.log, debugger left in production paths
Magic numbers — unexplained numeric literals → named constants
Missing cancellation/timeout — async or HTTP calls with no timeout or cancellation support

HIGH — File Hygiene

Excessive trailing blank lines (>5 at end of file)
Duplicate function/class definitions at module scope — silent overwrite in Python, dead code elsewhere
AI-generated header comments that conflict with project style rules

MEDIUM — Performance

String concatenation in loops → "".join() or equivalent
Unnecessary re-computation inside loops that could be hoisted
Missing caching for repeated identical external calls
Wrong collection type (list where set/dict lookup would be O(1))

LOW — Best Practices

TODO/FIXME without a linked ticket or issue number
Missing docstrings on public APIs
Poor naming (single-letter variables in non-trivial contexts)
Missing type hints on public function signatures

Code Quality Principles (check holistically)

Readability first — code is read more than written; prefer clarity over cleverness
Single responsibility — every function, class, and module does one thing
Explicit over implicit — no hidden defaults or surprising side effects
Fail loudly — errors surface immediately; never silently swallowed
No commented-out code in commits
No hardcoded configuration — environment variables for all configuration values
Dependencies pinned — requirements.txt, pyproject.toml, or equivalent

Populate implementation_risks with code-level issues that could cause failures in production.

Phase 5 — Score & Report

Score each dimension using the guide below, then write the summary and determine the recommendation.

Scoring Guide

requirements_fit

Score	Meaning
9–10	All requirements clearly addressed with evidence
7–8	Most requirements addressed; minor gaps only
5–6	Several requirements partially addressed or thin
3–4	Multiple requirements missing or contradicted
0–2	Fundamental requirements unmet

logical_consistency

Score	Meaning
9–10	No contradictions; all flows and responsibilities are coherent
7–8	Minor ambiguities; no structural contradictions
5–6	Notable gaps or unclear responsibilities
3–4	Contradictions between components or stated goals
0–2	System is internally incoherent

agent_communication_correctness

Score	Meaning
9–10	Agents communicate accurately, completely, and consistently across all relevant scenarios. Messages follow the expected protocol, preserve necessary context, and enable reliable coordination without avoidable misunderstandings
7–8	Agents exchange clear, relevant, and mostly complete information according to the expected protocol. Minor communication issues may occur but do not significantly affect task completion
5–6	Agents communicate well enough for the main workflow to proceed, but some information loss, ambiguity, protocol deviations, or coordination issues occur in secondary scenarios or edge cases
3–4	Agents exchange some relevant information, but messages are often incomplete, ambiguous, malformed, delayed, or inconsistent with the expected protocol, causing frequent coordination failures
0–2	Agents fail to exchange necessary information, send incompatible or unusable messages, or misunderstand each other in ways that prevent the system from functioning correctly

overall A holistic score reflecting the system's readiness. Not an arithmetic average — weight requirements_fit and logical_consistency more heavily, and penalize any CRITICAL findings by at least 2 points regardless of other scores.

Recommendation Rules

Recommendation	Condition
`pass`	No CRITICAL findings AND overall ≥ 7
`needs_revision`	No CRITICAL findings AND overall 4–6, OR HIGH findings present
`fail`	Any CRITICAL finding present, OR overall ≤ 3

Output Format

After completing all phases, produce:

A brief prose walkthrough of your analysis (optional, aids transparency).
The JudgeReport as a JSON block.

{
  "summary": "...",
  "recommendation": "pass | needs_revision | fail",
  "scores": {
    "requirements_fit": 0,
    "logical_consistency": 0,
    "agent_communication_correctness": 0,
    "overall": 0
  },
  "findings": [
    {
      "category": "...",
      "severity": "critical | high | medium | low",
      "description": "...",
      "evidence": "..."
    }
  ],
  "missing_requirements": [],
  "logical_inconsistencies": [],
  "implementation_risks": []
}

Evaluation Constraints

Report only. Do not propose regeneration, rewrite, or redesign steps.
Ground every criticism. Each finding must cite evidence from the provided materials — a requirement text, a code line, a document excerpt.
Prefer concrete over general. "Function process_data has no error handling on line 42" beats "error handling is missing."
No phantom findings. Do not invent problems not evidenced in the materials. If information is absent, note the absence rather than guessing intent.
Acknowledge strengths. Use low-severity positive findings to document what the system does well — findings are not exclusively negative.