Run any Skill in Manus with one click

$pwd:

evaluation

Name: Evaluation
Author: ProfSynapse

// Complete reference for the config-first model evaluation system. Covers the Evaluator CLI, assertion-driven YAML scenarios, response views, backend configuration, presets, scoring, LLM-as-judge, model comparison, and HuggingFace integration. Use when evaluating models, writing test prompts, comparing training runs, or interpreting eval results. This skill is about USING the evaluation system via CLI and YAML.

Run Skill in Manus

$ git log --oneline --stat

stars:23

forks:3

updated:April 24, 2026 at 17:56

File Explorer

6 files

SKILL.md

readonly

name	evaluation
description	Complete reference for the config-first model evaluation system. Covers the Evaluator CLI, assertion-driven YAML scenarios, response views, backend configuration, presets, scoring, LLM-as-judge, model comparison, and HuggingFace integration. Use when evaluating models, writing test prompts, comparing training runs, or interpreting eval results. This skill is about USING the evaluation system via CLI and YAML.
allowed-tools	Read, Bash, Write, Grep, Glob

Model Evaluation

Config-first evaluation framework for testing model responses against YAML-defined correctness assertions.

The evaluator does not hardcode a specific tool family, manager id, wrapper name, or behavior rule as correctness. Scenarios define the prompt and the acceptable response shape directly under correct.

Quick Reference

Task	Command
Interactive menu	`./run.sh` then Evaluate
Tool CLI eval	`python -m Evaluator.cli --backend vllm --model MODEL --scenario tool_prompts.yaml --host 127.0.0.1 --port 8011`
Full configured eval	`python -m Evaluator.cli --backend lmstudio --model MODEL --preset full`
Quick smoke test	`python -m Evaluator.cli --backend lmstudio --model MODEL --preset quick`
Tag filter	`python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --tags storageManager`
Dry run config load	`python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --dry-run`
Eval with environment runtime	`python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --env-backend local`
Eval with LLM judge	`python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --judge --judge-rubrics tool_call_quality`
Eval + upload to HF	`python -m Evaluator.cli --backend unsloth --model PATH --upload-to-hf user/model`

Status System

Status	Meaning	When
PASS	Configured checks passed	`correct` assertions passed, and optional environment/judge checks passed
FAIL	Configured checks failed or request errored	No `correct.any` path matched, required environment checks failed, judge failed, or backend errored

Schema/structural validation may still be reported for debugging, but it is not the source of task correctness. Correctness belongs in scenario YAML.

Key Directories

Evaluator/ - Core evaluation code
Evaluator/config/scenarios/ - YAML test scenarios
Evaluator/config/tool_schema.yaml - Current CLI wrapper/tool schema metadata
Evaluator/config/rubrics/ - LLM-as-judge rubrics
Evaluator/results/ - Evaluation output JSON and Markdown

Progressive Reference

Reference	When to Load	Path
CLI Commands	Running evaluations, all flags and examples	`reference/cli-commands.md`
Scenario Authoring	Writing or modifying YAML test scenarios	`reference/scenario-authoring.md`
Backends	Configuring vLLM, LM Studio, Ollama, Unsloth, and others	`reference/backends.md`
Results & Metrics	Interpreting JSON/Markdown output and failures	`reference/results-metrics.md`
Presets & Tags	Using presets and tag filters	`reference/presets-tags.md`

Active Scenario Pattern

Every test should define what counts as correct:

tests:
  - id: storage_copy_runbook
    question: Copy the incident runbook into a template file.
    tags: [storageManager, single-tool]
    system: |
      <session_context>
      sessionId: "session_eval"
      workspaceId: "ws_eval"
      </session_context>
    correct:
      any:
        - name: copy_cli
          assertions:
            - type: jsonpath_equals
              path: $.tool_calls[0].name
              value: useTools
            - type: jsonpath_regex
              path: $.tool_calls[0].arguments.tool
              pattern: '^storage copy\b(?=.*Incident-Response\.md)(?=.*Incident-Response-Template\.md)'

Use correct.any for multiple valid answers, such as command by id or by name. Use correct.all or nested all/any/not assertions for stricter structures.

Response View

Assertions query a generic response view. This is syntax normalization only:

$.raw preserves the raw assistant response.
$.content is assistant text.
$.content_json is parsed JSON content when content is JSON.
$.tool_calls is a normalized list of emitted tool calls.
OpenAI-style function.arguments JSON strings are parsed into objects.
Plain text blocks like tool_call: useTools plus arguments: {...} are parsed into the same view.

The response view must not map CLI commands to old manager tool ids or decide correctness. Scenario YAML decides what is correct.

Tips

Keep all task-specific expectations in YAML under correct.
Do not add evaluator code for a specific tool, wrapper, or use case.
Prefer regex or JSONPath assertions for tool CLI commands, because shell quoting and argument order can vary.
If a schema allows equivalent forms, represent them as separate correct.any paths.
Use --limit and --tags for fast iteration.
Use --validate-context only when the scenario includes context fields that should be structurally checked.
Use --env-backend local or e2b only when you need runtime execution checks beyond response correctness.

related-skills.json

same repository

fine-tuning.md

from "ProfSynapse/Synaptic-Tuner"

Complete reference for the fine-tuning pipeline (SFT, KTO, GRPO), cloud HF Jobs workflows, autonomous experiment search, checkpoint evaluation, and LoRA surgery. Covers training CLI flags, YAML configuration, model presets, dataset requirements, LoRA settings, training monitoring, hyperparameter search, and post-training optimization. Use when training models, configuring training runs, choosing hyperparameters, running cloud experiments, inspecting HF jobs, or troubleshooting training issues. This skill is about USING the training system via CLI and YAML — never modifying source code.

2026-05-2923

synthetic-data-generation.md

from "ProfSynapse/Synaptic-Tuner"

Complete reference for the SynthChat synthetic dataset generation system. Covers CLI commands (generate, improve, validate), scenario YAML authoring, rubric YAML authoring, settings configuration, evaluation, and full workflow. Use when generating datasets, writing rubrics/scenarios, configuring models/workers, improving dataset quality, or running evaluations. This skill is about USING the system via CLI and YAML — never modifying source code.

2026-05-2923

case-studies.md

from "ProfSynapse/Synaptic-Tuner"

End-to-end case studies showing how to implement the full training pipeline for different skill types. Covers three complete worked examples — tool-calling training, essay-style training, and agentic search (RAG agent) training — demonstrating dataset design, synthetic generation, validation, fine-tuning, evaluation, and iteration. Use when onboarding to the project, understanding how all components fit together, explaining the pipeline to others, or planning a new training capability. This skill is about UNDERSTANDING the system holistically — reference the other skills for specific CLI commands.

2026-05-2923

upload-deployment.md

from "ProfSynapse/Synaptic-Tuner"

Complete reference for model upload and deployment. Covers HuggingFace upload, save strategies (LoRA, merged 16-bit, merged 4-bit), GGUF conversion, model merging, model cards, and the full upload workflow. Use when uploading models, creating GGUF files, merging LoRA adapters, or deploying to HuggingFace. This skill is about USING the upload/deployment tools via CLI — never modifying source code.

2026-05-0123

research-reporting.md

from "ProfSynapse/Synaptic-Tuner"

Create structured research notes from experiment runs and analysis artifacts. Use when creating a note at run launch, updating it as training/evaluation/loss stages finish, summarizing a finished run, comparing experiment outcomes, extracting hypotheses from eval/loss artifacts, or proposing next-run actions grounded in `.tracking/experiments/<id>/analysis/` outputs. This skill is about turning repo-native experiment evidence into stable, machine-readable markdown.

2026-04-0223

dataset-publishing.md

from "ProfSynapse/Synaptic-Tuner"

Publish local dataset artifacts to a Hugging Face dataset repo. Use when uploading a JSONL dataset, pushing a filtered dataset variant, syncing a matching .metadata.json sidecar, or renaming a dataset file in the target repo. This skill is about USING the checked-in dataset publish script via CLI — never ad hoc Python.

2026-03-2223

package.json

"author": "ProfSynapse"

"repository": "ProfSynapse/Synaptic-Tuner"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

name	evaluation
description	Complete reference for the config-first model evaluation system. Covers the Evaluator CLI, assertion-driven YAML scenarios, response views, backend configuration, presets, scoring, LLM-as-judge, model comparison, and HuggingFace integration. Use when evaluating models, writing test prompts, comparing training runs, or interpreting eval results. This skill is about USING the evaluation system via CLI and YAML.
allowed-tools	Read, Bash, Write, Grep, Glob

Model Evaluation

Config-first evaluation framework for testing model responses against YAML-defined correctness assertions.

Quick Reference

Task	Command
Interactive menu	`./run.sh` then Evaluate
Tool CLI eval	`python -m Evaluator.cli --backend vllm --model MODEL --scenario tool_prompts.yaml --host 127.0.0.1 --port 8011`
Full configured eval	`python -m Evaluator.cli --backend lmstudio --model MODEL --preset full`
Quick smoke test	`python -m Evaluator.cli --backend lmstudio --model MODEL --preset quick`
Tag filter	`python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --tags storageManager`
Dry run config load	`python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --dry-run`
Eval with environment runtime	`python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --env-backend local`
Eval with LLM judge	`python -m Evaluator.cli --backend lmstudio --model MODEL --scenario tool_prompts.yaml --judge --judge-rubrics tool_call_quality`
Eval + upload to HF	`python -m Evaluator.cli --backend unsloth --model PATH --upload-to-hf user/model`

Status System

Status	Meaning	When
PASS	Configured checks passed	`correct` assertions passed, and optional environment/judge checks passed
FAIL	Configured checks failed or request errored	No `correct.any` path matched, required environment checks failed, judge failed, or backend errored

Schema/structural validation may still be reported for debugging, but it is not the source of task correctness. Correctness belongs in scenario YAML.

Key Directories

Evaluator/ - Core evaluation code
Evaluator/config/scenarios/ - YAML test scenarios
Evaluator/config/tool_schema.yaml - Current CLI wrapper/tool schema metadata
Evaluator/config/rubrics/ - LLM-as-judge rubrics
Evaluator/results/ - Evaluation output JSON and Markdown

Progressive Reference

Reference	When to Load	Path
CLI Commands	Running evaluations, all flags and examples	`reference/cli-commands.md`
Scenario Authoring	Writing or modifying YAML test scenarios	`reference/scenario-authoring.md`
Backends	Configuring vLLM, LM Studio, Ollama, Unsloth, and others	`reference/backends.md`
Results & Metrics	Interpreting JSON/Markdown output and failures	`reference/results-metrics.md`
Presets & Tags	Using presets and tag filters	`reference/presets-tags.md`

Active Scenario Pattern

Every test should define what counts as correct:

tests:
  - id: storage_copy_runbook
    question: Copy the incident runbook into a template file.
    tags: [storageManager, single-tool]
    system: |
      <session_context>
      sessionId: "session_eval"
      workspaceId: "ws_eval"
      </session_context>
    correct:
      any:
        - name: copy_cli
          assertions:
            - type: jsonpath_equals
              path: $.tool_calls[0].name
              value: useTools
            - type: jsonpath_regex
              path: $.tool_calls[0].arguments.tool
              pattern: '^storage copy\b(?=.*Incident-Response\.md)(?=.*Incident-Response-Template\.md)'

Use correct.any for multiple valid answers, such as command by id or by name. Use correct.all or nested all/any/not assertions for stricter structures.

Response View

Assertions query a generic response view. This is syntax normalization only:

$.raw preserves the raw assistant response.
$.content is assistant text.
$.content_json is parsed JSON content when content is JSON.
$.tool_calls is a normalized list of emitted tool calls.
OpenAI-style function.arguments JSON strings are parsed into objects.
Plain text blocks like tool_call: useTools plus arguments: {...} are parsed into the same view.

The response view must not map CLI commands to old manager tool ids or decide correctness. Scenario YAML decides what is correct.

Tips

Keep all task-specific expectations in YAML under correct.
Do not add evaluator code for a specific tool, wrapper, or use case.
Prefer regex or JSONPath assertions for tool CLI commands, because shell quoting and argument order can vary.
If a schema allows equivalent forms, represent them as separate correct.any paths.
Use --limit and --tags for fast iteration.
Use --validate-context only when the scenario includes context fields that should be structurally checked.
Use --env-backend local or e2b only when you need runtime execution checks beyond response correctness.

evaluation

Model Evaluation

Quick Reference

Status System

Key Directories

Progressive Reference

Active Scenario Pattern

Response View

Tips

More from this repository

More from this repository

Model Evaluation

Quick Reference

Status System

Key Directories

Progressive Reference

Active Scenario Pattern

Response View

Tips