Run any Skill in Manus with one click

$pwd:

evaluate-model

Name: Evaluate Model
Author: xvirobotics

// Load the latest model checkpoint, run evaluation on the test set, and generate a metrics report with confusion matrix. Use this after training to assess model performance or to re-evaluate a specific checkpoint.

Run Skill in Manus

$ git log --oneline --stat

stars:41

forks:6

updated:February 23, 2026 at 03:44

SKILL.md

readonly

name	evaluate-model
description	Load the latest model checkpoint, run evaluation on the test set, and generate a metrics report with confusion matrix. Use this after training to assess model performance or to re-evaluate a specific checkpoint.
user-invocable	true
context	fork
allowed-tools	Bash, Read, Grep, Write
argument-hint	[checkpoint-path] e.g. checkpoints/best_model.pt

You are running model evaluation for this project. Your goal is to load a trained model checkpoint, evaluate it on the held-out test set, compute comprehensive metrics, and generate a structured report.

Dynamic Context

Current branch: !git branch --show-current Available checkpoints: !ls checkpoints/*.pt checkpoints/*.pth 2>/dev/null || echo "No checkpoints found" Test data: !ls data/processed/test* data/features/test* 2>/dev/null || echo "No test data found" Latest metrics: !ls -t reports/*.json experiments/*.json 2>/dev/null | head -3 || echo "No previous metrics found" Config files: !ls configs/*.yaml configs/*.toml 2>/dev/null || echo "No configs found"

Checkpoint Selection

If the user provided a checkpoint path as an argument, use it: $ARGUMENTS

Otherwise, find the latest checkpoint:

Look for checkpoints/best_model.pt or checkpoints/best_model.pth
If not found, find the most recently modified .pt or .pth file in checkpoints/
If no checkpoints exist, report the error and stop

Evaluation Process

Step 1: Load and Verify Checkpoint

Verify the checkpoint file exists and can be loaded:

python3 -c "
import torch
ckpt = torch.load('$CHECKPOINT_PATH', map_location='cpu', weights_only=False)
print('Checkpoint keys:', list(ckpt.keys()))
print('Epoch:', ckpt.get('epoch', 'unknown'))
print('Best metric:', ckpt.get('best_metric', 'unknown'))
print('Config:', ckpt.get('config', 'not stored'))
"

Report the checkpoint metadata: epoch, stored metric, config used.

Step 2: Run Evaluation Script

Execute the evaluation:

python3 -m src.models.evaluation.evaluate \
    --checkpoint $CHECKPOINT_PATH \
    --data-dir data/features/ \
    --output-dir reports/ \
    --config configs/experiment.yaml

Alternative patterns to try if the above fails:

python3 src/evaluation/evaluate.py --checkpoint $CHECKPOINT_PATH
python3 evaluate.py --checkpoint $CHECKPOINT_PATH --test-data data/features/test.parquet

Step 3: Collect Metrics

After evaluation completes, read the metrics output. Look for the metrics JSON file:

cat reports/metrics.json 2>/dev/null || cat reports/evaluation_metrics.json 2>/dev/null

If no JSON file was generated, parse metrics from the script's stdout.

Step 4: Generate Confusion Matrix

If the evaluation script did not generate a confusion matrix plot, create one:

python3 -c "
import json
import numpy as np
from pathlib import Path

# Load metrics that include confusion matrix data
metrics_path = Path('reports/metrics.json')
if metrics_path.exists():
    metrics = json.loads(metrics_path.read_text())
    if 'confusion_matrix' in metrics:
        cm = np.array(metrics['confusion_matrix'])
        print('Confusion Matrix:')
        print(cm)
        print()
        # Print per-class metrics
        for i, row in enumerate(cm):
            precision = row[i] / max(row.sum(), 1)
            recall = row[i] / max(cm[:, i].sum(), 1)
            print(f'Class {i}: Precision={precision:.4f}, Recall={recall:.4f}')
"

Step 5: Compare with Baseline

If previous metrics exist, load and compare:

Find the most recent previous metrics file (excluding the one just generated)
Compute deltas for each metric
Flag any metric regressions (where current is worse than previous)
Highlight improvements

Step 6: Generate Summary Report

Produce a structured evaluation report:

## Model Evaluation Report

### Checkpoint
- Path: [checkpoint path]
- Epoch: [epoch number]
- Training config: [config file used]

### Test Set Metrics
| Metric | Value |
|--------|-------|
| Accuracy | X.XXXX |
| Precision (macro) | X.XXXX |
| Recall (macro) | X.XXXX |
| F1 (macro) | X.XXXX |
| AUC-ROC | X.XXXX |

### Confusion Matrix
[confusion matrix table or reference to plot]

### Comparison with Previous Run
| Metric | Previous | Current | Delta |
|--------|----------|---------|-------|
| ... | ... | ... | +/- ... |

### Observations
- [Key findings about model performance]
- [Any concerning patterns in errors]
- [Recommendations for improvement]

Write this report to reports/evaluation_report.md.

Error Handling

If checkpoint cannot be loaded: check for PyTorch version mismatch, report the error
If test data is missing: report which files are expected and where to find them
If CUDA is not available: run evaluation on CPU (will be slower but should work)
If metrics computation fails: report the specific error and which metric caused it

related-skills.json

same repository

generate-report.md

from "xvirobotics/metaskill"

Generate a comprehensive summary report of the latest experiment including metrics, plots, and comparison with baseline. Use this after training and evaluation to create a shareable experiment summary.

2026-02-2341

run-pipeline.md

from "xvirobotics/metaskill"

Run the full data science pipeline: validate raw data, preprocess, engineer features, train model, and evaluate. Use this when you want to execute the end-to-end ML pipeline or re-run it after data or code changes.

2026-02-2341

api-test.md

from "xvirobotics/metaskill"

Run API integration tests against the running backend, verify endpoints return expected responses and status codes. Use after deploying a preview or starting the dev server.

2026-02-2341

build-and-test.md

from "xvirobotics/metaskill"

Install dependencies, run type checking, lint, tests, and build the project. Use after making code changes to verify nothing is broken.

2026-02-2341

deploy-preview.md

from "xvirobotics/metaskill"

Build Docker images and launch a local preview environment with docker-compose. Use to test the full stack locally before merging.

2026-02-2341

build-and-test.md

from "xvirobotics/metaskill"

Build the Xcode project and run the full test suite. Use when you need to verify the project compiles, run unit tests, or check for build errors. Reports pass/fail results with detailed error output.

2026-02-2341

package.json

"author": "xvirobotics"

"repository": "xvirobotics/metaskill"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

python3 -c " import torch ckpt = torch.load('$CHECKPOINT_PATH', map_location='cpu', weights_only=False) print('Checkpoint keys:', list(ckpt.keys())) print('Epoch:', ckpt.get('epoch', 'unknown')) print('Best metric:', ckpt.get('best_metric', 'unknown')) print('Config:', ckpt.get('config', 'not stored')) "

python3 -c " import json import numpy as np from pathlib import Path # Load metrics that include confusion matrix data metrics_path = Path('reports/metrics.json') if metrics_path.exists(): metrics = json.loads(metrics_path.read_text()) if 'confusion_matrix' in metrics: cm = np.array(metrics['confusion_matrix']) print('Confusion Matrix:') print(cm) print() # Print per-class metrics for i, row in enumerate(cm): precision = row[i] / max(row.sum(), 1) recall = row[i] / max(cm[:, i].sum(), 1) print(f'Class {i}: Precision={precision:.4f}, Recall={recall:.4f}') "

## Model Evaluation Report ### Checkpoint - Path: [checkpoint path] - Epoch: [epoch number] - Training config: [config file used] ### Test Set Metrics | Metric | Value | |--------|-------| | Accuracy | X.XXXX | | Precision (macro) | X.XXXX | | Recall (macro) | X.XXXX | | F1 (macro) | X.XXXX | | AUC-ROC | X.XXXX | ### Confusion Matrix [confusion matrix table or reference to plot] ### Comparison with Previous Run | Metric | Previous | Current | Delta | |--------|----------|---------|-------| | ... | ... | ... | +/- ... | ### Observations - [Key findings about model performance] - [Any concerning patterns in errors] - [Recommendations for improvement]

evaluate-model

Dynamic Context

Checkpoint Selection

Evaluation Process

Step 1: Load and Verify Checkpoint

Step 2: Run Evaluation Script

Step 3: Collect Metrics

Step 4: Generate Confusion Matrix

Step 5: Compare with Baseline

Step 6: Generate Summary Report

Error Handling

More from this repository

More from this repository

Dynamic Context

Checkpoint Selection

Evaluation Process

Step 1: Load and Verify Checkpoint

Step 2: Run Evaluation Script

Step 3: Collect Metrics

Step 4: Generate Confusion Matrix

Step 5: Compare with Baseline

Step 6: Generate Summary Report

Error Handling