with one click
evaluate-model
// Load the latest model checkpoint, run evaluation on the test set, and generate a metrics report with confusion matrix. Use this after training to assess model performance or to re-evaluate a specific checkpoint.
// Load the latest model checkpoint, run evaluation on the test set, and generate a metrics report with confusion matrix. Use this after training to assess model performance or to re-evaluate a specific checkpoint.
Generate a comprehensive summary report of the latest experiment including metrics, plots, and comparison with baseline. Use this after training and evaluation to create a shareable experiment summary.
Run the full data science pipeline: validate raw data, preprocess, engineer features, train model, and evaluate. Use this when you want to execute the end-to-end ML pipeline or re-run it after data or code changes.
Run API integration tests against the running backend, verify endpoints return expected responses and status codes. Use after deploying a preview or starting the dev server.
Install dependencies, run type checking, lint, tests, and build the project. Use after making code changes to verify nothing is broken.
Build Docker images and launch a local preview environment with docker-compose. Use to test the full stack locally before merging.
Build the Xcode project and run the full test suite. Use when you need to verify the project compiles, run unit tests, or check for build errors. Reports pass/fail results with detailed error output.
| name | evaluate-model |
| description | Load the latest model checkpoint, run evaluation on the test set, and generate a metrics report with confusion matrix. Use this after training to assess model performance or to re-evaluate a specific checkpoint. |
| user-invocable | true |
| context | fork |
| allowed-tools | Bash, Read, Grep, Write |
| argument-hint | [checkpoint-path] e.g. checkpoints/best_model.pt |
You are running model evaluation for this project. Your goal is to load a trained model checkpoint, evaluate it on the held-out test set, compute comprehensive metrics, and generate a structured report.
Current branch: !git branch --show-current
Available checkpoints: !ls checkpoints/*.pt checkpoints/*.pth 2>/dev/null || echo "No checkpoints found"
Test data: !ls data/processed/test* data/features/test* 2>/dev/null || echo "No test data found"
Latest metrics: !ls -t reports/*.json experiments/*.json 2>/dev/null | head -3 || echo "No previous metrics found"
Config files: !ls configs/*.yaml configs/*.toml 2>/dev/null || echo "No configs found"
If the user provided a checkpoint path as an argument, use it: $ARGUMENTS
Otherwise, find the latest checkpoint:
checkpoints/best_model.pt or checkpoints/best_model.pth.pt or .pth file in checkpoints/Verify the checkpoint file exists and can be loaded:
python3 -c "
import torch
ckpt = torch.load('$CHECKPOINT_PATH', map_location='cpu', weights_only=False)
print('Checkpoint keys:', list(ckpt.keys()))
print('Epoch:', ckpt.get('epoch', 'unknown'))
print('Best metric:', ckpt.get('best_metric', 'unknown'))
print('Config:', ckpt.get('config', 'not stored'))
"
Report the checkpoint metadata: epoch, stored metric, config used.
Execute the evaluation:
python3 -m src.models.evaluation.evaluate \
--checkpoint $CHECKPOINT_PATH \
--data-dir data/features/ \
--output-dir reports/ \
--config configs/experiment.yaml
Alternative patterns to try if the above fails:
python3 src/evaluation/evaluate.py --checkpoint $CHECKPOINT_PATHpython3 evaluate.py --checkpoint $CHECKPOINT_PATH --test-data data/features/test.parquetAfter evaluation completes, read the metrics output. Look for the metrics JSON file:
cat reports/metrics.json 2>/dev/null || cat reports/evaluation_metrics.json 2>/dev/null
If no JSON file was generated, parse metrics from the script's stdout.
If the evaluation script did not generate a confusion matrix plot, create one:
python3 -c "
import json
import numpy as np
from pathlib import Path
# Load metrics that include confusion matrix data
metrics_path = Path('reports/metrics.json')
if metrics_path.exists():
metrics = json.loads(metrics_path.read_text())
if 'confusion_matrix' in metrics:
cm = np.array(metrics['confusion_matrix'])
print('Confusion Matrix:')
print(cm)
print()
# Print per-class metrics
for i, row in enumerate(cm):
precision = row[i] / max(row.sum(), 1)
recall = row[i] / max(cm[:, i].sum(), 1)
print(f'Class {i}: Precision={precision:.4f}, Recall={recall:.4f}')
"
If previous metrics exist, load and compare:
Produce a structured evaluation report:
## Model Evaluation Report
### Checkpoint
- Path: [checkpoint path]
- Epoch: [epoch number]
- Training config: [config file used]
### Test Set Metrics
| Metric | Value |
|--------|-------|
| Accuracy | X.XXXX |
| Precision (macro) | X.XXXX |
| Recall (macro) | X.XXXX |
| F1 (macro) | X.XXXX |
| AUC-ROC | X.XXXX |
### Confusion Matrix
[confusion matrix table or reference to plot]
### Comparison with Previous Run
| Metric | Previous | Current | Delta |
|--------|----------|---------|-------|
| ... | ... | ... | +/- ... |
### Observations
- [Key findings about model performance]
- [Any concerning patterns in errors]
- [Recommendations for improvement]
Write this report to reports/evaluation_report.md.