تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

compare

Name: Compare
Author: fcakyon

// Same-epoch comparison of training runs across wandb, neptune, tensorboard, or mlflow. Aligns runs at the student's current step (never current-vs-final-of-baseline) and separates proxy metrics from downstream targets. Use when the user asks to compare runs, check if a run is improving, track lag against a baseline, rank experiments, or evaluate run-vs-run performance.

تشغيل في Manus

$ git log --oneline --stat

stars:٢٥٠

forks:٢٤

updated:٣٠ أبريل ٢٠٢٦ في ١١:٠٣

SKILL.md

readonly

name	compare
description	Same-epoch comparison of training runs across wandb, neptune, tensorboard, or mlflow. Aligns runs at the student's current step (never current-vs-final-of-baseline) and separates proxy metrics from downstream targets. Use when the user asks to compare runs, check if a run is improving, track lag against a baseline, rank experiments, or evaluate run-vs-run performance.

Compare: same-epoch run comparison across trackers

The most common comparison error is reporting "run A is 4 percentage points behind baseline" when run A is at epoch 11 of 100 and the baseline number is from epoch 100. The student is still training; the comparison is meaningless. This skill enforces same-epoch alignment.

The agentic Stop hook routes here from reason when an assistant reports a delta without aligning the runs.

When to run

The user just said any of:

"compare run A to baseline / to run B"
"is my run improving / catching up / falling behind"
"rank these experiments"
"X vs Y wandb / neptune"
"track lag against baseline"

Auto-detect the tracker

Check in this order:

WANDB_API_KEY env var set, or wandb imports in the project → wandb
NEPTUNE_API_TOKEN env var set → neptune
MLFLOW_TRACKING_URI env var set, or mlruns/ dir present → mlflow
runs/ or lightning_logs/ dir present → tensorboard
*results*.json / *meta*.json files in run dirs → local file format

If none, ask the user where metrics live before guessing.

The protocol

1. Identify the runs

Get full names (no shortcodes). If the user says "fvs-fm vs the baseline", clarify:

which fvs-fm run (project + entity + run-id)
which baseline (full run name; baselines often have several variants)

2. Fetch metric history (not just final value)

You need the full curve, not the last reported value. Final-value-only comparisons hide convergence dynamics.

For wandb:

import wandb
api = wandb.Api()
run = api.run("entity/project/run-id")
history = run.history(samples=10000)  # full history, not just summary

For tensorboard, parse the event files (tensorboard.backend.event_processing.event_accumulator.EventAccumulator).

For neptune / mlflow, use their respective APIs.

3. Find the student's current step

The student is the run still in progress (or the one being evaluated). Get its current epoch / step from the latest history row.

4. Slice the baseline at the same step

This is the critical step. The baseline went all the way to (say) epoch 100. The student is at epoch 11. Pull the baseline's metrics at epoch 11, not at epoch 100.

student_step = student_history['epoch'].max()
baseline_at_same_step = baseline_history[baseline_history['epoch'] == student_step]

If the baseline doesn't have an exactly-matching step, interpolate or pick the nearest. State which.

5. Separate proxy metrics from downstream

Most ML pipelines have a proxy metric (cheap, computed during training, kNN accuracy on features, loss, perplexity) and a target downstream metric (expensive, computed periodically or only at the end, finetuned linear probe accuracy, downstream task F1).

The proxy is for tracking convergence; the target is what the project is actually optimizing. Reporting only the proxy can mislead, a run that lags on kNN may close the gap on downstream finetune. Report both, separately:

                        | student (ep 11) | baseline (ep 11) | delta |
| proxy (kNN top-1)     | 36.4%           | 38.9%            | -2.5  |
| downstream (linear)   | not yet         | 42.1%            | n/a   |

If the user only has proxy data, say so explicitly. Never declare a winner from proxy alone.

6. Run names in output

In every line of the report, use full run names. Never cs-ad vs fvs-fm; always phase1-7src-conv-s-adaptor-mlp vs phase1-7src-fastvit-s-featmap-mlp. Future-you reading this will not remember the shortcode.

Anti-patterns

"X is behind baseline by 4pp": without saying at what step. Almost always wrong.
"X has converged": without showing the last 5 epochs of the curve.
"Best run is Y": based on a metric that was logged differently across runs (different reduction, different eval set).
Single-seed comparison treated as definitive. Note variance if known; otherwise label as single-seed.

Output

Compact comparison table per metric pair (proxy + downstream). Each row aligned at the student's current step. Each cell traceable to a specific tracker run-id and step. End with one or two sentences interpreting the comparison, student is on track to catch up at step N, projected from current slope is a useful framing; student is winning / losing is rarely warranted before convergence.

related-skills.json

نفس المستودع

debug.md

from "fcakyon/phd-skills"

Evidence-before-action diagnosis of failing ML experiments. Probes the system before guessing causes, process list, dmesg, GPU stats, log scrollback, checkpoint state, then states a hypothesis as a hypothesis and runs a smoke before claiming a root cause. Use when the user asks why a run is failing, diverging, OOMing, hanging, slow, producing weird metrics, has crashed, or asks to debug, diagnose, troubleshoot, or investigate a training issue.

2026-04-30250

launch.md

from "fcakyon/phd-skills"

Pre-flight checklist for long-running ML training jobs covering config diff, run naming, path verification, monitoring setup, and restart-cleanup. Use when the user asks to launch, kick off, start, restart, or kill a training run, or mentions launching a multi-hour or multi-day GPU job (python train, accelerate launch, torchrun, deepspeed, sbatch, tmux training).

2026-04-30250

reproduce.md

from "fcakyon/phd-skills"

End-to-end paper reproduction from arxiv URL through smoke runs to replication experiments. Handles missing or partial official code, missing training scripts, missing hyperparameters, and private datasets via similar-public-dataset substitution. Use when the user asks to reproduce, implement, replicate, or re-run a paper from scratch, or pastes an arxiv URL with reproduction intent.

2026-04-30250

dataset-curation.md

from "fcakyon/phd-skills"

Use when the user wants to analyze dataset bias, create stratified samples, evaluate fairness, or plan dataset collection. Triggers on phrases like "dataset bias", "stratified sample", "class imbalance", "data distribution", "fairness analysis", or "ethical review".

2026-03-12250

experiment-design.md

from "fcakyon/phd-skills"

Use when the user wants to design experiments, plan ablation studies, structure baselines, or create incremental evaluation strategies. Triggers on phrases like "design ablation", "plan experiment", "what experiments should I run", "baseline comparison", or "experiment matrix".

2026-03-12250

latex-setup.md

from "fcakyon/phd-skills"

Use when the user wants to set up or troubleshoot a LaTeX environment, choose between biber and bibtex, install packages for a specific venue template, or configure compilation. Triggers on phrases like "setup latex", "biber vs bibtex", "latex compilation error", "install latex packages", "venue template", or "texlive setup".

2026-03-12250

package.json

"author": "fcakyon"

"repository": "fcakyon/phd-skills"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

$ useful --forSOC

علماء البياناتمهن الحاسوب والرياضيات15-2051L4

name	compare
description	Same-epoch comparison of training runs across wandb, neptune, tensorboard, or mlflow. Aligns runs at the student's current step (never current-vs-final-of-baseline) and separates proxy metrics from downstream targets. Use when the user asks to compare runs, check if a run is improving, track lag against a baseline, rank experiments, or evaluate run-vs-run performance.

Compare: same-epoch run comparison across trackers

The agentic Stop hook routes here from reason when an assistant reports a delta without aligning the runs.

When to run

The user just said any of:

"compare run A to baseline / to run B"
"is my run improving / catching up / falling behind"
"rank these experiments"
"X vs Y wandb / neptune"
"track lag against baseline"

Auto-detect the tracker

Check in this order:

WANDB_API_KEY env var set, or wandb imports in the project → wandb
NEPTUNE_API_TOKEN env var set → neptune
MLFLOW_TRACKING_URI env var set, or mlruns/ dir present → mlflow
runs/ or lightning_logs/ dir present → tensorboard
*results*.json / *meta*.json files in run dirs → local file format

If none, ask the user where metrics live before guessing.

The protocol

1. Identify the runs

Get full names (no shortcodes). If the user says "fvs-fm vs the baseline", clarify:

which fvs-fm run (project + entity + run-id)
which baseline (full run name; baselines often have several variants)

2. Fetch metric history (not just final value)

You need the full curve, not the last reported value. Final-value-only comparisons hide convergence dynamics.

For wandb:

import wandb
api = wandb.Api()
run = api.run("entity/project/run-id")
history = run.history(samples=10000)  # full history, not just summary

For tensorboard, parse the event files (tensorboard.backend.event_processing.event_accumulator.EventAccumulator).

For neptune / mlflow, use their respective APIs.

3. Find the student's current step

The student is the run still in progress (or the one being evaluated). Get its current epoch / step from the latest history row.

4. Slice the baseline at the same step

This is the critical step. The baseline went all the way to (say) epoch 100. The student is at epoch 11. Pull the baseline's metrics at epoch 11, not at epoch 100.

student_step = student_history['epoch'].max()
baseline_at_same_step = baseline_history[baseline_history['epoch'] == student_step]

If the baseline doesn't have an exactly-matching step, interpolate or pick the nearest. State which.

5. Separate proxy metrics from downstream

                        | student (ep 11) | baseline (ep 11) | delta |
| proxy (kNN top-1)     | 36.4%           | 38.9%            | -2.5  |
| downstream (linear)   | not yet         | 42.1%            | n/a   |

If the user only has proxy data, say so explicitly. Never declare a winner from proxy alone.

6. Run names in output

Anti-patterns

"X is behind baseline by 4pp": without saying at what step. Almost always wrong.
"X has converged": without showing the last 5 epochs of the curve.
"Best run is Y": based on a metric that was logged differently across runs (different reduction, different eval set).
Single-seed comparison treated as definitive. Note variance if known; otherwise label as single-seed.

compare

Compare: same-epoch run comparison across trackers

When to run

Auto-detect the tracker

The protocol

1. Identify the runs

2. Fetch metric history (not just final value)

3. Find the student's current step

4. Slice the baseline at the same step

5. Separate proxy metrics from downstream

6. Run names in output

Anti-patterns

Output

المزيد من هذا المستودع

Compare: same-epoch run comparison across trackers

When to run

Auto-detect the tracker

The protocol

1. Identify the runs

2. Fetch metric history (not just final value)

3. Find the student's current step

4. Slice the baseline at the same step

5. Separate proxy metrics from downstream

6. Run names in output

Anti-patterns

Output

المزيد من هذا المستودع