تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

monitor

Name: Monitor
Author: NVIDIA

// Monitor submitted jobs (PTQ, evaluation, deployment) on SLURM clusters. Use when the user asks "check job status", "is my job done", "monitor my evaluation", "what's the status of the PTQ", "check on job <slurm_job_id>", or after any skill submits a long-running job. Also triggers on "nel status", "squeue", or any request to check progress of a previously submitted job.

تشغيل في Manus

$ git log --oneline --stat

stars:٢٬٧٤٩

forks:٤٠٥

updated:٢١ مايو ٢٠٢٦ في ٢٢:١٦

SKILL.md

readonly

related-skills.json

نفس المستودع

evaluation.md

from "NVIDIA/Model-Optimizer"

Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq), deploying/serving models (use deployment), or comparing completed baseline-vs-quantized results (use compare-results).

2026-05-222.7k

compare-results.md

from "NVIDIA/Model-Optimizer"

Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).

2026-05-212.7k

deployment.md

from "NVIDIA/Model-Optimizer"

Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).

2026-05-212.7k

launching-evals.md

from "NVIDIA/Model-Optimizer"

Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs.

2026-05-212.7k

ptq.md

from "NVIDIA/Model-Optimizer"

This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.

2026-05-212.7k

release-cherry-pick.md

from "NVIDIA/Model-Optimizer"

Cherry-pick merged PRs labeled for a release branch into that branch, then open a PR and apply the cherry-pick-done label. Use when asked to "cherry-pick PRs for release/X.Y.Z", "pick PRs to release branch", or "cherry-pick labeled PRs".

2026-04-272.7k

package.json

"author": "NVIDIA"

"repository": "NVIDIA/Model-Optimizer"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

$ useful --forSOC

مديرو الشبكات وأنظمة الحاسوبمهن الحاسوب والرياضيات15-1244L4

name	monitor
description	Monitor submitted jobs (PTQ, evaluation, deployment) on SLURM clusters. Use when the user asks "check job status", "is my job done", "monitor my evaluation", "what's the status of the PTQ", "check on job <slurm_job_id>", or after any skill submits a long-running job. Also triggers on "nel status", "squeue", or any request to check progress of a previously submitted job.

Job Monitor

Monitor jobs submitted to SLURM clusters — PTQ quantization, NEL evaluation, model deployment, or raw SLURM jobs.

When to use

Auto-monitor — another skill (PTQ, evaluation, deployment) just submitted a job. Register the job and set up monitoring immediately.
User-initiated — user asks about a job status. Check the current session registry first; if the job is not registered there, use the discovery steps below.

Job Registry

Active jobs are tracked in per-session registries under .claude/agents/. This avoids multiple agents clobbering one shared registry when they run at the same time.

Use the current agent session id as <session_id>:

Claude Code: $CLAUDE_CODE_SESSION_ID, or the session_id field from hook input
Codex: $CODEX_THREAD_ID
If no session id is available, create a stable id for the current terminal session and reuse it for every job registered by that agent

Registry layout:

.claude/agents/
  <session_id>/
    active_jobs.json

Each session's active_jobs.json is a JSON array:

[
  {
    "type": "nel",
    "id": "<invocation_id or slurm_job_id>",
    "host": "<cluster_hostname>",
    "user": "<ssh_user>",
    "submitted": "YYYY-MM-DD HH:MM",
    "description": "<what this job does>",
    "last_status": "<last known status>",
    "owner": {
      "agent": "claude-code|codex|manual",
      "session_id": "<session_id>"
    }
  }
]

type is one of: nel, slurm, launcher.

On Job Submission

Every time a job is submitted (by any skill or manually):

Add an entry to .claude/agents/<session_id>/active_jobs.json. Create the session directory and file if they don't exist.
Start a durable monitor (if one isn't already watching the registry) that polls this session's registered jobs until they reach terminal status. Prefer the Claude Code Monitor tool when it is available: write a small watcher that reads .claude/agents/<session_id>/active_jobs.json, checks every job with the appropriate method below, prints state-change events, updates last_status, removes terminal jobs from the session registry, and exits when no active jobs remain for this session.

The monitor should terminate naturally when every registered job has reached a terminal state. If the Monitor tool is not available in the current harness, run an equivalent background process that implements the same loop and lets the agent resume/restart when the process exits.

Always do both steps. Don't try to predict job duration.

On Monitor Event / Status Check

Whether triggered by monitor output or by the user asking "check status":

Read the registry from .claude/agents/<session_id>/active_jobs.json
Check each job using the appropriate method (see below)
Report only state changes — compare against last_status in registry
Update last_status in the session registry
Remove completed jobs — any job in a terminal state (COMPLETED, FAILED, CANCELLED, KILLED, TIMEOUT, NODE_FAIL, OUT_OF_MEMORY, PREEMPTED, BOOT_FAIL, DEADLINE)
If no active jobs remain — let the monitor exit

How to Check Each Job Type

Each check method has its own status vocabulary. A watcher that mixes them (e.g. uses SLURM's COMPLETED terminal-state regex against nel status output) will silently never fire terminal transitions. Always match against the vocabulary of the source you're polling.

NEL jobs (`type: nel`)

Check: nel status <id>.

extract_nel_state() {
  local jid="$1" nel_bin="${NEL:-nel}" output state_col
  output=$("$nel_bin" status "$jid" 2>&1)
  state_col=$(echo "$output" \
    | awk -F'|' -v prefix="$jid." 'index($1, prefix) == 1 { print $2; exit }')
  [ -z "$state_col" ] && state_col="$output"
  echo "$state_col" \
    | LC_ALL=C tr '[:lower:]' '[:upper:]' \
    | awk 'match($0, /(PENDING|RUNNING|SUCCESS|FAILED|KILLED|ERROR|NOT[[:space:]]+FOUND)/) { print substr($0, RSTART, RLENGTH); exit }' \
    | sed 's/[[:space:]][[:space:]]*/ /g'
}

is_nel_terminal() {
  case "$(extract_nel_state "$1")" in
    SUCCESS|FAILED|KILLED|ERROR|"NOT FOUND") return 0 ;;
    *) return 1 ;;
  esac
}

On completion: nel info <id> to fetch results.
On failure: nel info <id> --logs then inspect server/client/SLURM logs via SSH.

Launcher jobs (`type: launcher`)

Check: Tail the launcher's background output file for key events.
Key events: experiment ID, SLURM job ID, container import, calibration progress, export path, final status.
On failure: Look for Traceback, Error, or FAILED in the output.

Raw SLURM jobs (`type: slurm`)

Check: sacct; use sacct for the termination check because squeue can lag in COMPLETING after sacct reports a terminal state.

extract_slurm_state() {
  local jid="$1" host="$2"
  ssh "$host" "sacct -j $jid -X --format=State --noheader -P 2>/dev/null | head -1" \
    | sed 's/^[[:space:]]*//;s/[[:space:]]*$//' \
    | sed 's/^CANCELLED by .*/CANCELLED/'
}

is_slurm_terminal() {
  case "$(extract_slurm_state "$1" "$2")" in
    COMPLETED|FAILED|CANCELLED|TIMEOUT|NODE_FAIL|OUT_OF_MEMORY|PREEMPTED|BOOT_FAIL|DEADLINE) return 0 ;;
    *) return 1 ;;
  esac
}

On completion: ssh <host> "sacct -j <id> --format=State,ExitCode,Elapsed -n".
On failure: Check the job's output log file.

Identifying Jobs (user-initiated, no ID given)

When the user asks about a job without specifying an ID, check in order:

.claude/agents/<current_session_id>/active_jobs.json — current agent's jobs
nel ls runs --since 1d — recent NEL runs
ssh <host> "squeue -u <user>" — active SLURM jobs
ls -lt tools/launcher/experiments/cicd/ | head -10 — recent launcher experiments

Reporting Guidelines

Report state changes proactively — PENDING → RUNNING, or job completes
Aggregate multiple jobs — "2 of 4 completed (MMLU-Pro: 42.3%, GSM8K: 67.1%), 1 running, 1 pending"
Summarize, don't echo — interpret events ("Calibration complete, exporting checkpoint") not raw logs
On failure, diagnose immediately — check logs and report root cause without waiting for user to ask
Minimize noise — don't report "still running" unless the user is actively asking

monitor

المزيد من هذا المستودع

Job Monitor

When to use

Job Registry

On Job Submission

On Monitor Event / Status Check

How to Check Each Job Type

NEL jobs (type: nel)

Launcher jobs (type: launcher)

Raw SLURM jobs (type: slurm)

Identifying Jobs (user-initiated, no ID given)

Reporting Guidelines

Job Monitor

When to use

Job Registry

On Job Submission

On Monitor Event / Status Check

How to Check Each Job Type

NEL jobs (type: nel)

Launcher jobs (type: launcher)

Raw SLURM jobs (type: slurm)

Identifying Jobs (user-initiated, no ID given)

Reporting Guidelines

المزيد من هذا المستودع

NEL jobs (`type: nel`)

Launcher jobs (`type: launcher`)

Raw SLURM jobs (`type: slurm`)

NEL jobs (`type: nel`)

Launcher jobs (`type: launcher`)

Raw SLURM jobs (`type: slurm`)