تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

launch-hf-job

Name: Launch Hf Job
Author: mybigday

// Submit a training/inference script to Hugging Face Jobs (`hf jobs run`). Triggered when the user wants to run training in the cloud, scale beyond local hardware, or kick off a multi-hour fine-tune. Enforces pre-flight: hub_model_id required, ≥2h timeout for training, single-job validation before batch.

تشغيل في Manus

$ git log --oneline --stat

stars:٠

forks:٠

updated:٢٨ أبريل ٢٠٢٦ في ٠٨:٠٣

SKILL.md

readonly

name	launch-hf-job
description	Submit a training/inference script to Hugging Face Jobs (`hf jobs run`). Triggered when the user wants to run training in the cloud, scale beyond local hardware, or kick off a multi-hour fine-tune. Enforces pre-flight: hub_model_id required, ≥2h timeout for training, single-job validation before batch.

launch-hf-job — Hugging Face Jobs submission

Wrapper: scripts/launch_hf_job.py. Refuses to launch unless the pre-flight discipline is satisfied.

Required, every time

--hub-model-id you/repo — job storage is ephemeral. Without push_to_hub the trained model is permanently lost.
--timeout ≥2h for training. Default 4h; scale to model size.
Reference script must already have been smoke-tested locally (or in a tiny GPU sandbox) on the smallest sane subset.

Hardware sizing

Params	Flavor	Min timeout
1–3B	`a10g-largex2`	2h
7–13B	`a100-large`	6h
30B+	`l40sx4` or `a100x4`	12h
70B+	`a100x8`	24h

a10g-small and a10g-large have the same 24 GB GPU — only CPU/RAM differ. Pick large for the extra system RAM.

Batch / ablation jobs

Submit ONE job first. Read the logs (hf jobs logs <id>) until you see the loss come down for at least 10 logging steps. THEN submit the rest.

Never submit a sweep all at once — one bug kills every run identically.

Common failure modes

Symptom	Fix
Job killed at 30 minutes	You forgot `--timeout`.
Trained model gone	`push_to_hub=False` or `hub_model_id` missing.
`flash-attn` ImportError	Add it to `--extra-dep flash-attn`.
`KeyError: 'chosen'`	Audit dataset; you sent SFT data to DPO.
Gated dataset prompt	`HF_TOKEN` lacks scope; regenerate with read access.

Example

python scripts/launch_hf_job.py \
    --script scripts/train_sft.py \
    --config configs/my-run.yaml \
    --hardware a10g-largex2 \
    --timeout 4h \
    --hub-model-id my-user/smollm2-360m-capybara-sft

OOM recovery in HF Jobs

Same playbook as local OOM (see train-sft skill): halve micro-batch, double grad-accum, gradient_checkpointing=True, bf16=True, adamw_8bit, then upgrade flavor. Never silently switch SFT → LoRA or drop max_seq_length.

related-skills.json

نفس المستودع

rocm-strix-halo.md

from "mybigday/ml-intern-kit"

Set up training and inference on AMD Ryzen AI Max+ 395 / Strix Halo (gfx1151, RDNA 3.5) with TheRock nightly ROCm wheels. Triggered when the host has gfx1151, when `rocminfo` shows Strix Halo, or when the user mentions Strix Halo / Ryzen AI Max / gfx1151 / 128GB unified memory.

2026-04-280

env-bootstrap.md

from "mybigday/ml-intern-kit"

Recreate the ml-intern-kit Python environment on a new machine (laptop, rented GPU box, Docker, fresh checkout). Triggered when the user is on a new host, sees ImportError on a core dep (torch/transformers/trl/peft/accelerate), or wants to install flash-attn / unsloth / bitsandbytes after the fact.

2026-04-280

eval-model.md

from "mybigday/ml-intern-kit"

Evaluate a trained or downloaded language model with `lm-eval-harness` standard tasks (arc, hellaswag, gsm8k, mmlu, truthfulqa, ifeval, ...). Triggered when the user wants to benchmark, eval, or compare a model — pre- or post-training.

2026-04-280

inspect-dataset.md

from "mybigday/ml-intern-kit"

Audit a Hugging Face dataset before training to confirm splits, columns, format, sample rows, distributions, and duplicates. Triggered before any training/fine-tuning script runs, when a user mentions a new dataset, or when you hit a KeyError / format mismatch in a training job.

2026-04-280

research-recipe.md

from "mybigday/ml-intern-kit"

Run a literature-first crawl before writing ANY ML training/fine-tuning/inference code. Spawns an Explore sub-agent that mines papers, citation graphs, methodology sections, and matched HF datasets to produce a ranked list of training recipes attributed to specific published results. Triggered when the user asks to fine-tune, train, or improve a model, or when the user names a task/benchmark and you need a recipe before coding.

2026-04-280

train-dpo.md

from "mybigday/ml-intern-kit"

Direct Preference Optimization (DPO) fine-tune with TRL `DPOTrainer`. Triggered when the user wants to align a model on preferences / pairwise comparisons / chosen-vs-rejected data, or improve an existing SFT checkpoint with a preference dataset.

2026-04-280

package.json

"author": "mybigday"

"repository": "mybigday/ml-intern-kit"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

$ useful --forSOC

مطوّرو البرمجياتمهن الحاسوب والرياضيات15-1252L4

launch-hf-job — Hugging Face Jobs submission

Wrapper: scripts/launch_hf_job.py. Refuses to launch unless the pre-flight discipline is satisfied.

Required, every time

--hub-model-id you/repo — job storage is ephemeral. Without push_to_hub the trained model is permanently lost.

--timeout ≥2h for training. Default 4h; scale to model size.

Reference script must already have been smoke-tested locally (or in a tiny GPU sandbox) on the smallest sane subset.

Hardware sizing

Params

Flavor

Min timeout

1–3B

a10g-largex2

7–13B

a100-large

30B+

l40sx4 or a100x4

12h

70B+

a100x8

24h

a10g-small and a10g-large have the same 24 GB GPU — only CPU/RAM differ. Pick large for the extra system RAM.

Batch / ablation jobs

Submit ONE job first. Read the logs (hf jobs logs <id>) until you see the loss come down for at least 10 logging steps. THEN submit the rest.

Never submit a sweep all at once — one bug kills every run identically.

Common failure modes

Symptom

Fix

Job killed at 30 minutes

You forgot --timeout.

Trained model gone

push_to_hub=False or hub_model_id missing.

flash-attn ImportError

Add it to --extra-dep flash-attn.

KeyError: 'chosen'

Audit dataset; you sent SFT data to DPO.

Gated dataset prompt

HF_TOKEN lacks scope; regenerate with read access.

Example

python scripts/launch_hf_job.py \ --script scripts/train_sft.py \ --config configs/my-run.yaml \ --hardware a10g-largex2 \ --timeout 4h \ --hub-model-id my-user/smollm2-360m-capybara-sft

OOM recovery in HF Jobs

launch-hf-job

launch-hf-job — Hugging Face Jobs submission

Required, every time

Hardware sizing

Batch / ablation jobs

Common failure modes

Example

OOM recovery in HF Jobs

المزيد من هذا المستودع

المزيد من هذا المستودع

launch-hf-job — Hugging Face Jobs submission

Required, every time

Hardware sizing

Batch / ablation jobs

Common failure modes

Example

OOM recovery in HF Jobs