Run any Skill in Manus with one click

$pwd:

hpc-python

Name: Hpc Python
Author: JoaquinCampo

// Python HPC patterns — threading/multiprocessing/async, CUDA streams, latency hiding, PyTorch DDP. Use when writing performance-sensitive Python code, distributed training, or parallel data processing.

Run Skill in Manus

$ git log --oneline --stat

stars:2

forks:0

updated:March 19, 2026 at 16:29

File Explorer

8 files

SKILL.md

readonly

related-skills.json

same repository

gsm8k-eval.md

from "JoaquinCampo/Skills"

GSM8K evaluation protocol: answer extraction (####, \boxed, CoT), accuracy scoring, prompt formatting, few-shot exemplars, dataset loading, pitfalls. Use when: GSM8K, grade school math, openai/gsm8k, #### delimiter, parse_gsm8k_answer, detect_answer_failure, load_gsm8k, format_chat, math benchmark scoring, gsm8k few-shot, chain-of-thought eval.

2026-03-232

hazard-survival-modeling.md

from "JoaquinCampo/Skills"

Use when implementing labeling.py, features.py, train.py, or code involving hazard/survival modeling, person-period data expansion, horizon labels, catastrophe prediction, XGBoost survival (survival:cox, survival:aft, binary:logistic), discrete-time survival, censoring, competing risks, C-index, Brier score, scale_pos_weight, or GroupKFold for sequences.

2026-03-232

hf-generate-internals.md

from "JoaquinCampo/Skills"

HF Transformers generate() internals — scores vs logits, LogitsProcessors, KV cache, StoppingCriteria, chat templates. Use when code calls model.generate(), output_scores, output_logits, return_dict_in_generate, GenerateDecoderOnlyOutput, LogitsProcessor, StoppingCriteria, past_key_values, DynamicCache, apply_chat_template, do_sample, or num_beams.

2026-03-232

kvpress.md

from "JoaquinCampo/Skills"

kvpress (NVIDIA) KV-cache compression for HuggingFace LLMs. Use when: kvpress imports, compression_ratio, press(model) context managers, StreamingLLMPress, SnapKVPress, ExpectedAttentionPress, TOVAPress, KnormPress, KV-cache eviction, token pruning during generation, or attention sink methods.

2026-03-232

time-series-forecasting.md

from "JoaquinCampo/Skills"

Use when writing, reviewing, or planning time series forecasting code. Triggers on: ARIMA, ETS, Theta, SARIMA, statsforecast, mlforecast, neuralforecast, XGBoost/LightGBM/CatBoost for time series, PatchTST, N-BEATS, TFT, Chronos, TimesFM, Moirai, MASE, MAPE, CRPS, temporal CV, walk-forward validation, prediction intervals, conformal prediction, data leakage in time series, demand forecasting, hierarchical forecasting, lag features, rolling features.

2026-03-232

worktrunk.md

from "JoaquinCampo/Skills"

Use when: worktrunk, `wt` commands, `.config/wt.toml`, git worktrees for parallel agents, worktree hooks, LLM commit messages, agent handoffs, `hash_port`/`sanitize` filters, "run agents in parallel", "set up worktrees", managing multiple Claude Code sessions.

2026-03-232

package.json

"author": "JoaquinCampo"

"repository": "JoaquinCampo/Skills"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	hpc-python
description	Python HPC patterns — threading/multiprocessing/async, CUDA streams, latency hiding, PyTorch DDP. Use when writing performance-sensitive Python code, distributed training, or parallel data processing.
user-invocable	true

HPC Python

Python parallelism, PyTorch DDP, and CPU/GPU latency hiding patterns.

Extension Files

File	Content
`parallel-python.md`	Threading vs multiprocessing vs asyncio decision tree, GIL rules, CUDA+fork safety
`latency-hiding.md`	CUDA streams, double buffering, compute/comm overlap, CUDAGraphs, async checkpoint
`pytorch-ddp.md`	DDP internals, gradient buckets, common bugs, mixed precision, DistributedSampler
`preload-caching.md`	Three-level caching: L1 file (disk/memmap/shm), L2 function (lru_cache/dedup/index), L3 variable (buffer/GPU cache/warmup/KV cache)
`torch-compile.md`	torch.compile modes, graph break diagnosis/fixes, reading generated Triton as starting point for hand-tuning
`benchmarking.md`	Correct GPU timing (CUDA events, torch.utils.benchmark.Timer), warmup, common pitfalls, what to measure
`dataloader.md`	DataLoader params (num_workers/pin_memory/prefetch_factor), dataset patterns, data formats, collation, worker issues

Quick Decision Tree

What is the bottleneck?
├─ I/O bound         → threading (ThreadPoolExecutor) or asyncio
├─ CPU bound         → multiprocessing (mp.Pool, fork BEFORE CUDA!)
├─ GPU bound         → batch inputs, don't parallelize
├─ Mixed CPU→GPU     → pipeline + CUDA streams (see latency-hiding.md)
└─ DDP communication → tune bucket_cap_mb, use model.no_sync()

Critical Rules

mp.Pool BEFORE CUDA: Create multiprocessing pool before any torch.cuda call (fork+CUDA = deadlock)
Never .item() in loops: Accumulate on GPU, transfer final result only
pin_memory=True: Required for non_blocking=True transfers to actually be async
DDP no_sync(): Use during gradient accumulation steps to avoid redundant AllReduce
find_unused_parameters=True: Only when needed — it's expensive

Review Checklist (Python)

□ No .cpu()/.item()/.numpy() in hot loops?
□ DataLoader: num_workers > 0, pin_memory=True?
□ mp.Pool created before CUDA init?
□ Threading used only for I/O, not CPU-bound work?
□ DDP gradient accumulation uses no_sync()?
□ CUDA streams used for transfer/compute overlap?
□ Pre-allocated buffers reused (not created per iteration)?

hpc-python

More from this repository

More from this repository

HPC Python

Extension Files

Quick Decision Tree

Critical Rules

Review Checklist (Python)

HPC Python

Extension Files

Quick Decision Tree

Critical Rules

Review Checklist (Python)