with one click
hpc-python
// Python HPC patterns — threading/multiprocessing/async, CUDA streams, latency hiding, PyTorch DDP. Use when writing performance-sensitive Python code, distributed training, or parallel data processing.
// Python HPC patterns — threading/multiprocessing/async, CUDA streams, latency hiding, PyTorch DDP. Use when writing performance-sensitive Python code, distributed training, or parallel data processing.
GSM8K evaluation protocol: answer extraction (####, \boxed, CoT), accuracy scoring, prompt formatting, few-shot exemplars, dataset loading, pitfalls. Use when: GSM8K, grade school math, openai/gsm8k, #### delimiter, parse_gsm8k_answer, detect_answer_failure, load_gsm8k, format_chat, math benchmark scoring, gsm8k few-shot, chain-of-thought eval.
Use when implementing labeling.py, features.py, train.py, or code involving hazard/survival modeling, person-period data expansion, horizon labels, catastrophe prediction, XGBoost survival (survival:cox, survival:aft, binary:logistic), discrete-time survival, censoring, competing risks, C-index, Brier score, scale_pos_weight, or GroupKFold for sequences.
HF Transformers generate() internals — scores vs logits, LogitsProcessors, KV cache, StoppingCriteria, chat templates. Use when code calls model.generate(), output_scores, output_logits, return_dict_in_generate, GenerateDecoderOnlyOutput, LogitsProcessor, StoppingCriteria, past_key_values, DynamicCache, apply_chat_template, do_sample, or num_beams.
kvpress (NVIDIA) KV-cache compression for HuggingFace LLMs. Use when: kvpress imports, compression_ratio, press(model) context managers, StreamingLLMPress, SnapKVPress, ExpectedAttentionPress, TOVAPress, KnormPress, KV-cache eviction, token pruning during generation, or attention sink methods.
Use when writing, reviewing, or planning time series forecasting code. Triggers on: ARIMA, ETS, Theta, SARIMA, statsforecast, mlforecast, neuralforecast, XGBoost/LightGBM/CatBoost for time series, PatchTST, N-BEATS, TFT, Chronos, TimesFM, Moirai, MASE, MAPE, CRPS, temporal CV, walk-forward validation, prediction intervals, conformal prediction, data leakage in time series, demand forecasting, hierarchical forecasting, lag features, rolling features.
Use when: worktrunk, `wt` commands, `.config/wt.toml`, git worktrees for parallel agents, worktree hooks, LLM commit messages, agent handoffs, `hash_port`/`sanitize` filters, "run agents in parallel", "set up worktrees", managing multiple Claude Code sessions.
| name | hpc-python |
| description | Python HPC patterns — threading/multiprocessing/async, CUDA streams, latency hiding, PyTorch DDP. Use when writing performance-sensitive Python code, distributed training, or parallel data processing. |
| user-invocable | true |
Python parallelism, PyTorch DDP, and CPU/GPU latency hiding patterns.
| File | Content |
|---|---|
parallel-python.md | Threading vs multiprocessing vs asyncio decision tree, GIL rules, CUDA+fork safety |
latency-hiding.md | CUDA streams, double buffering, compute/comm overlap, CUDAGraphs, async checkpoint |
pytorch-ddp.md | DDP internals, gradient buckets, common bugs, mixed precision, DistributedSampler |
preload-caching.md | Three-level caching: L1 file (disk/memmap/shm), L2 function (lru_cache/dedup/index), L3 variable (buffer/GPU cache/warmup/KV cache) |
torch-compile.md | torch.compile modes, graph break diagnosis/fixes, reading generated Triton as starting point for hand-tuning |
benchmarking.md | Correct GPU timing (CUDA events, torch.utils.benchmark.Timer), warmup, common pitfalls, what to measure |
dataloader.md | DataLoader params (num_workers/pin_memory/prefetch_factor), dataset patterns, data formats, collation, worker issues |
What is the bottleneck?
├─ I/O bound → threading (ThreadPoolExecutor) or asyncio
├─ CPU bound → multiprocessing (mp.Pool, fork BEFORE CUDA!)
├─ GPU bound → batch inputs, don't parallelize
├─ Mixed CPU→GPU → pipeline + CUDA streams (see latency-hiding.md)
└─ DDP communication → tune bucket_cap_mb, use model.no_sync()
torch.cuda call (fork+CUDA = deadlock).item() in loops: Accumulate on GPU, transfer final result onlypin_memory=True: Required for non_blocking=True transfers to actually be asyncno_sync(): Use during gradient accumulation steps to avoid redundant AllReducefind_unused_parameters=True: Only when needed — it's expensive□ No .cpu()/.item()/.numpy() in hot loops?
□ DataLoader: num_workers > 0, pin_memory=True?
□ mp.Pool created before CUDA init?
□ Threading used only for I/O, not CPU-bound work?
□ DDP gradient accumulation uses no_sync()?
□ CUDA streams used for transfer/compute overlap?
□ Pre-allocated buffers reused (not created per iteration)?