一键在 Manus 中运行任何 Skill

$pwd:

hf-generate-internals

Name: Hf Generate Internals
Author: JoaquinCampo

// HF Transformers generate() internals — scores vs logits, LogitsProcessors, KV cache, StoppingCriteria, chat templates. Use when code calls model.generate(), output_scores, output_logits, return_dict_in_generate, GenerateDecoderOnlyOutput, LogitsProcessor, StoppingCriteria, past_key_values, DynamicCache, apply_chat_template, do_sample, or num_beams.

在 Manus 中运行

$ git log --oneline --stat

stars:2

forks:0

updated:2026年3月23日 21:16

文件资源管理器

7 个文件

SKILL.md

readonly

name	hf-generate-internals
description	HF Transformers generate() internals — scores vs logits, LogitsProcessors, KV cache, StoppingCriteria, chat templates. Use when code calls model.generate(), output_scores, output_logits, return_dict_in_generate, GenerateDecoderOnlyOutput, LogitsProcessor, StoppingCriteria, past_key_values, DynamicCache, apply_chat_template, do_sample, or num_beams.

HuggingFace Transformers Generation Internals

Reference for HF Transformers generation pipeline internals, verified against transformers 5.3.0 source code. Covers the full path from model.generate() to output tensors.

Extension Files

File	Content
`references/generate-loop.md`	`generate()` entry point, mode dispatch, `_sample()` loop structure, token selection, prefill
`references/scores-and-logits.md`	`output_scores` vs `output_logits`, tensor shapes, float32 conversion, `GenerateDecoderOnlyOutput`, memory, `compute_transition_scores()`
`references/logits-processors.md`	`LogitsProcessorList`, processing order, always-active vs sampling-only processors, key implementations
`references/kv-cache.md`	Cache class hierarchy, `DynamicCache`, attention mask / position_id updates, kvpress interaction
`references/stopping-criteria.md`	`StoppingCriteria` interface, built-in criteria, custom stopping, when checked in the loop
`references/chat-templates.md`	`apply_chat_template()`, Jinja2 rendering, `add_generation_prompt`, `continue_final_message`, tokenization patterns

Quick Decision Tree

What do you need from generation?
│
├─ Per-token logit features    → output_scores=True (see scores-and-logits.md)
│  └─ Need truly raw logits?   → output_logits=True (bypasses LogitsProcessors)
│
├─ Custom stopping logic       → StoppingCriteria subclass (see stopping-criteria.md)
│
├─ Modify token probabilities  → LogitsProcessor subclass (see logits-processors.md)
│
├─ KV cache compression/debug  → past_key_values / DynamicCache (see kv-cache.md)
│
├─ Chat-formatted prompts      → tokenizer.apply_chat_template() (see chat-templates.md)
│
└─ Understand the loop itself  → _sample() internals (see generate-loop.md)

Critical Facts

scores != logits: .scores are post-LogitsProcessor, .logits are raw model output. Both are pre-softmax. With do_sample=False and no custom processors, they're identical.
Greedy and sampling share _sample(): In v5.3.0, there is no separate _greedy_search(). The do_sample flag controls behavior inside the same method.
Always float32: Logits are converted to float32 before processing (line 2754), regardless of model precision (float16/bfloat16).
Score shape: Each element of outputs.scores is (batch_size, vocab_size) — one tensor per generated token.
Memory cost: For Qwen2.5-7B (vocab=152064), 512 tokens of scores costs ~296 MB per batch element in float32.
Sampling processors are gated: Temperature, top-k, top-p warpers are only added when do_sample=True. With greedy decoding, scores pass through the pipeline essentially unmodified.
Left-pad required: Decoder-only models require left-padding for batched generation. HF warns if right-padding is detected.

Review Checklist

□ Using return_dict_in_generate=True with output_scores=True?
□ Indexing scores correctly? scores[t] is (batch, vocab), not (vocab,)
□ Know whether scores are pre- or post-processor for your use case?
□ Not holding all score tensors on GPU unnecessarily? (move to CPU or process inline)
□ Padding direction correct for batched generation? (left-pad for decoder-only)
□ StoppingCriteria returns BoolTensor of shape (batch,)?
□ Custom LogitsProcessor signature: (input_ids: LongTensor, scores: FloatTensor) -> FloatTensor?
□ Chat template rendered with add_generation_prompt=True for generation?

related-skills.json

同仓库

gsm8k-eval.md

from "JoaquinCampo/Skills"

GSM8K evaluation protocol: answer extraction (####, \boxed, CoT), accuracy scoring, prompt formatting, few-shot exemplars, dataset loading, pitfalls. Use when: GSM8K, grade school math, openai/gsm8k, #### delimiter, parse_gsm8k_answer, detect_answer_failure, load_gsm8k, format_chat, math benchmark scoring, gsm8k few-shot, chain-of-thought eval.

2026-03-232

hazard-survival-modeling.md

from "JoaquinCampo/Skills"

Use when implementing labeling.py, features.py, train.py, or code involving hazard/survival modeling, person-period data expansion, horizon labels, catastrophe prediction, XGBoost survival (survival:cox, survival:aft, binary:logistic), discrete-time survival, censoring, competing risks, C-index, Brier score, scale_pos_weight, or GroupKFold for sequences.

2026-03-232

kvpress.md

from "JoaquinCampo/Skills"

kvpress (NVIDIA) KV-cache compression for HuggingFace LLMs. Use when: kvpress imports, compression_ratio, press(model) context managers, StreamingLLMPress, SnapKVPress, ExpectedAttentionPress, TOVAPress, KnormPress, KV-cache eviction, token pruning during generation, or attention sink methods.

2026-03-232

time-series-forecasting.md

from "JoaquinCampo/Skills"

Use when writing, reviewing, or planning time series forecasting code. Triggers on: ARIMA, ETS, Theta, SARIMA, statsforecast, mlforecast, neuralforecast, XGBoost/LightGBM/CatBoost for time series, PatchTST, N-BEATS, TFT, Chronos, TimesFM, Moirai, MASE, MAPE, CRPS, temporal CV, walk-forward validation, prediction intervals, conformal prediction, data leakage in time series, demand forecasting, hierarchical forecasting, lag features, rolling features.

2026-03-232

worktrunk.md

from "JoaquinCampo/Skills"

Use when: worktrunk, `wt` commands, `.config/wt.toml`, git worktrees for parallel agents, worktree hooks, LLM commit messages, agent handoffs, `hash_port`/`sanitize` filters, "run agents in parallel", "set up worktrees", managing multiple Claude Code sessions.

2026-03-232

agent-browser.md

from "JoaquinCampo/Skills"

Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction.

2026-03-192

package.json

"author": "JoaquinCampo"

"repository": "JoaquinCampo/Skills"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

数据科学家计算机与数学类职业15-2051L4

name	hf-generate-internals
description	HF Transformers generate() internals — scores vs logits, LogitsProcessors, KV cache, StoppingCriteria, chat templates. Use when code calls model.generate(), output_scores, output_logits, return_dict_in_generate, GenerateDecoderOnlyOutput, LogitsProcessor, StoppingCriteria, past_key_values, DynamicCache, apply_chat_template, do_sample, or num_beams.

HuggingFace Transformers Generation Internals

Reference for HF Transformers generation pipeline internals, verified against transformers 5.3.0 source code. Covers the full path from model.generate() to output tensors.

Extension Files

File	Content
`references/generate-loop.md`	`generate()` entry point, mode dispatch, `_sample()` loop structure, token selection, prefill
`references/scores-and-logits.md`	`output_scores` vs `output_logits`, tensor shapes, float32 conversion, `GenerateDecoderOnlyOutput`, memory, `compute_transition_scores()`
`references/logits-processors.md`	`LogitsProcessorList`, processing order, always-active vs sampling-only processors, key implementations
`references/kv-cache.md`	Cache class hierarchy, `DynamicCache`, attention mask / position_id updates, kvpress interaction
`references/stopping-criteria.md`	`StoppingCriteria` interface, built-in criteria, custom stopping, when checked in the loop
`references/chat-templates.md`	`apply_chat_template()`, Jinja2 rendering, `add_generation_prompt`, `continue_final_message`, tokenization patterns

Quick Decision Tree

What do you need from generation?
│
├─ Per-token logit features    → output_scores=True (see scores-and-logits.md)
│  └─ Need truly raw logits?   → output_logits=True (bypasses LogitsProcessors)
│
├─ Custom stopping logic       → StoppingCriteria subclass (see stopping-criteria.md)
│
├─ Modify token probabilities  → LogitsProcessor subclass (see logits-processors.md)
│
├─ KV cache compression/debug  → past_key_values / DynamicCache (see kv-cache.md)
│
├─ Chat-formatted prompts      → tokenizer.apply_chat_template() (see chat-templates.md)
│
└─ Understand the loop itself  → _sample() internals (see generate-loop.md)

Critical Facts

scores != logits: .scores are post-LogitsProcessor, .logits are raw model output. Both are pre-softmax. With do_sample=False and no custom processors, they're identical.
Greedy and sampling share _sample(): In v5.3.0, there is no separate _greedy_search(). The do_sample flag controls behavior inside the same method.
Always float32: Logits are converted to float32 before processing (line 2754), regardless of model precision (float16/bfloat16).
Score shape: Each element of outputs.scores is (batch_size, vocab_size) — one tensor per generated token.
Memory cost: For Qwen2.5-7B (vocab=152064), 512 tokens of scores costs ~296 MB per batch element in float32.
Sampling processors are gated: Temperature, top-k, top-p warpers are only added when do_sample=True. With greedy decoding, scores pass through the pipeline essentially unmodified.
Left-pad required: Decoder-only models require left-padding for batched generation. HF warns if right-padding is detected.

Review Checklist

□ Using return_dict_in_generate=True with output_scores=True?
□ Indexing scores correctly? scores[t] is (batch, vocab), not (vocab,)
□ Know whether scores are pre- or post-processor for your use case?
□ Not holding all score tensors on GPU unnecessarily? (move to CPU or process inline)
□ Padding direction correct for batched generation? (left-pad for decoder-only)
□ StoppingCriteria returns BoolTensor of shape (batch,)?
□ Custom LogitsProcessor signature: (input_ids: LongTensor, scores: FloatTensor) -> FloatTensor?
□ Chat template rendered with add_generation_prompt=True for generation?

hf-generate-internals

HuggingFace Transformers Generation Internals

Extension Files

Quick Decision Tree

Critical Facts

Review Checklist

同仓库更多 Skills

同仓库更多 Skills

HuggingFace Transformers Generation Internals

Extension Files

Quick Decision Tree

Critical Facts

Review Checklist