Run any Skill in Manus with one click

$pwd:

adaptbpe-general-purpose-specialized

Name: Adaptbpe General Purpose Specialized
Author: ndpvt-web

// Adapt general-purpose BPE tokenizers into domain- or language-specialized tokenizers using the AdaptBPE post-training strategy. Replaces low-utility tokens with high-frequency domain-specific tokens to improve tokenization efficiency without retraining from scratch. Trigger phrases: "adapt tokenizer to domain", "specialize BPE for medical text", "optimize tokenizer for French", "reduce token fertility for code", "adapt vocabulary for legal documents", "domain-specific tokenizer"

Run Skill in Manus

$ git log --oneline --stat

stars:4

forks:0

updated:February 13, 2026 at 13:35

SKILL.md

readonly

related-skills.json

same repository

a2rag-adaptive-agentic-graph.md

from "ndpvt-web/arxiv-claude-skills"

Build adaptive, cost-aware Graph-RAG pipelines that route queries through escalating retrieval stages (local -> bridge -> global) with triple-check verification and provenance map-back. Use when: 'build a graph RAG pipeline', 'implement adaptive retrieval for knowledge graphs', 'cost-aware multi-hop question answering', 'add evidence verification to RAG', 'handle mixed-difficulty queries efficiently', 'graph retrieval with source text grounding'.

2026-02-134

addressing-explainability-generative-ai.md

from "ndpvt-web/arxiv-claude-skills"

Explain generative AI outputs using the gSMILE perturbation-based attribution framework. Builds local surrogate models from controlled input perturbations and Wasserstein distance to produce token-level or word-level importance scores for LLM and diffusion model outputs. Triggers: 'explain why the model generated this', 'token attribution for prompt', 'which words in my prompt matter most', 'interpret generative model output', 'build explainability for my LLM pipeline', 'debug prompt influence on generation'

2026-02-134

agent-based-software-artifact-evaluation.md

from "ndpvt-web/arxiv-claude-skills"

Automatically evaluate software research artifacts (code repositories with READMEs) by constructing dependency-aware command graphs, building containerized environments, and executing instructions with structured error recovery. Use when asked to: 'evaluate this artifact', 'reproduce this paper's results', 'run this repo's README instructions', 'check if this artifact builds and runs', 'automate artifact evaluation', 'verify research reproducibility'.

2026-02-134

agentcgroup-understanding-controlling-os.md

from "ndpvt-web/arxiv-claude-skills"

Design and implement OS-level resource controls for sandboxed AI agents using hierarchical cgroups, eBPF enforcement, and tool-call-level resource management. Use when: 'set up cgroups for AI agent containers', 'control memory for coding agents', 'isolate tool-call resources with eBPF', 'manage multi-tenant agent resource limits', 'prevent OOM kills in agent sandboxes', 'configure agent resource policies with cgroup v2'.

2026-02-134

ai-agent-systems-supply.md

from "ndpvt-web/arxiv-claude-skills"

Build LLM-based multi-agent systems for supply chain inventory management using structured decision prompts and memory-retrieval (AIM-RM). Implements the beer game multi-echelon supply chain simulation with per-stage agents that use stepwise ordering prompts, safety-stock calculations, and Euclidean-distance memory retrieval of similar historical episodes. Use when asked to: "build a supply chain agent", "implement inventory management with LLMs", "create a beer game simulation with AI agents", "multi-agent ordering system", "AIM-RM memory retrieval agent", "supply chain decision prompt design".

2026-02-134

alertguardian-intelligent-alert-life-cycle.md

from "ndpvt-web/arxiv-claude-skills"

Build intelligent alert lifecycle management systems for cloud infrastructure using graph-based denoising, RAG-powered summarization, and multi-agent rule refinement. Trigger phrases: - "reduce alert fatigue in our monitoring system" - "deduplicate and correlate alerts" - "summarize alerts for on-call engineers" - "refine our alerting rules automatically" - "build an alert denoising pipeline" - "too many alerts, help me triage"

2026-02-134

package.json

"author": "ndpvt-web"

"repository": "ndpvt-web/arxiv-claude-skills"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

name

adaptbpe-general-purpose-specialized

description

Adapt general-purpose BPE tokenizers into domain- or language-specialized tokenizers using the AdaptBPE post-training strategy. Replaces low-utility tokens with high-frequency domain-specific tokens to improve tokenization efficiency without retraining from scratch. Trigger phrases: "adapt tokenizer to domain", "specialize BPE for medical text", "optimize tokenizer for French", "reduce token fertility for code", "adapt vocabulary for legal documents", "domain-specific tokenizer"

AdaptBPE: Adapting General-Purpose BPE Tokenizers to Specialized Domains and Languages

This skill enables Claude to guide users through adapting an existing BPE tokenizer (e.g., from GPT-2, LLaMA, Mistral) to a specific domain (medical, legal, code) or language by selectively replacing low-utility tokens with domain-relevant alternatives. The technique, from Liyanage & Yvon (EACL 2026), avoids retraining tokenizers from scratch. Instead, it scores every token in the existing vocabulary by its utility on an adaptation corpus, prunes the least useful tokens, and fills the freed slots with high-frequency subword units from the target domain -- all while keeping vocabulary size constant.

When to Use

When a user wants to improve tokenization efficiency for a specialized domain (medical, legal, scientific, financial text) using an existing LLM tokenizer
When a user is deploying an LLM to a low-resource or morphologically rich language and the default tokenizer over-fragments words in that language
When a user observes high token fertility (many tokens per word) on their target corpus and wants to reduce it
When a user asks how to adapt a HuggingFace tokenizer for a specific use case without training one from scratch
When a user wants to compress domain-specific text more efficiently to fit longer contexts into the same token budget
When a user is fine-tuning an LLM on domain data and wants the tokenizer to match the domain vocabulary

Key Technique

The Problem. Standard BPE tokenizers are trained on broad web-crawl data and allocate vocabulary slots to tokens that cover general English text well. When applied to specialized domains (e.g., biomedical literature with terms like "electroencephalography") or underrepresented languages, the tokenizer fragments words into many small pieces. This increases sequence length, wastes compute, and can degrade model quality.

The AdaptBPE Solution. Rather than training a new tokenizer from scratch (which would require retraining all model embeddings), AdaptBPE performs a vocabulary swap: (1) Score every token in the existing vocabulary by its utility on the target domain corpus -- utility combines token frequency with how much compression it provides. (2) Identify low-utility tokens -- those rarely used or providing little compression on domain text. (3) Run BPE merge learning on the adaptation corpus to discover candidate replacement tokens. (4) Replace the lowest-utility tokens with the highest-utility new candidates, keeping total vocabulary size fixed. The result is a tokenizer that shares most of its vocabulary with the original (preserving pretrained embeddings) but swaps out dead weight for domain-relevant subwords.

Why It Works. Most general-purpose tokenizers have a long tail of tokens that encode rare Unicode sequences, emoji combinations, or whitespace patterns unused in any given domain. These can be safely replaced. The adapted tokenizer then produces shorter token sequences on domain text (lower fertility), which means faster inference, longer effective context windows, and better alignment between token boundaries and meaningful linguistic units.

Step-by-Step Workflow

Select the base tokenizer. Load the pretrained tokenizer you want to adapt (e.g., AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")). Record its vocabulary size V and the full merge list.
Prepare the adaptation corpus. Collect 1--10 million tokens of representative text from the target domain or language. Clean it (remove boilerplate, deduplicate). This corpus drives both utility scoring and new token discovery.
Tokenize the adaptation corpus with the base tokenizer. Run the base tokenizer over the entire adaptation corpus and collect token frequency counts. Compute fertility (average tokens per whitespace-delimited word) and proportion of continued words (fraction of words split into 2+ tokens) as baseline metrics.
Score token utility. For each token t in the vocabulary, compute a utility score combining: (a) frequency of t in the tokenized adaptation corpus, and (b) the compression contribution of t (how many characters it covers per occurrence). Tokens with zero or near-zero frequency on the domain corpus are prime candidates for replacement. Rank all tokens by utility ascending.
Select tokens to replace. Choose the bottom K tokens by utility score for removal. A typical starting point is K = 5--15% of vocabulary size. Exclude special tokens (<bos>, <eos>, <pad>, <unk>) and single-byte tokens from removal -- these must be preserved for correctness.
Learn new merge candidates from the adaptation corpus. Run standard BPE training on the raw adaptation corpus to generate a fresh merge list. Extract candidate tokens that are NOT already in the base vocabulary. Rank these candidates by their frequency in the adaptation corpus.
Swap tokens. Replace the K lowest-utility tokens with the top K new candidates. For each replacement, update the vocabulary mapping (token-to-ID) and add the corresponding merge rule to the tokenizer's merge list.
Re-initialize embeddings for new tokens. For each newly added token, initialize its embedding as the mean of the embeddings of its constituent subtokens in the original model. This gives the model a reasonable starting point for the new token.
Validate the adapted tokenizer. Re-tokenize the adaptation corpus with the new tokenizer. Measure fertility and proportion of continued words. Verify fertility decreased and single-token word coverage increased. Spot-check domain-specific terms to confirm they are tokenized into fewer pieces.
Fine-tune the model (optional but recommended). Run a short continued pretraining or fine-tuning pass on domain data so the model learns proper contextual representations for newly added tokens. Even a few hundred steps can significantly improve downstream performance.

Concrete Examples

Example 1: Adapting LLaMA tokenizer for biomedical text

User: "I'm fine-tuning LLaMA-3-8B on PubMed abstracts. The tokenizer splits medical terms into too many pieces. How can I adapt it?"

Approach:

Load the LLaMA-3 tokenizer (128K vocab) and a sample of 5M tokens from PubMed abstracts.
Tokenize the corpus and compute fertility. Expect ~1.8 tokens/word vs ~1.4 for general English.
Score all 128K tokens by frequency on PubMed. Identify tokens like emoji sequences, CJK character combinations, and rare Unicode blocks that never appear in biomedical text.
Run BPE on raw PubMed text to discover medical subwords: cardio, pathy, neur, oscopy, pharma, glyc, emia, etc.
Replace the bottom 10K low-utility tokens with the top 10K medical candidates.
Initialize new embeddings by averaging constituent subtoken embeddings.
Validate: fertility drops to ~1.45, "electroencephalography" goes from 6 tokens to 3.

Output:

from transformers import AutoTokenizer
import collections

# Step 1: Load base tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
vocab = tokenizer.get_vocab()
V = len(vocab)

# Step 2: Load adaptation corpus
with open("pubmed_abstracts.txt") as f:
    corpus = f.read()

# Step 3: Tokenize and count frequencies
token_ids = tokenizer.encode(corpus)
freq = collections.Counter(token_ids)

# Step 4: Score utility (frequency * avg chars per token)
utility = {}
for tid, count in freq.items():
    token_str = tokenizer.decode([tid])
    utility[tid] = count * len(token_str)

# Tokens with zero frequency on domain corpus
zero_freq = [tid for tid in range(V) if tid not in freq]
# These are immediate replacement candidates

# Step 5: Rank and select bottom K (excluding special + single-byte)
special_ids = set(tokenizer.all_special_ids)
removable = [tid for tid in zero_freq if tid not in special_ids
             and len(tokenizer.decode([tid]).encode('utf-8')) > 1]
K = min(len(removable), int(0.10 * V))
to_remove = removable[:K]

print(f"Replacing {K} tokens ({100*K/V:.1f}% of vocabulary)")
print(f"Baseline fertility: {len(token_ids) / len(corpus.split()):.3f}")

Example 2: Specializing a tokenizer for French

User: "Our GPT-2 based model tokenizes French poorly -- words like 'aujourd'hui' and 'c'est' get fragmented badly. Can we fix the tokenizer?"

Approach:

Load the GPT-2 tokenizer (50,257 vocab) and a French corpus (e.g., French Wikipedia, 10M tokens).
Tokenize and measure: expect fertility ~1.9 for French vs ~1.3 for English.
Identify English-centric tokens with zero French utility: tokens encoding English contractions ("n't", "'ve"), rare English compounds, ASCII art patterns.
Run BPE on the French corpus to discover French-specific merges: aujourd, qu', c'est, ment, tion, common French affixes.
Swap ~3,000 tokens (6% of vocab). Prioritize tokens that capture French contractions and frequent morphological patterns.
Validate: French fertility drops from 1.9 to ~1.5. "aujourd'hui" tokenizes as 1--2 tokens instead of 4.

Output:

# Measure before/after fertility for French text
test_sentences = [
    "Aujourd'hui, c'est une journée magnifique.",
    "L'électroménager est en promotion.",
    "Qu'est-ce que vous en pensez?"
]

print("=== Before Adaptation ===")
for sent in test_sentences:
    tokens = tokenizer.tokenize(sent)
    words = sent.split()
    print(f"  '{sent}' -> {len(tokens)} tokens, "
          f"fertility={len(tokens)/len(words):.2f}")
    print(f"  Tokens: {tokens}")

# After adaptation...
print("\n=== After Adaptation ===")
for sent in test_sentences:
    tokens = adapted_tokenizer.tokenize(sent)
    words = sent.split()
    print(f"  '{sent}' -> {len(tokens)} tokens, "
          f"fertility={len(tokens)/len(words):.2f}")
    print(f"  Tokens: {tokens}")

Example 3: Adapting for source code tokenization

User: "I want to use a general LLM for code review, but the tokenizer wastes tokens on common programming patterns. How do I adapt it?"

Approach:

Load base tokenizer and collect 5M tokens of Python/JS/Rust source code.
Common programming patterns fragmented by general tokenizers: def __init__, self., import, return, function, const, async/await.
Identify low-utility general tokens: tokens for natural language phrases, literary vocabulary, rare proper nouns.
Run BPE on the code corpus. Discover code-specific tokens: self., __init__, return, import, function, common variable patterns like _id, _name.
Replace ~5% of vocabulary.
Validate: code fertility drops, common 3-token patterns become 1-token, effective context window for code increases by ~15%.

Output:

Before: def __init__(self, name):  -> ['def', ' __', 'init', '__(', 'self', ',', ' name', '):']  (8 tokens)
After:  def __init__(self, name):  -> ['def', ' __init__', '(self', ',', ' name', '):']           (6 tokens)

Before fertility on Python corpus: 1.72
After  fertility on Python corpus: 1.41
Token savings: ~18% fewer tokens for equivalent code

Best Practices

Do: Start conservatively -- replace 5% of vocabulary first, measure improvement, then iterate. Small swaps with high-frequency replacements give the best ROI.
Do: Always preserve single-byte tokens (byte-level fallback) and all special tokens. Removing these breaks the tokenizer's ability to encode arbitrary input.
Do: Initialize new token embeddings as the mean of their subtoken constituents. This gives the model a warm start and dramatically reduces the fine-tuning needed.
Do: Measure both fertility (tokens/word) and downstream task performance. Lower fertility alone does not guarantee better model quality -- validate end-to-end.
Avoid: Replacing tokens that appear frequently in the base model's pretraining distribution, even if rare in your domain. These tokens may be important for the model's general capabilities you still want to retain.
Avoid: Adapting more than 15% of the vocabulary without significant continued pretraining. Large vocabulary changes destabilize the model's learned representations.

Error Handling

Problem	Cause	Solution
Tokenizer produces `<unk>` after adaptation	Single-byte fallback tokens were removed	Restore all 256 byte-level tokens; never remove them
Model quality degrades after adaptation	Too many tokens replaced or embeddings not initialized	Reduce K, use mean-subtoken embedding init, add fine-tuning steps
Fertility barely improves	Replacement candidates overlap with existing tokens	Verify new candidates are genuinely absent from original vocab; lower the merge threshold
Merge list inconsistencies	New merge rules conflict with existing ones	Insert new merges at the end of the merge list (lowest priority); let existing merges take precedence
OOV errors on general text	Over-specialized vocabulary lost general coverage	Keep K below 10% for models that must handle mixed-domain input

Limitations

Requires continued pretraining or fine-tuning. Swapping vocabulary tokens without any model adaptation gives tokenization improvements but can hurt generation quality. Budget for at least a short fine-tuning pass.
Not a substitute for training a domain tokenizer from scratch. If you are training a model from scratch for a specific domain, train the tokenizer natively on domain data. AdaptBPE is specifically for adapting existing pretrained models.
Diminishing returns past ~10-15% replacement. The more tokens you swap, the more you diverge from the pretrained model's learned representations. Beyond 15%, consider whether a full retrain is more appropriate.
Does not fix architectural limitations. If the model's context window is the bottleneck, reduced fertility helps but cannot compensate for fundamental architecture constraints.
Evaluation on downstream tasks is essential. Fertility is a proxy metric. A tokenizer with lower fertility but poor token boundaries (splitting mid-morpheme) can perform worse than a higher-fertility tokenizer with linguistically sound splits.
Language-specific considerations. For agglutinative languages (Turkish, Finnish) or logographic scripts (Chinese, Japanese), the adaptation dynamics differ. Test carefully -- character-level tokens in CJK may already be optimal.

Reference

Liyanage, V. & Yvon, F. (2026). AdaptBPE: From General Purpose to Specialized Tokenizers. EACL 2026. arXiv:2601.21665

Key sections to consult: the utility scoring formula (Section 3), the token replacement algorithm (Algorithm 1), and the multilingual evaluation (Section 5) showing fertility improvements across domains and languages.

adaptbpe-general-purpose-specialized

More from this repository

AdaptBPE: Adapting General-Purpose BPE Tokenizers to Specialized Domains and Languages

When to Use

Key Technique

Step-by-Step Workflow

Concrete Examples

Best Practices

Error Handling

Limitations

Reference

AdaptBPE: Adapting General-Purpose BPE Tokenizers to Specialized Domains and Languages

When to Use

Key Technique

Step-by-Step Workflow

Concrete Examples

Best Practices

Error Handling

Limitations

Reference

More from this repository