| name | huggingface-tokenizers |
| description | Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training. |
| version | 1.0.0 |
| author | Orchestra Research |
| license | MIT |
| dependencies | ["tokenizers","transformers","datasets"] |
| platforms | ["linux","macos","windows"] |
| metadata | {"hermes":{"tags":["Tokenization","HuggingFace","BPE","WordPiece","Unigram","Fast Tokenization","Rust","Custom Tokenizer","Alignment Tracking","Production"]}} |
HuggingFace Tokenizers - Fast Tokenization for NLP
Fast, production-ready tokenizers with Rust performance and Python ease-of-use.
When to use HuggingFace Tokenizers
Use HuggingFace Tokenizers when:
- Need extremely fast tokenization (<20s per GB of text)
- Training custom tokenizers from scratch
- Want alignment tracking (token → original text position)
- Building production NLP pipelines
- Need to tokenize large corpora efficiently
Performance:
- Speed: <20 seconds to tokenize 1GB on CPU
- Implementation: Rust core with Python/Node.js bindings
- Efficiency: 10-100× faster than pure Python implementations
Use alternatives instead:
- SentencePiece: Language-independent, used by T5/ALBERT
- tiktoken: OpenAI's BPE tokenizer for GPT models
- transformers AutoTokenizer: Loading pretrained only (uses this library internally)
Quick start
Installation
pip install tokenizers
pip install tokenizers transformers
Load pretrained tokenizer
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
output = tokenizer.encode("Hello, how are you?")
print(output.tokens)
print(output.ids)
text = tokenizer.decode(output.ids)
print(text)
Train custom BPE tokenizer
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(
vocab_size=30000,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
min_frequency=2
)
files = ["train.txt", "validation.txt"]
tokenizer.train(files, trainer)
tokenizer.save("my-tokenizer.json")
Training time: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB
Batch encoding with padding
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
texts = ["Hello world", "This is a longer sentence"]
encodings = tokenizer.encode_batch(texts)
for encoding in encodings:
print(encoding.ids)
Tokenization algorithms
BPE (Byte-Pair Encoding)
How it works:
- Start with character-level vocabulary
- Find most frequent character pair
- Merge into new token, add to vocabulary
- Repeat until vocabulary size reached
Used by: GPT-2, GPT-3, RoBERTa, BART, DeBERTa
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
tokenizer.pre_tokenizer = ByteLevel()
trainer = BpeTrainer(
vocab_size=50257,
special_tokens=["<|endoftext|>"],
min_frequency=2
)
tokenizer.train(files=["data.txt"], trainer=trainer)
Advantages:
- Handles OOV words well (breaks into subwords)
- Flexible vocabulary size
- Good for morphologically rich languages
Trade-offs:
- Tokenization depends on merge order
- May split common words unexpectedly
WordPiece
How it works:
- Start with character vocabulary
- Score merge pairs:
frequency(pair) / (frequency(first) × frequency(second))
- Merge highest scoring pair
- Repeat until vocabulary size reached
Used by: BERT, DistilBERT, MobileBERT
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import BertNormalizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = Whitespace()
trainer = WordPieceTrainer(
vocab_size=30522,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
continuing_subword_prefix="##"
)
tokenizer.train(files=["corpus.txt"], trainer=trainer)
Advantages:
- Prioritizes meaningful merges (high score = semantically related)
- Used successfully in BERT (state-of-the-art results)
Trade-offs:
- Unknown words become
[UNK] if no subword match
- Saves vocabulary, not merge rules (larger files)
Unigram
How it works:
- Start with large vocabulary (all substrings)
- Compute loss for corpus with current vocabulary
- Remove tokens with minimal impact on loss
- Repeat until vocabulary size reached
Used by: ALBERT, T5, mBART, XLNet (via SentencePiece)
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
tokenizer = Tokenizer(Unigram())
trainer = UnigramTrainer(
vocab_size=8000,
special_tokens=["<unk>", "<s>", "</s>"],
unk_token="<unk>"
)
tokenizer.train(files=["data.txt"], trainer=trainer)
Advantages:
- Probabilistic (finds most likely tokenization)
- Works well for languages without word boundaries
- Handles diverse linguistic contexts
Trade-offs:
- Computationally expensive to train
- More hyperparameters to tune
Tokenization pipeline
Complete pipeline: Normalization → Pre-tokenization → Model → Post-processing
Normalization
Clean and standardize text:
from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence
tokenizer.normalizer = Sequence([
NFD(),
Lowercase(),
StripAccents()
])
Common normalizers:
NFD, NFC, NFKD, NFKC - Unicode normalization forms
Lowercase() - Convert to lowercase
StripAccents() - Remove accents (é → e)
Strip() - Remove whitespace
Replace(pattern, content) - Regex replacement
Pre-tokenization
Split text into word-like units:
from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel
tokenizer.pre_tokenizer = Sequence([
Whitespace(),
Punctuation()
])
Common pre-tokenizers:
Whitespace() - Split on spaces, tabs, newlines
ByteLevel() - GPT-2 style byte-level splitting
Punctuation() - Isolate punctuation
Digits(individual_digits=True) - Split digits individually
Metaspace() - Replace spaces with ▁ (SentencePiece style)
Post-processing
Add special tokens for model input:
from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B [SEP]",
special_tokens=[
("[CLS]", 1),
("[SEP]", 2),
],
)
Common patterns:
TemplateProcessing(
single="$A <|endoftext|>",
special_tokens=[("<|endoftext|>", 50256)]
)
TemplateProcessing(
single="<s> $A </s>",
pair="<s> $A </s> </s> $B </s>",
special_tokens=[("<s>", 0), ("</s>", 2)]
)
Alignment tracking
Track token positions in original text:
output = tokenizer.encode("Hello, world!")
for token, offset in zip(output.tokens, output.offsets):
start, end = offset
print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")
Use cases:
- Named entity recognition (map predictions back to text)
- Question answering (extract answer spans)
- Token classification (align labels to original positions)
Integration with transformers
Load with AutoTokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(tokenizer.is_fast)
fast_tokenizer = tokenizer.backend_tokenizer
print(type(fast_tokenizer))
Convert custom tokenizer to transformers
from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast
tokenizer = Tokenizer(BPE())
tokenizer.save("my-tokenizer.json")
transformers_tokenizer = PreTrainedTokenizerFast(
tokenizer_file="my-tokenizer.json",
unk_token="[UNK]",
pad_token="[PAD]",
cls_token="[CLS]",
sep_token="[SEP]",
mask_token="[MASK]"
)
outputs = transformers_tokenizer(
"Hello world",
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
Common patterns
Train from iterator (large datasets)
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
def batch_iterator(batch_size=1000):
for i in range(0, len(dataset), batch_size):
yield dataset[i:i + batch_size]["text"]
tokenizer.train_from_iterator(
batch_iterator(),
trainer=trainer,
length=len(dataset)
)
Performance: Processes 1GB in ~10-20 minutes
Enable truncation and padding
tokenizer.enable_truncation(max_length=512)
tokenizer.enable_padding(
pad_id=tokenizer.token_to_id("[PAD]"),
pad_token="[PAD]",
length=512
)
output = tokenizer.encode("This is a long sentence that will be truncated...")
print(len(output.ids))
Multi-processing
from tokenizers import Tokenizer
from multiprocessing import Pool
tokenizer = Tokenizer.from_file("tokenizer.json")
def encode_batch(texts):
return tokenizer.encode_batch(texts)
with Pool(8) as pool:
chunk_size = 1000
chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]
results = pool.map(encode_batch, chunks)
Speedup: 5-8× with 8 cores
Performance benchmarks
Training speed
| Corpus Size | BPE (30k vocab) | WordPiece (30k) | Unigram (8k) |
|---|
| 10 MB | 15 sec | 18 sec | 25 sec |
| 100 MB | 1.5 min | 2 min | 4 min |
| 1 GB | 15 min | 20 min | 40 min |
Hardware: 16-core CPU, tested on English Wikipedia
Tokenization speed
| Implementation | 1 GB corpus | Throughput |
|---|
| Pure Python | ~20 minutes | ~50 MB/min |
| HF Tokenizers | ~15 seconds | ~4 GB/min |
| Speedup | 80× | 80× |
Test: English text, average sentence length 20 words
Memory usage
| Task | Memory |
|---|
| Load tokenizer | ~10 MB |
| Train BPE (30k vocab) | ~200 MB |
| Encode 1M sentences | ~500 MB |
Supported models
Pre-trained tokenizers available via from_pretrained():
BERT family:
bert-base-uncased, bert-large-cased
distilbert-base-uncased
roberta-base, roberta-large
GPT family:
gpt2, gpt2-medium, gpt2-large
distilgpt2
T5 family:
t5-small, t5-base, t5-large
google/flan-t5-xxl
Other:
facebook/bart-base, facebook/mbart-large-cc25
albert-base-v2, albert-xlarge-v2
xlm-roberta-base, xlm-roberta-large
Browse all: https://huggingface.co/models?library=tokenizers
References
Resources