name	tokenizer-checker
description	Validate a HuggingFace tokenizer with OpenVINO Tokenizers and OpenVINO GenAI. Use when: checking if a tokenizer converts and works correctly, verifying tokenizer/detokenizer accuracy, testing normalization steps, checking GenAI Tokenizer compatibility.
argument-hint	model_id (e.g. zai-org/GLM-4.7)

OpenVINO Tokenizer Checker

Validates that a HuggingFace tokenizer converts to OpenVINO correctly and produces matching outputs for encoding, decoding, normalization, and GenAI compatibility.

When to Use

Verify a HuggingFace tokenizer converts to OpenVINO and matches HF outputs
Check if a newly supported tokenizer works end-to-end with OpenVINO GenAI
Diagnose which test categories (English, multilingual, emoji, whitespace) fail
Test normalization steps individually to isolate mismatches

Inputs

The user must provide:

model_id: HuggingFace model identifier or local path (e.g. zai-org/GLM-4.7)

Optional flags the user may request (pass through to the CLI):

--trust-remote-code — required for some models with custom tokenizer code
--subfolder — tokenizer subfolder inside a HuggingFace repo or local model directory (used when tokenizer is in a subfolder)
--no-detokenizer — skip detokenizer conversion and testing
--use-sentencepiece-backend — use SentencePiece backend during conversion
--no-special-tokens — encode without special tokens
--no-skip-special-tokens — decode keeping special tokens
--skip-missing-outputs — ignore HF outputs absent in OV result (e.g. token_type_ids)
--use-fast-false — load the legacy (slow) tokenizer
--max-length — max length for conversion and HF truncation checks (default: None)

Prerequisites

Activate the Python virtual environment before running any commands.

Locate the virtual environment — check for common directories at the repository root: .venv/, venv/, env/. Use list_dir to find it. If none is found, ask the user for its location.
Activate based on the current platform:
- Linux/macOS: source <venv_path>/bin/activate
- Windows (cmd): <venv_path>\Scripts\activate.bat
- Windows (PowerShell): <venv_path>\Scripts\Activate.ps1

Procedure

Step 1: Run the tokenizer check

Run from the repository root:

openvino_tokenizers check <model_id> [flags]

This executes:

[1/5] Load HF tokenizer — downloads and loads the tokenizer via AutoTokenizer.from_pretrained
[2/5] Convert to OpenVINO — converts to OV tokenizer + detokenizer models
[3/5] Test against 31 strings — compares HF vs OV encode/decode on English, multilingual, emoji, and edge-case strings
[4/5] GenAI Tokenizer encode + decode — tests openvino_genai.Tokenizer encode/decode with and without special tokens (skipped if openvino_genai is not installed)
[5/5] GenAI padding + pair inputs — checks batch padding and pair-input behaviour. For tokenizers-backend tokenizers (PreTrainedTokenizerFast / TokenizersBackend), mismatches are reported as errors and affect the exit code. For other tokenizers, mismatches are reported as warnings only (skipped if openvino_genai is not installed)

[Optional] Step 2: Run the normalization check

Run this step if there are issues in the [3/5] Test against 31 strings step of the previous command:

openvino_tokenizers check_normalization <model_id> [flags]

This executes:

[1/3] Load HF tokenizer — same as above
[2/3] Parse normalizer pipeline — extracts individual normalizer steps from tokenizer.json and prints the HF → OV mapping
[3/3] Test normalizer steps — tests each normalizer step independently, then tests the full stacked pipeline

Step 3: Interpret Results

Both commands print ✓ / ✗ per step and exit with code 0 (all passed) or 1 (any failure).

Pass criteria:

Exit code 0 for each command
All test strings matched in step 3 of check
All normalizer steps matched in step 3 of check_normalization

Failure output includes:

The input string that failed
Expected (HF) vs actual (OV) values — token IDs, decoded text, or normalized text
Shape mismatches, value mismatches, or missing output keys

Step 5 results (padding + pair inputs):

Batch padding mismatches across different configurations (longest, max_length, left/right padding)
Pair-input encode mismatches
For tokenizers-backend tokenizers (PreTrainedTokenizerFast / TokenizersBackend): these are errors that affect the exit code
For other tokenizers (e.g. SentencePiece-only): these are warnings that do NOT affect the exit code but should be reported

Step 4: Report Results

Provide a structured report to the user:

If all steps pass:

State that the tokenizer is fully compatible
Note whether GenAI steps were tested or skipped (if openvino_genai is not installed)
Note any step-5 warnings if present

If any step fails, build a failure report covering:

Which step failed — conversion, tokenizer comparison, detokenizer, GenAI encode/decode, or normalization
Which string categories failed — identify patterns:
- English strings only → basic tokenization issue
- Multilingual strings → Unicode/encoding issue
- Emoji strings → multi-byte / surrogate handling issue
- Empty/whitespace strings → edge-case handling issue
- All strings → fundamental conversion issue
Nature of the mismatch — token ID mismatch, shape mismatch, missing output key, decode mismatch, or normalization mismatch
Normalization isolation — if check_normalization identifies a specific normalizer step as the root cause, report which step type (e.g. NFC, Lowercase, Precompiled) and its parameters

Security

NEVER install any packages. Assume the environment is pre-configured.
NEVER modify model_id — pass it exactly as provided by the user.
NEVER call internal Python functions directly — always use the openvino_tokenizers CLI commands.

name	tokenizer-checker
description	Validate a HuggingFace tokenizer with OpenVINO Tokenizers and OpenVINO GenAI. Use when: checking if a tokenizer converts and works correctly, verifying tokenizer/detokenizer accuracy, testing normalization steps, checking GenAI Tokenizer compatibility.
argument-hint	model_id (e.g. zai-org/GLM-4.7)

OpenVINO Tokenizer Checker

Validates that a HuggingFace tokenizer converts to OpenVINO correctly and produces matching outputs for encoding, decoding, normalization, and GenAI compatibility.

When to Use

Verify a HuggingFace tokenizer converts to OpenVINO and matches HF outputs
Check if a newly supported tokenizer works end-to-end with OpenVINO GenAI
Diagnose which test categories (English, multilingual, emoji, whitespace) fail
Test normalization steps individually to isolate mismatches

Inputs

The user must provide:

model_id: HuggingFace model identifier or local path (e.g. zai-org/GLM-4.7)

Optional flags the user may request (pass through to the CLI):

--trust-remote-code — required for some models with custom tokenizer code
--subfolder — tokenizer subfolder inside a HuggingFace repo or local model directory (used when tokenizer is in a subfolder)
--no-detokenizer — skip detokenizer conversion and testing
--use-sentencepiece-backend — use SentencePiece backend during conversion
--no-special-tokens — encode without special tokens
--no-skip-special-tokens — decode keeping special tokens
--skip-missing-outputs — ignore HF outputs absent in OV result (e.g. token_type_ids)
--use-fast-false — load the legacy (slow) tokenizer
--max-length — max length for conversion and HF truncation checks (default: None)

Prerequisites

Activate the Python virtual environment before running any commands.

Locate the virtual environment — check for common directories at the repository root: .venv/, venv/, env/. Use list_dir to find it. If none is found, ask the user for its location.
Activate based on the current platform:
- Linux/macOS: source <venv_path>/bin/activate
- Windows (cmd): <venv_path>\Scripts\activate.bat
- Windows (PowerShell): <venv_path>\Scripts\Activate.ps1

Procedure

Step 1: Run the tokenizer check

Run from the repository root:

openvino_tokenizers check <model_id> [flags]

This executes:

[1/5] Load HF tokenizer — downloads and loads the tokenizer via AutoTokenizer.from_pretrained
[2/5] Convert to OpenVINO — converts to OV tokenizer + detokenizer models
[3/5] Test against 31 strings — compares HF vs OV encode/decode on English, multilingual, emoji, and edge-case strings
[4/5] GenAI Tokenizer encode + decode — tests openvino_genai.Tokenizer encode/decode with and without special tokens (skipped if openvino_genai is not installed)
[5/5] GenAI padding + pair inputs — checks batch padding and pair-input behaviour. For tokenizers-backend tokenizers (PreTrainedTokenizerFast / TokenizersBackend), mismatches are reported as errors and affect the exit code. For other tokenizers, mismatches are reported as warnings only (skipped if openvino_genai is not installed)

[Optional] Step 2: Run the normalization check

Run this step if there are issues in the [3/5] Test against 31 strings step of the previous command:

openvino_tokenizers check_normalization <model_id> [flags]

This executes:

[1/3] Load HF tokenizer — same as above
[2/3] Parse normalizer pipeline — extracts individual normalizer steps from tokenizer.json and prints the HF → OV mapping
[3/3] Test normalizer steps — tests each normalizer step independently, then tests the full stacked pipeline

Step 3: Interpret Results

Both commands print ✓ / ✗ per step and exit with code 0 (all passed) or 1 (any failure).

Pass criteria:

Exit code 0 for each command
All test strings matched in step 3 of check
All normalizer steps matched in step 3 of check_normalization

Failure output includes:

The input string that failed
Expected (HF) vs actual (OV) values — token IDs, decoded text, or normalized text
Shape mismatches, value mismatches, or missing output keys

Step 5 results (padding + pair inputs):

Batch padding mismatches across different configurations (longest, max_length, left/right padding)
Pair-input encode mismatches
For tokenizers-backend tokenizers (PreTrainedTokenizerFast / TokenizersBackend): these are errors that affect the exit code
For other tokenizers (e.g. SentencePiece-only): these are warnings that do NOT affect the exit code but should be reported

Step 4: Report Results

Provide a structured report to the user:

If all steps pass:

State that the tokenizer is fully compatible
Note whether GenAI steps were tested or skipped (if openvino_genai is not installed)
Note any step-5 warnings if present

If any step fails, build a failure report covering:

Which step failed — conversion, tokenizer comparison, detokenizer, GenAI encode/decode, or normalization
Which string categories failed — identify patterns:
- English strings only → basic tokenization issue
- Multilingual strings → Unicode/encoding issue
- Emoji strings → multi-byte / surrogate handling issue
- Empty/whitespace strings → edge-case handling issue
- All strings → fundamental conversion issue
Nature of the mismatch — token ID mismatch, shape mismatch, missing output key, decode mismatch, or normalization mismatch
Normalization isolation — if check_normalization identifies a specific normalizer step as the root cause, report which step type (e.g. NFC, Lowercase, Precompiled) and its parameters

Security

NEVER install any packages. Assume the environment is pre-configured.
NEVER modify model_id — pass it exactly as provided by the user.
NEVER call internal Python functions directly — always use the openvino_tokenizers CLI commands.

tokenizer-checker

OpenVINO Tokenizer Checker

When to Use

Inputs

Prerequisites

Procedure

Step 1: Run the tokenizer check

[Optional] Step 2: Run the normalization check

Step 3: Interpret Results

Step 4: Report Results

Security

Plus depuis ce dépôt

OpenVINO Tokenizer Checker

When to Use

Inputs

Prerequisites

Procedure

Step 1: Run the tokenizer check

[Optional] Step 2: Run the normalization check

Step 3: Interpret Results

Step 4: Report Results

Security

Plus depuis ce dépôt