一键在 Manus 中运行任何 Skill

$pwd:

tokenizer-diagnostics

Name: Tokenizer Diagnostics
Author: openvinotoolkit

// Diagnose tokenizer conversion issues for OpenVINO Tokenizers. Use when: tokenizer-checker reports failures, need to pinpoint root cause location (Python conversion vs C++ operation), identify which pipeline stage diverges, determine whether to use tokenizer-fix-python or tokenizer-fix-cpp skill.

在 Manus 中运行

$ git log --oneline --stat

stars:52

forks:56

updated:2026年5月8日 12:39

SKILL.md

readonly

related-skills.json

同仓库

tokenizer-checker.md

from "openvinotoolkit/openvino_tokenizers"

Validate a HuggingFace tokenizer with OpenVINO Tokenizers and OpenVINO GenAI. Use when: checking if a tokenizer converts and works correctly, verifying tokenizer/detokenizer accuracy, testing normalization steps, checking GenAI Tokenizer compatibility.

2026-05-0852

tokenizer-fix-python.md

from "openvinotoolkit/openvino_tokenizers"

Fix Python conversion issues in OpenVINO Tokenizers. Use when: tokenizer-diagnostics reports root_cause_location=python, unsupported types need new handlers in hf_parser.py, pipeline step mapping or merging is incorrect, tokenizer_pipeline.py step classes need fixes.

2026-04-0152

package.json

"author": "openvinotoolkit"

"repository": "openvinotoolkit/openvino_tokenizers"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

软件开发工程师计算机与数学类职业15-1252L4

name	tokenizer-diagnostics
description	Diagnose tokenizer conversion issues for OpenVINO Tokenizers. Use when: tokenizer-checker reports failures, need to pinpoint root cause location (Python conversion vs C++ operation), identify which pipeline stage diverges, determine whether to use tokenizer-fix-python or tokenizer-fix-cpp skill.
argument-hint	model_id (e.g. zai-org/GLM-4.7)

OpenVINO Tokenizer Diagnostics

Pinpoints the root cause of tokenizer conversion failures by analyzing the pipeline stage-by-stage. Determines whether the issue is in the Python conversion layer or the C++ operation implementation, and identifies the exact pipeline stage that diverges.

When to Use

The tokenizer-checker skill reported status: FAIL
Need to understand where a tokenizer mismatch originates before fixing it
Want to see how HF tokenizer.json pipeline maps to OV pipeline steps
Need to identify unsupported normalizer/pre-tokenizer/decoder types

Inputs

Required:

model_id: HuggingFace model identifier or local path (e.g. meta-llama/Llama-3-8B)

Optional (from tokenizer-checker result):

failure_types — helps focus the diagnosis (e.g. [token_id_mismatch], [conversion_error])
failing_categories — narrows which test strings to inspect
CLI flags: --trust-remote-code, --subfolder, --max-length, --use-fast-false

Prerequisites

Activate the Python virtual environment before running any commands.

Locate the virtual environment — check for common directories at the repository root: .venv/, venv/, env/. Use list_dir to find it. If none is found, ask the user for its location.
Activate based on the current platform:
- Linux/macOS: source <venv_path>/bin/activate
- Windows (cmd): <venv_path>\Scripts\activate.bat
- Windows (PowerShell): <venv_path>\Scripts\Activate.ps1

Procedure

Step 1: Run the diagnose CLI

Run from the repository root:

openvino_tokenizers diagnose <model_id> [flags]

This executes 5 steps:

[1/5] Load HF tokenizer — downloads and loads via AutoTokenizer.from_pretrained
[2/5] Map pipeline — extracts tokenizer.json sections (normalizer, pre_tokenizer, model, post_processor, decoder) and maps each HF step to its OV equivalent. Flags unsupported types with ⚠ UNSUPPORTED.
[3/5] Test normalization — tests each normalizer step individually, then tests the combined pipeline. Reuses check_normalization logic.
[4/5] Test pre-tokenization — compares HF backend_tokenizer.pre_tokenizer output with OV pre-tokenization behavior.
[5/5] Full pipeline comparison — runs full encode + decode comparison to identify the first point of divergence.

The command prints a Diagnosis Summary at the end with structured fields.

Step 2: Run normalization check (if needed)

If step 3 of diagnose shows normalization failures, run the dedicated normalization checker for more detail:

openvino_tokenizers check_normalization <model_id> [flags]

This gives per-step HF→OV mapping with detailed mismatch output for each normalizer step.

Step 3: Inspect the pipeline mapping

If step 2 of diagnose flags unsupported types or the pipeline mapping reveals gaps, inspect the relevant code:

For unsupported types — check whether the type exists in the appropriate map in hf_parser.py:

TransformersTokenizerPipelineParser.normalizers_map — normalizer types
TransformersTokenizerPipelineParser.pre_tokenization_map — pre-tokenizer types
TransformersTokenizerPipelineParser.post_tokenization_map — post-processor types
TransformersTokenizerPipelineParser.decoding_map — decoder types
Tokenization model types are checked in tokenization_model() method

For conversion errors — read the traceback from diagnose output. Common patterns:

OVTypeError: ... type '...' is not supported → missing map entry (Python fix)
KeyError in parse_* functions → unexpected tokenizer.json structure (Python fix)
Conversion succeeds but outputs differ → C++ operation bug or incorrect Python step parameters

Step 4: Determine root cause location

Use the Diagnosis Summary from step 1:

Summary Field	Interpretation
`root_cause_location: python`	Fix needed in `hf_parser.py` or `tokenizer_pipeline.py`
`root_cause_location: cpp`	Fix needed in C++ operation under `src/`
`root_cause_location: both`	Fix Python first, then C++
`root_cause_location: none`	No issues found
`unsupported_types: [X, Y]`	Types X, Y need new handlers in hf_parser.py
`affected_stages: [normalization]`	Issue isolated to normalizer operations
`affected_stages: [encode]`	Token ID mismatch — could be pre-tokenizer, tokenizer model, or post-processor
`affected_stages: [decode]`	Detokenizer issue — check decoder pipeline

Decision rules:

Unsupported types exist → root_cause_location: python. The type needs a new handler in the parser map and possibly a new pipeline step class.
Normalization fails, full pipeline also fails → root_cause_location: cpp. The Python mapping is correct but the C++ operation produces wrong results.
Normalization passes, full pipeline fails → root_cause_location: python. The issue is in pre-tokenization, tokenization model, post-processing, or decoding pipeline construction.
Only normalization fails → root_cause_location: cpp. Individual normalizer step works differently in C++ than in HF.

Step 5: Produce diagnosis report

After all analysis, produce a structured report:

## Diagnosis
- root_cause_location: python | cpp | both | none
- affected_stages: [<list of affected stages>]
- unsupported_types: [<list of unsupported HF types>]
- normalization_failures: <count>
- pre_tokenization_failures: <count>
- full_pipeline_failures: <count>
- description: <human-readable summary of the root cause>
- suggested_fix_skill: tokenizer-fix-python | tokenizer-fix-cpp | none
- details: |
    <copy the relevant diagnostic output, including pipeline mapping,
     failing test strings, and mismatch details>

Step 6: Generate minimal reproducer (when applicable)

If the issue is well-isolated, create a minimal Python script that demonstrates the mismatch. Use this template:

#!/usr/bin/env python3
"""Minimal reproducer for <model_id> tokenizer mismatch in <stage>."""
import numpy as np
from transformers import AutoTokenizer
from openvino import Core
from openvino_tokenizers import convert_tokenizer

# Load
hf_tok = AutoTokenizer.from_pretrained("<model_id>")
ov_tok_model, ov_detok_model = convert_tokenizer(hf_tok, with_detokenizer=True)
ov_tok = Core().compile_model(ov_tok_model)

# Test
test_string = "<failing_input>"
hf_out = hf_tok([test_string], return_tensors="np", truncation=True)
ov_out = ov_tok([test_string])

print(f"HF ids:  {hf_out['input_ids'].tolist()}")
print(f"OV ids:  {ov_out['input_ids'].tolist()}")
print(f"Match:   {np.array_equal(hf_out['input_ids'], ov_out['input_ids'])}")

Save the reproducer to inform the fixer skill or for human review.

Key Code References

CLI diagnose tool: python/openvino_tokenizers/cli_tools/diagnose_tokenizer.py
CLI normalization check: python/openvino_tokenizers/cli_tools/check_normalization.py
HF parser & maps: python/openvino_tokenizers/hf_parser.py → TransformersTokenizerPipelineParser
Pipeline step classes: python/openvino_tokenizers/tokenizer_pipeline.py
C++ operations: src/*.cpp / src/*.hpp

Security

NEVER install any packages. Assume the environment is pre-configured.
NEVER modify model_id — pass it exactly as provided by the user.

tokenizer-diagnostics

同仓库更多 Skills

OpenVINO Tokenizer Diagnostics

When to Use

Inputs

Prerequisites

Procedure

Step 1: Run the diagnose CLI

Step 2: Run normalization check (if needed)

Step 3: Inspect the pipeline mapping

Step 4: Determine root cause location

Step 5: Produce diagnosis report

Step 6: Generate minimal reproducer (when applicable)

Key Code References

Security

OpenVINO Tokenizer Diagnostics

When to Use

Inputs

Prerequisites

Procedure

Step 1: Run the diagnose CLI

Step 2: Run normalization check (if needed)

Step 3: Inspect the pipeline mapping

Step 4: Determine root cause location

Step 5: Produce diagnosis report

Step 6: Generate minimal reproducer (when applicable)

Key Code References

Security

同仓库更多 Skills