一键导入
benchmark-translate
// Run a quality benchmark of the /translate skill by selecting stratified test keys, capturing ground truth, translating, judging with sub-agents, and compiling a regression report. Invoke with /benchmark-translate.
// Run a quality benchmark of the /translate skill by selecting stratified test keys, capturing ground truth, translating, judging with sub-agents, and compiling a regression report. Invoke with /benchmark-translate.
Run QA tests using agent-browser and post results to the qabot dashboard. Interactive mode helps craft fixtures. With a fixture provided (or for automated runs like clawdbot releases), executes tests and reports results. Use when user says "qa test", "run qabot", "/qabot", or when running automated QA.
Integrate a new blockchain as a second-class citizen in ShapeShift Web. HDWallet packages live in the monorepo under packages/hdwallet-*. Covers everything from HDWallet native/Ledger support to Web chain adapter, asset generation, and feature flags. Activates when user wants to add basic support for a new blockchain.
Integrate new DEX aggregators, swappers, or bridge protocols (like Bebop, Portals, Jupiter, 0x, 1inch, etc.) into ShapeShift Web. Activates when user wants to add, integrate, or implement support for a new swapper. Guides through research, implementation, and testing following established patterns. (project)
Create a new qabot E2E test fixture interactively. Guides the user through defining test steps, expected outcomes, and saves the YAML fixture file. Use when user says "create fixture", "new test", "qabot fixture", or "/qabot-fixture".
Translate new/changed English UI strings into all supported languages using a translate-review-refine pipeline. Invoke with /translate to detect untranslated strings and produce high-quality translations for de, es, fr, ja, pt, ru, tr, uk, zh.
Comprehensive React and Next.js performance optimization guide with 40+ rules for eliminating waterfalls, optimizing bundles, and improving rendering. Use when optimizing React apps, reviewing performance, or refactoring components.
| name | benchmark-translate |
| description | Run a quality benchmark of the /translate skill by selecting stratified test keys, capturing ground truth, translating, judging with sub-agents, and compiling a regression report. Invoke with /benchmark-translate. |
| allowed-tools | Read, Write, Edit, Grep, Glob, Bash(node *), Bash(git checkout*), Bash(git diff*), Bash(git status*), Bash(git rev-parse*), Task, Skill, AskUserQuestion |
Measures the quality of the /translate skill by comparing its output against existing human translations. Uses stratified key selection with a fixed/rotating split, LLM judges, and programmatic validation to produce a comprehensive quality report with regression tracking across all 9 supported locales.
All benchmark data lives in scripts/translations/benchmark/ (gitignored):
| File | Purpose |
|---|---|
testKeys.json | Selected test keys with categories and fixed flag |
coreKeys.json | Persistent core key set (stable across runs) |
ground-truth.json | Captured human translations before removal |
report.json | Latest benchmark report (becomes baseline on next run) |
baseline.json | Previous report (auto-copied by setup.js) |
node .claude/skills/benchmark-translate/scripts/select-keys.js [--count N] [--core N]
Selects N keys (default 150) stratified across 6 categories: glossary-term, financial-error, single-word, interpolation, defi-jargon, general. Validates all selected keys exist in en + all 9 locales.
Fixed/rotating split:
--core N (default 100): Number of fixed core keys for stable regression tracking--count N (default 150): Total keys (core + rotating)coreKeys.json exists: loads it, validates keys still exist in all locales, tops up if neededcoreKeys.json doesn't exist: selects core keys via stratified sampling and saves themtestKeys.json has "fixed": true (core) or "fixed": false (rotating)Outputs scripts/translations/benchmark/testKeys.json.
node .claude/skills/benchmark-translate/scripts/setup.js
report.json exists from a previous run, copies it to baseline.jsontestKeys.json, captures ground truth translations for all 9 localesground-truth.json/translate can regenerate themInvoke the /translate skill using the Skill tool. This regenerates the removed keys through the full translate-review-refine pipeline.
Launch 9 sub-agents in 3 waves of 3 (matching /translate's wave structure) using the Task tool. Each sub-agent receives the locale info, all key triplets, and glossary terms.
Wave 1: de, es, fr Wave 2: pt, ru, tr Wave 3: ja, uk, zh
For each locale, use this prompt:
You are an expert multilingual localization quality assessor for a cryptocurrency/DeFi application.
Rate translations from English into {LANGUAGE_NAME} on a 1-5 scale.
1 = Wrong/misleading meaning
2 = Significant issues (wrong register, missing nuance)
3 = Acceptable but could be more natural
4 = Good, natural, accurate
5 = Excellent, indistinguishable from professional native translation
Check: meaning preservation, naturalness, register ({REGISTER}), UI conciseness,
glossary compliance (these stay English: {NEVER_TRANSLATE_TERMS}),
placeholder integrity (%{...} preserved), DeFi terminology conventions.
Rate each translation INDEPENDENTLY. Community translations can contain errors.
Input: JSON array of {key, english, human, skill}
{ITEMS_JSON}
Output: Return ONLY a JSON array of objects with these exact fields:
{key, humanScore, skillScore, humanJustification, skillJustification, preferenceNote}
Scores must be integers 1-5. Justifications should be 1-2 sentences. preferenceNote should say which is better and why, or "tie" if equal.
Locale info for prompt substitution:
| Locale | Language | Register |
|---|---|---|
de | German | Formal (Sie) |
es | Spanish | Informal (tú) |
fr | French | Formal (vous) |
ja | Japanese | Polite (です/ます) |
pt | Portuguese | Informal (você) |
ru | Russian | Formal (вы) |
tr | Turkish | Formal (siz) |
uk | Ukrainian | Formal (ви) |
zh | Chinese (Simplified) | Neutral/formal |
Building the items array for each locale:
scripts/translations/benchmark/ground-truth.jsonsrc/assets/translations/{locale}/main.json{ key: dottedPath, english: groundTruth.english[key], human: groundTruth.groundTruth[locale][key], skill: getValueFromLocaleFile(key) }Getting never-translate terms: Read src/assets/translations/glossary.json, collect all keys where value is null (excluding _meta).
Each sub-agent must write its output to /tmp/{locale}-judge-scores.json. Parse the JSON array from the sub-agent's response and write it to that path.
node .claude/skills/benchmark-translate/scripts/compile-report.js
Loads judge scores from /tmp/{locale}-judge-scores.json, runs programmatic validation (including Cyrillic script check for ru/uk), computes summary stats, and writes scripts/translations/benchmark/report.json. If baseline.json exists, includes regression deltas. Report includes coreSummary and rotatingSummary alongside the overall summary.
node .claude/skills/benchmark-translate/scripts/restore.js
Restores locale files via git checkout --, verifies no diff remains.
Read the compile output (printed to stdout) and present to the user: