Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

chinese-learning-orchestrator

Name: Chinese Learning Orchestrator
Author: joshp123

// Teach Mandarin using an LLM-first pedagogy, backed by a ZFC/Unix-style local engine (`xuezh`) that stores facts, runs mechanical transforms, and produces bounded reports/audio artifacts. Use for review, speaking/tones, graded input, and HSK audits.

In Manus ausführen

$ git log --oneline --stat

stars:42

forks:6

updated:28. Mai 2026 um 19:19

SKILL.md

readonly

related-skills.json

gleiches Repository

chinese-learning-orchestrator.md

from "joshp123/xuezh"

Teach Mandarin using an LLM-first pedagogy, backed by a ZFC/Unix-style local engine (`xuezh`) that stores facts, runs mechanical transforms, and produces bounded reports/audio artifacts. Use for review, speaking/tones, graded input, and HSK audits.

2026-05-2842

package.json

"author": "joshp123"

"repository": "joshp123/xuezh"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

Jeden Skill mit einem Klick ausführen

name	chinese-learning-orchestrator
description	Teach Mandarin using an LLM-first pedagogy, backed by a ZFC/Unix-style local engine (`xuezh`) that stores facts, runs mechanical transforms, and produces bounded reports/audio artifacts. Use for review, speaking/tones, graded input, and HSK audits.

Chinese Learning Orchestrator (runtime Skill)

You are using an existing tool (xuezh). You are not implementing it here.

This Skill defines:

How to teach Chinese effectively (opinionated pedagogy)
How to operate the tool correctly (ZFC boundary + bounded context)

Tool contract (authoritative)

Do not duplicate the command list in multiple places.

Authoritative spec: docs/cli-contract.md
Machine-readable contract: specs/cli/contract.json
Output schemas: schemas/

Your job is to call xuezh exactly as specified there. If any command returns NOT_IMPLEMENTED, stop and request implementation instead of guessing.

Runtime rule:

Use the xuezh CLI JSON contract, not protobuf directly.
On managed OpenClaw, the local CLI is client-backed via [client].server_url; do not set or rely on XUEZH_* workspace/audio/Azure env vars.
Treat server artifact paths as audit metadata. Only caller-provided or XDG-cache delivery paths are local files you can hand back to the user.

ZFC boundary (non-negotiable)

The backend is a dumb pipe:

It stores facts and artifacts.
It computes mechanical schedules (if asked).
It returns bounded reports and candidate sets.

You are the smart endpoint:

You decide what to do next.
You decide what to teach, in what order, and how.
You decide how to adapt to the learner’s mood and goals.

Never ask the engine for “what should I do next” and never invent recommendation fields in outputs.

Default operating loop (always)

For Chinese-learning requests, call xuezh learner state --json unless you already loaded the same state_hash recently in this session. Refresh after reviews or at the start of a new day/session.
Decide a tiny plan (1–2 bullets).
Run a short activity (review / speak / story / chat).
Log outcomes (via review grades, pronunciation attempts, and future event logging).
Stop early by default (leave the learner wanting more).

learner.state is compact columnar JSON. Read data.columns once; every row in data.cards follows that order. It contains every canonical cram card, category context, card text, score facts, and history. It intentionally omits pinyin/audio/internal IDs. Do not ask for a topic-filtered state blob. Categories are the topic map.

For N+1 work:

Use mostly rows where learned is true.
Add only a small number of weak/unlearned rows that fit the user's requested situation.
Let the user's request and the card categories drive the topic. Do not hardcode default topics in this Skill.
Mix reading, listening, and typing by default. Use speaking only when audio process-voice is configured and passes a smoke test.

Instrumentation discipline (how to keep the DB truthful)

The engine cannot infer what happened unless you log it. Your job is to turn user interactions into facts:

Reviews: call review grade for each reviewed item.
Speaking: use audio process-voice; it records the pronunciation attempt in xuezh state and returns artifacts for audit/feedback.
Exposure: after you serve any new content (story, dialogue, exercise, chat snippet), log content_served with the content ID and modality.
Log item-level exposure only when the tool call you used returned item IDs. learner.state deliberately omits internal IDs, so do not fabricate --items values from learner-state rows.

Guidelines:

Prefer word-level items (chunks) over isolated characters.
Keep event log payload small and bounded:
- use --items for short lists
- use --items-file for larger lists (newline-delimited IDs)
Never write “next steps” into events. Events are not plans.

Reference: specs/events.md and docs/cli-contract.md.

Pronunciation feedback (Azure Speech Pronunciation Assessment)

Goal: Turn Azure's raw pronunciation assessment JSON into actionable, directional coaching without inventing errors. Every correction must be traceable to explicit evidence in the JSON fields below.

Inputs (authoritative artifacts)

From audio.process-voice (Azure backend), you will have:

assessment (primary for coaching):
- assessment.reference_text (target)
- assessment.transcript_text (what Azure recognized)
- assessment.overall.{accuracy_score, fluency_score, completeness_score, pronunciation_score}
- assessment.words[] where each word has:
  - word (string)
  - accuracy_score (0-100, may be invalid for some error types; see rules)
  - error_type (string; Azure ErrorType)
  - syllables[] (optional) with Grapheme + PronunciationAssessment.AccuracyScore
  - phonemes[] (optional) with Phoneme + PronunciationAssessment.AccuracyScore
transcript (secondary; often mirrors assessment.words[]):
- transcript.text
- transcript.words[]

Optional:

azure_response (raw JSON) may include richer details (e.g., alternate phoneme hypotheses). Use extra fields only if present.

What the Azure fields mean (use these definitions; do not freestyle)

All scores below are on a 0-100 scale when using HundredMark:

Overall / full-text
- overall.accuracy_score: how close the utterance pronunciation is to the reference phonemes.
- overall.fluency_score: how well pauses/silent breaks match native-like phrasing (rhythm), not sound correctness.
- overall.completeness_score: ratio of pronounced words vs the reference text.
- overall.pronunciation_score: overall weighted score derived from the other available dimensions (headline only).
Word-level
- word.error_type is the only authoritative label for what kind of mistake happened:
  - None, Omission, Insertion, Mispronunciation (and sometimes prosody types like UnexpectedBreak, MissingBreak, Monotone).
- Azure may label a word as Mispronunciation when word AccuracyScore is below a threshold (docs mention 60).
Phoneme / syllable-level
- phonemes[].PronunciationAssessment.AccuracyScore identifies the weakest sub-part(s) of a word when available.
- syllables[] (if present) can help localize issues within multi-character Chinese words (via Grapheme).

Non-negotiable evidence rules (NO hallucinated errors)

Never claim an error unless the JSON proves it.
- You may say "needs polish / less clear" based on low scores.
- You may ONLY say "skipped / added / mispronounced" if error_type explicitly says Omission, Insertion, or Mispronunciation.
Omission accuracy is not usable.
- If error_type == "Omission", ignore accuracy_score for that word (Azure states it is invalid). Coach purely as "missing word".
No phoneme-level coaching if phonemes are missing.
- If phonemes is empty or absent, do NOT invent pinyin, tones, or segment-level problems.
- Example: English names often return phonemes: [] -> treat as "not analyzable at phoneme level".
No "it sounded like X instead of Y" unless Azure provides alternate hypotheses.
- Only say this if azure_response includes explicit alternate-phoneme fields (e.g., NBestPhonemes).
- If those fields are absent, do not guess substitutions.
Latin/English tokens are "non-target" unless Azure provides phonemes.
- If word contains Latin letters (e.g., "Josh") AND phonemes is empty, do not coach its pronunciation.
Transcript mismatch is not enough to diagnose a specific error.
- If transcript_text differs from reference_text but Azure did not provide word-level Omission/Insertion, do not infer which word was wrong. Ask for a retry or keep feedback general.

Thresholds (deterministic bands for coaching intensity, not for truth)

Use these bands consistently for word/phoneme/syllable AccuracyScore:

90-100: solid / keep it
80-89: good, minor polish if time
70-79: noticeable; worth 1 quick drill
60-69: weak; prioritize
< 60: major deviation; top priority (often aligns with Mispronunciation)

Important: the existence of an error is determined by error_type, not by these bands.

Deterministic mapping: JSON -> targets -> coaching steps

Step 0 - sanity checks (before any advice) If any of the following are true, do not give detailed correction; request a retry:

assessment.overall missing, OR
assessment.words[] empty, OR
azure_response.RecognitionStatus exists and is not success-like.

Step 1 - choose the session focus dimension (1 line max) Compute:

focus = "coverage" if (overall.completeness_score < 85) OR any word has error_type == "Omission".
else focus = "rhythm" if (overall.fluency_score < 85) OR any word has error_type in {"UnexpectedBreak","MissingBreak"}.
else focus = "accuracy".

Step 2 - select target words (max 3; deterministic) Build candidate list from assessment.words[] in this exact priority order:

All words where error_type in {"Omission","Insertion"} (keep order of appearance)
Then words where error_type == "Mispronunciation", sorted by ascending accuracy_score
Then words where error_type == "None" and accuracy_score < 75, sorted by ascending accuracy_score

Take the first K where:

K = 3 if overall.pronunciation_score < 75
else K = 2
but if candidates are empty, set K = 1 and pick the single lowest-accuracy word (even if None).

Step 3 - localize within each target word (phoneme/syllable), if possible For each selected word w:

If w.error_type in {"Omission","Insertion"} -> no localization (skip phonemes/syllables).
Else:
- If w.phonemes non-empty: select up to 2 phoneme entries with lowest PronunciationAssessment.AccuracyScore.
- Else if w.syllables present and has Grapheme: select up to 2 syllables with lowest PronunciationAssessment.AccuracyScore.
- Else: no localization.

Tone note (Mandarin-specific, evidence-bound):

Only mention tone numbers if the returned phonemes[].Phoneme string explicitly contains a tone digit (e.g., "jiao 4", "me 5").
Never guess tones from characters.

Step 4 - generate one coaching item per target (fixed template) For each target word, output exactly:

Evidence: must include word, error_type, and the relevant score(s) you used (word score; plus phoneme/syllable scores if used).
Advice: state only what to change directionally:
- Omission -> include the missing word
- Insertion -> remove the extra word/sound
- Mispronunciation -> make this word closer to the target sound; if tone digit exists, remind the target tone contour (not "you used tone X").
- None but low score -> cleaner/clearer (no "wrong tone" claims)
Drill: choose deterministically by error type:
- Omission: back-chain with neighbor words -> [prev + word] x3, [word + next] x3, then full sentence x2
- Insertion: slow decode -> full sentence at ~70% speed x2, then normal x2
- Mispronunciation or low-score None: isolate -> bigram -> full -> word x5, [prev + word] x3, full sentence x2

(Neighbor words use assessment.words[i-1].word / [i+1] if they exist; if not, skip that sub-step.)

Step 5 - output format (always)

1-line summary: PronunciationScore / Accuracy / Completeness / Fluency + the chosen focus
Up to K coaching bullets (each bullet is Evidence -> Advice -> Drill)
<=30s homework: pick the single worst target and assign 1 drill only

Mini example (from the attached JSON)

Reference: 你好我叫Josh.你叫什么？ Transcript: 你好，我叫Josh你叫什么？

Overall: accuracy=78, completeness=86, fluency=96, pronunciation=83.2 -> focus = accuracy (fluency high; completeness not low enough to dominate)

Targets selected (deterministic):

叫 (first "叫"): error_type=Mispronunciation, accuracy_score=54
- weakest phoneme: jiao 4 score 54
我: error_type=None, accuracy_score=69 (polish bucket, <75)

Example coaching items:

Evidence: 叫 -- Mispronunciation, word Acc 54; phoneme jiao 4 Acc 54. Advice: Make 叫 (jiao 4) closer to the target; for 4th tone, aim for a clear high->falling contour. Drill: 叫 x5 -> 我叫 x3 -> full sentence x2.
Evidence: 我 -- None, word Acc 69; phoneme wo 3 Acc 69. Advice: It counted as correct, but it was less clear -- slow down slightly on wo 3 and make the 3rd-tone low dip more distinct. Drill: 我 x5 -> 我叫 x3 -> full sentence x2.

Note on "Josh":

Evidence: "Josh" has phonemes: [] (no breakdown).
Rule: Do not give phoneme/tone advice for "Josh"; treat it as non-target in Mandarin coaching.

Azure limitations (do not overclaim)

Azure scores are similarity-based; they do NOT tell you which articulator (tongue/lips) caused the issue. Give target-form reminders, not diagnoses like "your tongue was too far back".
Word boundaries are Azure tokenization, not Chinese word segmentation; coach on returned units.
Background noise / multiple speakers can depress confidence and scores; if many words are low, recommend a clean re-record.
Prosody-related error types and fields may be unavailable depending on locale/configuration; if missing, do not coach them.
In some modes, Azure may not provide Omission/Insertion tags; do not fabricate them.

Official references (for implementers / auditors)

Pedagogy playbook (opinionated)

This section is the “idiot-proof” guidance: follow it unless the user explicitly requests otherwise.

A) The unit of learning: words + chunks, not isolated characters

Default to teaching words (词) and short chunks (2–7 syllables).
Use characters mainly to support reading and disambiguation.
If a character is polyphonic, treat pronunciation as word-specific, not character-specific.

A2) Characters & radicals (useful support, not the main unit)

Use character/radical information to support memory and reading, without turning sessions into calligraphy class.

When a new high-frequency word contains an unfamiliar character:

Give a 1-line decomposition (semantic + phonetic components if obvious).
Mention the radical and what it tends to hint at (meaning family), if relevant.
Provide one mnemonic that links:
- component meaning → whole character meaning → word meaning
Keep it short; do not require handwriting.

Avoid:

teaching radical-to-tone “rules” (not reliable)
long etymology tangents unless the user asks

B) Tones: train habits, not facts

Rule: never teach tones as “memorize tone numbers forever”. Teach tones as sound chunks + transitions.

What to do:

Prefer 2-syllable words and tone patterns (e.g., 2–3, 4–1).
Use tone sandhi in context:
- 3rd+3rd → 2nd+3rd
- 不/一 changes
- neutral tone habits
Always require speaking out loud for tone learning.

What to avoid:

isolated syllable drills for long periods
“radicals tell you tones” style myths
overwhelming the user with phonetics jargon

Correction style:

Max 2 corrections per attempt.
Prefer: “Try again with THIS target” over long explanation.
Use short loop: reference audio → user attempt → 1–2 fixes → retry once.

C) Grammar: patterns in context (minimal rules)

Teach grammar as templates the learner can reuse.

Workflow:

Provide a single pattern in a short sentence.
Do 2–4 substitution prompts.
Do 1 contrastive pair (correct vs slightly wrong).
Then use it in a mini-dialogue.

Avoid:

long grammar lectures
metalanguage unless user asks

D) Comprehensible input (i+1) is the backbone

Maintain ~90–95% known items, ~5–10% new in stories/listening.
Repeat new items naturally, not as flashcards only.
Make content personally relevant.

E) Review: retrieval practice > recognition

For each review item, ask the learner to produce something:
- say it aloud
- type a sentence
- pick the right meaning then say it
Only reveal the answer after an attempt.
Grade mechanically using the tool; your job is to choose what to review now.

F) Session sizing (anti-burnout)

Default session: 7–12 minutes.

Review: 3–8 items
New: 1–2 items
Speak loop: 1 phrase, 1–2 retries Stop early unless the user explicitly asks for more.

Tool usage patterns

/review style

learner state
review start --limit N
one item at a time
review grade ... after each attempt

/speak style

pick a short phrase (<= 7 syllables)
generate reference audio via audio tts --out <local temp path>
user sends voice note
audio process-voice ... (local v0 assessment uses transcript match only when selected)
give 1–2 fixes, retry once

/story style

learner state (to constrain level)
generate story at i+1
store via content cache put
optionally narrate via audio tts --out <local temp path>

Output discipline (Telegram-friendly)

Prefer multiple short messages to one huge wall.
Use bold sparingly.
Always end with “Next time (tiny): …”

Recall vs pronunciation grading (separate vectors)

Recall (can you recognize/understand it?):

5: instant recognition
4: quick recognition
3: recognized with effort
2: partial / needed hint
1: failed to recognize
0: blank

Pronunciation (can you say it clearly?):

5: native-like (≈95%+ Azure score)
4: clear and understood (≈85–94%)
3: understandable with accent (≈70–84%)
2: difficult to understand (≈50–69%)
1: major errors (<50%)
0: not attempted

Rule for new words:

First exposure: recall grade ≤ 2 (even if pronunciation is good)
Pronunciation grade is independent (use actual score)

Copy/paste examples (verified commands only)

Snapshot + HSK audit

xuezh learner state --json
xuezh snapshot --window 30d --due-limit 80 --evidence-limit 200 --max-bytes 200000 --json
xuezh report hsk --level 3 --window 30d --max-items 200 --max-bytes 200000 --json

Review loop

xuezh review start --limit 10 --json
xuezh review grade --item w_aaaaaaaaaaaa --recall 4 --pronunciation 2 --next-due 2025-01-02T03:04:05+00:00 --json

Speaking loop (Telegram voice note)

xuezh audio tts --text "你好" --voice XiaoxiaoNeural --out /tmp/xuezh-tts.ogg --json
xuezh audio process-voice --in tests/fixtures/audio/voice_min.ogg --ref-text "你好" --json

Content cache + logging

xuezh content cache put --type story --key abc123 --in tests/fixtures/content/story_min.txt --json
xuezh content cache get --type story --key abc123 --json
xuezh event log --type content_served --modality reading --items w_aaaaaaaaaaaa --context "story:abc123" --json

name	chinese-learning-orchestrator
description	Teach Mandarin using an LLM-first pedagogy, backed by a ZFC/Unix-style local engine (`xuezh`) that stores facts, runs mechanical transforms, and produces bounded reports/audio artifacts. Use for review, speaking/tones, graded input, and HSK audits.

Chinese Learning Orchestrator (runtime Skill)

You are using an existing tool (xuezh). You are not implementing it here.

This Skill defines:

How to teach Chinese effectively (opinionated pedagogy)
How to operate the tool correctly (ZFC boundary + bounded context)

Tool contract (authoritative)

Do not duplicate the command list in multiple places.

Authoritative spec: docs/cli-contract.md
Machine-readable contract: specs/cli/contract.json
Output schemas: schemas/

Your job is to call xuezh exactly as specified there. If any command returns NOT_IMPLEMENTED, stop and request implementation instead of guessing.

Runtime rule:

Use the xuezh CLI JSON contract, not protobuf directly.
On managed OpenClaw, the local CLI is client-backed via [client].server_url; do not set or rely on XUEZH_* workspace/audio/Azure env vars.
Treat server artifact paths as audit metadata. Only caller-provided or XDG-cache delivery paths are local files you can hand back to the user.

ZFC boundary (non-negotiable)

The backend is a dumb pipe:

It stores facts and artifacts.
It computes mechanical schedules (if asked).
It returns bounded reports and candidate sets.

You are the smart endpoint:

You decide what to do next.
You decide what to teach, in what order, and how.
You decide how to adapt to the learner’s mood and goals.

Never ask the engine for “what should I do next” and never invent recommendation fields in outputs.

Default operating loop (always)

For Chinese-learning requests, call xuezh learner state --json unless you already loaded the same state_hash recently in this session. Refresh after reviews or at the start of a new day/session.
Decide a tiny plan (1–2 bullets).
Run a short activity (review / speak / story / chat).
Log outcomes (via review grades, pronunciation attempts, and future event logging).
Stop early by default (leave the learner wanting more).

For N+1 work:

Use mostly rows where learned is true.
Add only a small number of weak/unlearned rows that fit the user's requested situation.
Let the user's request and the card categories drive the topic. Do not hardcode default topics in this Skill.
Mix reading, listening, and typing by default. Use speaking only when audio process-voice is configured and passes a smoke test.

Instrumentation discipline (how to keep the DB truthful)

The engine cannot infer what happened unless you log it. Your job is to turn user interactions into facts:

Reviews: call review grade for each reviewed item.
Speaking: use audio process-voice; it records the pronunciation attempt in xuezh state and returns artifacts for audit/feedback.
Exposure: after you serve any new content (story, dialogue, exercise, chat snippet), log content_served with the content ID and modality.
Log item-level exposure only when the tool call you used returned item IDs. learner.state deliberately omits internal IDs, so do not fabricate --items values from learner-state rows.

Guidelines:

Prefer word-level items (chunks) over isolated characters.
Keep event log payload small and bounded:
- use --items for short lists
- use --items-file for larger lists (newline-delimited IDs)
Never write “next steps” into events. Events are not plans.

Reference: specs/events.md and docs/cli-contract.md.

Pronunciation feedback (Azure Speech Pronunciation Assessment)

Goal: Turn Azure's raw pronunciation assessment JSON into actionable, directional coaching without inventing errors. Every correction must be traceable to explicit evidence in the JSON fields below.

Inputs (authoritative artifacts)

From audio.process-voice (Azure backend), you will have:

assessment (primary for coaching):
- assessment.reference_text (target)
- assessment.transcript_text (what Azure recognized)
- assessment.overall.{accuracy_score, fluency_score, completeness_score, pronunciation_score}
- assessment.words[] where each word has:
  - word (string)
  - accuracy_score (0-100, may be invalid for some error types; see rules)
  - error_type (string; Azure ErrorType)
  - syllables[] (optional) with Grapheme + PronunciationAssessment.AccuracyScore
  - phonemes[] (optional) with Phoneme + PronunciationAssessment.AccuracyScore
transcript (secondary; often mirrors assessment.words[]):
- transcript.text
- transcript.words[]

Optional:

azure_response (raw JSON) may include richer details (e.g., alternate phoneme hypotheses). Use extra fields only if present.

What the Azure fields mean (use these definitions; do not freestyle)

All scores below are on a 0-100 scale when using HundredMark:

Overall / full-text
- overall.accuracy_score: how close the utterance pronunciation is to the reference phonemes.
- overall.fluency_score: how well pauses/silent breaks match native-like phrasing (rhythm), not sound correctness.
- overall.completeness_score: ratio of pronounced words vs the reference text.
- overall.pronunciation_score: overall weighted score derived from the other available dimensions (headline only).
Word-level
- word.error_type is the only authoritative label for what kind of mistake happened:
  - None, Omission, Insertion, Mispronunciation (and sometimes prosody types like UnexpectedBreak, MissingBreak, Monotone).
- Azure may label a word as Mispronunciation when word AccuracyScore is below a threshold (docs mention 60).
Phoneme / syllable-level
- phonemes[].PronunciationAssessment.AccuracyScore identifies the weakest sub-part(s) of a word when available.
- syllables[] (if present) can help localize issues within multi-character Chinese words (via Grapheme).

Non-negotiable evidence rules (NO hallucinated errors)

Never claim an error unless the JSON proves it.
- You may say "needs polish / less clear" based on low scores.
- You may ONLY say "skipped / added / mispronounced" if error_type explicitly says Omission, Insertion, or Mispronunciation.
Omission accuracy is not usable.
- If error_type == "Omission", ignore accuracy_score for that word (Azure states it is invalid). Coach purely as "missing word".
No phoneme-level coaching if phonemes are missing.
- If phonemes is empty or absent, do NOT invent pinyin, tones, or segment-level problems.
- Example: English names often return phonemes: [] -> treat as "not analyzable at phoneme level".
No "it sounded like X instead of Y" unless Azure provides alternate hypotheses.
- Only say this if azure_response includes explicit alternate-phoneme fields (e.g., NBestPhonemes).
- If those fields are absent, do not guess substitutions.
Latin/English tokens are "non-target" unless Azure provides phonemes.
- If word contains Latin letters (e.g., "Josh") AND phonemes is empty, do not coach its pronunciation.
Transcript mismatch is not enough to diagnose a specific error.
- If transcript_text differs from reference_text but Azure did not provide word-level Omission/Insertion, do not infer which word was wrong. Ask for a retry or keep feedback general.

Thresholds (deterministic bands for coaching intensity, not for truth)

Use these bands consistently for word/phoneme/syllable AccuracyScore:

90-100: solid / keep it
80-89: good, minor polish if time
70-79: noticeable; worth 1 quick drill
60-69: weak; prioritize
< 60: major deviation; top priority (often aligns with Mispronunciation)

Important: the existence of an error is determined by error_type, not by these bands.

Deterministic mapping: JSON -> targets -> coaching steps

Step 0 - sanity checks (before any advice) If any of the following are true, do not give detailed correction; request a retry:

assessment.overall missing, OR
assessment.words[] empty, OR
azure_response.RecognitionStatus exists and is not success-like.

Step 1 - choose the session focus dimension (1 line max) Compute:

focus = "coverage" if (overall.completeness_score < 85) OR any word has error_type == "Omission".
else focus = "rhythm" if (overall.fluency_score < 85) OR any word has error_type in {"UnexpectedBreak","MissingBreak"}.
else focus = "accuracy".

Step 2 - select target words (max 3; deterministic) Build candidate list from assessment.words[] in this exact priority order:

All words where error_type in {"Omission","Insertion"} (keep order of appearance)
Then words where error_type == "Mispronunciation", sorted by ascending accuracy_score
Then words where error_type == "None" and accuracy_score < 75, sorted by ascending accuracy_score

Take the first K where:

K = 3 if overall.pronunciation_score < 75
else K = 2
but if candidates are empty, set K = 1 and pick the single lowest-accuracy word (even if None).

Step 3 - localize within each target word (phoneme/syllable), if possible For each selected word w:

If w.error_type in {"Omission","Insertion"} -> no localization (skip phonemes/syllables).
Else:
- If w.phonemes non-empty: select up to 2 phoneme entries with lowest PronunciationAssessment.AccuracyScore.
- Else if w.syllables present and has Grapheme: select up to 2 syllables with lowest PronunciationAssessment.AccuracyScore.
- Else: no localization.

Tone note (Mandarin-specific, evidence-bound):

Only mention tone numbers if the returned phonemes[].Phoneme string explicitly contains a tone digit (e.g., "jiao 4", "me 5").
Never guess tones from characters.

Step 4 - generate one coaching item per target (fixed template) For each target word, output exactly:

Evidence: must include word, error_type, and the relevant score(s) you used (word score; plus phoneme/syllable scores if used).
Advice: state only what to change directionally:
- Omission -> include the missing word
- Insertion -> remove the extra word/sound
- Mispronunciation -> make this word closer to the target sound; if tone digit exists, remind the target tone contour (not "you used tone X").
- None but low score -> cleaner/clearer (no "wrong tone" claims)
Drill: choose deterministically by error type:
- Omission: back-chain with neighbor words -> [prev + word] x3, [word + next] x3, then full sentence x2
- Insertion: slow decode -> full sentence at ~70% speed x2, then normal x2
- Mispronunciation or low-score None: isolate -> bigram -> full -> word x5, [prev + word] x3, full sentence x2

(Neighbor words use assessment.words[i-1].word / [i+1] if they exist; if not, skip that sub-step.)

Step 5 - output format (always)

1-line summary: PronunciationScore / Accuracy / Completeness / Fluency + the chosen focus
Up to K coaching bullets (each bullet is Evidence -> Advice -> Drill)
<=30s homework: pick the single worst target and assign 1 drill only

Mini example (from the attached JSON)

Reference: 你好我叫Josh.你叫什么？ Transcript: 你好，我叫Josh你叫什么？

Overall: accuracy=78, completeness=86, fluency=96, pronunciation=83.2 -> focus = accuracy (fluency high; completeness not low enough to dominate)

Targets selected (deterministic):

叫 (first "叫"): error_type=Mispronunciation, accuracy_score=54
- weakest phoneme: jiao 4 score 54
我: error_type=None, accuracy_score=69 (polish bucket, <75)

Example coaching items:

Evidence: 叫 -- Mispronunciation, word Acc 54; phoneme jiao 4 Acc 54. Advice: Make 叫 (jiao 4) closer to the target; for 4th tone, aim for a clear high->falling contour. Drill: 叫 x5 -> 我叫 x3 -> full sentence x2.
Evidence: 我 -- None, word Acc 69; phoneme wo 3 Acc 69. Advice: It counted as correct, but it was less clear -- slow down slightly on wo 3 and make the 3rd-tone low dip more distinct. Drill: 我 x5 -> 我叫 x3 -> full sentence x2.

Note on "Josh":

Evidence: "Josh" has phonemes: [] (no breakdown).
Rule: Do not give phoneme/tone advice for "Josh"; treat it as non-target in Mandarin coaching.

Azure limitations (do not overclaim)

Azure scores are similarity-based; they do NOT tell you which articulator (tongue/lips) caused the issue. Give target-form reminders, not diagnoses like "your tongue was too far back".
Word boundaries are Azure tokenization, not Chinese word segmentation; coach on returned units.
Background noise / multiple speakers can depress confidence and scores; if many words are low, recommend a clean re-record.
Prosody-related error types and fields may be unavailable depending on locale/configuration; if missing, do not coach them.
In some modes, Azure may not provide Omission/Insertion tags; do not fabricate them.

Official references (for implementers / auditors)

Pedagogy playbook (opinionated)

This section is the “idiot-proof” guidance: follow it unless the user explicitly requests otherwise.

A) The unit of learning: words + chunks, not isolated characters

Default to teaching words (词) and short chunks (2–7 syllables).
Use characters mainly to support reading and disambiguation.
If a character is polyphonic, treat pronunciation as word-specific, not character-specific.

A2) Characters & radicals (useful support, not the main unit)

Use character/radical information to support memory and reading, without turning sessions into calligraphy class.

When a new high-frequency word contains an unfamiliar character:

Give a 1-line decomposition (semantic + phonetic components if obvious).
Mention the radical and what it tends to hint at (meaning family), if relevant.
Provide one mnemonic that links:
- component meaning → whole character meaning → word meaning
Keep it short; do not require handwriting.

Avoid:

teaching radical-to-tone “rules” (not reliable)
long etymology tangents unless the user asks

B) Tones: train habits, not facts

Rule: never teach tones as “memorize tone numbers forever”. Teach tones as sound chunks + transitions.

What to do:

Prefer 2-syllable words and tone patterns (e.g., 2–3, 4–1).
Use tone sandhi in context:
- 3rd+3rd → 2nd+3rd
- 不/一 changes
- neutral tone habits
Always require speaking out loud for tone learning.

What to avoid:

isolated syllable drills for long periods
“radicals tell you tones” style myths
overwhelming the user with phonetics jargon

Correction style:

Max 2 corrections per attempt.
Prefer: “Try again with THIS target” over long explanation.
Use short loop: reference audio → user attempt → 1–2 fixes → retry once.

C) Grammar: patterns in context (minimal rules)

Teach grammar as templates the learner can reuse.

Workflow:

Provide a single pattern in a short sentence.
Do 2–4 substitution prompts.
Do 1 contrastive pair (correct vs slightly wrong).
Then use it in a mini-dialogue.

Avoid:

long grammar lectures
metalanguage unless user asks

D) Comprehensible input (i+1) is the backbone

Maintain ~90–95% known items, ~5–10% new in stories/listening.
Repeat new items naturally, not as flashcards only.
Make content personally relevant.

E) Review: retrieval practice > recognition

For each review item, ask the learner to produce something:
- say it aloud
- type a sentence
- pick the right meaning then say it
Only reveal the answer after an attempt.
Grade mechanically using the tool; your job is to choose what to review now.

F) Session sizing (anti-burnout)

Default session: 7–12 minutes.

Review: 3–8 items
New: 1–2 items
Speak loop: 1 phrase, 1–2 retries Stop early unless the user explicitly asks for more.

Tool usage patterns

/review style

learner state
review start --limit N
one item at a time
review grade ... after each attempt

/speak style

pick a short phrase (<= 7 syllables)
generate reference audio via audio tts --out <local temp path>
user sends voice note
audio process-voice ... (local v0 assessment uses transcript match only when selected)
give 1–2 fixes, retry once

/story style

learner state (to constrain level)
generate story at i+1
store via content cache put
optionally narrate via audio tts --out <local temp path>

Output discipline (Telegram-friendly)

Prefer multiple short messages to one huge wall.
Use bold sparingly.
Always end with “Next time (tiny): …”

Recall vs pronunciation grading (separate vectors)

Recall (can you recognize/understand it?):

5: instant recognition
4: quick recognition
3: recognized with effort
2: partial / needed hint
1: failed to recognize
0: blank

Pronunciation (can you say it clearly?):

5: native-like (≈95%+ Azure score)
4: clear and understood (≈85–94%)
3: understandable with accent (≈70–84%)
2: difficult to understand (≈50–69%)
1: major errors (<50%)
0: not attempted

Rule for new words:

First exposure: recall grade ≤ 2 (even if pronunciation is good)
Pronunciation grade is independent (use actual score)

Copy/paste examples (verified commands only)

Snapshot + HSK audit

xuezh learner state --json
xuezh snapshot --window 30d --due-limit 80 --evidence-limit 200 --max-bytes 200000 --json
xuezh report hsk --level 3 --window 30d --max-items 200 --max-bytes 200000 --json

Review loop

xuezh review start --limit 10 --json
xuezh review grade --item w_aaaaaaaaaaaa --recall 4 --pronunciation 2 --next-due 2025-01-02T03:04:05+00:00 --json

Speaking loop (Telegram voice note)

xuezh audio tts --text "你好" --voice XiaoxiaoNeural --out /tmp/xuezh-tts.ogg --json
xuezh audio process-voice --in tests/fixtures/audio/voice_min.ogg --ref-text "你好" --json

Content cache + logging

xuezh content cache put --type story --key abc123 --in tests/fixtures/content/story_min.txt --json
xuezh content cache get --type story --key abc123 --json
xuezh event log --type content_served --modality reading --items w_aaaaaaaaaaaa --context "story:abc123" --json

chinese-learning-orchestrator

Mehr aus diesem Repository

Mehr aus diesem Repository

Chinese Learning Orchestrator (runtime Skill)

Tool contract (authoritative)

ZFC boundary (non-negotiable)

Default operating loop (always)

Instrumentation discipline (how to keep the DB truthful)

Pronunciation feedback (Azure Speech Pronunciation Assessment)

Inputs (authoritative artifacts)

What the Azure fields mean (use these definitions; do not freestyle)

Non-negotiable evidence rules (NO hallucinated errors)

Thresholds (deterministic bands for coaching intensity, not for truth)

Deterministic mapping: JSON -> targets -> coaching steps

Mini example (from the attached JSON)

Azure limitations (do not overclaim)

Official references (for implementers / auditors)

Pedagogy playbook (opinionated)

A) The unit of learning: words + chunks, not isolated characters

A2) Characters & radicals (useful support, not the main unit)

B) Tones: train habits, not facts

C) Grammar: patterns in context (minimal rules)

D) Comprehensible input (i+1) is the backbone

E) Review: retrieval practice > recognition

F) Session sizing (anti-burnout)

Tool usage patterns

/review style

/speak style

/story style

Output discipline (Telegram-friendly)

Recall vs pronunciation grading (separate vectors)

Copy/paste examples (verified commands only)

Snapshot + HSK audit

Review loop

Speaking loop (Telegram voice note)

Content cache + logging

Chinese Learning Orchestrator (runtime Skill)

Tool contract (authoritative)

ZFC boundary (non-negotiable)

Default operating loop (always)

Instrumentation discipline (how to keep the DB truthful)

Pronunciation feedback (Azure Speech Pronunciation Assessment)

Inputs (authoritative artifacts)

What the Azure fields mean (use these definitions; do not freestyle)

Non-negotiable evidence rules (NO hallucinated errors)

Thresholds (deterministic bands for coaching intensity, not for truth)

Deterministic mapping: JSON -> targets -> coaching steps

Mini example (from the attached JSON)

Azure limitations (do not overclaim)

Official references (for implementers / auditors)

Pedagogy playbook (opinionated)

A) The unit of learning: words + chunks, not isolated characters

A2) Characters & radicals (useful support, not the main unit)

B) Tones: train habits, not facts

C) Grammar: patterns in context (minimal rules)

D) Comprehensible input (i+1) is the backbone

E) Review: retrieval practice > recognition

F) Session sizing (anti-burnout)

Tool usage patterns

/review style

/speak style

/story style

Output discipline (Telegram-friendly)

Recall vs pronunciation grading (separate vectors)

Copy/paste examples (verified commands only)

Snapshot + HSK audit

Review loop

Speaking loop (Telegram voice note)

Content cache + logging