en un clic
serve-cache-qa
// Diagnose and validate @mlxts/serve prompt-prefix cache behavior across Pi-style agent sessions, Gemma/Qwen dense and MoE checkpoints, exact-boundary caches, and multi-model serving.
// Diagnose and validate @mlxts/serve prompt-prefix cache behavior across Pi-style agent sessions, Gemma/Qwen dense and MoE checkpoints, exact-boundary caches, and multi-model serving.
| name | serve-cache-qa |
| description | Diagnose and validate @mlxts/serve prompt-prefix cache behavior across Pi-style agent sessions, Gemma/Qwen dense and MoE checkpoints, exact-boundary caches, and multi-model serving. |
Use this skill when changing or debugging @mlxts/serve prompt-prefix caching,
continuous scheduling, Pi/OpenAI-compatible agent loops, or Qwen/Gemma serving
regressions.
Prompt-prefix cache hits are completed retained snapshots. Cold concurrent requests cannot reuse an in-flight snapshot. Exact-boundary caches such as Qwen hybrid and Gemma layer-pattern caches can reuse only retained exact prompt boundaries unless the family snapshot explicitly supports shorter forks.
Serve owns matching, retention, eviction, event accounting, and protocol usage.
Transformers owns cache state shape, layerKinds, snapshot/fork validity, and
disposal.
packages/serve/AGENTS.md, docs/runtime-safety.md, and the serving
entries in MEMORY.md.[route] ... route=continuous|single ... model_type=...[cache] ... miss|hit|write ... read_tokens=... write_tokens=...generation_scheduler_phase queued, prefill, admitted, first-token, and
finished phases.regression:agent-cache JSON requests[] entries with client
duration/TTFT/stream/cache usage and server route/cache/prefill summaries.CacheLayerKind, not family names:
For a cache remediation, add the narrowest unit test that reproduces the product failure without real checkpoints, then run the focused package tests.
Minimum unit coverage:
["full", "sliding"]
CacheLayerKind["linear-recurrent", "full"]CacheLayerKindReal checkpoint proof, when cached models and runtime budget are available:
Use bun run regression:agent-cache -- --scenarios qwen-dense,gemma-dense,multi-dense
for the automated dense proof. Add --include-moe when the Qwen/Gemma MoE
checkpoints and memory budget are available.
For reports involving two active Gemma MoE agents, escalate to
bun run regression:agent-cache -- --scenarios gemma-moe --max-concurrent-requests 2 --report-json .tmp/agent-cache-regression/gemma-moe-concurrent.json.
The cold A/B requests may both miss. The warm A/B replay and exact A replay must
hit retained prompt boundaries.
Use cmux for Pi/server smokes so server logs and two client terminals remain
visible. Keep the in-process regression as the canonical server gate, and use
the cmux smoke to explain client-shape differences such as tool schemas,
thinking flags, or changed workspace prompt text. Heavy MLX commands remain
exclusive.
Runtime-sensitive cache changes need a docs/reviews/ artifact with:
## Files Reviewed naming every changed runtime-sensitive fileDo not claim shared AGENTS-prefix reuse for exact-boundary caches unless there is an exact retained snapshot at that boundary or a family-owned cache backend that supports that fork.