Run any Skill in Manus with one click

mlx-serving

Stars2

Forks2

UpdatedMay 9, 2026 at 22:13

This skill should be used when the user asks about "MLX serving", "mlx_lm.server", "oMLX", "Apple Silicon LLM serving", or "local LLM on Mac" — and when troubleshooting symptoms like model fails to load, OOM during load or inference, server hangs or crashes at batch>1, tool calls returning as plaintext content, throughput regression, or choosing between mlx-lm and oMLX. Also applies to oMLX feature-flag tuning ("turboquant_kv", "dflash", "MTP", "specprefill", "thinking_budget", "max-concurrent-requests", "force_sampling"), OptiQ proxy for models exceeding RAM, Llama-4 ChunkedKVCache batch handling, Llama-3 tool-call JSON format ("name"/"parameters"), and bench-driven validation of serving configs. For Apple Silicon (M-series) only — not for cloud LLM hosting (Bedrock, OpenAI API, Anthropic API), not for non-MLX backends (llama.cpp, Ollama, vLLM), not for model training.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

AeyeOps

AeyeOps/aeo-skill-marketplace

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

File Explorer

5 files

SKILL.md

readonly

name	mlx-serving
version	1.0.0
description	This skill should be used when the user asks about "MLX serving", "mlx_lm.server", "oMLX", "Apple Silicon LLM serving", or "local LLM on Mac" — and when troubleshooting symptoms like model fails to load, OOM during load or inference, server hangs or crashes at batch>1, tool calls returning as plaintext content, throughput regression, or choosing between mlx-lm and oMLX. Also applies to oMLX feature-flag tuning ("turboquant_kv", "dflash", "MTP", "specprefill", "thinking_budget", "max-concurrent-requests", "force_sampling"), OptiQ proxy for models exceeding RAM, Llama-4 ChunkedKVCache batch handling, Llama-3 tool-call JSON format ("name"/"parameters"), and bench-driven validation of serving configs. For Apple Silicon (M-series) only — not for cloud LLM hosting (Bedrock, OpenAI API, Anthropic API), not for non-MLX backends (llama.cpp, Ollama, vLLM), not for model training.

Apple Silicon LLM Serving (MLX / oMLX)

Operational guidance for running and tuning local LLM inference on Apple Silicon (M-series). Covers two common backends: Apple's mlx-lm and the oMLX server built on top of MLX.

Both expose OpenAI-compatible HTTP APIs, so client code is portable. The differences are in features, packaging, and the failure modes you hit when pushing models near the limits of unified memory.

Pre-Flight: Verify Environment

Before debugging any serving issue, confirm the basics:

# Hardware: confirm Apple Silicon and unified-memory size
system_profiler SPHardwareDataType | grep -E "Model|Chip|Memory"

# Activate the venv where mlx-lm is installed before the next two probes,
# otherwise they hit system Python and may report nothing useful.
python -c "import mlx.core as mx; print(mx.__version__, mx.default_device())"
python -c "import mlx_lm; print(mlx_lm.__version__)"

# oMLX (DMG install): inspect available subcommands and run health checks
omlx --help
omlx diagnose menubar   # checks Tahoe ControlCenter visibility (oMLX-specific)

# Confirm a process is actually listening
lsof -nP -iTCP -sTCP:LISTEN | grep -E "omlx|mlx_lm"

Verify which backend is bound before debugging: both can serve OpenAI- compatible endpoints on user-chosen ports, and a client request hitting the wrong backend silently produces correct-looking but unrelated output.

Backend Decision: mlx-lm vs oMLX

Concern	mlx-lm	oMLX
Install	`pip install mlx-lm`	DMG (macOS app bundle)
Process model	One model per `mlx_lm.server` process	Built-in registry; multi-model
Config surface	Minimal CLI flags	Rich per-model JSON (`model_settings.json`)
Feature flags	`--temp`, `--top-p`, basic	turboquant_kv, dflash, MTP, specprefill, thinking budget
Tool calling	Format depends on chat template	Per-model parser (configurable)
Models > RAM	Not supported (OOM on load)	OptiQ proxy build (sensitivity-driven quant)
GUI	None	Native macOS app + menu bar
Default port	8080	8000
Best for	Library-first scripts, embedding in Python	Long-running daemon, multiple models, GUI ops

Skill bias: the depth here leans oMLX — feature flags, cache controls, upstream bug patterns. mlx-lm coverage is lighter and focuses on the shared substrate.

Pick mlx-lm when: you control the Python process, you want minimal surface area, you serve one model at a time, and the model fits comfortably in RAM.

Pick oMLX when: you need a daemon that survives terminal sessions, you serve multiple models behind one endpoint, you want feature flags (turboquant, dflash, MTP) without writing them yourself, or your model exceeds RAM and you need OptiQ proxy.

Both backends share the same fundamental constraints (unified memory, KV cache pressure, model architecture quirks), so most troubleshooting in this skill applies to either.

Symptom → Cause Triage

Symptom	Likely cause	First move
OOM during model load	Model + KV scratch > available RAM	Smaller quant; OptiQ proxy (oMLX); free other apps
OOM during long-context inference	KV cache growth	Enable `turboquant_kv`; set `--paged-ssd-cache-dir` for spill-to-SSD (CLI flag is the operative control; `dflash_*` keys are version-dependent overrides)
Crash / hang at batch > 1	Architecture cache shape mismatch	Pin `--max-concurrent-requests 1` (oMLX); see upstream-bug-patterns
Tool calls return as plaintext content	Parser doesn't recognize the model's tool-call JSON shape	See upstream-bug-patterns; check `tool_calls` field in response
Throughput regression after config change	Feature flag interaction	Bisect via bench, one flag at a time
Server hangs on startup	Large checkpoint loading from cold disk	Wait; tail `server.log`; check disk I/O via `iostat`
Wrong / garbled outputs	Quantization too aggressive, or wrong chat template	Reduce `turboquant_kv_bits` or disable; verify chat template matches model card
First request slow, rest fast	Cold prefix cache	Enable `specprefill`; warm cache with a dummy request
Output cut off mid-sentence	`max_tokens` too low; thinking budget exhausted	Raise `max_tokens`; check `thinking_budget_enabled`

For deeper triage on each symptom: references/symptoms.md.

Core Principles

Lower --max-concurrent-requests to 1 for batch-fragile architectures. oMLX defaults to --max-concurrent-requests 8. Higher concurrency exposes architecture-specific batch handling bugs (notably in cache implementations like ChunkedKVCache used by Llama-4). For affected architectures, pin concurrency to 1 at launch and bench upward only after confirming clean output at batch>1.
One flag at a time. Feature flags interact (e.g., dflash + turboquant_kv both touch the KV cache path). Changing two flags simultaneously and observing a regression makes attribution impossible. Bench between each change.
Bench, don't guess. "It feels faster" is not a tuning signal. Maintain a small bench suite (chat / coding / tool-calling correctness + throughput) and re-run it after every config change. See references/bench-methodology.md.
Match chat templates to model cards. Tool-calling and reasoning behavior depend on the chat template applied at request time. The model's HuggingFace card is authoritative.
Quantization is a curve, not a switch. turboquant_kv_bits=4.0 keeps most quality; lower bits trade quality for memory. Always include a quality cell in the bench suite — a config that passes throughput but fails coding can be worse than slower-but-correct.
OptiQ proxy is for the "model > RAM" case only. It builds a sensitivity-driven proxy of the model so per-layer quant decisions can be made without holding the full model in memory. It's not a general speedup; for in-RAM models it adds startup cost without runtime benefit.
Server-side bugs masquerade as config problems. When a symptom reproduces across multiple config combinations, suspect upstream rather than chasing more flags. See references/upstream-bug-patterns.md.

References

references/symptoms.md — symptom triage with diagnostics and fixes (OOM, batch>1 crashes, tool-call-as-text, throughput regression, log triage)
references/omlx-feature-flags.md — per-flag reference (turboquant_kv, dflash, MTP, specprefill, thinking budget, max-concurrent-requests, force_sampling) with interactions
references/bench-methodology.md — bench-it-don't-guess: suite design, backend-agnostic harness shape, scoring, change attribution
references/upstream-bug-patterns.md — when to suspect upstream vs config; two recurring bug patterns (cache-shape mismatch under batched scheduling; model tool-call format not parsed) with diagnostics and reporting guidance

What This Skill Does Not Cover

Cloud LLM hosting (Bedrock, OpenAI API, Anthropic API) — different surface entirely
Non-MLX local backends (llama.cpp, Ollama, vLLM) — overlapping problem space but different tooling
Model training, fine-tuning, or LoRA — see MLX training tutorials
General Python / HTTP debugging unrelated to MLX
Model card creation or HuggingFace upload workflows

More from this repository

same repository

cowork-migrate

AeyeOps/aeo-skill-marketplace

Migrate a Claude Cowork session from one Windows machine to another with full history, working file links, and no truncated-transcript rendering bug. Use this whenever the user mentions moving, importing, copying, or migrating a Cowork session/conversation/project between machines, or troubleshoots symptoms of a broken import like "session shows blank", "only the latest messages show", "scratchpad files don't open", "can't scroll past the last compaction", or "Loaded N messages (truncated via tail/compaction)" in the Cowork log. Covers orphan sessions on Windows to Windows under the same Cowork account. Handles the undocumented two-layer compact_boundary truncation filter in app.asar that silently clips imported transcripts. Does not handle Cowork Spaces/Projects, Linux/macOS, or cross-account migration.

2026-06-242

skill-creator

AeyeOps/aeo-skill-marketplace

Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, update or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.

2026-06-242

tailscale-macos-headscale

AeyeOps/aeo-skill-marketplace

Onboard a macOS host (Tahoe / macOS 26 and later) as a Tailscale client of a self-hosted headscale control plane. Covers Tailscale.app installation via Homebrew Cask, the NetworkExtension permission grants required for the daemon to start, the conflict that arises if the brew formula `tailscale` is also installed alongside the cask, how to use `tailscale up --login-server` with a headscale preauth key, the deep-link fallback flow when the CLI cannot reach the daemon, the headscale-specific gotcha that `headscale preauthkeys create --user <N>` expects a numeric user ID rather than a username on recent builds, and bidirectional reach verification once joined. Use when adding a macOS host to a headscale-controlled mesh, troubleshooting symptoms like "failed to connect to local tailscale service", Tailscale.app stuck on "Starting...", `tailscale up` hanging on "joining <coordinator>", a blank menu-bar icon after a fresh install, deciding between the Homebrew cask and formula distributions, or recovering from a st

2026-05-242

glinet-slate7

AeyeOps/aeo-skill-marketplace

Comprehensive reference for the GL-iNet Slate 7 travel router (model GL-BE3600, Wi-Fi 7). Covers hardware specs, 2.5G ports, touchscreen interface, full admin panel menu structure, VPN client setup (WireGuard/OpenVPN; NordVPN, Mullvad, Surfshark, and 30+ providers), WireGuard/OpenVPN server setup, AdGuard Home, Tor, Tailscale, DDNS, network modes (Router/AP/Extender/WDS/Drop-in Gateway), SSH/CLI access with command reference, factory reset, firmware update, and U-Boot bricked-device recovery. Also covers the JSON-RPC admin API at /rpc (challenge/response auth, module/method discovery, reusable bash helper), programmatic WireGuard server provisioning via the wg-server module (add_peer, generate_peer, settings, leak verification, local-only Endpoint pattern), and Linux client-side WG with overlay-VPN stacking — including the two leak modes that appear when running Tailscale on top of a full-tunnel WG client (fwmark 0x80000 bypass and wg-quick catch-all shadowing the tailnet routes) plus the wg-quick PostUp/PreD

2026-05-242

lima-vm-operations

AeyeOps/aeo-skill-marketplace

This skill should be used when the user asks about "Lima", "limactl", "lima.yaml", "lima start", "lima shell", "creating a Linux VM on Mac", "running Linux on Apple Silicon", "macOS Linux VM", "Apple Silicon VM", or wants to "install Lima", "configure a Lima VM", "edit lima config", "spin up an Ubuntu VM on my Mac", or "use Lima to run Docker on macOS". Also applies for "lima vmType vz", "lima vz vs qemu", "host.lima.internal", "socket_vmnet", "lima networking", "lima shared network", "lima bridged network", "virtiofs mount", "9p mount", "lima port forward", "lima mount writable", "limactl edit", "limactl validate", "limactl template", "lima Rosetta", "running x86 in lima", "lima debug startup", or any task involving spinning up, configuring, troubleshooting, or shelling into a Lima VM on an Apple Silicon Mac. Use this skill whenever Lima is mentioned even if the user doesn't explicitly ask for "help" — the right configuration choices (vz vs qemu, mount type, network mode) are non-obvious and easy to get wron

2026-05-092

architecture-docs

AeyeOps/aeo-skill-marketplace

Produce architecture documentation for technical audiences using established frameworks — the C4 model for system structure (Context / Container / Component / Code), Architectural Decision Records (ADRs) for capturing the why behind significant design choices, OpenAPI for HTTP API specifications, and system / component design documents for engineers who need to build against or reason about the architecture. Use when someone asks to document system architecture, write an ADR, capture an architectural decision, generate C4 diagrams (context / container / component), produce an OpenAPI spec, draft a component design doc, or document how services integrate. Trigger phrases include "create a C4 diagram", "write an ADR for X", "document the architecture", "specify the API", "OpenAPI spec for", "system design doc", "decision record", "architecture overview". Not for: end-user-facing docs (use the `aeo-docs:diataxis` skill instead — tutorials, how-tos, references, explanations live there), README content, or code-le

2026-05-082

name	mlx-serving
version	1.0.0
description	This skill should be used when the user asks about "MLX serving", "mlx_lm.server", "oMLX", "Apple Silicon LLM serving", or "local LLM on Mac" — and when troubleshooting symptoms like model fails to load, OOM during load or inference, server hangs or crashes at batch>1, tool calls returning as plaintext content, throughput regression, or choosing between mlx-lm and oMLX. Also applies to oMLX feature-flag tuning ("turboquant_kv", "dflash", "MTP", "specprefill", "thinking_budget", "max-concurrent-requests", "force_sampling"), OptiQ proxy for models exceeding RAM, Llama-4 ChunkedKVCache batch handling, Llama-3 tool-call JSON format ("name"/"parameters"), and bench-driven validation of serving configs. For Apple Silicon (M-series) only — not for cloud LLM hosting (Bedrock, OpenAI API, Anthropic API), not for non-MLX backends (llama.cpp, Ollama, vLLM), not for model training.

Apple Silicon LLM Serving (MLX / oMLX)

Operational guidance for running and tuning local LLM inference on Apple Silicon (M-series). Covers two common backends: Apple's mlx-lm and the oMLX server built on top of MLX.

Both expose OpenAI-compatible HTTP APIs, so client code is portable. The differences are in features, packaging, and the failure modes you hit when pushing models near the limits of unified memory.

Pre-Flight: Verify Environment

Before debugging any serving issue, confirm the basics:

# Hardware: confirm Apple Silicon and unified-memory size
system_profiler SPHardwareDataType | grep -E "Model|Chip|Memory"

# Activate the venv where mlx-lm is installed before the next two probes,
# otherwise they hit system Python and may report nothing useful.
python -c "import mlx.core as mx; print(mx.__version__, mx.default_device())"
python -c "import mlx_lm; print(mlx_lm.__version__)"

# oMLX (DMG install): inspect available subcommands and run health checks
omlx --help
omlx diagnose menubar   # checks Tahoe ControlCenter visibility (oMLX-specific)

# Confirm a process is actually listening
lsof -nP -iTCP -sTCP:LISTEN | grep -E "omlx|mlx_lm"

Backend Decision: mlx-lm vs oMLX

Concern	mlx-lm	oMLX
Install	`pip install mlx-lm`	DMG (macOS app bundle)
Process model	One model per `mlx_lm.server` process	Built-in registry; multi-model
Config surface	Minimal CLI flags	Rich per-model JSON (`model_settings.json`)
Feature flags	`--temp`, `--top-p`, basic	turboquant_kv, dflash, MTP, specprefill, thinking budget
Tool calling	Format depends on chat template	Per-model parser (configurable)
Models > RAM	Not supported (OOM on load)	OptiQ proxy build (sensitivity-driven quant)
GUI	None	Native macOS app + menu bar
Default port	8080	8000
Best for	Library-first scripts, embedding in Python	Long-running daemon, multiple models, GUI ops

Skill bias: the depth here leans oMLX — feature flags, cache controls, upstream bug patterns. mlx-lm coverage is lighter and focuses on the shared substrate.

Pick mlx-lm when: you control the Python process, you want minimal surface area, you serve one model at a time, and the model fits comfortably in RAM.

Both backends share the same fundamental constraints (unified memory, KV cache pressure, model architecture quirks), so most troubleshooting in this skill applies to either.

Symptom → Cause Triage

Symptom	Likely cause	First move
OOM during model load	Model + KV scratch > available RAM	Smaller quant; OptiQ proxy (oMLX); free other apps
OOM during long-context inference	KV cache growth	Enable `turboquant_kv`; set `--paged-ssd-cache-dir` for spill-to-SSD (CLI flag is the operative control; `dflash_*` keys are version-dependent overrides)
Crash / hang at batch > 1	Architecture cache shape mismatch	Pin `--max-concurrent-requests 1` (oMLX); see upstream-bug-patterns
Tool calls return as plaintext content	Parser doesn't recognize the model's tool-call JSON shape	See upstream-bug-patterns; check `tool_calls` field in response
Throughput regression after config change	Feature flag interaction	Bisect via bench, one flag at a time
Server hangs on startup	Large checkpoint loading from cold disk	Wait; tail `server.log`; check disk I/O via `iostat`
Wrong / garbled outputs	Quantization too aggressive, or wrong chat template	Reduce `turboquant_kv_bits` or disable; verify chat template matches model card
First request slow, rest fast	Cold prefix cache	Enable `specprefill`; warm cache with a dummy request
Output cut off mid-sentence	`max_tokens` too low; thinking budget exhausted	Raise `max_tokens`; check `thinking_budget_enabled`

For deeper triage on each symptom: references/symptoms.md.

Core Principles

Lower --max-concurrent-requests to 1 for batch-fragile architectures. oMLX defaults to --max-concurrent-requests 8. Higher concurrency exposes architecture-specific batch handling bugs (notably in cache implementations like ChunkedKVCache used by Llama-4). For affected architectures, pin concurrency to 1 at launch and bench upward only after confirming clean output at batch>1.
One flag at a time. Feature flags interact (e.g., dflash + turboquant_kv both touch the KV cache path). Changing two flags simultaneously and observing a regression makes attribution impossible. Bench between each change.
Bench, don't guess. "It feels faster" is not a tuning signal. Maintain a small bench suite (chat / coding / tool-calling correctness + throughput) and re-run it after every config change. See references/bench-methodology.md.
Match chat templates to model cards. Tool-calling and reasoning behavior depend on the chat template applied at request time. The model's HuggingFace card is authoritative.
Quantization is a curve, not a switch. turboquant_kv_bits=4.0 keeps most quality; lower bits trade quality for memory. Always include a quality cell in the bench suite — a config that passes throughput but fails coding can be worse than slower-but-correct.
OptiQ proxy is for the "model > RAM" case only. It builds a sensitivity-driven proxy of the model so per-layer quant decisions can be made without holding the full model in memory. It's not a general speedup; for in-RAM models it adds startup cost without runtime benefit.
Server-side bugs masquerade as config problems. When a symptom reproduces across multiple config combinations, suspect upstream rather than chasing more flags. See references/upstream-bug-patterns.md.

References

references/symptoms.md — symptom triage with diagnostics and fixes (OOM, batch>1 crashes, tool-call-as-text, throughput regression, log triage)
references/omlx-feature-flags.md — per-flag reference (turboquant_kv, dflash, MTP, specprefill, thinking budget, max-concurrent-requests, force_sampling) with interactions
references/bench-methodology.md — bench-it-don't-guess: suite design, backend-agnostic harness shape, scoring, change attribution
references/upstream-bug-patterns.md — when to suspect upstream vs config; two recurring bug patterns (cache-shape mismatch under batched scheduling; model tool-call format not parsed) with diagnostics and reporting guidance

What This Skill Does Not Cover

Cloud LLM hosting (Bedrock, OpenAI API, Anthropic API) — different surface entirely
Non-MLX local backends (llama.cpp, Ollama, vLLM) — overlapping problem space but different tooling
Model training, fine-tuning, or LoRA — see MLX training tutorials
General Python / HTTP debugging unrelated to MLX
Model card creation or HuggingFace upload workflows