Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

mlx-apple-silicon-mlx

MLX-powered local AI — run LLMs, Stable Diffusion, speech-to-text, and embeddings natively on Apple Silicon via MLX. Ollama uses MLX for LLM inference, mflux uses MLX for Flux image generation, DiffusionKit uses MLX for Stable Diffusion 3, and Qwen3-ASR uses MLX for transcription. One fleet router coordinates all four across Mac Studio, Mac Mini, MacBook Pro.

Ejecutar en Manus

Resumen

Comando de instalación

npx skills add https://github.com/geeks-accelerator/ollama-herd --skill mlx-apple-silicon-mlx

Copia y pega este comando en Claude Code para instalar la habilidad

Fuente

geeks-accelerator/ollama-herd

Estrellas8

Forks0

Actualizado27 de abril de 2026, 20:00

SKILL.md

readonly

name	mlx-apple-silicon-mlx
description	MLX-powered local AI — run LLMs, Stable Diffusion, speech-to-text, and embeddings natively on Apple Silicon via MLX. Ollama uses MLX for LLM inference, mflux uses MLX for Flux image generation, DiffusionKit uses MLX for Stable Diffusion 3, and Qwen3-ASR uses MLX for transcription. One fleet router coordinates all four across Mac Studio, Mac Mini, MacBook Pro.
version	1.0.0
homepage	https://github.com/geeks-accelerator/ollama-herd
metadata	{"openclaw":{"emoji":"bolt","requires":{"anyBins":["curl","wget"],"optionalBins":["python3","pip"]},"configPaths":["~/.fleet-manager/latency.db","~/.fleet-manager/logs/herd.jsonl"],"os":["darwin"]}}

MLX Local AI — Apple's ML Framework Powers Your Entire Fleet

Everything in this fleet runs on Apple's MLX framework. LLM inference, image generation, speech-to-text, embeddings — all MLX-native, all optimized for Apple Silicon's unified memory architecture.

The MLX stack

Capability	Tool	MLX usage
LLM inference	Ollama	MLX backend for model loading and inference on Apple Silicon
Image gen (Flux)	mflux	Pure MLX implementation of Flux diffusion models
Image gen (SD3)	DiffusionKit	MLX-native Stable Diffusion 3 and 3.5
Speech-to-text	Qwen3-ASR	MLX-accelerated audio transcription
Embeddings	Ollama	MLX backend for embedding model inference

One router. One framework. Four modalities. All local.

Setup

pip install ollama-herd    # PyPI: https://pypi.org/project/ollama-herd/
herd                       # start the router (port 11435)
herd-node                  # run on each device — finds the router automatically

# Install image generation backends
uv tool install mflux           # Flux models (~7s at 512px)
uv tool install diffusionkit    # Stable Diffusion 3/3.5

All tools leverage MLX for Metal-accelerated inference on Apple Silicon's GPU cores.

LLM inference via MLX

Ollama runs models using MLX on Apple Silicon. Unified memory means the entire model stays in one address space — no PCIe bottleneck.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="llama3.3:70b",
    messages=[{"role": "user", "content": "Explain MLX unified memory"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Image generation via MLX

Both mflux and DiffusionKit are pure MLX implementations — no PyTorch, no CUDA.

# Flux via mflux (fastest)
curl -o flux.png http://localhost:11435/api/generate-image \
  -H "Content-Type: application/json" \
  -d '{"model": "z-image-turbo", "prompt": "a neural network visualization", "width": 1024, "height": 1024}'

# Stable Diffusion 3 via DiffusionKit
curl -o sd3.png http://localhost:11435/api/generate-image \
  -H "Content-Type: application/json" \
  -d '{"model": "sd3-medium", "prompt": "a circuit board landscape", "width": 1024, "height": 1024, "steps": 20}'

Speech-to-text via MLX

Qwen3-ASR transcribes audio using MLX acceleration.

curl http://localhost:11435/api/transcribe \
  -F "file=@meeting.wav" \
  -F "model=qwen3-asr"

Embeddings via MLX

Ollama embedding models run on the MLX backend.

curl http://localhost:11435/api/embed \
  -d '{"model": "nomic-embed-text", "input": "Apple MLX framework for machine learning"}'

Why MLX matters for local AI

Unified memory — model weights, activations, and KV cache share one memory pool. No CPU-GPU transfer overhead.
Metal acceleration — MLX compiles to Metal shaders that run on Apple Silicon GPU cores (up to 80 on M3/M4 Ultra).
Lazy evaluation — MLX only computes what's needed, reducing memory pressure.
Dynamic shapes — no recompilation when input sizes change (unlike some CUDA frameworks).
Apple-maintained — MLX is developed by Apple's ML research team, optimized for every chip generation.

Fleet performance on Apple Silicon

Chip	GPU Cores	Memory	LLM Sweet Spot	Image Gen
M1	8	8-16GB	3-7B models	Slow
M2 Pro	19	32GB	14B models	Capable
M3 Max	40	128GB	70B models	Fast
M4 Ultra	80	256GB	120B+ models	Very fast

Monitor your MLX fleet

# Fleet overview
curl -s http://localhost:11435/fleet/status | python3 -m json.tool

# Model recommendations based on your hardware
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool

# Health checks
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

Dashboard at http://localhost:11435/dashboard — see every node, every model, every queue in real time.

Full documentation

Agent Setup Guide — all 4 model types
Image Generation Guide — 3 backends
API Reference

Contribute

Ollama Herd is open source (MIT) and built on the MLX ecosystem. We welcome contributions:

Star on GitHub — helps others discover the project
Open an issue — bug reports, feature requests, questions
AI agents welcome — CLAUDE.md provides full architectural context. Fork, branch, PR.
964 tests, async Python, runs in under 40 seconds. Hard to break things.

Guardrails

No automatic downloads — all model pulls require explicit user confirmation.
Model deletion requires explicit user confirmation.
All requests stay local — no data leaves your network.
Never delete or modify files in ~/.fleet-manager/.

Más de este repositorio

mismo repositorio

gemma-gemma3

geeks-accelerator/ollama-herd

Gemma 3 by Google — run Gemma 3 (4B, 12B, 27B) across your local device fleet. Google's most capable open model with 128K context, strong coding, and multilingual support. Fleet-routed to the best available machine via Ollama Herd. Cross-platform (macOS, Linux, Windows). Zero cloud costs.

2026-04-278

mac-studio-ai

geeks-accelerator/ollama-herd

Mac Studio AI — run LLMs, image generation, speech-to-text, and embeddings on your Mac Studio. M2 Ultra (192GB), M3 Ultra (512GB), M4 Max (128GB), and M4 Ultra (256GB) make the Mac Studio the most powerful local AI device. Load 120B+ models in Mac Studio unified memory. Route across multiple Mac Studios automatically. Mac Studio本地AI推理。Mac Studio IA local.

2026-04-278

mistral-codestral

geeks-accelerator/ollama-herd

Mistral and Codestral — run Mistral Large, Mistral-Nemo, Codestral, and Mistral-Small locally. Mistral AI's open-source LLMs for code generation and reasoning. Codestral by Mistral trained on 80+ languages. Mistral routed across your fleet. Mistral本地推理。Mistral IA local. Codestral código local.

2026-04-278

ollama-proxy

geeks-accelerator/ollama-herd

Ollama proxy — one endpoint that routes to multiple Ollama instances. Drop-in Ollama proxy replacement for localhost:11434. Same Ollama API, same model names, but the Ollama proxy routes requests to the best device. Auto-discovers Ollama nodes, scores on 7 signals, retries on failure. Works with Open WebUI, LangChain, Aider. Ollama代理 | proxy Ollama

2026-04-278

private-ai

geeks-accelerator/ollama-herd

Private AI — run LLMs, image generation, speech-to-text, and embeddings on your own hardware. Private AI keeps all data on your network. No cloud APIs, no telemetry, no third-party access. Offline AI and air-gapped AI compatible. On-premise AI for privacy, compliance, data sovereignty. HIPAA-friendly private AI, GDPR-ready private AI. 私有AI离线推理。IA privada sin nube.

2026-04-278

qwen-qwen3-5

geeks-accelerator/ollama-herd

Qwen 3.5 by Alibaba — run Qwen 3.5 (the latest and most capable Qwen model) across your local device fleet. Qwen 3.5 rivals GPT-4o and Claude 3.5 on reasoning benchmarks. Plus Qwen3-Coder for code generation and Qwen3-ASR for speech-to-text. Fleet-routed to the best available machine via Ollama Herd. Zero cloud costs.

2026-04-278

Fuente

geeks-accelerator

geeks-accelerator/ollama-herd

Abrir repositorio de GitHub Ver repositorios del creador

Comando de instalación

Descarga

Ejecutar en Manus

Útil paraSOC

Científicos de datosOcupaciones informáticas y matemáticas15-2051L4

name	mlx-apple-silicon-mlx
description	MLX-powered local AI — run LLMs, Stable Diffusion, speech-to-text, and embeddings natively on Apple Silicon via MLX. Ollama uses MLX for LLM inference, mflux uses MLX for Flux image generation, DiffusionKit uses MLX for Stable Diffusion 3, and Qwen3-ASR uses MLX for transcription. One fleet router coordinates all four across Mac Studio, Mac Mini, MacBook Pro.
version	1.0.0
homepage	https://github.com/geeks-accelerator/ollama-herd
metadata	{"openclaw":{"emoji":"bolt","requires":{"anyBins":["curl","wget"],"optionalBins":["python3","pip"]},"configPaths":["~/.fleet-manager/latency.db","~/.fleet-manager/logs/herd.jsonl"],"os":["darwin"]}}

MLX Local AI — Apple's ML Framework Powers Your Entire Fleet

Everything in this fleet runs on Apple's MLX framework. LLM inference, image generation, speech-to-text, embeddings — all MLX-native, all optimized for Apple Silicon's unified memory architecture.

The MLX stack

Capability	Tool	MLX usage
LLM inference	Ollama	MLX backend for model loading and inference on Apple Silicon
Image gen (Flux)	mflux	Pure MLX implementation of Flux diffusion models
Image gen (SD3)	DiffusionKit	MLX-native Stable Diffusion 3 and 3.5
Speech-to-text	Qwen3-ASR	MLX-accelerated audio transcription
Embeddings	Ollama	MLX backend for embedding model inference

One router. One framework. Four modalities. All local.

Setup

pip install ollama-herd    # PyPI: https://pypi.org/project/ollama-herd/
herd                       # start the router (port 11435)
herd-node                  # run on each device — finds the router automatically

# Install image generation backends
uv tool install mflux           # Flux models (~7s at 512px)
uv tool install diffusionkit    # Stable Diffusion 3/3.5

All tools leverage MLX for Metal-accelerated inference on Apple Silicon's GPU cores.

LLM inference via MLX

Ollama runs models using MLX on Apple Silicon. Unified memory means the entire model stays in one address space — no PCIe bottleneck.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11435/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="llama3.3:70b",
    messages=[{"role": "user", "content": "Explain MLX unified memory"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Image generation via MLX

Both mflux and DiffusionKit are pure MLX implementations — no PyTorch, no CUDA.

# Flux via mflux (fastest)
curl -o flux.png http://localhost:11435/api/generate-image \
  -H "Content-Type: application/json" \
  -d '{"model": "z-image-turbo", "prompt": "a neural network visualization", "width": 1024, "height": 1024}'

# Stable Diffusion 3 via DiffusionKit
curl -o sd3.png http://localhost:11435/api/generate-image \
  -H "Content-Type: application/json" \
  -d '{"model": "sd3-medium", "prompt": "a circuit board landscape", "width": 1024, "height": 1024, "steps": 20}'

Speech-to-text via MLX

Qwen3-ASR transcribes audio using MLX acceleration.

curl http://localhost:11435/api/transcribe \
  -F "file=@meeting.wav" \
  -F "model=qwen3-asr"

Embeddings via MLX

Ollama embedding models run on the MLX backend.

curl http://localhost:11435/api/embed \
  -d '{"model": "nomic-embed-text", "input": "Apple MLX framework for machine learning"}'

Why MLX matters for local AI

Unified memory — model weights, activations, and KV cache share one memory pool. No CPU-GPU transfer overhead.
Metal acceleration — MLX compiles to Metal shaders that run on Apple Silicon GPU cores (up to 80 on M3/M4 Ultra).
Lazy evaluation — MLX only computes what's needed, reducing memory pressure.
Dynamic shapes — no recompilation when input sizes change (unlike some CUDA frameworks).
Apple-maintained — MLX is developed by Apple's ML research team, optimized for every chip generation.

Fleet performance on Apple Silicon

Chip	GPU Cores	Memory	LLM Sweet Spot	Image Gen
M1	8	8-16GB	3-7B models	Slow
M2 Pro	19	32GB	14B models	Capable
M3 Max	40	128GB	70B models	Fast
M4 Ultra	80	256GB	120B+ models	Very fast

Monitor your MLX fleet

# Fleet overview
curl -s http://localhost:11435/fleet/status | python3 -m json.tool

# Model recommendations based on your hardware
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool

# Health checks
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

Dashboard at http://localhost:11435/dashboard — see every node, every model, every queue in real time.

Full documentation

Agent Setup Guide — all 4 model types
Image Generation Guide — 3 backends
API Reference

Contribute

Ollama Herd is open source (MIT) and built on the MLX ecosystem. We welcome contributions:

Star on GitHub — helps others discover the project
Open an issue — bug reports, feature requests, questions
AI agents welcome — CLAUDE.md provides full architectural context. Fork, branch, PR.
964 tests, async Python, runs in under 40 seconds. Hard to break things.

Guardrails

No automatic downloads — all model pulls require explicit user confirmation.
Model deletion requires explicit user confirmation.
All requests stay local — no data leaves your network.
Never delete or modify files in ~/.fleet-manager/.