| name | neuroskill-llm |
| description | NeuroSkill built-in LLM inference server — on-device OpenAI-compatible chat with llama.cpp, model catalog management (add/download/select/delete GGUF models), vision support (mmproj), streaming WebSocket and HTTP chat, automatic tool calling (bash, read, write, edit, web_search, web_fetch), GenParams tuning, and persistent chat history. Use when managing local LLM models, chatting, or integrating with the OpenAI-compatible API. |
NeuroSkill LLM — On-Device Inference Server
Control the built-in on-device LLM inference server (OpenAI-compatible, powered by llama.cpp).
All subcommands route to the WebSocket/HTTP API.
Subcommands
| Subcommand | Description |
|---|
llm status | Server state (stopped/loading/running), model name, context window. Context size is auto-recommended based on model size and available GPU/RAM (minimum 8192 tokens, configurable in Settings → LLM → Context Size) |
llm start | Load the active model and start the inference server |
llm stop | Stop the server and free GPU/CPU memory |
llm catalog | List all GGUF models with download states and active selections |
llm add <repo> <filename> | Add an external HF model to the catalog and download it |
llm add <hf-url> | Add from a full HuggingFace URL |
llm add ... --mmproj <file> | Also add and download a vision projector from the same repo |
llm select <filename> | Set the active text model |
llm mmproj <filename|none> | Set the active vision projector (or none to disable) |
llm autoload-mmproj <on|off> | Toggle auto-loading of vision projector on start |
llm download <filename> | Download a model (fire-and-forget; poll catalog for progress) |
llm pause <filename> | Pause an in-progress download |
llm resume <filename> | Resume a paused download |
llm cancel <filename> | Cancel an in-progress download |
llm delete <filename> | Delete a locally-cached model file |
llm downloads | List all downloads with status and progress |
llm refresh | Re-probe the HF Hub cache for externally downloaded models |
llm fit | Check which models fit in available RAM/VRAM |
llm logs | Print the last 500 LLM server log lines |
llm chat | Interactive multi-turn REPL — type exit to quit (WebSocket only) |
llm chat "message" | Single-shot: send one message, stream the reply, and exit |
CLI Examples
npx neuroskill llm status
npx neuroskill llm start
npx neuroskill llm stop
npx neuroskill llm catalog
npx neuroskill llm catalog --json | jq '.entries[] | select(.state == "downloaded")'
npx neuroskill llm add bartowski/Phi-4-mini-reasoning-GGUF Phi-4-mini-reasoning-Q4_K_M.gguf
npx neuroskill llm add bartowski/Phi-4-mini-reasoning-GGUF Phi-4-mini-reasoning-Q4_K_M.gguf --mmproj mmproj-Phi-4-mini-reasoning-BF16.gguf
npx neuroskill llm select "Qwen_Qwen3.5-4B-Q4_K_M.gguf"
npx neuroskill llm mmproj "mmproj-Qwen_Qwen3.5-4B-BF16.gguf"
npx neuroskill llm mmproj none
npx neuroskill llm download "Qwen3-1.7B-Q4_K_M.gguf"
npx neuroskill llm delete "Qwen3-1.7B-Q4_K_M.gguf"
npx neuroskill llm fit
npx neuroskill llm chat
npx neuroskill llm chat --system "You are a concise EEG assistant."
npx neuroskill llm chat --temperature 0.3 --max-tokens 512
npx neuroskill llm chat "What EEG frequency bands are associated with meditation?"
npx neuroskill llm chat "Explain delta waves" --temperature 0.3 --max-tokens 256
npx neuroskill llm chat "What do you see?" --image screenshot.png
npx neuroskill llm chat "Compare these" --image session1.png --image session2.png
Interactive REPL Commands
| Command | Effect |
|---|
/clear | Clear conversation history and start a new session (system prompt is kept) |
/history | Print all messages in the current conversation |
/image <path> | Stage an image file for the next message (can repeat for multiple images) |
/images | Show count of currently staged images |
/help | Show REPL command help |
exit or quit | End the session |
Image staging example inside the REPL:
You: /image eeg_plot.png
✓ image staged: eeg_plot.png (1 pending)
You: What anomalies do you see in this EEG trace?
WebSocket Streaming Protocol — llm_chat
llm_chat returns multiple frames per request. Tokens stream as delta frames;
generation ends with done or error.
ws.send(JSON.stringify({
command: "llm_chat",
messages: [
{ role: "system", content: "You are a concise EEG assistant." },
{ role: "user", content: "What does high theta power indicate?" },
],
}));
ws.send(JSON.stringify({ command: "llm_chat", message: "Hello!" }));
Server Frames
{ "command": "llm_chat", "type": "session", "session_id": 42 }
{ "command": "llm_chat", "type": "delta", "text": "High theta" }
{
"command": "llm_chat", "ok": true, "type": "done",
"finish_reason": "stop", "prompt_tokens": 42,
"completion_tokens": 87, "n_ctx": 16384, "session_id": 42
}
HTTP REST API
curl -s http://127.0.0.1:8375/llm/status | jq '{status, model_name, n_ctx}'
curl -s -X POST http://127.0.0.1:8375/llm/start
curl -s -X POST http://127.0.0.1:8375/llm/stop
curl -s http://127.0.0.1:8375/llm/catalog | jq '.entries[] | select(.state == "downloaded") | .filename'
curl -s -X POST http://127.0.0.1:8375/llm/chat \
-H "Content-Type: application/json" \
-d '{"message":"What is EEG coherence?"}' | jq '{text, finish_reason, session_id}'
curl -s -X POST http://127.0.0.1:8375/llm/chat \
-H "Content-Type: application/json" \
-d '{"message":"Can you elaborate?","session_id":42}'
curl -s -X POST http://127.0.0.1:8375/llm/chat \
-H "Content-Type: application/json" \
-d '{"message":"Summarize in one line.","system":"Be extremely brief.","temperature":0.3}'
curl -s -X POST http://127.0.0.1:8375/llm/chat \
-H "Content-Type: application/json" \
-d '{"message":"What do you see?","images":["data:image/png;base64,iVBORw0KGgo..."]}'
curl -s http://127.0.0.1:8375/llm/logs | jq '.logs[-10:]'
WebSocket Commands
| Command | Required | Optional | Description |
|---|
llm_status | — | — | Server state + model info |
llm_start | — | — | Start inference server |
llm_stop | — | — | Stop server + free resources |
llm_catalog | — | — | Full model catalog |
llm_add_model | repo, filename | download, mmproj | Add external HF model |
llm_select_model | filename | — | Set active text model |
llm_select_mmproj | filename | — | Set active vision projector |
llm_download | filename | — | Start model download |
llm_pause_download | filename | — | Pause download |
llm_resume_download | filename | — | Resume download |
llm_cancel_download | filename | — | Cancel download |
llm_delete | filename | — | Delete cached model |
llm_downloads | — | — | List downloads with progress |
llm_refresh_catalog | — | — | Re-probe HF cache |
llm_hardware_fit | — | — | Check model fit in RAM/VRAM |
llm_logs | — | — | Last 500 log entries |
llm_chat | messages or message | GenParams | Streaming chat |
GenParams Reference
| Field | Type | Default | Description |
|---|
temperature | float | 0.8 | Sampling temperature (0 = deterministic) |
top_k | int | 40 | Top-K sampling |
top_p | float | 0.9 | Nucleus sampling threshold |
repeat_penalty | float | 1.1 | Repetition penalty |
seed | uint | 0xDEADBEEF | RNG seed for reproducible output |
max_tokens | uint | 2048 | Maximum tokens to generate |
thinking_budget | uint/null | 512 | Max tokens in <think> block (0 = skip, null = unlimited) |
Built-in Tool Calling
When the LLM server is running, llm_chat supports automatic tool calling — the
model can invoke built-in tools and return results within the conversation.
| Tool | Description |
|---|
date | Get current date/time metadata |
location | Approximate public-IP geolocation |
bash | Execute shell commands |
read_file | Read file contents from disk |
write_file | Write / create files |
edit_file | Surgical find-and-replace edits |
search_output | Navigate large tool outputs (paginated) |
web_search | Search the web |
web_fetch | Fetch a URL and return its content |
skill | Query the NeuroSkill EEG API (status, sessions, labels, search, screenshots, hooks, DND, health, etc.) |
Tool calls are detected in the model's output, executed server-side, and results are
fed back automatically. The CLI llm chat handles this transparently.
The skill Tool
The skill tool lets the LLM query and control NeuroSkill™ directly. Commands are
sent as { "command": "<ws_command>", "args": { ... } } objects. Supported commands
include: status, sessions, session_metrics, search, compare, sleep,
sleep_schedule, sleep_schedule_set, label, search_labels, interactive_search,
search_screenshots, screenshots_around, screenshots_for_eeg, eeg_for_screenshots,
search_screenshots_vision, search_screenshots_by_image_b64, hooks_status,
hooks_get, hooks_suggest, hooks_log, dnd, notify, say, health_summary,
health_query, health_metric_types, health_sync, and others.
Argument coercion: If the LLM flattens arguments to the top level (e.g.
{"command":"search_screenshots","query":"today"} instead of properly nesting in args),
the tool system automatically wraps stray properties into the args object before validation.
Blocked commands: For safety, the LLM cannot invoke self-management commands:
llm_start, llm_stop, llm_select_model, llm_select_mmproj, llm_add_model,
llm_download, llm_cancel_download, llm_pause_download, llm_resume_download,
llm_refresh_catalog, llm_logs, llm_delete, run_calibration, timer.
Status formatting: When the LLM calls the status command, the result is
automatically formatted from raw JSON into a human-readable text block with clear
section headers (Device, Session, EEG Embeddings, Labels, Most Used Apps, Screenshots,
Signal Quality, Current Scores, Hooks, Sleep, Recording History) for easier comprehension.
Chat Persistence
All conversations are automatically persisted to SQLite
(~/.skill/chats/chat_history.sqlite):
- WebSocket / HTTP chats appear in the Chat window sidebar.
- The server returns a
session_id for multi-turn persistence across connections.
- New sessions are auto-titled from the first user message.
- Tool calls executed during conversation are also persisted.
Server Status Values
status | Meaning |
|---|
"stopped" | No model loaded |
"loading" | Model being loaded from disk |
"running" | Model ready; chat and /v1/* endpoints are live |
Download State Values
state | Meaning |
|---|
"not_downloaded" | Not present locally |
"downloading" | In progress; check progress (0.0–1.0) |
"downloaded" | Cached locally; ready to use |
"cancelled" | Download was cancelled |
"failed" | Download failed; status_msg has details |
The LLM server also exposes an OpenAI-compatible HTTP API at /v1/chat/completions,
/v1/completions, and /v1/embeddings. Use any OpenAI client library by pointing
it at http://127.0.0.1:<port>.