gpu-lock

// Use when about to launch any GPU/CUDA/PyTorch/Ollama workload, when wrapping a program that uses an LLM or image generator, when handling GPU coordination on Mandragora, or when working with the bots (im-gen, llm-via-telegram, crush, MCP server). Loads the full Mandragora gpu-lock contract — non-blocking respect-the-holder protocol, CLI verbs, library API, VRAM cleanup discipline, hard rules.

name	gpu-lock
description	Use when about to launch any GPU/CUDA/PyTorch/Ollama workload, when wrapping a program that uses an LLM or image generator, when handling GPU coordination on Mandragora, or when working with the bots (im-gen, llm-via-telegram, crush, MCP server). Loads the full Mandragora gpu-lock contract — non-blocking respect-the-holder protocol, CLI verbs, library API, VRAM cleanup discipline, hard rules.

gpu-lock — GPU coordination on Mandragora

This machine has one GPU (RTX 5070 Ti, 16 GB). Workloads assume exclusive access. The gpu-lock primitive enforces cooperative serialization with respect-the-holder semantics: a running holder is never interrupted; new arrivals fail fast and the caller decides what to do (retry, fall back, surface the busy state to the user).

When this applies

Anything that touches CUDA: PyTorch, TensorFlow, diffusers, Ollama clients (crush, llm-via-telegram, MCP server), nvidia-smi-monitored work, model loading, inference, fine-tuning, training, profiling.

Two ways in

CLI (any language):

gpu-lock run --name <yourname> --expect <seconds> -- <cmd>

Tries the fcntl mutex on /dev/shm/gpu-lock/gpu.lock non-blocking. If free: runs cmd, releases on exit. If held: exits 75 (EX_TEMPFAIL) and prints {"error":"gpu-busy","holder":{...}} JSON to stderr. The running holder is never signalled.

Python library (in-tree projects):

from gpu_lock import gpu_lock, GpuBusy
import torch

try:
    with gpu_lock.acquire("yourname", expected_seconds=N):
        for step in work:
            do_step(step)
        torch.cuda.empty_cache()
except GpuBusy as busy:
    print(f"GPU held by {busy.holder['name']}, ~{busy.expected_remaining():.0f}s remaining")

PYTHONPATH must include /etc/nixos/mandragora/.local/share/gpu-lock for the import to resolve. Mandragora-managed services already set this; ad-hoc scripts can either set it themselves or just use the CLI wrapper.

Hard rules

PyTorch users MUST torch.cuda.empty_cache() before exit. Without it, the caching allocator keeps the pages and the next holder sees a near-full GPU. Real incident on 2026-04-27: a Flux render left 13.7 GiB of cached pages, forcing every subsequent Ollama call into CPU offload mode (2+ minute response times until im-gen restarted).
Respect execution. A running holder is not preempted, signalled, or killed. If acquire raises GpuBusy, your job is to back off — retry later, fall back to CPU, or tell the user. Do not os.kill the holder PID and do not loop tight on acquire.
No sys.path.insert("/home/m/Projects/gpu-lock") — that path is a bridge symlink for legacy callers and will go away in Phase 2 of the migration. Use PYTHONPATH or the CLI wrapper.
Ollama is currently outside the protocol. Crush, llm-via-telegram, and the local MCP server all hit http://127.0.0.1:11434 directly without going through the lock. Until the Ollama-fronting proxy ships (Layer 2), expect interleaving — manually sudo systemctl stop ollama before exclusive workloads if you need guaranteed isolation.

Inspection

gpu-lock status — current holder (pid, name, held_for, expected_remaining)
State files in /dev/shm/gpu-lock/ (RAM-backed, auto-clears on reboot): gpu.lock (the fcntl mutex), gpu.lock.holder (JSON)

Long form

Full design rationale, storage layout, idiomatic patterns, what the tool deliberately does NOT do: /etc/nixos/mandragora/docs/gpu.md.

gpu-lock — GPU coordination on Mandragora

When this applies

Two ways in

CLI (any language):

gpu-lock run --name <yourname> --expect <seconds> -- <cmd>

Python library (in-tree projects):

from gpu_lock import gpu_lock, GpuBusy import torch try: with gpu_lock.acquire("yourname", expected_seconds=N): for step in work: do_step(step) torch.cuda.empty_cache() except GpuBusy as busy: print(f"GPU held by {busy.holder['name']}, ~{busy.expected_remaining():.0f}s remaining")

Hard rules

PyTorch users MUST torch.cuda.empty_cache() before exit. Without it, the caching allocator keeps the pages and the next holder sees a near-full GPU. Real incident on 2026-04-27: a Flux render left 13.7 GiB of cached pages, forcing every subsequent Ollama call into CPU offload mode (2+ minute response times until im-gen restarted).

Respect execution. A running holder is not preempted, signalled, or killed. If acquire raises GpuBusy, your job is to back off — retry later, fall back to CPU, or tell the user. Do not os.kill the holder PID and do not loop tight on acquire.

No sys.path.insert("/home/m/Projects/gpu-lock") — that path is a bridge symlink for legacy callers and will go away in Phase 2 of the migration. Use PYTHONPATH or the CLI wrapper.

Ollama is currently outside the protocol. Crush, llm-via-telegram, and the local MCP server all hit http://127.0.0.1:11434 directly without going through the lock. Until the Ollama-fronting proxy ships (Layer 2), expect interleaving — manually sudo systemctl stop ollama before exclusive workloads if you need guaranteed isolation.

Inspection

gpu-lock status — current holder (pid, name, held_for, expected_remaining)

State files in /dev/shm/gpu-lock/ (RAM-backed, auto-clears on reboot): gpu.lock (the fcntl mutex), gpu.lock.holder (JSON)

Long form

Full design rationale, storage layout, idiomatic patterns, what the tool deliberately does NOT do: /etc/nixos/mandragora/docs/gpu.md.