بنقرة واحدة
بنقرة واحدة
Manage recipe registries and create inference recipes
ALWAYS invoke this skill before running any sparkrun CLI commands. Never run sparkrun directly without loading this skill first. Covers launching, monitoring, stopping, and checking status of inference workloads on NVIDIA DGX Spark.
Manage recipe registries and create inference recipes
ALWAYS invoke this skill before running any sparkrun CLI commands. Never run sparkrun directly via Bash without loading this skill first. Covers launching, monitoring, stopping, and checking status of inference workloads on NVIDIA DGX Spark.
Install sparkrun and configure DGX Spark clusters
| name | setup |
| description | Install sparkrun and configure DGX Spark clusters |
<Use_When>
<Do_Not_Use_When>
# Ensure that uv is installed
uv --version
# uv can be installed with (IF NEEDED)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install sparkrun as a CLI tool
uvx sparkrun setup install
# Update sparkrun + registries (top-level shortcut)
sparkrun update
# Update to latest version (and registries)
sparkrun setup update
# Update sparkrun only (skip registry sync)
sparkrun setup update --no-update-registries
The wizard handles all setup steps in a single guided flow:
# Interactive wizard (auto-launches when no default cluster exists)
sparkrun setup wizard
# Pre-populate hosts and cluster name
sparkrun setup wizard --hosts <ip1>,<ip2> --cluster mylab
# Non-interactive (accept all defaults)
sparkrun setup wizard --yes --hosts <ip1>,<ip2>
# Dry-run preview
sparkrun setup wizard --dry-run --hosts <ip1>,<ip2>
The wizard performs these phases:
Running sparkrun setup with no subcommand auto-launches the wizard when no default cluster is configured.
Clusters are named host groups saved in ~/.config/sparkrun/clusters/.
# Create a cluster (first host = head node)
sparkrun cluster create <name> --hosts <ip1>,<ip2>,... [-d "description"] [--user <ssh_user>]
sparkrun cluster create <name> --hosts <ips> --transfer-mode push --transfer-interface cx7
# Set as default (used when --hosts/--cluster not specified)
sparkrun cluster set-default <name>
# View clusters
sparkrun cluster list
sparkrun cluster show <name>
sparkrun cluster default
# Modify
sparkrun cluster update <name> --hosts <new_hosts> [--user <user>] [-d "desc"]
sparkrun cluster update <name> --add-host 10.0.0.5
sparkrun cluster update <name> --add-host 10.0.0.5,10.0.0.6
sparkrun cluster update <name> --remove-host 10.0.0.2
sparkrun cluster update <name> --transfer-mode push --transfer-interface cx7
sparkrun cluster delete <name>
sparkrun cluster unset-default
| Option | Description |
|---|---|
--hosts, -H | Comma-separated host list |
--hosts-file | File with hosts (one per line) |
--user, -u | SSH username for this cluster |
--cache-dir | HuggingFace cache directory for this cluster |
--transfer-mode | Resource transfer mode (auto, local, push, delegated) |
--transfer-interface | Network interface for transfers (auto, cx7, mgmt) |
--add-host | Add host(s) to the cluster (repeatable, comma-ok) |
--remove-host | Remove host(s) from the cluster (repeatable, comma-ok) |
Multi-node inference requires passwordless SSH between all hosts. sparkrun bundles a mesh setup script.
# Set up SSH mesh across cluster hosts (interactive -- prompts for passwords)
sparkrun setup ssh --cluster <name>
sparkrun setup ssh --hosts <ip1>,<ip2> [--user <username>]
# Include extra hosts (e.g. control machine) in the mesh
sparkrun setup ssh --cluster <name> --extra-hosts <control_ip>
# Exclude the local machine from the mesh
sparkrun setup ssh --cluster <name> --no-include-self
# Dry-run to see what would happen
sparkrun setup ssh --cluster <name> --dry-run
IMPORTANT: The SSH setup script runs interactively (prompts for passwords on first connection). Do NOT capture its output -- let it pass through to the terminal.
Configure ConnectX-7 network interfaces on cluster hosts for high-speed transfers.
# Auto-detect CX7 interfaces and configure with defaults
sparkrun setup cx7 --cluster <name>
sparkrun setup cx7 --hosts <ip1>,<ip2>
# Override subnets
sparkrun setup cx7 --cluster <name> --subnet1 192.168.11.0/24 --subnet2 192.168.12.0/24
# Force reconfiguration and set MTU
sparkrun setup cx7 --cluster <name> --force --mtu 9000
# Dry-run
sparkrun setup cx7 --cluster <name> --dry-run
Requires passwordless sudo on target hosts. Will prompt for sudo password if needed.
Ensure the SSH user can run Docker commands without sudo.
sparkrun setup docker-group --cluster <name>
sparkrun setup docker-group --hosts <ip1>,<ip2> [--user <username>]
sparkrun setup docker-group --cluster <name> --dry-run
Fix file ownership in HuggingFace cache directories on cluster hosts.
# Fix permissions on default cache directory
sparkrun setup fix-permissions --cluster <name>
# Custom cache directory
sparkrun setup fix-permissions --cluster <name> --cache-dir /data/hf-cache
# Install sudoers entry for passwordless future runs
sparkrun setup fix-permissions --cluster <name> --save-sudo
# Dry-run
sparkrun setup fix-permissions --cluster <name> --dry-run
Drop the Linux page cache on cluster hosts to free memory for inference.
sparkrun setup clear-cache --cluster <name>
sparkrun setup clear-cache --cluster <name> --save-sudo
sparkrun setup clear-cache --cluster <name> --dry-run
Install earlyoom on cluster hosts to prevent system hangs from memory pressure.
sparkrun setup earlyoom --cluster <name>
sparkrun setup earlyoom --hosts <ip1>,<ip2> [--user <username>]
sparkrun setup earlyoom --cluster <name> --dry-run
Collect diagnostic information from cluster hosts (hidden command, useful for debugging).
sparkrun setup diagnose --cluster <name>
sparkrun setup diagnose --cluster <name> --output diag.json
sparkrun setup diagnose --cluster <name> --json # JSON to stdout
sparkrun setup diagnose --cluster <name> --sudo # include sudo-level checks
Config file: ~/.config/sparkrun/config.yaml
Key settings:
cluster.hosts: Default host list (used when no --hosts/--cluster given)ssh.user: Default SSH usernamessh.key: Path to SSH private keyssh.options: Additional SSH options listcache_dir: sparkrun cache directory (default: ~/.cache/sparkrun)hf_cache_dir: HuggingFace cache directory (default: ~/.cache/huggingface)<Tool_Usage>
Use the sparkrun_exec tool for all sparkrun commands.
When running SSH setup, the command is interactive and must be run with inherited stdio -- use the shell tool directly for sparkrun setup ssh instead of sparkrun_exec, as it requires terminal interaction.
</Tool_Usage>
<Important_Notes>
sparkrun setup ssh is interactive -- let it pass through to the terminaltensor_parallel maps to node countuv is the recommended Python package manager; install with curl -LsSf https://astral.sh/uv/install.sh | shsparkrun setup cx7 requires passwordless sudo; use --force to reconfigure already-valid hostssparkrun setup fix-permissions and clear-cache try non-interactive sudo first, then prompt if needed--save-sudo to install scoped sudoers entries for passwordless future runssparkrun update is a top-level shortcut that upgrades sparkrun (if uv-installed) and updates registries--transfer-mode options: auto (default), local (no transfer), push (head pushes to workers), delegated (workers pull)--transfer-interface options: auto (default), cx7 (use CX7 IPs), mgmt (use management IPs)--add-host / --remove-host for incremental cluster changes instead of replacing the full host list
</Important_Notes>Task: {{ARGUMENTS}}