Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

system-performance-remediation

Sterne0

Forks0

Aktualisiert8. Mai 2026 um 12:40

Restore machine responsiveness via safe, selective process cleanup. Use when system unresponsive, high CPU/load average, IO pressure, filesystem cache bloat, memory pressure from btrfs/ext4, stuck tests, competing cargo builds, confused agents in loops, swap thrashing, disk full, systemd-oomd kills, or tmux/zellij session sprawl.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

oysteinkrog

oysteinkrog/dotfiles

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

Netzwerk- und ComputersystemadministratorenInformatik- und Mathematikberufe·SOC 15-1244

Datei-Explorer

11 Dateien

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

consult-oracles

oysteinkrog/dotfiles

Consult Fable (primary oracle) for expert second opinions; escalate to GPT-5.5-Pro only for extremely important or complex tasks (always paired with Fable). Use for complex decisions, architecture choices, debugging hard problems, or when user says "consult oracles", "ask the experts", or wants a second opinion.

2026-06-100

oracle-review

oysteinkrog/dotfiles

Run iterative oracle + agent hardening loop on any artifact (designs, plans, beads, architecture) until findings converge to near-zero. Combines /swarm-oracle with /swarm-review in alternating rounds. Use for the full hardening cycle, not just a single oracle pass. For oracle-only, use /swarm-oracle. For bead-only hardening, use /swarm-beads-quality.

2026-06-100

oracle-consensus

oysteinkrog/dotfiles

Run 2x oracle sessions (FOR + AGAINST stances) to validate design decisions, plans, or bead readiness. Default = two Fable subagents; escalate to PAL 2x GPT-Pro (always paired with Fable) for extremely important or complex validations. Use after design rounds, before implementation, or to challenge architecture decisions.

2026-06-100

sync-human

oysteinkrog/dotfiles

Act as a wise, effective teacher whose goal is to make the human deeply understand the work done in this session (a change, a bug fix, a feature, a design) — i.e. sync the human's mental model up to the agent's. Use when the user says "sync-human", "sync me up", "teach me this session", "make sure I understand", "walk me through what we did", "quiz me on this", or "I want to actually understand this PR/change", or otherwise wants Socratic, gated, incremental teaching with comprehension checks rather than a one-shot summary. Drives understanding at both high level (motivation, impact) and low level (business logic, edge cases) using a running checklist and quizzes.

2026-06-050

agent-mail

oysteinkrog/dotfiles

MCP Agent Mail for multi-agent coordination. Use when agents need file locks, messaging, inboxes, or conflict prevention. Handles macro_start_session, file_reservation_paths, send_message, threading, pre-commit guards.

2026-05-290

secret-lookup

oysteinkrog/dotfiles

Retrieve API tokens, keys, and credentials Oystein has stored locally. Use whenever code, scripts, or shell commands need a secret value: GitHub tokens, Cloudflare, HubSpot, Slack, Zendesk, Jira, Sentry, Anthropic, Apify, Browserbase, Google OAuth, Huma. Use BEFORE searching shell history, session logs, dotfiles, or the filesystem — the canonical store is documented here and the values are reachable via two fish helpers. Also use when adding, rotating, or removing a credential.

2026-05-110

name	system-performance-remediation
description	Restore machine responsiveness via safe, selective process cleanup. Use when system unresponsive, high CPU/load average, IO pressure, filesystem cache bloat, memory pressure from btrfs/ext4, stuck tests, competing cargo builds, confused agents in loops, swap thrashing, disk full, systemd-oomd kills, or tmux/zellij session sprawl.

System Performance Remediation

Core Principle: First, do no harm. Kill OBVIOUSLY useless processes before touching anything potentially useful.

The Whack-a-Mole Anti-Pattern: Killing child processes (cargo builds, tests) is POINTLESS if confused parent agents respawn them. Kill the confused agents, not their children.

Quick Reference — Copy-Paste Commands

# === INSTANT DIAGNOSIS ===
uptime && nproc && cat /proc/pressure/cpu | head -1

# === ONE-LINER STATUS (includes swap + memory pressure) ===
echo "Load: $(uptime | awk -F'load average:' '{print $2}') / $(nproc) cores | Mem: $(free -h | awk '/Mem:/{print $3"/"$2}') | Swap: $(free -h | awk '/Swap:/{print $3"/"$2}') | Zombies: $(ps -eo stat | grep -c '^Z' || echo 0) | MemP: $(awk -F= '/some/{print $2}' /proc/pressure/memory | cut -d' ' -f1)%"

# === VM TUNING CHECK (catches cache bloat before it kills sessions) ===
sysctl vm.vfs_cache_pressure vm.min_free_kbytes && cat /proc/pressure/memory

# === FIND STUCK PROCESSES ===
ps -eo pid,etimes,pcpu,args --sort=-etimes | grep -E 'bun test|cargo test|vercel|git add' | awk '$2 > 3600'

# === FIND STALE GEMINI AGENTS (24+ hours) ===
ps -eo pid,etimes,pcpu,rss,args | grep 'bun.*gemini' | grep -v grep | awk '$2 > 86400 {print $1, int($2/3600)"h", $3"%", int($4/1024)"MB"}'

# === COUNT MCP SERVER BLOAT ===
ps aux | grep -E 'playwright|morphmcp' | grep -v grep | wc -l

# === FIND COMPETING BUILDS ===
ps aux | grep cc1plus | grep -oP 'target[^/]*/' | sort | uniq -c

# === FIND OLD AGENTS (16+ hours) ===
ps -eo pid,etimes,pcpu,args | grep -E 'claude --dangerously|codex --dangerously' | awk '$2 > 57600 {print $1, int($2/3600)"h", $3"%"}'

# === KILL OLD AGENTS (16+ hours) ===
ps -eo pid,etimes,args | grep -E 'claude|codex' | awk '$2 > 57600 {print $1}' | xargs -r kill

# === RENICE ALL COMPILATION ===
for pid in $(pgrep -f '/bin/cargo') $(pgrep cc1plus); do renice 19 -p $pid; ionice -c 3 -p $pid; done 2>/dev/null

# === ZELLIJ DEAD SESSION COUNT ===
zellij list-sessions 2>&1 | grep -c EXITED

Kill Hierarchy (Safest First)

Priority	Category	Examples	Risk
1	Zombies	Defunct processes (Z state)	Zero — already dead
2	Exited zellij/tmux sessions	`zellij delete-all-sessions`	Zero — already exited
3	Stuck tests	`bun test`, `cargo test` 12+ hours	Low — idempotent
4	Orphaned poll loops	zsh shells waiting on files that never appear	Low — wasted CPU
5	Stuck CLI	`vercel inspect`, `git add .` 5+ min	Low — restart-safe
6	Duplicate builds	Multiple `cargo check` same project	Low — keep newest
7	Old dev servers	`next dev`, `bun --hot` idle 24+ hours	Low — restart-safe
8	Stale gemini agents	`bun gemini` running 24+ hours	Medium — likely stuck
9	Old tmux sessions	`ntm-*` no activity	Medium — check first
10	Old agents	`claude`, `codex` 16+ hours	Medium — likely stuck
11	Active agents	`claude`, `codex` <16 hours	High — doing work
12	System processes	NEVER TOUCH	Forbidden

Protected Patterns (NEVER KILL)

systemd, sshd, dbus, cron, docker, containerd
postgres, mysql, redis, elasticsearch, nginx, caddy
wezterm-mux-server  ← ABSOLUTELY NEVER TOUCH — holds ALL agent sessions

SIGTERM vs SIGKILL

Some processes ignore SIGTERM. Always try SIGTERM first, wait 3s, escalate:

kill $PID; sleep 3; kill -0 $PID 2>/dev/null && kill -9 $PID

Known SIGTERM-ignorers: bun test — always needs SIGKILL after SIGTERM fails.

VM Tuning & Filesystem Cache Bloat (The Silent Killer)

Real-world incident (2026-02-23): On trj (499GB RAM, btrfs), vfs_cache_pressure=50 let btrfs inode/dentry caches balloon to 388GB page cache + 40GB slab. Memory pressure hit 18%. systemd-oomd killed user@1000.service, destroying the mux server and all 382 agent sessions instantly. The fix: vfs_cache_pressure=200 + min_free_kbytes=2GB + drop caches. Pressure dropped from 18% to 2.4% in minutes.

The Cache Bloat Pattern

High-RAM machines with many agents accumulate massive filesystem caches. The kernel hoards dentries, inodes, and page cache (especially on btrfs). This creates memory pressure even with "free" RAM because the kernel's reclaim paths stall under pressure.

Symptoms:

System feels sluggish despite free -h showing lots of "available" RAM
/proc/pressure/memory shows sustained avg10 > 5% (the key metric!)
kcompactd0 running at 2-5% CPU continuously
Slab cache (cat /proc/meminfo | grep Slab) is 20-40+ GB
vmstat 1 3 shows high si/so or bi/bo in first sample

Diagnose Cache Bloat

# 1. Check memory pressure (THE critical metric)
cat /proc/pressure/memory
# some avg10=18.78 → 18.78% of time tasks stalled on memory = BAD

# 2. Check VM tuning
sysctl vm.vfs_cache_pressure vm.min_free_kbytes

# 3. Check slab breakdown
sudo slabtop -o -s c | head -15
# Look for: btrfs_inode (GB), radix_tree_node (GB), dentry (GB), ext4_inode_cache (GB)

# 4. Check page cache vs actual usage
grep -E "Cached|Slab|SReclaimable|SUnreclaim|Dirty|MemAvail" /proc/meminfo

# 5. Check kcompactd (memory compaction daemon — should be ~0% CPU)
ps -o pid,pcpu,etime,cmd -p $(pgrep kcompactd) 2>/dev/null

Fix: Tune VM Parameters

Settings by filesystem and RAM size:

Machine Type	FS	vfs_cache_pressure	min_free_kbytes	Notes
499GB btrfs	btrfs	200	2GB (2097152)	btrfs caches are aggressive
251GB ext4	ext4	150	1-2GB (1048576-2097152)	ext4 is less cache-heavy
58GB ext4	ext4	150	512MB (524288)	VPS tier
29GB ext4	ext4	150	512MB (524288)	VPS tier
15GB ext4	ext4	150	256MB (262144)	Small VPS

# Apply immediately
sudo sysctl -w vm.vfs_cache_pressure=200 vm.min_free_kbytes=2097152

# Drop caches for immediate relief (only if pressure avg10 > 5%)
sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"

# Persist to sysctl conf
sudo tee /etc/sysctl.d/99-system-resource-protection.conf << 'EOF'
# Tuned for heavy agent workloads
vm.swappiness = 10
vm.vfs_cache_pressure = 200
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
vm.min_free_kbytes = 2097152
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 1024
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
vm.max_map_count = 2147483642
EOF

WARNING: The default vfs_cache_pressure=100 is dangerous on high-RAM btrfs machines. A value of 50 (often set by "desktop optimization" scripts) is even worse — it actively tells the kernel to hoard caches. Always check this on any machine that feels sluggish.

Fleet-Wide VM Audit

# Quick audit of VM tuning across fleet
for host in trj css csd vmi1152480 vmi1153651 vmi1156319 vmi1167313 vmi1227854 vmi1264463 vmi1293453; do
  echo -n "$host: " && ssh -o ConnectTimeout=5 $host \
    'printf "vfs=%s min_free=%sKB mem_pressure=%s\n" \
      $(sysctl -n vm.vfs_cache_pressure) \
      $(sysctl -n vm.min_free_kbytes) \
      $(awk -F= "/some/{print \$2}" /proc/pressure/memory | cut -d" " -f1)' 2>/dev/null || echo "UNREACHABLE"
done

systemd-oomd Protection (Preventing Session Massacres)

The worst-case scenario: systemd-oomd kills user@1000.service, which cascades to kill the wezterm-mux-server, destroying ALL agent sessions. This happened on trj when a single session peaked at 404GB and the user slice hit 496GB/536GB EffectiveMemoryMax.

Set Per-Session Memory Limits

Prevent any single session from consuming enough memory to trigger oomd:

# Cap individual sessions (prevents one runaway session from killing the user slice)
sudo mkdir -p /etc/systemd/system/session-.scope.d
sudo tee /etc/systemd/system/session-.scope.d/memory-limit.conf << 'EOF'
[Scope]
MemoryMax=64G
MemoryHigh=48G
EOF

# Cap the entire user slice (leave headroom for system)
# NOTE: Check for existing override.conf that might set MemoryMax=infinity
sudo mkdir -p /etc/systemd/system/user-1000.slice.d
# Edit existing override.conf if present, or create new:
sudo tee /etc/systemd/system/user-1000.slice.d/memory-limit.conf << 'EOF'
[Slice]
MemoryMax=460G
MemoryHigh=400G
EOF

sudo systemctl daemon-reload

CRITICAL: Check for pre-existing override files that set MemoryMax=infinity — these sort alphabetically after memory-limit.conf and will negate your limits. Consolidate all settings into a single file or name yours zz-memory-limit.conf.

Verify Session Limits

# Check per-session limits
for scope in $(systemctl --user list-units --type=scope --state=running --no-legend | awk '{print $1}' | grep session); do
  echo -n "$scope: " && systemctl show "$scope" -p MemoryMax 2>/dev/null
done

# Check user slice limit
systemctl show user-1000.slice -p MemoryMax -p MemoryHigh

Retain oomd Logs

By default, journald rotation can lose oomd kill logs. Ensure they survive:

sudo mkdir -p /etc/systemd/journald.conf.d
sudo tee /etc/systemd/journald.conf.d/retain-oomd.conf << 'EOF'
[Journal]
SystemMaxUse=2G
SystemKeepFree=1G
MaxRetentionSec=7day
EOF
sudo systemctl restart systemd-journald

Fix resource-watchdog Service

If resource-watchdog.service is crash-looping with IOPRIO errors:

# Check status
systemctl --user status resource-watchdog.service

# The fix: IOSchedulingClass=realtime requires root — change to best-effort
# In ~/.config/systemd/user/resource-watchdog.service:
# IOSchedulingClass=realtime  →  IOSchedulingClass=best-effort

systemctl --user daemon-reload
systemctl --user restart resource-watchdog.service

Diagnosis

CPU Pressure (Critical for Sluggishness)

Load average can look "OK" while system feels sluggish. CPU pressure reveals the truth.

cat /proc/pressure/cpu
# some avg10=57.18 → 57% of time tasks waiting for CPU = BAD

Metric	Healthy	Warning	Critical
CPU pressure avg10	<10%	10-30%	>30%
IO pressure avg10	<5%	5-15%	>15%
Memory pressure avg10	<5%	5-20%	>20%

Full Status Check (Linux)

uptime && nproc              # Load vs cores (danger: ratio > 1.5)
free -h                       # Memory (danger: available < 10%)
swapon --show                 # Swap config (danger: 0B total or near-full)
ps -eo stat | grep -c '^Z'   # Zombie count
cat /proc/sys/fs/file-nr     # File handles (allocated, free, max)
ps aux --sort=-%cpu | head -20 # Top CPU consumers
ps aux --sort=-%mem | head -10 # Top memory consumers
vmstat 1 3                    # IO wait, context switches, swap in/out
cat /proc/pressure/cpu        # CPU pressure
cat /proc/pressure/memory     # Memory pressure (THE key sluggishness metric)
cat /proc/pressure/io         # IO pressure
df -h / /data /tmp /data/tmp  # Disk space
sysctl vm.vfs_cache_pressure vm.min_free_kbytes  # VM tuning (cache bloat check)
grep -E "Slab|SReclaimable" /proc/meminfo         # Slab cache size

macOS Quick Status

top -l 1 -n 10 -stats pid,command,cpu,mem,state | head -20
memory_pressure               # < 20% = warning, < 10% = critical

Swap & zram Management

The Swap Paradox

A machine can have 189GB free RAM yet feel sluggish because 30GB of process pages are stuck in swap from a past memory spike. The kernel doesn't proactively move pages back to RAM — they only fault back on access, causing latency spikes.

Symptom: Machine feels laggy, free -h shows plenty of available RAM but significant swap used.

Diagnose Swap Issues

swapon --show
free -h | grep Swap

Red flags:

Swap total = 0B — no safety net, OOM killer strikes without warning
Swap used >> 0 with lots of free RAM — past spike left pages in swap, causing latency
Swap near-full — next spike = OOM kill

Fix: Flush Swap Back to RAM

When swap is used but RAM is plentiful, flush to restore responsiveness:

# ONLY do this when free RAM >> swap used
sudo swapoff -a && sudo swapon -a
# Can take minutes for large swap usage (30GB ≈ 2-5 min)

Set Up zram Swap

zram creates compressed swap in RAM. Cold pages get compressed (2-3x ratio with lz4) instead of being evicted or hitting disk. Size at ~13% of total RAM.

RAM	Cores	zram Size	Algorithm
499GB	128	64GB	lzo-rle or lz4
251GB	64-128	32GB	lz4
62GB	16	8GB	lz4
31GB	8	4GB	lz4

# Activate immediately
sudo modprobe zram num_devices=1
echo lz4 | sudo tee /sys/block/zram0/comp_algorithm 2>/dev/null
echo 32G | sudo tee /sys/block/zram0/disksize   # Adjust per table above
sudo mkswap /dev/zram0
sudo swapon -p 100 /dev/zram0  # Priority 100 = prefer over disk swap

# Verify
swapon --show && zramctl

Persist zram Across Reboots

Machines silently lose zram after reboot without a systemd service:

sudo tee /etc/systemd/system/zram-swap.service > /dev/null << 'EOF'
[Unit]
Description=Configure zram swap
After=local-fs.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/bash -c 'modprobe zram num_devices=1 && echo lz4 > /sys/block/zram0/comp_algorithm 2>/dev/null; echo 32G > /sys/block/zram0/disksize && mkswap /dev/zram0 && swapon -p 100 /dev/zram0'
ExecStop=/bin/bash -c 'swapoff /dev/zram0 2>/dev/null; echo 1 > /sys/block/zram0/reset 2>/dev/null'

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload && sudo systemctl enable zram-swap.service

Adjust the echo 32G line to match the sizing table above for your machine.

Disk Space Cleanup

Common Disk Hogs on Agent Machines

df -h / /data /tmp /data/tmp
du -sh /tmp/* 2>/dev/null | sort -rh | head -20
du -sh /data/tmp/* 2>/dev/null | sort -rh | head -20

# Find stale Rust target dirs (>7 days old)
find /data -maxdepth 2 -name 'cargo-target*' -o -name 'target-*' -o -name 'tmp-target-*' 2>/dev/null

# Large old files (>100MB, >3 days)
find /tmp /data/tmp -type f -size +100M -mtime +3 -exec ls -lh {} \; 2>/dev/null | head -30

Known Disk Hogs

Pattern	Location	Typical Size	Safe to Remove?
`rch_` / `rch-target-`	/tmp, /data/tmp	10-200GB each	Yes if not actively compiling
`rch_bolddesert` etc	/data/tmp	Up to 400GB	Yes — agent build caches
`codex_*_target`	/tmp	5-30GB each	Yes if idle
`tmp.XXXXXXXXXX` (mktemp)	/data/tmp	30-200GB	Check with `fuser` first
`cargo-target` / `target--build`	/data root	15-40GB each	Yes if >7 days old
`go-build*`	/data/tmp	1-5GB each	Yes — Go rebuilds automatically
`fsfs_index_snapshot*`	/tmp	1-2GB each	Yes

Before Removing: Check if In Use

fuser /data/tmp/suspect_dir 2>&1 | head -3
stat -c '%y' /data/tmp/suspect_dir   # Last modification time

Zellij Session Cleanup

Agent swarms create hundreds of zellij sessions that pile up as EXITED. 268 exited sessions is common after a few weeks.

# Count exited sessions
zellij list-sessions 2>&1 | grep -c EXITED

# List active (non-exited) sessions
zellij list-sessions 2>&1 | grep -v EXITED | grep -v '^$'

# Delete ALL exited sessions (safe — they're already dead)
zellij delete-all-sessions --yes

# Delete a specific session
zellij delete-session SESSION_NAME

Exited sessions consume disk and memory for stored scrollback. Clean regularly.

NTM/Tmux Cleanup

# List sessions with ages
/usr/bin/tmux list-sessions -F '#{session_name} #{session_created}' | while read name created; do
  echo "$name: $(( ($(date +%s) - created) / 3600 ))h old"
done

# Kill test artifact sessions (safe)
/usr/bin/tmux list-sessions -F '#{session_name}' | grep -E '^ntm-(test|lifecycle|rapid)-' | \
  xargs -I{} /usr/bin/tmux kill-session -t {}

Orphaned Process Cleanup

Orphaned Poll Loops

Agent shells sometimes spawn zsh processes that poll forever for files that will never appear:

# Find orphaned zsh shells waiting on nonexistent files
ps aux | grep 'while.*sleep.*done' | grep -v grep
ps aux | grep 'zsh.*exit_code' | grep -v grep

# Find detached agent processes (no TTY = likely orphaned)
ps aux | grep -E 'claude|codex|cass' | grep -v grep | awk '$7 == "?" {print}'

MCP Server Bloat

Each claude/codex agent spawns ~4 MCP server processes (playwright npx, morphmcp npx, sh wrapper, playwright-mcp node). Stale agents leave these orphaned:

# Count MCP servers
ps aux | grep -E 'playwright|morphmcp' | grep -v grep | wc -l

# They die when their parent agent dies — kill the agent, not the MCP servers

Runaway `cass` / `am` Processes

These can burn 20%+ CPU each when stuck:

# Find long-running cass processes
ps -eo pid,etimes,pcpu,rss,comm | grep cass | awk '$2 > 86400 {print}'

Agent Swarm Meltdown Fix

The #1 cause of meltdowns: Multiple agents building the same project with different CARGO_TARGET_DIR.

Step 1: Detect Competing Builds

ps aux | grep cc1plus | grep -oP 'target[^/]*/' | sort | uniq -c

Step 2: Kill Confused OLD Agents (NOT the builds!)

ps -eo pid,etimes,pcpu,args | grep -E 'claude --dangerously|codex --dangerously' | \
  awk '$2 > 57600 {print $1, int($2/3600)"h"}' | \
  while read pid age; do echo "Killing old agent $pid ($age)"; kill $pid; done

Step 3: Renice Remaining Builds

for pid in $(pgrep -f '/bin/cargo') $(pgrep cc1plus); do
  renice 19 -p $pid; ionice -c 3 -p $pid
done 2>/dev/null

Step 4: Monitor Improvement

for i in 1 2 3; do
  sleep 10
  echo "$(date +%H:%M:%S) Load: $(cat /proc/loadavg | awk '{print $1}') | CPU pressure: $(cat /proc/pressure/cpu | awk -F= '{print $2}' | cut -d' ' -f1)%"
done

Fleet-Wide Triage

When managing multiple machines, triage sequentially via SSH (parallel SSH can cascade failures):

# Quick status across fleet (sequential to avoid SSH cascading failures)
for host in trj css csd vmi1149989 vmi1152480 vmi1153651 vmi1156319 vmi1167313 vmi1227854 vmi1264463 vmi1293453; do
  echo "=== $host ===" && ssh -o ConnectTimeout=10 $host \
    'df -h / /data 2>/dev/null | tail -n+2 && uptime && echo -n "zombies: " && ps -eo stat | grep -c Z' 2>&1
  echo "---"
done

# Fleet VM tuning audit (catches cache bloat before it kills sessions)
for host in trj css csd vmi1149989 vmi1152480 vmi1153651 vmi1156319 vmi1167313 vmi1227854 vmi1264463 vmi1293453; do
  echo -n "$host: " && ssh -o ConnectTimeout=10 $host \
    'printf "vfs=%s min_free=%sKB mem_pressure=%s slab=%s\n" \
      $(sysctl -n vm.vfs_cache_pressure) \
      $(sysctl -n vm.min_free_kbytes) \
      $(awk -F= "/some/{print \$2}" /proc/pressure/memory | cut -d" " -f1) \
      $(grep SReclaimable /proc/meminfo | awk "{print \$2}")' 2>/dev/null || echo "UNREACHABLE"
done

# Fleet swap audit (machines silently lose swap after reboots)
for host in trj css csd vmi1149989 vmi1152480 vmi1153651 vmi1156319 vmi1167313 vmi1227854 vmi1264463 vmi1293453; do
  echo -n "$host: " && ssh -o ConnectTimeout=10 $host \
    "swapon --show --noheadings | wc -l; systemctl is-enabled zram-swap.service 2>/dev/null || echo 'NO_ZRAM_SERVICE'" 2>/dev/null || echo "UNREACHABLE"
done

# Find stuck bun tests across fleet
for host in trj css csd; do
  echo "=== $host ===" && ssh $host "ps -eo pid,etimes,args | grep 'bun test' | awk '\$2 > 43200 {print \$1, int(\$2/3600)\"h\"}'"
done

Kill Stuck Processes

Stuck Tests (12+ hours)

Note: bun test ignores SIGTERM — always escalate to SIGKILL.

for pid in $(ps -eo pid,etimes,args | grep 'bun test' | awk '$2 > 43200 {print $1}'); do
  kill $pid; sleep 3; kill -0 $pid 2>/dev/null && kill -9 $pid
done

Stuck Vercel/Git Commands

ps -eo pid,etimes,args | grep 'vercel' | awk '$2 > 600 {print $1}' | xargs -r kill
ps -eo pid,etimes,args | grep 'git add' | awk '$2 > 120 {print $1}' | xargs -r kill -9

Stale Gemini Agents (24+ hours)

ps -eo pid,etimes,args | grep 'bun.*gemini' | grep -v grep | awk '$2 > 86400 {print $1}' | xargs -r kill -9

Abandoned Dev Servers (24+ hours)

ps -eo pid,etimes,pcpu,args | grep -E 'next dev|bun --hot|vite' | \
  awk '$2 > 86400 && $3 < 1 {print $1}' | xargs -r kill

Emergency Response

When load > 2x CPU count:

# 1. Quick status
echo "LOAD: $(cat /proc/loadavg | awk '{print $1}')/$(nproc) cores"

# 2. Kill stuck tests (always safe — bun needs SIGKILL)
pgrep -f 'bun test' | while read pid; do
  age=$(ps -o etimes= -p $pid 2>/dev/null | tr -d ' ')
  [[ "$age" -gt 43200 ]] && kill -9 $pid && echo "Killed bun test $pid"
done

# 3. Kill stale gemini agents (24+ hours)
ps -eo pid,etimes,args | grep 'bun.*gemini' | grep -v grep | awk '$2 > 86400 {print $1}' | xargs -r kill -9

# 4. Delete exited zellij sessions
zellij delete-all-sessions --yes 2>/dev/null

# 5. Kill old Claude/Codex agents (16+ hours)
ps -eo pid,etimes,args | grep -E 'claude|codex' | awk '$2 > 57600 {print $1}' | xargs -r kill

# 6. Flush swap if used with free RAM available
used_swap=$(free -b | awk '/Swap:/{print $3}')
avail_ram=$(free -b | awk '/Mem:/{print $7}')
[[ $used_swap -gt 0 && $avail_ram -gt $((used_swap * 2)) ]] && sudo swapoff -a && sudo swapon -a && echo "Swap flushed"

# 7. Verify
sleep 5 && uptime && free -h | grep Swap

RCH Health Check

rch daemon status --json 2>/dev/null | head -5  # Running?
rch workers probe --all                          # Workers reachable?
pkill -9 rchd; sleep 1; rch daemon start        # Quick fix

File Handle Exhaustion

cat /proc/sys/fs/file-nr                        # allocated, free, max
sudo sysctl -w fs.file-max=2097152              # Temporary increase
ulimit -n 65536                                  # Session increase

Memory Relief

Quick Cache Drop (Safe)

free -h
sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"  # Drops clean page cache, dentries, inodes
free -h

Full Memory Pressure Remediation

When memory pressure is sustained (avg10 > 5%), the issue is usually VM tuning, not a single process:

# 1. Check pressure
cat /proc/pressure/memory
# some avg10=18.78 → 18.78% stalled = BAD

# 2. Fix VM tuning (the root cause on most agent machines)
sudo sysctl -w vm.vfs_cache_pressure=200 vm.min_free_kbytes=2097152

# 3. Drop caches for immediate relief
sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"

# 4. Verify improvement (wait 30s, check again)
sleep 30 && cat /proc/pressure/memory
# avg10 should be dropping rapidly

# 5. Persist the fix
# Edit /etc/sysctl.d/99-system-resource-protection.conf

Slab Cache Investigation

When free -h shows huge buff/cache but no single process is responsible:

# What's in the slab cache?
sudo slabtop -o -s c | head -15

# Common offenders on btrfs:
# - btrfs_inode (7+ GB)      → vfs_cache_pressure too low
# - radix_tree_node (11+ GB) → page cache metadata
# - kmalloc-rnd-08-128       → kernel allocations

# Common offenders on ext4:
# - buffer_head (2+ GB)      → ext4 block metadata
# - ext4_inode_cache (1+ GB) → ext4 inode caching

Validation Checklist

uptime                                          # Load < nproc
free -h                                         # Available increased
cat /proc/pressure/memory                       # avg10 < 5% (THE key metric)
sysctl vm.vfs_cache_pressure vm.min_free_kbytes # vfs >= 150, min_free >= 256MB
swapon --show                                   # Swap present and not full
zramctl                                         # zram active with compression
ps -eo stat | grep -c '^Z'                      # No zombies
zellij list-sessions 2>&1 | grep -c EXITED      # No exited sessions
df -h / /data                                   # Disk not critically full
cat /proc/sys/fs/file-nr                        # FDs normalized
systemctl --user status resource-watchdog 2>/dev/null  # Not crash-looping
grep -E "Slab|SReclaimable" /proc/meminfo       # Slab not bloated (< 10% of RAM)

References

Topic	Reference
Agent swarm patterns	AGENT-SWARM-PATTERNS.md
Common scenarios	COMMON-SCENARIOS.md
Detailed triage steps	DETAILED-TRIAGE.md
Detailed diagnostics	DIAGNOSIS.md
Escalation ladder	ESCALATION.md
Prevention guide	PREVENTION.md
PT integration	PT-INTEGRATION.md
RCH integration	RCH-INTEGRATION.md
Wezterm recovery	WEZTERM-RECOVERY.md
Diagnostic script	scripts/diagnose-system.sh
Related skills	`ntm`, `vibing-with-ntm`, `rch`

system-performance-remediation

Mehr aus diesem Repository

Mehr aus diesem Repository

System Performance Remediation

Quick Reference — Copy-Paste Commands

Kill Hierarchy (Safest First)

Protected Patterns (NEVER KILL)

SIGTERM vs SIGKILL

VM Tuning & Filesystem Cache Bloat (The Silent Killer)

The Cache Bloat Pattern

Diagnose Cache Bloat

Fix: Tune VM Parameters

Fleet-Wide VM Audit

systemd-oomd Protection (Preventing Session Massacres)

Set Per-Session Memory Limits

Verify Session Limits

Retain oomd Logs

Fix resource-watchdog Service

Diagnosis

CPU Pressure (Critical for Sluggishness)

Full Status Check (Linux)

macOS Quick Status

Swap & zram Management

The Swap Paradox

Diagnose Swap Issues

Fix: Flush Swap Back to RAM

Set Up zram Swap

Persist zram Across Reboots

Disk Space Cleanup

Common Disk Hogs on Agent Machines

Known Disk Hogs

Before Removing: Check if In Use

Zellij Session Cleanup

NTM/Tmux Cleanup

Orphaned Process Cleanup

Orphaned Poll Loops

MCP Server Bloat

Runaway cass / am Processes

Agent Swarm Meltdown Fix

Step 1: Detect Competing Builds

Step 2: Kill Confused OLD Agents (NOT the builds!)

Step 3: Renice Remaining Builds

Step 4: Monitor Improvement

Fleet-Wide Triage

Kill Stuck Processes

Stuck Tests (12+ hours)

Stuck Vercel/Git Commands

Stale Gemini Agents (24+ hours)

Abandoned Dev Servers (24+ hours)

Emergency Response

RCH Health Check

File Handle Exhaustion

Memory Relief

Quick Cache Drop (Safe)

Full Memory Pressure Remediation

Slab Cache Investigation

Validation Checklist

References

System Performance Remediation

Quick Reference — Copy-Paste Commands

Kill Hierarchy (Safest First)

Protected Patterns (NEVER KILL)

SIGTERM vs SIGKILL

VM Tuning & Filesystem Cache Bloat (The Silent Killer)

The Cache Bloat Pattern

Diagnose Cache Bloat

Fix: Tune VM Parameters

Fleet-Wide VM Audit

systemd-oomd Protection (Preventing Session Massacres)

Set Per-Session Memory Limits

Verify Session Limits

Retain oomd Logs

Fix resource-watchdog Service

Diagnosis

CPU Pressure (Critical for Sluggishness)

Full Status Check (Linux)

macOS Quick Status

Swap & zram Management

The Swap Paradox

Diagnose Swap Issues

Runaway `cass` / `am` Processes

Runaway `cass` / `am` Processes