with one click
incident-response
Runbook for mass outages — prod down, en-masse client disconnects. Order of triage, who to notify, how to roll back.
Menu
Runbook for mass outages — prod down, en-masse client disconnects. Order of triage, who to notify, how to roll back.
Subscription, payment, quota, and refund operations. Reading subscription state, extending users, granting quota, looking up payment history.
Review code in the vpn-bot-refactor project; optionally land the change via git push
Server administration — Docker, systemd, logs, certs, SSH between nodes, and git operations on the deployed repo
VPN infrastructure ops — Xray nodes, X-UI panel, traffic, client configs, and entry↔exit failover diagnostics
| name | incident-response |
| description | Runbook for mass outages — prod down, en-masse client disconnects. Order of triage, who to notify, how to roll back. |
| type | prompt |
| whenToUse | User reports a mass outage (лежит, не работает у всех, массово отваливаются, прод down, outage, срочно) or anything implying an incident affecting many users at once |
Before touching anything, answer: is this one user, some users, or everyone?
# How many users are paid/demo and supposedly active right now?
docker compose -f /opt/vpn-bot/docker-compose.yml exec -T vpn-bot python3 -c "
import sqlite3
c = sqlite3.connect('/var/lib/vpn-bot/bot.db')
for r in c.execute(\"SELECT status, COUNT(*) FROM users GROUP BY status\"): print(r)
"
# Is anyone connected to Xray right now?
ss -tn dst :443 | head -10
ss -tn '( sport = :443 )' | wc -l
# Health endpoint
curl -s http://127.0.0.1:8080/health | head -c 200
If only 1-2 users complain → it's a user-specific issue, use vpn-ops skill. If half+ complain → it's an incident.
1. Exit node containers up? docker compose -f /opt/vpn-bot/docker-compose.yml ps
2. Xray inbound listening? ss -tlnp | grep :443
3. Reality params unchanged? docker exec 3x-ui cat /etc/x-ui/x-ui.json | python3 -m json.tool | grep -E 'publicKey|shortIds|serverName'
4. Entry node reachable? ssh entry-node 'uptime'
5. Entry iptables DNAT present? ssh entry-node 'iptables -t nat -S PREROUTING | grep 443'
6. Cert valid for the dashboard? echo | openssl s_client -connect <dashboard-host>:9443 2>/dev/null | openssl x509 -noout -dates
7. DNS resolving the entry host? dig +short <dashboard-host>
8. Disk full? df -h / /var/lib/docker
9. OOM killer fired? dmesg -T | grep -iE 'killed|oom' | tail -5
Stop at the first failed step — fix that, re-test, don't continue down the list with a known broken upstream.
If the incident lasts >5 minutes, broadcast to users. The dashboard has a Broadcast UI, but from CLI:
# Mint admin token
TOKEN=$(cd /opt/vpn-bot && python3 -c "
import os; from dotenv import load_dotenv; load_dotenv()
from bot.utils.admin_token import make_admin_token
print(make_admin_token(os.environ['BOT_TOKEN'], '1652899'))
")
# Preview first (confirm=false)
curl -s -X POST "http://127.0.0.1:8080/api/admin/broadcast?admin_token=$TOKEN" \
-H 'Content-Type: application/json' \
-d '{"text":"⚠️ Ведутся технические работы, восстановление в течение 15 минут.","audience":"active","confirm":false}'
# Send for real — ONLY after the user OKs the preview
curl -s -X POST "http://127.0.0.1:8080/api/admin/broadcast?admin_token=$TOKEN" \
-H 'Content-Type: application/json' \
-d '{"text":"...","audience":"active","confirm":true}'
Never send a broadcast without the user's explicit "OK, отправляй". ~80 paid users get a notification per send; a typo in there is a public mistake.
In order of preference (least → most destructive):
Restart the bot container. Fixes ~half of incidents that are stuck-state or memory-leak related.
docker compose -f /opt/vpn-bot/docker-compose.yml restart vpn-bot
Revert the last commit and redeploy. If the incident started right after docker compose up --build:
cd /opt/vpn-bot
git log --oneline -5
git revert <bad-sha> --no-edit
git push kimi-origin main
docker compose up -d --build vpn-bot
Restore from yesterday's backup tarball. Only after the user confirms:
ls -lt /opt/backups/*.tar.gz | head -3
# Show the user, get OK, then:
# ...stop containers, swap volume contents, restart...
| Symptom | Most likely cause | First check |
|---|---|---|
| All users disconnected, Xray container down | OOM, restart loop | dmesg -T | tail, docker compose logs vpn-bot --tail 50 |
| New keys "fail to connect", old ones work | sid / pbk env passthrough broken in docker-compose | docker exec vpn-bot env | grep -E 'SID_VALUE|REALITY' |
| Bot polls but doesn't respond | Telegram rate limit or BOT_TOKEN revoked | docker compose logs vpn-bot | grep -i '429|401|forbidden' |
| Dashboard 502 | Caddy can't reach :8080 | journalctl -u caddy -n 30, curl -s http://127.0.0.1:8080/health |
| Mass disconnect every ~6 minutes | Entry iptables NAT timeout shorter than client keepalive | ssh entry-node 'sysctl net.netfilter.nf_conntrack_*timeout*' |
| Subscription panel HTTP 500 | Schema drift between database.py and prod DB | docker compose logs vpn-bot | grep 'no such column' |
When the incident is closed:
Don't skip this — incidents repeat when the post-mortem skips.
docker system prune (wipes images mid-restart).git push --force (can break the deploy on entry if entry pulls too).kimi-bridge (you'd lose your own conversation context).systemctl restart docker (kills both vpn-bot and 3x-ui at once, longer downtime).