Run any Skill in Manus with one click

$pwd:

tokenkey-online-traffic-profile

Name: Tokenkey Online Traffic Profile
Author: youxuanxue

// Read-only TokenKey production/edge traffic-profiling workflow. Reconstructs per-minute request-traffic series for the past N hours per account — base RPM (request-start minute), sticky vs non-sticky (load-balance) RPM split, active sessions (idle-window), and peak concurrency — then compares each against its cap (base_rpm / rpm_sticky_buffer / max_sessions / concurrency) and flags which limit is being touched. Use when asked to profile online traffic, see per-minute RPM/session/concurrency, validate the admin account-card gauges (concurrency 1/8, $/window cost, sessions 16/30, RPM 3/28), or explain "no available accounts" / throttling without ad-hoc command guessing.

Run Skill in Manus

$ git log --oneline --stat

stars:2

forks:0

updated:May 23, 2026 at 13:40

SKILL.md

readonly

related-skills.json

same repository

tokenkey-stage0-edge-lightsail-expansion.md

from "youxuanxue/sub2api"

End-to-end runbook for adding a TokenKey Stage0 Edge gateway on AWS Lightsail (parallel to the EC2/CFN path): register the edge in deploy/aws/lightsail/edge-targets-lightsail.json, ensure the one-time Lightsail IAM addon + GHCR PAT are in place, provision via deploy-edge-lightsail-stage0.yml, point DNS, smoke, and upgrade/rollback. EC2/CFN remains the default Edge path; this skill covers the Lightsail parallel path only.

2026-05-232

tokenkey-stage0-edge-lightsail-ip-rotation.md

from "youxuanxue/sub2api"

Rotate the egress Static IP of a TokenKey Stage0 Lightsail Edge (uk1-ls / us1-ls / fra1-ls / sg1-ls) when the live IP has been risk-blocked ("polluted") by Anthropic / OpenAI / Google. Mirrors the EC2 EIP rotation posture: a single primitive (ops/lightsail/rotate-static-ip.sh) swaps the Static IP, the operator updates Porkbun DNS, and external verification runs from a clean-egress host. No CloudFormation drift step because Lightsail Edge is not CloudFormation-owned.

2026-05-232

tokenkey-stage0-edge-ip-rotation.md

from "youxuanxue/sub2api"

Rotate / replace the egress Elastic IP of a TokenKey Stage0 edge (uk1/us1/sg1/fra1/…) when the live IP has been risk-blocked ("polluted") by an upstream API (Anthropic / OpenAI / Google). Drives the single canonical path: a workflow_dispatch of deploy-edge-stage0.yml with operation=rotate_egress_ip, which does a CFN-native UpdateStack — no detach, no IMPORT, no drift class. Auto-allocates a clean candidate (checked against edge-polluted-ips.json), swaps via CFN, verifies SSM Online + outbound IP + Anthropic/OpenAI/Google pollution probe from the edge itself, and auto-reverts on a polluted result. The only operator step that remains is the DNS A-record update at Porkbun (and committing the retired IP into edge-polluted-ips.json).

2026-05-232

tokenkey-anthropic-oauth-config.md

from "youxuanxue/sub2api"

TokenKey Anthropic 配置写入流水线（snapshot → check → plan → apply → verify）。 **三条写入面**，都由同一个脚本 ops/anthropic/manage-anthropic-config.py 编排，且都 "JSON 派生 SQL、无静态模板、operator 不写 SQL"： (A) edge anthropic OAuth account 的 tier baseline（concurrency / base_rpm / sticky_buffer / max_sessions 等 account 字段）—— 来源 anthropic-oauth-stability-baselines-tiered.json；同一事务把 users.id=1 的 concurrency 更新为该 edge 库内 schedulable=true 的 anthropic 账号 concurrency 之和。 (B) prod anthropic api-key 镜像 stub（base_url=api-*.tokenkey.dev 形状）的 credentials.pool_mode + pool_mode_retry_count —— 来源 anthropic-stub-pool-baselines.json。 (C) prod stub concurrency 镜像（plan-concurrency-mirror）：把 edge users.id=1 与对应 prod stub.concurrency 与 prod users.id=1 都对齐为「Σ schedulable=true anthropic concurrency」的四跳级联——值从 live 派生，不引入新 baseline JSON；stub↔edge 链接按 edge-targets.json 的 domain 字段稳定匹配，不推断。 group.rpm_limit 不由本流水线写——admin UI 直接独立设置。

2026-05-232

tokenkey-anthropic-oauth-priority-by-window.md

from "youxuanxue/sub2api"

TokenKey 跨所有 deployable edge 的 Anthropic OAuth 账号 priority 重排流水线（snapshot → plan → apply → verify）。按账号当前 5h/7d 可用用量窗口剩余度打分，同 stability tier 内重排 priority（smaller wins），剩余越多 priority 越小（越优先调度）。**只写** accounts.priority 一个字段，不动 tier baseline、不动 group.rpm_limit、不动 credentials。单一脚本 ops/anthropic/rebalance-anthropic-priority.py 编排，1 个 SQL 模板固化写入。

2026-05-232

tokenkey-online-log-troubleshooting.md

from "youxuanxue/sub2api"

Read-only TokenKey production/edge troubleshooting workflow for querying live logs, ops_error_logs, Docker containers, SSM targets, CI/deploy runs, and turning evidence into a stable root-cause summary without ad-hoc command guessing.

2026-05-232

package.json

"author": "youxuanxue"

"repository": "youxuanxue/sub2api"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

name

tokenkey-online-traffic-profile

description

Read-only TokenKey production/edge traffic-profiling workflow. Reconstructs per-minute request-traffic series for the past N hours per account — base RPM (request-start minute), sticky vs non-sticky (load-balance) RPM split, active sessions (idle-window), and peak concurrency — then compares each against its cap (base_rpm / rpm_sticky_buffer / max_sessions / concurrency) and flags which limit is being touched. Use when asked to profile online traffic, see per-minute RPM/session/concurrency, validate the admin account-card gauges (concurrency 1/8, $/window cost, sessions 16/30, RPM 3/28), or explain "no available accounts" / throttling without ad-hoc command guessing.

TokenKey：线上请求流量画像（逐分钟 RPM / sticky / session / concurrency）

把"过去 N 小时某账号/edge 的流量与限额命中情况"固定成稳定的只读重建流程。专治：流量趋势、admin 账号卡片四个 gauge 的核对、no available accounts / 节流的归因。

权威纪律以仓库根 CLAUDE.md 为准。本 skill 只读：只跑 docker logs / psql SELECT / redis-cli 读命令 / aws ... describe|get|send-command(只读脚本)。任何写配置、改 max_sessions/base_rpm、重启、部署都必须另行显式确认，并交给写入面 skill（tokenkey-anthropic-oauth-config 等）。

环境识别（prod/edge 实例解析、容器名、SSM 执行、UTC+本地双写、小输出优先）与 tokenkey-online-log-troubleshooting 完全一致——本 skill 复用它的 §1/§2/§3，不重复；下面只写流量画像特有的部分。

确定性基线（机械化 vs 真判断）

按 dev-rules rules/dev-rules-convention.mdc §「skill / command 确定性基线」自审。

步骤	类型	承载
解析 target（region / instance_id）	机械	`deploy/aws/stage0/resolve-edge-target.py` + describe-stacks
SSM base64 投递 + send + poll（probe-caps.sh / probe-traffic-logs.sh / profile-traffic.py 都通过它发）	机械	`ops/observability/run-probe.sh`
caps + 不可调度证据 + Redis 快照 + 近 2h 错误聚类	机械	`ops/observability/probe-caps.sh`（输出每行一 JSON，`row_to_json`）
拉 access log + sticky.scheduler_entry → /tmp/acc.txt / /tmp/sse.txt	机械	`ops/observability/probe-traffic-logs.sh`
逐分钟重建 RPM / sticky / activeSess / conc	机械	`ops/observability/profile-traffic.py`
历史 cost-window 累计（5h gauge 校准）	机械	psql `usage_logs` 派生（SKILL §3 末段 SQL）
§0 的 8 个 trap pattern（base_rpm 误判、列号陷阱、镜像账号链式失败、activeSess 上界）	判断	prompt（架构 + 历史现场判断）
§4 解读规则（哪个 cap 触顶）	判断	prompt（依赖同时段的 503 / 粘性 vs 非粘性现象）
镜像账号双跳归因（prod cooldown → edge 实因）	判断	prompt（必须 edge 同时段画像，单跳不算结论）

调用参数

/tokenkey-online-traffic-profile target=<prod|edge:<id>|all-edges|domain> [hours=<N，默认1>] [minutes=<M>] [account=<id|name|all，默认all>] [model=<name>] [path=/v1/messages] [allow_planned=false]

参数	语义
`target`	`prod`、`edge:us1`/`edge:uk1`/…、`all-edges`（= `edge-targets.json` 中所有 `deployable:true` 的 edge，先解析再逐个跑；当前实际只有 us1），或域名。决定 region/instance。
`hours`	回看小时数。注意 docker logs 仅覆盖容器 `Up` 时长——先 `docker ps` 看 `tokenkey` 启动多久，超出部分日志不存在。
`minutes`	亚小时窗口；用户说"过去 30 分钟"用 `minutes=30`，直接转 `docker logs --since 32m`（多拉 2min 缓冲让按 `completed_at` 过滤的边界分钟完整）。给了 `minutes` 就忽略 `hours`。
`account`	账号 id 或 name；`all` 则先列该 platform 的可调度账号再画像。

默认：hours=1、account=all、path=/v1/messages、mode=只读、桶=分钟。planned edge 不查除非 allow_planned=true。当前桶只支持分钟；需要 5-min 等更粗桶就在分钟输出上做 rollup，不要靠 FMT 偷桥（strftime 无法表达 5-min 桶）。

target=all-edges 的解析：可调度集 = deploy/aws/stage0/edge-targets.json 里 deployable:true 的条目（用 resolve-edge-target.py 或直接读 JSON）。不要对 deployable:false 的 planned edge（uk1/sg1/fra1 等）跑画像，除非 allow_planned=true。当前矩阵下 all-edges 实际只解析出 us1。

0) 为什么必须"逐分钟重建"，不能只看 gauge

admin 账号卡片四个数字是瞬时 gauge，主要读 Redis，没有逐分钟历史：

卡片 gauge	Redis 落地 key	历史保留
🎛 并发 `cur/concurrency`	`concurrency:account:{id}`（zset，活跃 slot）	❌ 仅当前；`wait:account:{id}` 为等待槽
💲 窗口费用 `$x/$limit`	`window_cost:account:{id}`（string，5h 窗边界缓存，≠简单尾随求和）	❌ 仅当前；底层 `usage_logs` 有逐请求成本
👥 会话 `cur/max_sessions`	`session_limit:account:{id}`（zset，按 `session_idle_timeout_minutes` 过期，默认 5、可被 extra 覆盖）	❌ 仅当前
🕐 RPM `cur/base_rpm [T]`	`rpm:{id}:{unixMinute}`，TTL=120s	❌ 只留最近 ~2 分钟

结论：除“当前快照”外，过去 N 小时的逐分钟值只能从 access log（http request completed）+ sticky.scheduler_entry + usage_logs 重建。这是本 skill 的核心。

已踩过的坑：

grep -c '429'/'529' 是误报——会命中 UUID、body_bytes、latency_ms 里的子串。判真实上游限流/过载要解析 JSON 或匹配 rate_limit_error/overloaded_error，不要数裸数字。
瞬时 gauge ≤ cap 不代表历史没触顶（峰值已过、配置事后被改）。务必重建；并确认 cap 在事发时段的取值（如 max_sessions 被从 16 改到 30）。
account 被判不可用/no available accounts 时三个本地 cap（concurrency/max_sessions/base_rpm）都能触发且不留专门日志，prod Debug 级 sticky.layer* 默认关——只有重建数据能区分。
不要先认定 base_rpm（本 skill 第一版排障就误判过）。判别口诀：
- no available 那一分钟若 RPM<base_rpm 且 conc<concurrency（低负载也 503）→ 几乎一定是 session 面：算 全局活跃会话 vs Σ(max_sessions)。
- 现象是「粘性请求 200、非粘性(新会话/sticky miss)503」→ 黄区 RPM 或 session 满二选一；用 RPM 序列区分：RPM≥base 选黄区，RPM<base 选 session。
- 只有某分钟 RPM 真的 ≥base_rpm 才轮到 base_rpm 黄/红区。
重建出的 activeSess 是上界，不是触顶证据（与坑 2 对称）。§3 用 IDLE_MIN 尾随窗按 session_hash 去重计活跃会话，这个窗通常比真实 zset 的过期行为更宽，所以 activeSess 常会高于当下 ZCARD session_limit:account:* 之和，甚至越过 Σ(max_sessions)。单看 activeSess>Σmax 不能判 session 触顶——必须同时满足「该时段确有 503 / no available」且「现象是粘性 200、非粘性失败」。零 503 时 activeSess 越线只说明会话维度余量最小、值得盯，不是已触顶。核对方式：对照当前 ZCARD 之和（live 真值）与该时段真实失败计数。
字段来源混淆 + 数列号陷阱（2026-05-23 现场踩坑）。accounts 表里 cap 字段一半是顶层列（concurrency / schedulable / rate_limited_at / rate_limit_reset_at / overload_until / temp_unschedulable_until / temp_unschedulable_reason / session_window_* / error_message），一半在 extra JSON（base_rpm / rpm_strategy / rpm_sticky_buffer / max_sessions / session_idle_timeout_minutes / window_cost_limit / stability_tier）—— 同名字段 extra->>'concurrency' 会查到 NULL，必须用 accounts.concurrency。更危险的失败模式：psql -t -A -F'|' 把 20+ 列输出为纯位置 | 分隔串、无列头，肉眼数列号几乎必错（曾把第 20 列 window_cost_limit=1500 当成第 18 列 max_sessions，结论从"session 触顶"翻成"上游 503"）。硬纪律：本 skill 所有 cap / 不可调度证据查询强制用 §1 给出的 row_to_json 固化 SQL，输出形如 {"id":4,"max_sessions":"100","window_cost_limit":"1500",...}，字段名跟值粘在一起、物理不可能错列；禁止自由写多列管道 SELECT。下游展示也只能 key=value，禁止"a4: 28/20/100/8/1500/l5" 这种靠列位读的自由文本。
链式失败 / 镜像账号。prod 上以 cc-<edge>-oauth（如 cc-us1-oauth / cc-uk1-oauth）命名的 anthropic Key 账号，其 credentials 上游就指向对应 edge 域名（api-<edge>.tokenkey.dev）。edge 端任何 5xx / no available accounts 都会作为 upstream 503 透传回 prod；prod 路由层的 anthropic_upstream_error 关键词阈值规则会基于这些 transient 503 累计计数，达阈值（默认 3/3）后给该 prod 账号写 temp_unschedulable_until（tier-based cooldown，常见 10m），admin UI 即显示「临时不可调度」黄标。归因纪律：看到 prod temp_unschedulable_reason.matched_keyword='anthropic_upstream_error' 时，真因在 edge 同时段画像，不是 prod 本地 cap；必须切到对应 edge 跑一遍 §1+§3 才算定案。把 prod 的 cooldown 当根因 = 漏判 edge 容量问题。

1) 先抓 cap 配置 + 不可调度证据 + 当前快照（调用 probe-caps.sh）

字段来源（坑 6 详）：

用途	字段	来源
标识	`id name platform type status`	`accounts` 顶层列
调度开关 / 即时并发上限	`schedulable` / `concurrency`	`accounts` 顶层列（不是 extra）
临时不可调度（admin「黄标」）	`temp_unschedulable_until` / `temp_unschedulable_reason`(jsonb)	`accounts` 顶层列
上游错误状态	`rate_limited_at` / `rate_limit_reset_at` / `overload_until` / `error_message`	`accounts` 顶层列
会话窗口（部分平台用）	`session_window_status` / `session_window_start` / `session_window_end`	`accounts` 顶层列
RPM cap	`base_rpm` / `rpm_strategy` / `rpm_sticky_buffer`	`accounts.extra` (jsonb)
会话 cap	`max_sessions` / `session_idle_timeout_minutes`	`accounts.extra` (jsonb)
费用窗口 cap（≠session 数！）	`window_cost_limit`	`accounts.extra` (jsonb)
稳定性分级	`stability_tier`	`accounts.extra` (jsonb)

window_cost_limit 是 5h 费用窗上限（单位 USD/cents 视配置），不是 max_sessions —— 这是坑 6 的现场翻车点：值都是几十~几千的整数，列错位时极易混淆。

temp_unschedulable_reason jsonb 关键键：matched_keyword（anthropic_upstream_error / rate_limit / …）、until_unix、triggered_at_unix、status_code、error_message、rule_index、tier-based cooldown 时长（写在 error_message 文案里，如 cooldown=10m0s tier=2）。

RPM 三区（代码 Account.CheckRPMSchedulability / isAccountSchedulableForRPM）：

buffer = rpm_sticky_buffer（若设）else concurrency + max_sessions，下限 base_rpm/5。
绿区 RPM < base_rpm → 任何请求可调度。
黄区 base_rpm ≤ RPM < base_rpm+buffer → 仅粘性（非粘性负载均衡路径会跳过该账号，line isAccountSchedulableForRPM(acc,false)）。
红区 RPM ≥ base_rpm+buffer → 完全不可调度。rpm_strategy=sticky_exempt 时无红区。

schedulable=false（admin UI「暂停」灰色开关）≠ temp_unschedulable_until > now()（admin UI「临时不可调度」黄标）：前者是人手关掉，后者是阈值规则自动打的。归因要分清。

1.1 调用 probe-caps.sh（机械化抓取，零 prose SQL）

固化在 ops/observability/probe-caps.sh（dev-rules 「机械化优于 prompt 推断」基线：本可机械化的步骤由脚本承载，prompt 只描述调用接口与真实判断）。该脚本在远端运行 psql + redis-cli，输出三段：

段	形态	解析方式
`docker ps` 块	文本表	肉眼或 grep 容器名
caps + 不可调度证据	每行一 JSON（`row_to_json`）	`jq '.max_sessions'` / `json.loads` 按字段名取值
Redis snapshot	`redis_snapshot acct=N conc=N sess=N wait=N wcost=… rpm_now=N`	grep `字段名=`
ops_error_logs 近 2h	每行一 JSON	同上

字段名嵌在值旁，物理不可能数错列——这就是坑 6 的硬约束载体。

调用（远端在 SSM 里跑，全部由 run-probe.sh 统一投递）：

# prod / edge 都走同一个 wrapper；它负责 region/instance 解析 + base64 投递 + send + poll
bash ops/observability/run-probe.sh \
  --target prod \
  --script ops/observability/probe-caps.sh \
  --env PLATFORM=anthropic \
  --env ERR_HOURS=2

# edge 同款（planned edge 需 ALLOW_PLANNED=1）
bash ops/observability/run-probe.sh \
  --target edge:us1 \
  --script ops/observability/probe-caps.sh \
  --env PLATFORM=anthropic

环境变量（脚本顶部 contract）：PLATFORM（默认 anthropic）、ERR_HOURS（默认 2）、ERR_LIMIT（默认 150）。新增字段只在脚本里改一次——不再回头同步 SKILL 文本。禁止手写 base64 投递 / send-command 调用：所有漂移点都收敛在 run-probe.sh 内。

redis-cli stderr 噪声坑（实测）：容器里设了 REDISCLI_AUTH，即使不带 -a，redis-cli 仍可能往 stderr 刷 AUTH failed: ERR AUTH <password> called without any password configured。这是无害噪声——StandardOutputContent 是正确的；不要因 StandardErrorContent 非空就判失败。

1.2 读结果的硬纪律（坑 6 的执行面）

caps 行：每行一 JSON。要某字段直接 jq '.max_sessions' 或 python3 -c "import sys,json;[print(json.loads(l).get('max_sessions')) for l in sys.stdin if l.strip()]"。禁止眼睛数列号。
redis_snapshot 行：grep sess= 不会错位。
给用户的报告里，所有 cap 列出必须用「字段名: 值」格式（见 §5）。禁止 a4: 10/28/20/100/8/1500/l5 这种靠列位的自由文本——这是坑 6 的二次失败入口。

2) 拉 access log（调用 probe-traffic-logs.sh）

固化在 ops/observability/probe-traffic-logs.sh。它做的事：

docker logs $CONTAINER --since $SINCE | grep -F 'http request completed' | grep -F "$PATH_KEY" > /tmp/acc.txt
docker logs $CONTAINER --since $SINCE | grep -F 'sticky.scheduler_entry'                        > /tmp/sse.txt

并打印一行 probe_traffic_logs container=… since=… path_key=… acc_lines=N sse_lines=N；两边都为 0 时往 stderr 报一行 WARN（提示 log-format drift / 容器名错 / SINCE 超出 docker 保留窗）。

环境：SINCE（默认 1h，minutes 模式传 $((MIN+2))m）、PATH_KEY（默认 /v1/messages）、CONTAINER（默认 tokenkey）。

http request completed JSON 关键字段：account_id、path、status_code、latency_ms、completed_at(UTC, ...Z)。 sticky.scheduler_entry JSON：session_hash、sticky_account_id(>0=粘性命中,0=无绑定走负载均衡)、sticky_source(prefetch/…)、excluded_count(cooldown 预排除数)。

3) 逐分钟重建（调用 profile-traffic.py）

固化在 ops/observability/profile-traffic.py。读 /tmp/acc.txt + /tmp/sse.txt，输出每分钟一行：

min(UTC) | aN  :rpm/sR/conc/ok/bad … | nonStk actSess(g)

末尾每账号一行 acctN totals reqs=… rpm_max=… conc_max=… statuses={…}。

投递方式：与 §1.1 同款——通过 ops/observability/run-probe.sh 包了 base64 投递 + 远端拉脚本 + env 注入；禁止手写完整的 base64 / send-command 链。

ACCTS / IDLE_MIN 在远端按 psql 派生（不让 operator 手填）：

PSQL='docker exec tokenkey-postgres psql -U tokenkey -d tokenkey -X -A -t'
ACCTS=$($PSQL -c "SELECT string_agg(id::text, ',' ORDER BY id) FROM accounts WHERE platform='anthropic' AND schedulable AND status='active';")
IDLE_MIN=$($PSQL -c "SELECT COALESCE(MAX(NULLIF(extra->>'session_idle_timeout_minutes','')::int), 5) FROM accounts WHERE platform='anthropic' AND schedulable AND status='active';")
ACCTS=$ACCTS IDLE_MIN=$IDLE_MIN python3 /tmp/profile-traffic.py

上面这段派生 + 调用是一份远端薄壳，由 §1.1 提到的 run-probe.sh 投递（脚本作为本机文件传输到远端 /tmp/）。如果以后这段薄壳被频繁复用，应该抽出为一个独立的 driver 脚本放进 ops/observability/ 下，届时连同它一起加入 §1.1 的工具表；在那之前不要把这段派生 prose 当作另一份 contract。

env 契约（脚本 docstring 是 ground truth，这里只列要点）：

env	用途	默认
`ACCTS`	逗号分隔账号 id（必填）	—
`IDLE_MIN`	session 活跃尾随窗（分钟），= MAX(account.idle_min)	5
`PATH_KEY`	路径过滤（与 §2 必须一致）	`/v1/messages`
`FMT`	输出时间列的 `strftime` 格式（只换显示格式，不换桶粒度——桶始终是分钟）	`%H:%M`

逐分钟表 ok / bad 是该分钟完成的 200 / 非 200 数（按 request start 落分钟）；某分钟 rpm > ok 即该分钟有失败。整段 status 分布看末尾 statuses={…} 字典。不要把整段总数当逐分钟值（旧模板曾踩此坑）。

成本逐分钟（DB，独立 SQL；window_cost gauge 是 5h 窗缓存，逐分钟用 usage_logs）：

SELECT to_char(date_trunc('minute',created_at),'HH24:MI') min_utc, account_id,
       count(*) reqs, round(sum(total_cost),4) cost
FROM usage_logs
WHERE account_id = ANY($IDS) AND created_at >= now()-interval '$HOURS hours'
GROUP BY 1,2 ORDER BY 1,2;
-- 5h 窗累计校准卡片 $ gauge：
SELECT account_id, round(sum(total_cost),2) cost_5h FROM usage_logs
WHERE account_id = ANY($IDS) AND created_at >= now()-interval '5 hours' GROUP BY 1;

4) 解读规则（哪个参数触顶）

观察	判定
某分钟 `rpm ≥ base_rpm` 且非粘性请求失败/`no available`	base_rpm 黄/红区：非粘性被 RPM 闸挤出。
`peak_conc ≈ concurrency` 且新请求 429/排队超时	concurrency 触顶（`Concurrency limit exceeded` 或 `umq`/wait 超时）。
`global activeSess > Σ(max_sessions)` 且该时段确有 503/`no available` 且粘性 200、非粘性失败	max_sessions 触顶：新会话被 `checkAndRegisterSession` 拒（`ErrNoAvailableAccounts`，gateway_service.go ~line 2121），已绑定会话 `ZSCORE` 命中放行。
`activeSess > Σ(max_sessions)` 但零 503（全程 200）	未触顶：`activeSess` 是 IDLE 窗上界、高于 live `ZCARD` 之和（见 §0 坑 5）。结论=会话维度余量最小、值得盯，不是触顶；核对当前 `ZCARD` 之和与真实失败计数。
RPM<base、conc<max、sess 未饱和，却仍 503	查上游：解析 JSON 找 `rate_limit_error`/`overloaded_error`/`cooldown`/`rate_limit_reset_at`，别数裸 429/529。
`nonStickyRpm` 高、`activeSess` 接近 Σmax	单 CLI 派生大量短会话 → 会话面先到顶（典型：edge 仅 2 账号时）。
prod 账号 `temp_unschedulable_until > now()` 且 `temp_unschedulable_reason.matched_keyword='anthropic_upstream_error'`	链式失败：prod 把 edge 透传回来的 503 累计到本地阈值规则，自动 cooldown。这不是根因——切到对应 edge 跑 §1+§3 找 edge 实因（max_sessions / concurrency / 真上游 503）。归因责任在 edge 同时段画像。

判 session 饱和的关键不等式：全局活跃会话 > Σ(max_sessions over 可调度账号) ⇒ 必有新会话落空。务必用事发时段的 max_sessions（可能被事后调过）。

4.1 链式失败 / 镜像账号识别

prod 上 anthropic Key 账号若命名形如 cc-<edge>-oauth（如 cc-us1-oauth → api-us1.tokenkey.dev），归因路径必须是双跳：

[edge 实因: max_sessions / concurrency / base_rpm / 真上游 503]
       │
       ▼ 503 透传 (upstream_status_code=503, body="no available accounts" 等)
[prod 路由层 anthropic_upstream_error 关键词阈值规则: 累计 N/N]
       │
       ▼ 写 accounts.temp_unschedulable_until = now()+cooldown(tier-based, 常见 10m)
[admin UI: 临时不可调度黄标]

确认是否镜像账号：SELECT credentials->>'base_url' FROM accounts WHERE id=<prod_acct_id>;（或 credentials->>'endpoint'，依字段名而定）。base_url 指向 api-<edge>.tokenkey.dev/* 即镜像。

操作上：先在 prod 跑一次 §1 拿 temp_unschedulable_reason，从 triggered_at_unix 反推 edge 上的事发分钟（同一秒精度），再到 edge 跑 §1+§3，对照那几分钟的 actSess / sRPM / nonStk / ZCARD-now / max_sessions / concurrency 才能定真因。

5) 输出模板

强制 key=value / JSON：每个数字前面必须挨着字段名。禁止列号风格的自由文本（a4: 10/28/20/100/8/1500/l5 这种），否则坑 6 会复发。

target=<...>  time_window_utc=<..>..<..>  time_window_local=<..>

accounts:                                # 每行一账号，字段名: 值，禁止单行多值无名
- id=4 name=am-us-ec2-5-1-b status=active schedulable=true
    concurrency=10  base_rpm=28  rpm_strategy=tiered  rpm_sticky_buffer=20
    max_sessions=100  idle_min=8  window_cost_limit=1500  tier=l5
    session_window_status=allowed
    session_window_start=2026-05-22T21:00:00Z  session_window_end=2026-05-23T02:00:00Z
    temp_unschedulable_until=- temp_unschedulable_reason.matched_keyword=-

caps_snapshot(redis now):
- acct=4 conc=0 sess=12 wait=0 wcost=- rpm_now=0

peaks(过去Nh):
- acct=4 rpm_max=19@01:13 conc_max=- activeSess_max(global)=108@01:10 reqs=N 5h_cost=$N
limit_touched: <base_rpm | concurrency | max_sessions | upstream | chained-from:<edge>>  置信度 high|med|low
evidence:
- 01:09 UTC sRPM=9 nonStk=6 totalReq=15(<base=28) actSess=102(>max=100) 503=3 → max_sessions 触顶
- 01:14 UTC sRPM=13 nonStk=14 totalReq=27(≈base=28) actSess=88(<max=100) 503=3 → base_rpm 黄区 / conc 临界

per_minute_table: <见 §3 输出；超长则写 $CLAUDE_JOB_DIR 文件，仅回摘要+路径>

报告里出现的每个数字必须能在 SSM stdout 里以 字段名=值 或 JSON "字段名":值 形式 grep 到原文。如果 grep 不到，就是从列号读出来的——回去用 §1.1 的固化脚本重抓。

证据不足（如 docker logs 未覆盖整段、账号无流量）就说明并缩短 hours 或换 target，不要外推。镜像账号（§4.1）必须给出双跳归因，单跳报告视为未完成。

6) 交接

需要改 cap（base_rpm/max_sessions/concurrency/priority）→ 不在本 skill 内写；输出 plan 后交 tokenkey-anthropic-oauth-config（tier baseline / account 字段）或 admin UI（group.rpm_limit 独立设置）。冗余不足导致落空 503 的同类问题参见运维记忆与 tokenkey-online-log-troubleshooting。

tokenkey-online-traffic-profile

More from this repository

More from this repository

TokenKey：线上请求流量画像（逐分钟 RPM / sticky / session / concurrency）

确定性基线（机械化 vs 真判断）

调用参数

0) 为什么必须"逐分钟重建"，不能只看 gauge

1) 先抓 cap 配置 + 不可调度证据 + 当前快照（调用 probe-caps.sh）

1.1 调用 probe-caps.sh（机械化抓取，零 prose SQL）

1.2 读结果的硬纪律（坑 6 的执行面）

2) 拉 access log（调用 probe-traffic-logs.sh）

3) 逐分钟重建（调用 profile-traffic.py）

4) 解读规则（哪个参数触顶）

4.1 链式失败 / 镜像账号识别

5) 输出模板

6) 交接

TokenKey：线上请求流量画像（逐分钟 RPM / sticky / session / concurrency）

确定性基线（机械化 vs 真判断）

调用参数

0) 为什么必须"逐分钟重建"，不能只看 gauge

1) 先抓 cap 配置 + 不可调度证据 + 当前快照（调用 probe-caps.sh）

1.1 调用 probe-caps.sh（机械化抓取，零 prose SQL）

1.2 读结果的硬纪律（坑 6 的执行面）

2) 拉 access log（调用 probe-traffic-logs.sh）

3) 逐分钟重建（调用 profile-traffic.py）

4) 解读规则（哪个参数触顶）

4.1 链式失败 / 镜像账号识别

5) 输出模板

6) 交接