Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

ci-runner-utilization

Name: Ci Runner Utilization
Author: camunda

// Detect CI runner underutilization and give downsizing recommendations for cost savings. Queries BigQuery CPU/memory metrics from self-hosted runners in camunda/camunda, identifies overprovisioned jobs, and suggests smaller runner types. Use when asked about CI costs, runner sizing, resource waste, underutilization, or right-sizing runners.

Exécuter dans Manus

$ git log --oneline --stat

stars:4 151

forks:778

updated:28 mai 2026 à 10:50

SKILL.md

readonly

related-skills.json

même dépôt

frontend-feature.md

from "camunda/camunda"

Use when creating new pages, components, modules, or features in the orchestration cluster webapp at webapp/client/apps/orchestration-cluster-webapp/. Use when adding routes, data loading, forms, API integration, or UI components. Trigger whenever someone is building or modifying frontend feature code in webapp/client/, even for small changes like adding a column, filter, or panel.

2026-05-294.2k

frontend-unit-test.md

from "camunda/camunda"

Use when writing, modifying, or debugging unit tests in the orchestration cluster webapp at webapp/client/apps/orchestration-cluster-webapp/. Use when working with Vitest browser mode, MSW mocking, vitest-browser-react rendering, or any *.test.tsx file in the OC webapp's src/ directory. Trigger whenever someone needs to create, fix, or understand a frontend unit test in webapp/client/.

2026-05-294.2k

act-testing.md

from "camunda/camunda"

Prepares act-testable GitHub Actions workflow scenarios for this monorepo. Use when validating workflow logic locally, generating temporary test harnesses, checking logic drift, assessing local act feasibility, and proposing reproducible user-run test cases.

2026-05-284.2k

ci-validation.md

from "camunda/camunda"

Validates GitHub Actions workflow changes in this monorepo using actionlint, conftest policy checks, spotless formatting, and act testability assessment. Use when creating, editing, or reviewing workflow files.

2026-05-284.2k

frontend-integration-test.md

from "camunda/camunda"

Use when writing, modifying, or debugging Playwright-based tests in the orchestration cluster webapp — integration tests, visual regression tests, or accessibility tests. Use when working with MSW network-level mocking via @msw/playwright, Page Object Models, axe-core accessibility checks, or screenshot comparisons. Trigger whenever someone is working in the test/ directory of the OC webapp at webapp/client/apps/orchestration-cluster-webapp/test/.

2026-05-274.2k

frontend-migrator.md

from "camunda/camunda"

Use when migrating, porting, rewriting, or moving frontend code from operate/client/ or tasklist/client/ to the orchestration cluster webapp at webapp/client/apps/orchestration-cluster-webapp/. Trigger whenever someone mentions migrating a legacy page, component, or module to the new unified frontend, converting React Router to TanStack Router, replacing MobX stores with TanStack Query or URL state, rewriting styled-components as SCSS modules, or converting legacy test patterns to Vitest browser mode. Also use when someone asks how a legacy pattern maps to the new architecture, even for small questions like "how would I write this Operate component in the new app?" or "what's the equivalent of this Tasklist store in the unified frontend?"

2026-05-274.2k

package.json

"author": "camunda"

"repository": "camunda/camunda"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Analystes des systèmes informatiquesProfessions informatiques et mathématiques15-1211L4

name	ci-runner-utilization
description	Detect CI runner underutilization and give downsizing recommendations for cost savings. Queries BigQuery CPU/memory metrics from self-hosted runners in camunda/camunda, identifies overprovisioned jobs, and suggests smaller runner types. Use when asked about CI costs, runner sizing, resource waste, underutilization, or right-sizing runners.

CI Runner Utilization & Downsizing Analysis

Analyzes CPU and memory utilization of self-hosted CI runners in the camunda/camunda repository to find overprovisioned jobs and recommend cheaper runner types.

What costs money and what doesn't

Self-hosted runners cost money — these are Kubernetes pods on GCP or AWS billed by core-hour. Their runner_type starts with gcp- (e.g., gcp-perf-core-16-default) or aws-. More cores = higher cost. Downsizing from 16 to 8 cores roughly halves the per-job compute cost.

GitHub-hosted runners are free for public repos — jobs on ubuntu-latest / ubuntu-slim have runner_type = NULL in BigQuery. Ignore them entirely for cost optimization.

CPU is the expensive resource. Memory is proportional to cores and much cheaper per unit. Focus downsizing decisions on CPU utilization; only check memory to ensure a smaller runner won't OOM.

Perf runners cost more than standard runners — gcp-perf-core-N uses faster CPUs than gcp-core-N. Only suggest downgrading perf→standard if the job doesn't need fast CPUs (e.g., linting, static analysis, artifact uploads).

Longrunning runners cost more than default — -longrunning has higher durability guarantees and costs more. Only needed for jobs that genuinely run long or are release-critical.

Runner type naming convention

Format: {cloud}-{tier?}-core-{cores}-{durability}

Component	Values
cloud	`gcp`, `aws`
tier	`perf` (fast CPU, more expensive) or absent (standard)
cores	`2`, `4`, `8`, `16` — number of vCPUs
durability	`default` (cheap, preemptible), `release` / `longrunning` (expensive, durable)

Available self-hosted runner types can be found on https://github.com/camunda/infra-global-github-actions/blob/main/actionlint/actionlint.yaml

Downsizing follows the same family: gcp-perf-core-16-default → gcp-perf-core-8-default.

Prerequisites

bq CLI authenticated with access to project ci-30-162810
- Verify: bq query --use_legacy_sql=false 'SELECT 1'
Data is in ci-30-162810.prod_ci_analytics.build_status_v2 (90-day retention)
CPU/memory metrics were added on 2026-05-18 — data availability starts from that date

How to analyze

Step 1: Identify underutilized self-hosted jobs

This query finds jobs where the peak CPU p95 never exceeds 50% of the runner's capacity, grouped by runner type. Only self-hosted runners (non-NULL runner_type) are included.

bq query --use_legacy_sql=false --format=prettyjson '
SELECT
  job_name,
  runner_type,
  COUNT(*) AS samples,
  ROUND(AVG(cpu_usage_ratio_p95), 3) AS avg_cpu_p95,
  ROUND(MAX(cpu_usage_ratio_p95), 3) AS max_cpu_p95,
  ROUND(AVG(memory_usage_ratio_p95), 3) AS avg_mem_p95,
  ROUND(MAX(memory_usage_ratio_p95), 3) AS max_mem_p95
FROM `ci-30-162810.prod_ci_analytics.build_status_v2`
WHERE cpu_usage_ratio_p95 IS NOT NULL
  AND ci_url LIKE "%camunda/camunda%"
  AND runner_type IS NOT NULL
  AND (runner_type LIKE "gcp-%" OR runner_type LIKE "aws-%")
GROUP BY job_name, runner_type
HAVING MAX(cpu_usage_ratio_p95) <= 0.5
ORDER BY max_cpu_p95 ASC
'

Step 2: Get full utilization picture (all self-hosted jobs)

This shows all jobs sorted by CPU usage so you can see the full spectrum and identify the boundary between "needs downsizing" and "correctly sized":

bq query --use_legacy_sql=false --format=csv --max_rows=200 '
SELECT
  job_name,
  runner_type,
  COUNT(*) AS samples,
  ROUND(AVG(cpu_usage_ratio_p95), 3) AS avg_cpu_p95,
  ROUND(MAX(cpu_usage_ratio_p95), 3) AS max_cpu_p95,
  ROUND(AVG(memory_usage_ratio_p95), 3) AS avg_mem_p95,
  ROUND(MAX(memory_usage_ratio_p95), 3) AS max_mem_p95
FROM `ci-30-162810.prod_ci_analytics.build_status_v2`
WHERE cpu_usage_ratio_p95 IS NOT NULL
  AND ci_url LIKE "%camunda/camunda%"
  AND runner_type IS NOT NULL
  AND (runner_type LIKE "gcp-%" OR runner_type LIKE "aws-%")
GROUP BY job_name, runner_type
ORDER BY max_cpu_p95 ASC
'

Step 3: Check runner type distribution

Understand which runner types carry the most jobs and runs:

bq query --use_legacy_sql=false --format=prettyjson '
SELECT
  runner_type,
  COUNT(DISTINCT job_name) AS distinct_jobs,
  COUNT(*) AS total_runs,
  ROUND(AVG(cpu_usage_ratio_p95), 3) AS overall_avg_cpu_p95
FROM `ci-30-162810.prod_ci_analytics.build_status_v2`
WHERE ci_url LIKE "%camunda/camunda%"
  AND cpu_usage_ratio_p95 IS NOT NULL
  AND runner_type IS NOT NULL
GROUP BY runner_type
ORDER BY total_runs DESC
'

Step 4: Deep-dive a specific job (time series)

When you want to see if a job's usage is stable or has spikes over time:

bq query --use_legacy_sql=false --format=prettyjson '
SELECT
  report_time,
  job_name,
  runner_type,
  ROUND(cpu_usage_ratio_p95, 3) AS cpu_p95,
  ROUND(memory_usage_ratio_p95, 3) AS mem_p95,
  build_status
FROM `ci-30-162810.prod_ci_analytics.build_status_v2`
WHERE ci_url LIKE "%camunda/camunda%"
  AND job_name = "REPLACE_WITH_JOB_NAME"
  AND cpu_usage_ratio_p95 IS NOT NULL
ORDER BY report_time DESC
LIMIT 50
'

How to interpret results and make recommendations

Utilization metrics

cpu_usage_ratio_p95: 95th-percentile CPU usage as a fraction of the container's CPU limit (0.0–1.0). A value of 0.25 on a 16-core runner means the job used ~4 cores at p95.
memory_usage_ratio_p95: Same for memory. Check this to ensure a smaller runner won't OOM.
Always use MAX(cpu_usage_ratio_p95) across runs, not just the average — you need to handle the worst case, not the typical case.

Decision framework

Max CPU p95	Action	Confidence
≤ 25%	Downsize by 4x (16→4 cores) or 2x (8→4, 4→2)	High
25–50%	Downsize by 2x (16→8, 8→4)	High
50–65%	Borderline — downsize only with ≥50 samples	Medium
65–80%	Keep current size	—
80–100%	Correctly sized or consider upsizing	—

Memory safety check

Before recommending a downsize, verify max_mem_p95:

If max_mem_p95 < 0.5 on the current runner, halving cores (and thus memory) is safe.
If max_mem_p95 > 0.5, halving would risk OOM. Consider keeping the larger runner or only stepping down one size (16→8 instead of 16→4).

If no suitable runner type can be found, suggest creating new runner types.

Sample count matters

≥ 100 samples: High confidence — safe to act on.
30–100 samples: Medium confidence — recommend with a note to monitor.
< 30 samples: Low confidence — flag for future review, don't act yet.

Forming the recommendation

For each underutilized job:

Note the current runner_type and extract the core count.
Multiply max_cpu_p95 by the core count to get effective cores used.
Find the smallest available runner type (same family) that provides ≥1.5x the effective cores.
Check memory won't OOM on the smaller runner.
State: job name, current runner, suggested runner, CPU headroom, memory headroom, sample count.

Example: A job with max_cpu_p95 = 0.25 on gcp-perf-core-8-default uses ~2 effective cores. A gcp-perf-core-4-default (4 cores) gives 2x headroom → recommend it.

If no suitable runner type can be found, suggest creating new runner types.

Implementing the recommendation

For each underutilized job:

Ask the user for confirmation to apply the recommendation.
Find the GitHub Action workflow YAML file that contains the job, and adjust the runs-on: label.
Offer to commit and push the changes to a Pull Request, and observe the CI runtime behavior on that PR.
Confirm job run times on the PR do not increase meaningfully.

BigQuery table schema reference

Table: ci-30-162810.prod_ci_analytics.build_status_v2 (90-day retention)

Column	Type	Description
`report_time`	TIMESTAMP	When the row was submitted
`ci_url`	STRING	`https://github.com/{owner}/{repo}`
`workflow_name`	STRING	GitHub Actions workflow name
`job_name`	STRING	Job identifier
`build_id`	STRING	`{run_id}/{attempt}`
`build_trigger`	STRING	Event name (push, pull_request, schedule, etc.)
`build_status`	STRING	success, failed, cancelled
`build_ref`	STRING	Git ref
`build_base_ref`	STRING	Target branch (PRs/merge queue)
`build_head_ref`	STRING	Source branch (PRs)
`build_duration_milliseconds`	INTEGER	Job duration
`runner_name`	STRING	Runner hostname
`runner_arch`	STRING	CPU architecture (x86_64, aarch64)
`runner_os`	STRING	OS (linux, windows)
`runner_type`	STRING	Self-hosted runner label (NULL for GitHub-hosted)
`cpu_usage_ratio_avg`	FLOAT64	Average CPU utilization (0.0–1.0)
`cpu_usage_ratio_p95`	FLOAT64	95th percentile CPU utilization
`memory_usage_ratio_avg`	FLOAT64	Average memory utilization (0.0–1.0)
`memory_usage_ratio_p95`	FLOAT64	95th percentile memory utilization
`user_reason`	STRING	User-provided failure reason
`user_description`	STRING	User-provided details

Data collection pipeline

start-build-monitor action starts a background monitor (5s polling) collecting CPU/memory from cgroups v2/v1 or /proc/
submit-build-status action stops the monitor, aggregates stats (avg, p95) via AWK, reads runner_type from /home/runner/.camunda-arc-runner-info/runs-on, and POSTs to BigQuery
Metrics are normalized ratios relative to the container's CPU/memory limits

Source: camunda/infra-global-github-actions/start-build-monitor/ and camunda/infra-global-github-actions/submit-build-status/

ci-runner-utilization

Plus depuis ce dépôt

CI Runner Utilization & Downsizing Analysis

What costs money and what doesn't

Runner type naming convention

Prerequisites

How to analyze

Step 1: Identify underutilized self-hosted jobs

Step 2: Get full utilization picture (all self-hosted jobs)

Step 3: Check runner type distribution

Step 4: Deep-dive a specific job (time series)

How to interpret results and make recommendations

Utilization metrics

Decision framework

Memory safety check

Sample count matters

Forming the recommendation

Implementing the recommendation

BigQuery table schema reference

Data collection pipeline

CI Runner Utilization & Downsizing Analysis

What costs money and what doesn't

Runner type naming convention

Prerequisites

How to analyze

Step 1: Identify underutilized self-hosted jobs

Step 2: Get full utilization picture (all self-hosted jobs)

Step 3: Check runner type distribution

Step 4: Deep-dive a specific job (time series)

How to interpret results and make recommendations

Utilization metrics

Decision framework

Memory safety check

Sample count matters

Forming the recommendation

Implementing the recommendation

BigQuery table schema reference

Data collection pipeline

Plus depuis ce dépôt