تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

pipeline-investigation

Name: Pipeline Investigation
Author: dotnet

// Investigate AzDO pipeline failures beyond Helix — build errors, infra tooling crashes, validation test flakiness, artifact cascade failures. USE FOR: "why did the unified-build fail", "what's breaking the pipeline", "how often does this failure occur", "drill into build task logs", "1ES scan failures", "SourcelinkTests flaky", "NetAnalyzers build error", analyzing AzDO build timelines and task logs, failure frequency/trend analysis. DO NOT USE FOR: Helix test failures (use helix-investigation), CI status overview (use ci-analysis), codeflow PRs (use flow-analysis). INVOKES: AzDO, Helix, and binlog MCP tools, az CLI for internal auth, gh CLI.

تشغيل في Manus

$ git log --oneline --stat

stars:٨

forks:١٥

updated:١٧ أبريل ٢٠٢٦ في ١٩:٣٤

مستكشف الملفات

3 ملفات

SKILL.md

readonly

related-skills.json

نفس المستودع

flow-analysis.md

from "dotnet/arcade-skills"

Analyze VMR codeflow health using maestro MCP tools and GitHub MCP tools. USE FOR: investigating stale codeflow PRs, checking if fixes have flowed through the VMR pipeline, debugging dependency update issues, checking overall flow status for a repo, diagnosing why backflow PRs are missing or blocked, subscription health, build freshness, URLs containing dotnet-maestro or "Source code updates from dotnet/dotnet". DO NOT USE FOR: CI build failures (use ci-analysis skill), code review (use code-review skill), general PR investigation without codeflow context, tracing whether a specific commit/PR has reached another repo (use flow-tracing skill). INVOKES: maestro and GitHub MCP tools, flow-health.cs script.

2026-05-228

flow-tracing.md

from "dotnet/arcade-skills"

Trace dependency flow across .NET repos through the VMR pipeline. USE FOR: checking if a PR/commit from repo A has reached repo B, finding what runtime SHA is in an SDK build, tracing dependency versions through the VMR, checking if a commit is included in an SDK build, decoding SDK version strings, "has my fix reached runtime", "did roslyn#80873 flow to runtime", "what SHA is in SDK version X", cross-repo dependency tracing, mapping SDK versions to VMR commits. DO NOT USE FOR: codeflow PR health or staleness (use flow-analysis skill), CI build failures (use ci-analysis skill). INVOKES: maestro and GitHub MCP tools, Get-SdkVersionTrace.ps1 script.

2026-05-228

ci-analysis.md

from "dotnet/arcade-skills"

Analyze CI build and test status from Azure DevOps and Helix for dotnet repository PRs. Use when checking CI status, investigating failures, determining if a PR is ready to merge, or given URLs containing dev.azure.com or helix.dot.net. Also use when asked "why is CI red", "test failures", "retry CI", "rerun tests", "is CI green", "build failed", "checks failing", or "flaky tests". DO NOT USE FOR: investigating stale codeflow PRs or dependency update health, tracing whether a commit has flowed from one repo to another, reviewing code changes for correctness or style.

2026-05-228

known-issue-history.md

from "dotnet/arcade-skills"

Analyze historical failure rates for Known Build Error issues by mining the edit history of issue bodies. Use when asked "when did this last fail", "failure history", "failure rate", "is this issue still active", "flaky test history", "known issue activity", or "most active known issues".

2026-05-228

binskim-analysis.md

from "dotnet/arcade-skills"

Investigate BinSkim SDL findings from official pipelines — understand Guardian filtering, compare raw vs merged SARIF, decode portal results, and determine fix ownership. Use when asked about SDL scan results, portal findings, Guardian filtering, rule meanings, or discrepancies between local and official results. Also use when asked "why does the portal show X", "what's filtered", "explain Guardian", "investigate SDL findings", "portal BA2008", "binskim failures in pipeline", or "what rules are required". DO NOT USE FOR: running BinSkim locally (use binskim-scan), source code analysis (use CodeQL), or credential scanning (use CredScan).

2026-05-218

binskim-scan.md

from "dotnet/arcade-skills"

Run BinSkim binary security analysis locally against a dotnet repository. Use when asked to scan binaries, check BinSkim compliance, verify a fix for a rule violation, or run a local SDL scan. Also use when asked "run binskim", "binary security scan", "scan binaries", "check binskim", "verify my fix", "repro BA2008 locally", or "verify BA2008 fix". DO NOT USE FOR: investigating official pipeline results or portal findings (use binskim-analysis), source code analysis (use CodeQL), credential scanning (use CredScan), or general build/test failures (use ci-analysis).

2026-04-178

package.json

"author": "dotnet"

"repository": "dotnet/arcade-skills"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

$ useful --forSOC

مديرو الشبكات وأنظمة الحاسوبمهن الحاسوب والرياضيات15-1244L4

name

pipeline-investigation

description

Investigate AzDO pipeline failures beyond Helix — build errors, infra tooling crashes, validation test flakiness, artifact cascade failures. USE FOR: "why did the unified-build fail", "what's breaking the pipeline", "how often does this failure occur", "drill into build task logs", "1ES scan failures", "SourcelinkTests flaky", "NetAnalyzers build error", analyzing AzDO build timelines and task logs, failure frequency/trend analysis. DO NOT USE FOR: Helix test failures (use helix-investigation), CI status overview (use ci-analysis), codeflow PRs (use flow-analysis). INVOKES: AzDO, Helix, and binlog MCP tools, az CLI for internal auth, gh CLI.

Pipeline Investigation

Investigate AzDO pipeline failures that aren't Helix test failures — build errors, infrastructure tooling crashes, validation test flakiness, and artifact cascade failures. Complements helix-investigation by covering everything else in the pipeline.

When to Use This Skill

User has an AzDO build URL with a non-Helix failure (build step, validation, infra task)
User asks "why did the pipeline fail" and the failure is in a build/scan/validation task
User wants to know how often a specific failure occurs (frequency/trend analysis)
User sees exit code null, 1ES PT errors, or MSBuild failures
User wants to understand artifact cascade failures ("missing artifacts from prior build")
User asks about SourcelinkTests, Binary Analysis Scan, or installer validation failures

Output Formats

This skill produces two distinct report types. Match the format to the request:

Health Assessment ("pipeline health", "are builds passing", "pipeline status")

Follow references/health-assessment-format.md. Output MUST include these two tables:

Failed Builds Table — every failed build, classified and investigated: | Build | Type | Source | Failure Detail |
- Type: Rolling, Forward Flow, or Other PR (classify via gh pr view)
- Source: Rolling → branch name. Forward Flow → target ← source-repo. Other PR → short description.
Summary Table — pass/fail breakdown by build type: | Type | Completed | ✅ Pass | ❌ Fail | Pass Rate |
Failure Trends Table (conditional — include when 3+ builds in scope and at least one pattern recurs; cap at top 5): | Pattern | Hits | Window | Status |
- Status: ❌ No issue filed, ✅ Fix merged, 🔄 Known issue (link), ⏳ Fix in progress

See the reference for build classification rules, branch filtering, and codeflow analysis methodology.

Save the report: After presenting the health assessment, save it to reviews/pipeline-health-<slug>-YYYY-MM-DD-HHMM.md in the repo, where <slug> identifies the pipeline or scope (e.g., unified-build, runtime-ci). Include the timestamp to ensure uniqueness across multiple investigations per day. Include a Methodology section at the end documenting data sources and classification approach.

Individual Failure Investigation ("why did this build fail", single build URL)

Use Step 7's numbered format: failure category, frequency, root cause, affected branches, owner, existing issues, recommended action.

Prerequisites

AzDO MCP tools (ado-* prefix) for querying builds and timelines on public projects
Binlog MCP tools (mcp-binlog-tool-*) for analyzing MSBuild binary logs from build artifacts
az account get-access-token or azureauth ado token for authenticated REST API access to dnceng/internal
curl for downloading task logs and build artifacts
gh CLI for searching related issues and source code

Authentication

The dnceng/internal project requires authentication. Two methods:

# Method 1: Azure CLI (preferred — works in most environments)
TOKEN=$(az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv)

# Method 2: azureauth tool
TOKEN=$(azureauth ado token --output token --prompt-hint "copilot-cli")

# Usage
curl -s -H "Authorization: Bearer $TOKEN" "https://dev.azure.com/dnceng/internal/_apis/build/builds/{buildId}/timeline?api-version=7.1"

⚠️ Tokens expire. Re-acquire if you get a 401 or redirect to a sign-in page. ⚠️ AzDO MCP tools do NOT work against dnceng/internal. Use curl with Bearer token for all internal queries.

Workflow

Step 1: Get the build timeline

Given a build ID or URL, query the timeline to find all failed records:

https://dev.azure.com/{org}/{project}/_apis/build/builds/{buildId}/timeline?api-version=7.1

Filter records by result == "failed" or result == "succeededWithIssues", with type == "Task". Don't skip succeededWithIssues — these contain real failures (signing validation errors, Component Governance warnings) that didn't block the overall job. For health assessments, also include builds with overall result partiallySucceeded.

Each failed or warning task has:

name — the task that failed
parentId — links to the parent Job record (tells you which leg)
issues — error messages with type and message fields
log.url — URL to download the full task log

Step 2: Categorize the failure

Pattern	Category	Owner
`Binary Analysis Scan` task, `exit code null`	1ES infra tooling crash	1ES PT team
`MSB3073` / `NetAnalyzers.Package.csproj` error	Build error	Source code / SDK
`Validate installer packages` — expected vs actual package list mismatch	Installer validation regex	dotnet/installer
`Run Tests` with `SourcelinkTests.VerifySourcelinks`	Sourcelink validation	dotnet/dotnet
`Run Tests` with `The task has timed out` across many `SB__Validation_` legs	Run Tests timeout epidemic	dotnet/dotnet
`Run Tests` with locale test failure (e.g., `tr-TR`)	Scenario test bug	dotnet/templating
`Build` with `IBCMerge` / PGO error in `Windows_Pgo_*`	PGO optimization failure	dotnet/runtime
`Download Previous Build` with "Artifact not found"	Artifact cascade failure	Prior build failed first
`Build` with `curl` / `tar` / download failure (exit code 2, "not recoverable")	External resource fetch failure	Retry first; if persistent, check URL/version
`Build` with `npm ci` ETIMEDOUT / ECONNREFUSED	npm network timeout	Transient; shell retry wrapper needed
`Build` task with compilation errors	Source code build break	Varies by component
`Crossgen.targets` with `exit code 57005` (0xDEAD)	crossgen2 fatal crash	dotnet/runtime (area-crossgen2-coreclr)
ESRP `MacSignFailed` / `FailDoNotRetry` / notarization errors	Signing/notarization failure	dotnet/sdk or ESRP team
`exit code null` on cross-compilation legs in containers	Container OOM kill	Infrastructure — pool/container config
Build start == finish, HTTP 204 timeline, `validationResults` has errors	YAML pre-flight rejection	PR author — pipeline YAML invalid

Routing: AzDO tests vs Helix tests. If the test runs directly as an AzDO task (e.g., Run Tests in SB_*_Validation_* legs — SourcelinkTests, scenario tests), it's pipeline-investigation. If the test is submitted to Helix (has a Helix job ID, work items, console logs at helix.dot.net), use helix-investigation.

Step 3: Download and analyze task logs

TOKEN=$(azureauth ado token --output token --prompt-hint "copilot-cli")
curl -s -H "Authorization: Bearer $TOKEN" "{log_url}" > /tmp/task-{buildId}.log

Key patterns to search for:

exit code null — process terminated by signal (OOM, timeout). Check for memory-related warnings before the crash.
exit code 57005 (0xDEAD) — crossgen2 fatal error. Check which assembly was being compiled. Intermittent crashes need crash dumps for diagnosis.
[ERROR] / ##[error] — explicit error messages from the task
[WARNING] floods — excessive warnings (e.g., 1000+ hardlink failures) indicate resource exhaustion
MSBuild error codes — MSB3073 (command failed), NETSDK* (SDK errors), NU1105 (invalid target framework — often transient forward flow)
MacSignFailed / FailDoNotRetry — ESRP signing rejection. FailDoNotRetry means deterministic content failure, not transient.
Stack traces — .NET exceptions in test output
ETIMEDOUT / ECONNREFUSED — TCP-level network failures in npm/curl that aren't retried by default

Step 4: Determine if it's a one-off or recurring

Query failed builds over a time range and count how many hit the same failure:

# Get failed builds for a pipeline definition
curl -s -H "Authorization: Bearer $TOKEN" \
  "https://dev.azure.com/{org}/{project}/_apis/build/builds?definitions={defId}&resultFilter=failed&minTime={iso8601}&$top=200&api-version=7.1"

💡 Rolling builds give cleaner signal. Add &reasonFilter=schedule to filter to scheduled builds only. PR builds include broken branches that inflate failure counts and obscure systemic trends.

For each build, check its timeline for the same failure pattern. Track in a SQL table:

CREATE TABLE pipeline_failures (
  build_id INT, branch TEXT, queued TEXT, failure_category TEXT, 
  failed_task TEXT, notes TEXT
);

Look for:

Burst pattern — many occurrences in a short window → something changed (tooling update, artifact size growth)
Steady trickle — consistent low rate → chronic issue
Single occurrence — likely transient, not worth investigating further

Step 5: Check for known issues

Search the relevant repo for existing issues:

gh search issues "{failure_pattern}" --repo dotnet/dotnet --limit 5

If found, check if the issue is being worked on. If not found and the failure is recurring, consider filing one.

Step 6: Root cause analysis

For non-obvious failures, pull up the source code of the failing test or tool:

gh api "repos/{owner}/{repo}/contents/{path}" --jq '.content' | base64 -d

Look for:

Unbounded parallelism — Parallel.ForEach without MaxDegreeOfParallelism
Tight timeouts — processes timing out under contention
Resource assumptions — disk space, memory, network that may not hold in CI
Non-determinism — race conditions, order-dependent assertions

Step 6b: Binlog analysis for deep investigation

When task logs don't reveal enough, download the build's binlog artifacts for detailed MSBuild analysis:

# List build artifacts to find binlogs
curl -s -H "Authorization: Bearer $TOKEN" \
  "https://dev.azure.com/{org}/{project}/_apis/build/builds/{buildId}/artifacts?api-version=7.1"

# Download the artifact zip
curl -s -H "Authorization: Bearer $TOKEN" -o /tmp/logs.zip "{downloadUrl}"
unzip -o /tmp/logs.zip "*/Build.binlog" -d /tmp/binlogs/

Then use the binlog MCP tools:

load_binlog → get_diagnostics for errors/warnings
search_binlog to find specific patterns (e.g., MacSignFailed, crossgen2)
get_task_info to see exact command lines and parameters for failed tasks
list_tasks_in_target to see all tasks in a target (e.g., 126 Crossgen tasks, which one failed?)

💡 Binlogs capture the full MSBuild execution including command lines, environment variables, and interleaved output that task logs lose. Essential for signing, crossgen2, and SBRP failures.

Step 7: Report findings (individual failure investigations only)

For health assessments, use the Output Formats section above instead.

Provide:

Failure category — which pattern from the table above
Frequency — how often in the last N days, trending up/down/stable
Root cause — with evidence from logs and/or source code
Affected branches — main only, or also release branches
Owner — who should fix it (1ES, dotnet/dotnet, SDK team, etc.)
Existing issues — links to any filed issues
Recommended action — file issue, retry, wait for fix, etc.

Artifact Cascade Failures

When a build leg fails, downstream legs that depend on its artifacts will also fail with "Artifact not found." This creates a cascade:

SB_CentOS_Build (fails: NetAnalyzers error)
  → SB_CentOS_Validation (fails: missing artifacts)
  → SB_Fedora_Offline (fails: missing artifacts)

Don't investigate the cascade — find the root failure. Look for the first failed task chronologically, or filter out "Download Previous Build" failures to find the real cause.

Category-Specific Investigation Techniques

For detailed investigation techniques for each failure category in the table above, see references/investigation-techniques.md. Covers: ESRP signing/notarization, container OOM, crossgen2 crashes, YAML pre-flight rejections, and network transient failures.

Fix Verification

When a fix is merged, verify it's tested by checking builds queued after the merge — not builds that finished after it.

Query all build statuses

Always include in-progress and not-started builds, not just completed ones. The AzDO builds API requires explicit statusFilter to return active builds.

⚠️ The AzDO API rejects combining completed with inProgress/notStarted in a single statusFilter. You must make separate calls and merge the results:

# Completed builds
curl -s -H "Authorization: Bearer $TOKEN" \
  "https://dev.azure.com/{org}/{project}/_apis/build/builds?definitions={defId}&branchName=refs/heads/{branch}&statusFilter=completed&api-version=7.1"

# Active builds (in-progress and queued)
curl -s -H "Authorization: Bearer $TOKEN" \
  "https://dev.azure.com/{org}/{project}/_apis/build/builds?definitions={defId}&branchName=refs/heads/{branch}&statusFilter=inProgress,notStarted&api-version=7.1"

If you only query completed builds, you'll miss active builds that are currently testing the fix and incorrectly conclude "not tested yet."

Multi-branch fixes

Fixes often need backporting to multiple release branches. For each branch:

Find the fix PR (or backport PR) and its merge time
Query builds on that branch with all statuses
Compare each build's queueTime to the merge time
Partition into pre-merge (doesn't have fix) and post-merge (has fix)

⚠️ batchedCI builds pick up source at queue time. A build queued before a PR merges runs against old source, even if it finishes hours later.

Stop Signals

Stop after categorizing if it's a known issue with an existing bug. Link to it and move on.
Stop frequency analysis after 14 days of data. Longer trends are rarely actionable.
Stop investigating cascades — always trace back to the root failure.
Present single-occurrence failures — even one-offs deserve a summary. Let the user decide whether to dig deeper or move on.

Anti-Patterns

🚨 Don't confuse cascade failures with root causes. "Missing artifacts" means a prior leg failed — find that leg.

🚨 Don't assume exit code null is a code bug. It usually means the process was terminated externally (OOM, signal). Look for resource exhaustion in the log. In containers, check memory limits.

🚨 Don't investigate all 100+ failed builds. Sample 3-5 spread across the time range to confirm the pattern, then count occurrences.

🚨 Don't assume ESRP FailDoNotRetry is transient. It means the binary content is deterministically invalid. Investigate the source commits, don't retry the build.

🚨 Don't skip infrastructure failures. Container OOM, agent timeouts, signing failures, and network issues deserve the same depth of investigation as code failures. They often have actionable fixes (memory tuning, retry wrappers, pool changes).

Discovering Pipeline Definitions

Never hardcode definition IDs — they vary across projects and can change. Discover them at runtime:

# Public pipelines (AzDO MCP tools work here too)
curl -s "https://dev.azure.com/dnceng-public/public/_apis/build/definitions?name=dotnet-unified-build&api-version=7.1" | jq '.value[] | {id, name, project: .project.name}'

# Internal pipelines (requires auth)
TOKEN=$(az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv)
curl -s -H "Authorization: Bearer $TOKEN" \
  "https://dev.azure.com/dnceng/internal/_apis/build/definitions?name=dotnet-unified-build&api-version=7.1" \
  | jq '.value[] | {id, name, project: .project.name}'

If querying internal fails (401/redirect), fall back to public-only analysis and note the limitation in the report.

Health Assessment Report Format

Required format for pipeline health assessments. See Output Formats section above for routing, and references/health-assessment-format.md for the complete specification including build classification rules, branch filtering, and codeflow analysis methodology.

pipeline-investigation

المزيد من هذا المستودع

المزيد من هذا المستودع

Pipeline Investigation

When to Use This Skill

Output Formats

Health Assessment ("pipeline health", "are builds passing", "pipeline status")

Individual Failure Investigation ("why did this build fail", single build URL)

Prerequisites

Authentication

Workflow

Step 1: Get the build timeline

Step 2: Categorize the failure

Step 3: Download and analyze task logs

Step 4: Determine if it's a one-off or recurring

Step 5: Check for known issues

Step 6: Root cause analysis

Step 6b: Binlog analysis for deep investigation

Step 7: Report findings (individual failure investigations only)

Artifact Cascade Failures

Category-Specific Investigation Techniques

Fix Verification

Query all build statuses

Multi-branch fixes

Stop Signals

Anti-Patterns

Discovering Pipeline Definitions

Health Assessment Report Format

Pipeline Investigation

When to Use This Skill

Output Formats

Health Assessment ("pipeline health", "are builds passing", "pipeline status")

Individual Failure Investigation ("why did this build fail", single build URL)

Prerequisites

Authentication

Workflow

Step 1: Get the build timeline

Step 2: Categorize the failure

Step 3: Download and analyze task logs

Step 4: Determine if it's a one-off or recurring

Step 5: Check for known issues

Step 6: Root cause analysis

Step 6b: Binlog analysis for deep investigation

Step 7: Report findings (individual failure investigations only)

Artifact Cascade Failures

Category-Specific Investigation Techniques

Fix Verification

Query all build statuses

Multi-branch fixes

Stop Signals

Anti-Patterns

Discovering Pipeline Definitions

Health Assessment Report Format