| name | copilot-review-analyst |
| description | Analyze GitHub Copilot code review effectiveness across Android Auth repositories. Collects all Copilot inline review comments via GitHub API, classifies them as helpful/not-helpful/unresolved through reply analysis, diff verification, and AI-assisted classification, then generates a report with per-repo and per-engineer statistics. Use this skill when asked to "analyze Copilot reviews", "measure Copilot review effectiveness", "generate Copilot review report", "how helpful are Copilot reviews", "run review analysis", or any request to measure/report on GitHub Copilot code review quality and adoption. |
Copilot Review Analyst
Analyze GitHub Copilot code review effectiveness across the Android Auth repositories by collecting all inline review comments, classifying each one, and producing a comprehensive report.
Prerequisites
- GitHub CLI (
gh) authenticated with an EMU account (*_microsoft) which has access to all three repos (public and private). The script auto-detects the EMU account from gh auth status.
- The
analyze.ps1 script automatically switches to the EMU account at startup and restores the original account on completion — no manual auth switching needed
- Output directory:
~/.copilot-review-analysis/ for final artifacts, $env:TEMP\copilot-review-analysis\ for intermediate data
Repository Configuration
Default repos (update in scripts if changed):
| Label | Slug | Auth |
|---|
| common | AzureAD/microsoft-authentication-library-common-for-android | EMU (also accessible via personal) |
| msal | AzureAD/microsoft-authentication-library-for-android | EMU (also accessible via personal) |
| broker | identity-authnz-teams/ad-accounts-for-android | EMU only |
Analysis Pipeline
The analysis runs in 5 sequential phases. Scripts and templates are bundled in this skill:
- Scripts:
scripts/ (3 core pipeline scripts)
- Assets:
assets/ (report templates — Markdown, HTML, Outlook HTML)
- References:
references/ (classification rules, report formatting guide)
Phase 1: Data Collection
Script: scripts/analyze.ps1
Collect all Copilot inline review comments from human-authored PRs:
# Default: last 60 days
.\.github\skills\copilot-review-analyst\scripts\analyze.ps1
# Custom date range:
.\.github\skills\copilot-review-analyst\scripts\analyze.ps1 -StartDate "2026-01-23"
Parameters:
-StartDate — Start date for PR search (default: 60 days ago). Format: YYYY-MM-DD
-OutputDir — Output directory (default: $env:TEMP\copilot-review-analysis)
What it does:
- Fetch all PRs created after
-StartDate via gh pr list
- Filter out bot-authored PRs (Copilot, copilot-swe-agent, dependabot, github-actions)
- For each PR, call
repos/{slug}/pulls/{prNum}/comments to get inline comments
- Filter to Copilot comments (user.login = "Copilot") that are top-level (not replies)
- Find human replies to each Copilot comment (matched via
in_reply_to_id)
- Record whether each comment has a reply and capture the reply text (no classification at this stage)
Outputs:
$env:TEMP\copilot-review-analysis\raw_results.json — all comments with HasReply flag and raw reply text
$env:TEMP\copilot-review-analysis\review_summaries.json — PR-level summary comments (for reference)
Phase 2: Diff-Level Verification
Script: scripts/precise.ps1
For every comment with no reply, verify whether the engineer silently acted on the feedback:
.\.github\skills\copilot-review-analyst\scripts\precise.ps1
What it does:
- Load
raw_results.json, filter to comments where HasReply = false
- For each comment, get the commit SHA it was left on and the PR head SHA
- Use
repos/{slug}/compare/{commitA}...{commitB} to get the diff
- For suggestion blocks: extract code tokens, check if they appear as
+ lines in the diff
- For prose comments: check if diff hunk line ranges overlap the comment's line range (±5 line tolerance)
Verdicts assigned:
suggestion-applied — suggestion tokens match diff + lines overlap
suggestion-likely-applied — tokens match but lines don't overlap
exact-lines-modified — prose comment's lines were modified
lines-modified-different-fix — nearby lines modified, different code
file-changed-elsewhere — file modified but at different lines
file-changed-no-line-info — file modified but comment had no line number
file-not-changed — file untouched after the comment
no-subsequent-commits — PR merged without any commits after the review
Output: $env:TEMP\copilot-review-analysis\precise.json
Phase 3: AI Reply Classification
This phase is performed by the agent (you) in conversation. Classify every replied comment by reading the full Copilot comment and engineer reply in context.
See references/classification-rules.md for detailed guidance on what counts as helpful vs not-helpful.
Process:
- Load
raw_results.json and filter to HasReply -eq true
- For each comment, read the
CommentBody (Copilot's feedback) and HumanReplyText (engineer's reply)
- Classify as
helpful or not-helpful based on the engineer's intent (read the full reply, don't keyword-match)
- Also review
file-changed-elsewhere and file-changed-no-line-info verdicts from Phase 2 to identify re-audit flips
Outputs (write to $env:TEMP\copilot-review-analysis\):
reply-verdicts.json — { "commentId": "helpful"|"not-helpful", ... } for every replied comment
reaudit-flips.json — { "reauditFlipKeys": ["repo/prNum/filePattern", ...] } for no-reply comments with strong evidence
See references/manual-audit-template.json for the schema.
Phase 4: Final Classification
Script: scripts/final-classification.ps1
Merge all results into a single authoritative dataset:
.\.github\skills\copilot-review-analyst\scripts\final-classification.ps1 `
-AccountMapFile ".github\skills\copilot-review-analyst\references\account-map.json" `
-ReplyVerdictsFile "$env:TEMP\copilot-review-analysis\reply-verdicts.json" `
-ReauditFlipsFile "$env:TEMP\copilot-review-analysis\reaudit-flips.json"
Parameters:
-OutputDir — Directory with raw_results.json and precise.json (default: $env:TEMP\copilot-review-analysis)
-AccountMapFile — Path to JSON mapping GitHub logins to display names. See references/account-map.json.
-ReplyVerdictsFile — Path to JSON with Phase 3 AI verdicts for replied comments. If omitted, replied comments are "unknown".
-ReauditFlipsFile — Path to JSON with Phase 3 re-audit flips for no-reply comments. If omitted, file-changed-elsewhere defaults to "not-helpful".
What it does:
- Load
raw_results.json and precise.json
- Load account mapping, reply verdicts (Phase 3), and re-audit flips (Phase 3)
- For replied comments: use the AI verdict from
reply-verdicts.json
- For no-reply comments: use the Phase 2 diff verdict, with re-audit overrides
- Produce per-engineer and per-repo statistics
Output: $env:TEMP\copilot-review-analysis\final_classification.json
Phase 4.5: Append History
Script: scripts/append-history.ps1
Append a snapshot of the current run to the persistent history file for trend tracking across runs:
.\.github\skills\copilot-review-analyst\scripts\append-history.ps1 `
-PeriodStart "2026-01-24" `
-PeriodEnd "2026-03-25"
Parameters:
-PeriodStart — First day of the analysis period (YYYY-MM-DD). This is the -StartDate that was passed to analyze.ps1 in Phase 1. Required.
-PeriodEnd — Last day of the analysis period (YYYY-MM-DD). Defaults to today.
-InputDir — Directory with final_classification.json and precise.json (default: $env:TEMP\copilot-review-analysis)
-HistoryFile — Path to persistent history JSON (default: ~/.copilot-review-analysis/history.json)
Important: Periods are non-overlapping and variable-length. Each run covers a distinct period — e.g., the first run might cover 60 days, subsequent runs might cover 2 weeks each. The -PeriodStart for a new run should be the day after the previous run's -PeriodEnd. The script automatically deduplicates entries with the same period.
What it does:
- Load
final_classification.json and precise.json
- Compute aggregate stats: response rate, three-way breakdown, per-repo, per-engineer
- Compute normalized volume:
comments/week for fair comparison across different period lengths
- Load existing
history.json (or create empty if first run)
- Append current snapshot; replace if same period already exists
- Save sorted by period (newest first)
Output: ~/.copilot-review-analysis/history.json
The history file is an array of snapshots. Each snapshot contains:
- Period metadata:
runDate, periodStart, periodEnd, periodDays
- Volume:
total, commentsPerWeek, reviewedPRs, avgCommentsPerPR
- Rates:
responseRate, helpful.pct, notHelpful.pct, unresolved.pct, repliedHelpfulRate
- Per-repo breakdown:
repos.{broker,common,msal}.{comments, responseRate, helpfulPct, ...}
- Per-engineer breakdown:
engineers.{name}.{comments, responseRate, helpfulPct}
Phase 5: Report Generation
Generate both Markdown and Outlook-compatible HTML reports.
Style/structure references (in assets/ — these contain data from the Jan-Mar 2026 analysis and serve as structural templates, NOT to be copied verbatim):
assets/Copilot-Code-Review-Effectiveness-Report.md — Markdown reference (~270 lines, ~3300 words)
assets/Copilot-Code-Review-Effectiveness-Report-Outlook.html — Outlook HTML reference (~430 lines, ~42KB)
assets/Copilot-Code-Review-Effectiveness-Report.html — Standard HTML reference
Important: The asset templates contain hardcoded numbers (557 comments, specific percentages, engineer names, etc.) from the first analysis. For each new run, generate fresh reports using the same section structure and formatting patterns but with statistics computed from final_classification.json.
MANDATORY: Read Templates Before Generating
You MUST read both the Markdown and Outlook HTML template files in full before generating any report. Do not generate from memory or from the section table in report-formatting.md alone — the templates contain critical patterns that are not captured in the section list:
- Narrative prose paragraphs between every table explaining the significance of the data (not just "here's a table")
- Callout boxes (blue for insights, yellow for warnings) that frame key findings
- Background section with team context and Copilot enablement framing
- "At a Glance" summary cards with a detailed adoption callout underneath
- Scope metric cards (Human PRs, PRs reviewed, Total comments, Avg per PR)
- Full Copilot comment text in examples (not truncated snippets)
- Explanatory paragraphs after examples describing why the comment was helpful/unhelpful
- Detailed Recommendations section with prose paragraphs (not bullet points)
- Bar chart visualizations (response rate bar, helpfulness verdict bar, per-repo bars) in HTML
Quality Gate
After generating each report, compare its dimensions against the template:
- Markdown: Must be ≥250 lines and ≥3000 words. If significantly smaller, the report is missing narrative depth.
- Outlook HTML: Must be ≥400 lines and ≥35KB. If significantly smaller, the report is missing visual elements or prose.
If a report doesn't meet these thresholds, re-read the template and identify what's missing before saving.
Generate two versions of each report:
- Team-internal — uses real engineer names (from account map). For the team.
- Org-wide — anonymizes engineers as "Engineer A", "Engineer B", etc., sorted by helpfulness descending. For sharing outside the team.
Process:
- Load
final_classification.json
- Read full template files (
assets/Copilot-Code-Review-Effectiveness-Report.md and assets/Copilot-Code-Review-Effectiveness-Report-Outlook.html) — do NOT skip this step
- Compute aggregate statistics (total, per-repo, per-engineer, three-way breakdown)
- Load
raw_results.json to collect full comment text for 4-5 helpful and 4-5 unhelpful examples
- Load
~/.copilot-review-analysis/history.json for trend data. If ≥2 entries exist, generate a Trend section comparing the current run with the previous run(s). See references/report-formatting.md for trend section formatting rules.
- Generate reports matching the template's section structure, narrative depth, and visual formatting
- Verify dimensions (word count, line count) against the quality gate thresholds
- Save to
~/.copilot-review-analysis/:
Copilot-Code-Review-Effectiveness-Report.md (team, real names)
Copilot-Code-Review-Effectiveness-Report-Anonymous.md (org-wide)
Copilot-Code-Review-Effectiveness-Report-Outlook.html (team, real names)
Copilot-Code-Review-Effectiveness-Report-Outlook-Anonymous.html (org-wide)
See references/report-formatting.md for the report structure and Outlook HTML formatting rules.
Key Principle: "Unresolved" ≠ "Not Helpful"
Comments with no reply and no diff evidence are Unresolved, not assumed unhelpful. This is a critical distinction:
- Confirmed Helpful = positive evidence (explicit acknowledgment OR verified fix in diff)
- Confirmed Not Helpful = positive evidence (explicit dismissal with stated reason OR comment on stale code)
- Unresolved = insufficient evidence either way (engineer never engaged)
The final-classification.ps1 script classifies no-response/no-diff-evidence comments as "not-helpful" for conservative stats, but the report should present the three-way breakdown to be honest about uncertainty.
Copilot Comment Identification
- Copilot inline review comments use
user.login = "Copilot"
- Legacy bot:
copilot-pull-request-reviewer[bot]
- Bot PR authors to exclude:
app/copilot-swe-agent, dependabot[bot], github-actions[bot]
- Only count top-level comments (
in_reply_to_id is null/0), not Copilot's own replies
Rate Limiting
The scripts call the GitHub API heavily. Built-in mitigations:
- PR comment caching (fetched once per PR)
- Diff caching (fetched once per commit range)
- Sleep every 15 PRs (300ms) and every 25 diff checks (200ms)
- Use
--paginate for repos with many comments
If hitting rate limits, increase sleep intervals or use GH_TOKEN with higher rate limits.