Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

data-security-analysis

Name: Data Security Analysis
Author: SCStelz

// Analyze data security events, sensitive information type (SIT) access, sensitivity label access, DLP matches, or Purview insider risk activity. Triggers on keywords like "data security", "sensitive information type", "SIT access", "DLP events", "DataSecurityEvents", "EDM access", "credit card access", "insider risk activity", "Purview data security", "sensitivity label", "label downgrade", "label change", "Copilot label exposure". Queries DataSecurityEvents in Advanced Hunting to produce SIT and label access analysis: volume breakdowns, user drill-downs, file inventories, action type distribution, DLP correlation, label change tracking, Copilot label exposure, temporal patterns, and risk-ranked user summaries. Inline chat or markdown output. Designed for large environments (100k+ users) with tiered drill-down.

In Manus ausführen

$ git log --oneline --stat

stars:204

forks:51

updated:7. Mai 2026 um 15:32

Datei-Explorer

2 Dateien

SKILL.md

readonly

related-skills.json

gleiches Repository

threat-pulse.md

from "SCStelz/security-investigator"

Recommended starting point for new users and daily SOC operations. 15-minute broad security scan across 7 domains (incidents, identity, NHI, endpoint, email, admin/cloud, exposure) producing a Threat Pulse Dashboard with drill-down recommendations to specialized skills. Trigger on getting-started questions like "where do I start", "what can you do", "help me investigate".

2026-05-29204

kql-query-authoring.md

from "SCStelz/security-investigator"

Use this skill when asked to write, create, or help with KQL (Kusto Query Language) queries for Microsoft Sentinel, Defender XDR, or Azure Data Explorer. Triggers on keywords like "write KQL", "create KQL query", "help with KQL", "query [table]", "KQL for [scenario]", or when a user requests queries for specific data analysis scenarios. This skill uses schema validation, Microsoft Learn documentation, and community examples to generate production-ready KQL queries.

2026-05-26204

app-registration-posture.md

from "SCStelz/security-investigator"

Audit Entra ID app registration and service principal security posture. Triggers on keywords like "app registration posture", "service principal permissions", "dangerous app permissions", "app ownership", "app credential abuse", "SPN lateral movement", "app consent grant", "overprivileged apps", "cross-tenant SPN", "app registration kill chain", "app persistence", "credential add chain", "Graph API permissions audit". Combines Graph API current-state inventory (dangerous permissions, ownership, credential hygiene) with KQL chain detection (AuditLogs, AADServicePrincipalSignInLogs, AADUserRiskEvents, MicrosoftGraphActivityLogs) for posture assessment covering permission concentration, owner risk, credential hygiene, cross-tenant exposure, and active abuse signals. Includes 5-dimension App Permission Risk Score. Inline chat or markdown output.

2026-05-07204

email-threat-posture.md

from "SCStelz/security-investigator"

Generate email threat protection reports and assess email security posture. Triggers on keywords like "email threat report", "email security posture", "phishing report", "MDO report", "Defender for Office 365 report", "ZAP effectiveness", "Safe Links report", "DMARC report", "spam report", "email volume report". Queries EmailEvents, EmailPostDeliveryEvents, UrlClickEvents, and EmailAttachmentInfo in Advanced Hunting for a posture assessment covering inbound mail flow, threat composition, phishing detection, email authentication (DMARC/DKIM/SPF), post-delivery remediation (ZAP), Safe Links click protection, attachment analysis, detection method effectiveness, and delivery disposition. Supports inline chat, markdown file, and SVG dashboard output.

2026-05-07204

identity-posture.md

from "SCStelz/security-investigator"

Audit identity security posture across the organization. Triggers on keywords like "identity posture", "identity security report", "account hygiene", "stale accounts", "privileged accounts", "password posture", "identity providers", "multi-provider identity", "identity sprawl", "service accounts", "deleted accounts with roles", "cross-IdP", "honeytoken", "sensitive accounts". Queries IdentityAccountInfo in Advanced Hunting (enriched with IdentityInfo and IdentityLogonEvents) for a posture assessment covering account inventory by provider, privileged account audit, stale/deleted account hygiene, password posture, risk distribution, multi-provider identity linking, MDI tag analysis, and department-level insights. Inline chat or markdown output.

2026-05-07204

mitre-coverage-report.md

from "SCStelz/security-investigator"

MITRE ATT&CK Coverage Report — YAML-driven PowerShell pipeline gathers analytic rule MITRE tags, custom detection techniques, SOC Optimization recommendations, and alert/incident operational data via az rest/az monitor/Graph API, writes a deterministic scratchpad, LLM renders the report. Covers tactic-level coverage matrix, technique-level drill-down with rule mapping, coverage gap identification, SOC Optimization threat scenario alignment, untagged rule remediation, ICS/OT technique tracking, and MITRE Coverage Score (5 weighted dimensions). Inline chat and markdown file output.

2026-04-17204

package.json

"author": "SCStelz"

"repository": "SCStelz/security-investigator"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

InformationssicherheitsanalystenInformatik- und Mathematikberufe15-1212L4

Jeden Skill mit einem Klick ausführen

name	data-security-analysis
description	Analyze data security events, sensitive information type (SIT) access, sensitivity label access, DLP matches, or Purview insider risk activity. Triggers on keywords like "data security", "sensitive information type", "SIT access", "DLP events", "DataSecurityEvents", "EDM access", "credit card access", "insider risk activity", "Purview data security", "sensitivity label", "label downgrade", "label change", "Copilot label exposure". Queries DataSecurityEvents in Advanced Hunting to produce SIT and label access analysis: volume breakdowns, user drill-downs, file inventories, action type distribution, DLP correlation, label change tracking, Copilot label exposure, temporal patterns, and risk-ranked user summaries. Inline chat or markdown output. Designed for large environments (100k+ users) with tiered drill-down.
threat_pulse_domains	["admin","cloud"]
drill_down_prompt	Analyze data security events — SIT access patterns, label changes, DLP policy matches

Data Security Events Analysis — Instructions

Purpose

This skill analyzes DataSecurityEvents (Microsoft Purview Insider Risk Management / DLP telemetry) to answer questions about who accessed documents containing sensitive information types (SITs) and/or sensitivity labels — including EDM (Exact Data Match), built-in SITs (credit cards, SSNs, etc.), trainable classifiers, and Microsoft Purview sensitivity labels (Confidential, Highly Confidential, custom labels, etc.).

Primary Table: DataSecurityEvents (Defender XDR Advanced Hunting)

Use Case	Example Question
SIT access audit	"Who accessed files with credit card numbers in the last 30 days?"
EDM monitoring	"Show me all access to documents matching our EDM SIT"
DLP event analysis	"What DLP policy matches occurred this week?"
Insider risk triage	"Which users have the most sensitive data interactions?"
SIT landscape overview	"What sensitive information types exist in our environment?"
Sensitivity label audit	"Who accessed Highly Confidential labeled documents?"
Label change tracking	"Show me all label downgrades in the last 30 days"
Copilot label exposure	"What labeled documents did Copilot access in risky interactions?"

📑 TABLE OF CONTENTS

Critical Workflow Rules - Start here!
SIT GUID Mapping Strategy - How SIT GUIDs are resolved to names
Label GUID Mapping Strategy - How sensitivity label GUIDs are resolved to names
Output Modes - Inline chat vs. Markdown file
Quick Start - 8-step execution pattern
Execution Workflow - 6-phase analysis process
Sample KQL Queries - Validated query patterns (Queries 1-16d)
Report Template - Rendering rules (15 rules) + output format specification
Known Pitfalls - Table quirks and edge cases (27 entries)
Error Handling - Troubleshooting guide
SVG Dashboard Generation - Visual dashboard from report data

Investigation shortcuts:

DLP/exfiltration incident entities (TP Q1): Q3 (top users by SIT volume) → Q6 (DLP policy matches) → Q9 (single-user SIT profile for each incident entity) → Q10b (file-based spikes)
High-volume mailbox API access (TP Q9): Q9 (single-user SIT profile for API actors) → Q4 (top files accessed) → Q10b (file-based spikes) → Q6 (DLP policy matches)
Risky identity with data access (TP Q3): Q9 (single-user SIT profile) → Q4 (top files) → Q13 (label downgrade/changes by user)
Copilot sensitive data exposure (TP Q1 Copilot incidents, or TP Q10 AppRegistration with AI keywords): Q16a (Copilot SIT landscape + agent/human split) → Q16b (top human users, high-priority SITs) → Q16d (prompt-only risk signal)
Label compliance / downgrade alert (TP Q1 label-related incidents): Q13 (label changes) → Q15 (label-only events) → Q14 (Copilot label exposure)
Tenant-wide data security posture (standalone, no TP trigger): Full Phase 1–5 workflow

⛔ Shortcut Default Rule: When a matching shortcut exists for the investigation context, use it — don't run the full workflow. Only run the full Phase 1-5 sequence when the user explicitly requests "full analysis", "comprehensive", or "tenant-wide overview". Shortcuts render only the report sections relevant to their query chain (plus Executive Summary and Recommendations, always).

When invoked from a parent skill (threat-pulse, incident-investigation, user-investigation):

Inherit the workspace selection from the parent investigation context
Skip output mode prompts — default to inline chat (the parent skill controls the final output format)
Match the TP Q# trigger to the shortcuts above and execute that chain with entity substitution
Use 30d lookback (AH default) unless the parent specifies otherwise

⚠️ CRITICAL WORKFLOW RULES - READ FIRST ⚠️

Before starting ANY data security analysis:

ALWAYS use RunAdvancedHuntingQuery — DataSecurityEvents is an Advanced Hunting table, NOT available in Sentinel Data Lake
ALWAYS run Query 1 (SIT Discovery) first — establishes which SITs are active and builds the GUID-to-Name mapping
ALWAYS use summarize aggressively — this table can have 600k+ rows in 30 days even in mid-size tenants. NEVER retrieve raw rows except for targeted samples
ALWAYS pre-filter with has before mv-expand on SensitiveInfoTypeInfo — the has "<GUID>" filter avoids expensive expansion on non-matching rows
ALWAYS use tostring() + double parse_json() for SensitiveInfoTypeInfo — it's Collection(String), not native dynamic
NEVER report SIT GUIDs without attempting name resolution — use the mapping strategy below
ALWAYS ask for output mode if not specified: inline chat or markdown file
Prerequisite: DataSecurityEvents requires Insider Risk Management opt-in to share data with Defender XDR. If the table returns 0 rows or "table not found", inform the user of this requirement
ALWAYS run the Label Coverage Assessment (Query 11 quick stats variant) during Phase 1 to determine if this environment has significant label usage. Adapt the report accordingly (see Rule 11)
NEVER report sensitivity label GUIDs without attempting name resolution — use the label mapping strategy below
ALWAYS use split() on SensitivityLabelId — this column can contain comma-separated GUIDs (one per sub-entity), not a single GUID

⛔ PROHIBITED ACTIONS

Action	Status
Querying DataSecurityEvents via `mcp_sentinel-data_query_lake`	❌ PROHIBITED — AH-only table
Retrieving raw rows without `summarize` or `take` limit	❌ PROHIBITED — table is massive
Reporting SIT GUIDs without name resolution attempt	❌ PROHIBITED
Reporting sensitivity label GUIDs without name resolution attempt	❌ PROHIBITED
Running `mv-expand` on SensitiveInfoTypeInfo without pre-filtering with `has`	❌ PROHIBITED — performance killer at scale
Assuming `SensitiveInfoTypeInfo` is native dynamic	❌ PROHIBITED — it's `Collection(String)`, requires double-parse

SIT GUID Mapping Strategy

The Problem

DataSecurityEvents.SensitiveInfoTypeInfo contains SIT GUIDs, not human-readable names. SIT GUIDs fall into three categories:

Category	Resolvable via KQL?	Example
Built-in Microsoft SITs	✅ Yes — use embedded mapping	`50842eb7-...-b085` → "Credit Card Number"
Custom/EDM SITs	❌ No — org-specific GUIDs	`b28fcea1-...-9291` → "Project Obsidian" (custom)
Trainable Classifiers (ML)	❌ No — `ClassifierType: "MLModel"`	`77a140be-...-7560` → unknown ML classifier

Resolution Strategy (3 tiers, in order)

Tier 1: Embedded Well-Known SIT Mapping (instant, no auth)

The query library below includes a datatable of the most common Microsoft SIT GUIDs encountered in production environments. This covers ~90% of detections in typical tenants.

Tier 2: User-Provided Custom SIT Mapping (config-driven)

If the user has custom/EDM SITs, they can provide a mapping in config.json under a sit_mapping key:

{
  "sit_mapping": {
    "<custom-sit-guid-1>": "Your Custom SIT Name",
    "<custom-sit-guid-2>": "Your EDM SIT Name"
  }
}

At skill startup: Check if config.json has a sit_mapping section. If yes, merge it into the KQL datatable for name resolution.

Tier 3: PowerShell Resolution (optional, on-demand)

If unresolved GUIDs remain after Tier 1+2, offer to resolve them via PowerShell:

"I found N SIT GUIDs that aren't in the built-in mapping. Would you like me to resolve them via Get-DlpSensitiveInformationType? This requires an active Security & Compliance PowerShell session (Connect-IPPSSession)."

If the user agrees:

# Requires: Install-Module ExchangeOnlineManagement
# Requires: Connect-IPPSSession -UserPrincipalName <UPN>
Get-DlpSensitiveInformationType -Identity "<GUID>" | Select-Object Name, Id, Publisher

After resolution: Offer to save the mapping to config.json for future runs.

Post-Resolution Persistence (MANDATORY)

After Tier 3 PowerShell resolution completes, always offer to persist the resolved GUIDs:

"I resolved N SIT GUIDs via PowerShell. Would you like me to save these to config.json under sit_mapping so future runs resolve them automatically via Tier 2?"

If the user agrees, read the current config.json, add/merge a sit_mapping object with the resolved GUIDs, and write it back. Format:

{
  "sit_mapping": {
    "<guid>": "<resolved-name>",
    "<guid>": "<resolved-name>"
  }
}

Why this matters: Without persistence, every new session re-encounters the same unresolved GUIDs. The first report in a workspace should resolve and persist; subsequent runs benefit automatically.

Trainable Classifiers

GUIDs with ClassifierType: "MLModel" are trainable classifiers and may not resolve via Get-DlpSensitiveInformationType. Display them as:

[ML Classifier] <GUID> if unresolved
Check if the GUID appears in the well-known mapping (some trainable classifiers have known GUIDs)

Label GUID Mapping Strategy

The Problem

DataSecurityEvents has 4 label-related columns, all containing sensitivity label GUIDs (not names):

Column	Type	Content
`SensitivityLabelId`	string	Label on the document at event time. Can contain comma-separated GUIDs (one per sub-entity)
`PreviousSensitivityLabelId`	string	Previous label — only populated on label-change events (downgrade, removal)
`SharepointSiteSensitivityLabelId`	string	Label on the SharePoint site (not the document)
`RiskyAIUsageSensitivityLabelsInfo`	Collection(String)	Labels on resources Copilot accessed in risky AI events — JSON array of objects with `SubEntityId`, `SubEntityName`, `SensitivityLabelId`

Resolution Strategy (3 tiers, in order)

Tier 1: Embedded Well-Known Label Mapping (instant, no auth)

The query library includes a datatable of Microsoft default sensitivity labels (the defa4170-* GUID family). All 12 default labels — including parent labels — use the deterministic pattern defa4170-0d19-0005-XXXX-bc88714345d2, confirmed across multiple tenants.

⚠️ Important: Microsoft does not publish default label GUIDs in official documentation. The GUID pattern is confirmed via Get-Label on default-configuration tenants. Older tenants may have renamed default labels (e.g., "Non-business" instead of "Personal", "Internal exception" instead of "Anyone (unrestricted)") or replaced default parent label GUIDs with random tenant-specific ones. Always validate with Get-Label (Tier 3) when accuracy matters.

Default Label GUID Pattern: defa4170-0d19-0005-XXXX-bc88714345d2 — complete mapping (12 labels, priority-ordered):

GUID suffix	Priority	Default Name	Parent
`0000`	0	Personal	(top-level)
`0001`	1	Public	(top-level)
`0002`	2	General	(top-level)
`0003`	3	Anyone (unrestricted)	General
`0004`	4	All Employees (unrestricted)	General
`0005`	5	Confidential	(top-level, parent)
`0006`	6	Anyone (unrestricted)	Confidential
`0007`	7	All Employees	Confidential
`0008`	8	Trusted People	Confidential
`0009`	9	Highly Confidential	(top-level, parent)
`000a`	10	All Employees	Highly Confidential
`000b`	11	Specified People	Highly Confidential

Older/customized tenants: Admins may have renamed default labels or deleted and recreated parent labels with random GUIDs. If Get-Label returns a different GUID for "Confidential" or "Highly Confidential" (not matching defa4170-*), the tenant has custom parent labels — add them via config.json (Tier 2).

Tier 2: User-Provided Custom Label Mapping (config-driven)

Custom labels (org-created) have random GUIDs. Users can provide a mapping in config.json:

{
  "label_mapping": {
    "<custom-label-guid-1>": "Your Custom Label",
    "<sub-label-guid>": "Sub-Label Name|Parent Label Name",
    "<parent-label-guid>": "Confidential"
  }
}

Value format: "LabelName" for top-level labels, "LabelName|ParentName" (pipe-delimited) for sub-labels. When building the KQL datatable, split on | to populate LabelName and LabelParent columns.

Renamed defaults: If a tenant has renamed default labels (e.g., defa4170...0000 → "Non-business" instead of "Personal"), include the renamed GUID in label_mapping — Tier 2 entries override Tier 1 defaults.

At skill startup: Check if config.json has a label_mapping section. If yes, merge it into the KQL datatable for name resolution. Tier 2 entries take precedence over Tier 1 defaults for the same GUID.

Tier 3: PowerShell Resolution (optional, on-demand)

If unresolved label GUIDs remain after Tier 1+2, offer to resolve them via PowerShell:

"I found N label GUIDs that aren't in the built-in mapping. Would you like me to resolve them via Get-Label? This requires an active Security & Compliance PowerShell session (Connect-IPPSSession)."

If the user agrees:

# Requires: Install-Module ExchangeOnlineManagement
# Requires: Connect-IPPSSession -UserPrincipalName <UPN>
Get-Label | Select-Object DisplayName, @{N='LabelGuid';E={$_.Guid.ToString()}}, ParentLabelDisplayName | Format-List

After resolution: Offer to save the mapping to config.json under label_mapping for future runs (same persistence pattern as SIT mapping).

Key difference from SIT resolution: Labels use Get-Label (not Get-DlpSensitiveInformationType). The cmdlet returns ALL labels at once — no need to query by individual GUID.

Output Modes

ASK the user which they prefer if not explicitly specified. Both may be selected.

Mode 1: Inline Chat Summary (Default)

Render analysis directly in chat
Includes summary tables, top-N breakdowns, risk-ranked user list
Best for quick review and follow-up questions

Mode 2: Markdown File Report

Save to reports/data-security/DataSecurity_Analysis_<scope>_<timestamp>.md
Full detail including all phases, temporal charts, file inventories
Use create_file tool — NEVER use terminal commands for file output
Filename pattern: DataSecurity_Analysis_<scope>_YYYYMMDD_HHMMSS.md
- <scope> = tenant_wide, sit_<SITname>, user_<username>, etc.

Quick Start (TL;DR)

Determine scope → Tenant-wide overview? Specific SIT? Specific user? Specific label? Time range?
Check config.json → Look for sit_mapping and label_mapping sections for custom name resolution
Run Phase 1 → Query 1 (SIT Discovery) + Label Coverage Assessment (quick stats variant) to determine environment maturity
Run Phase 2 → Queries 2-5 (breakdowns by action type, user, file, time)
Run Phase 2.5 → Queries 16a-16d (Copilot SIT exposure analysis) — conditional on Copilot volume (see Phase 2.5 trigger)
Run Phase 3 → Queries 6-8 (DLP correlation, workload, SIT drill-down), Query 10b (file-based spikes)
Run Phase 4 → Queries 11-15 (label landscape, label-based user ranking, label downgrades, Copilot label exposure, label-only events) — depth depends on label coverage (see Rule 11)
Output Results → Render in selected mode(s), offer PowerShell resolution for unknowns

Execution Workflow

Phase 1: Discovery & Mapping (always run first)

Goal: Establish what SITs and labels exist in the data, their volume, and resolve GUIDs to names.

Run Query 1 (SIT Discovery) — returns top SIT GUIDs with hit counts
Run Label Coverage Assessment (quick stats from Query 11 comment block) — returns label vs SIT coverage percentages
Apply Tier 1 mapping (embedded datatable) to resolve known SIT and label GUIDs
Check config.json for Tier 2 mapping (sit_mapping + label_mapping) to resolve custom GUIDs
Flag any remaining unresolved GUIDs for optional Tier 3 (PowerShell)
Present the SIT + label landscape to the user before proceeding
Determine label analysis depth based on coverage (see Rule 11)

Phase 2: Breakdown Analysis

Goal: Decompose SIT access patterns by multiple dimensions.

Run these queries in parallel where possible:

Query	Dimension	Purpose
Query 2	Action Type	What operations triggered SIT detections (file read, download, copy, Copilot response, etc.)
Query 3	User Ranking	Top users by SIT interaction volume — risk-ranked
Query 4	File Inventory	Top files/documents containing the most SIT detections
Query 5	Temporal Pattern	Daily/hourly volume trend to spot spikes

Phase 2.5: Copilot SIT Exposure Analysis (conditional on Copilot volume)

Trigger: Run this phase when Copilot/AI events exceed 30% of total volume (determined from Query 2 Action Type breakdown or Query 7 Workload breakdown).

Goal: Decompose Copilot SIT interactions by priority tier, identify users prompting high-value SITs into Copilot, separate service account noise from human risk signals, and estimate real interaction counts (correcting for row multiplication).

Key insight: Each Copilot interaction generates ~2-3 DSE rows on average (up to 35 for complex exchanges) because Purview creates separate rows for prompt SIT matches, response SIT matches, and compound agent interactions. Raw event counts must be corrected for this multiplier when reporting interaction volumes.

Query	Purpose
Query 16a	Copilot SIT Landscape — Which SITs fire in Copilot interactions, classified by priority tier (High/Medium/Low)
Query 16b	Top Users by High-Priority SIT in Copilot — Risk-ranked users excluding service accounts
Query 16c	Daily Temporal Trend by SIT Category — Spot adoption vs risk pattern changes over time
Query 16d	Prompt-Only Human Users — Users typing sensitive data INTO Copilot (primary risk signal), excluding service accounts and responses

Service Account Filtering: Automated service accounts (e.g., Security Copilot agents, Purview agents) can generate 50-70% of all Copilot events. These accounts typically follow patterns like securitycopilotagentuser-*, svc-*, or system-generated UPN prefixes with GUIDs. Query 16a quantifies the agent vs human split; Queries 16b-16d exclude agents to surface human risk.

Priority SIT Classification: SITs are classified into tiers for Copilot risk assessment:

Tier	SIT Categories	Risk Rationale
🔴 High	Credit Card Numbers, SSNs, Azure/Cloud Credentials, Employee HR Data (custom)	Direct financial, identity, or infrastructure exposure
🟡 Medium	Project code names (custom), Employee IDs (custom)	Business-sensitive but not directly exploitable
🔵 Low	All Full Names, IP Addresses, Physical Addresses, Medical Terms	High-volume, low-specificity — noise in Copilot context

Custom SIT classification: Organizations should classify their custom/EDM SITs into these tiers. If config.json has a sit_priority section (mapping GUID → tier), use it. Otherwise, classify custom SITs as 🔴 High by default (conservative).

Phase 3: Deep Dive (conditional on scope)

Scenario	Run These
Tenant-wide overview	Query 6 (DLP policy matches), Query 7 (Workload breakdown)
Specific SIT investigation	Query 8 (Single-SIT deep dive with full user/file/action breakdown)
Specific user investigation	Query 9 (Single-user SIT access profile)
Anomaly detection	Query 10b (file-based spikes — PRIMARY), Query 10 (overall spikes — secondary, includes Copilot)

Phase 4: Sensitivity Label Analysis (conditional on coverage)

Goal: Analyze sensitivity label access patterns, label changes, and Copilot label exposure.

Run the Label Coverage Assessment first (Phase 1, step 2). Then adapt depth per Rule 11:

Label Coverage	Analysis Depth	Run These
≥5% of events have labels (label-mature environment)	Full label analysis — dedicated report sections	Query 11 (label landscape), Query 12 (label-based user ranking), Query 13 (label changes), Query 14 (Copilot label exposure), Query 15 (label-only events)
1-5% of events have labels (emerging label environment)	Summary label section — condensed into one section	Query 11 (label landscape), Query 13 (label changes)
<1% of events have labels (SIT-dominant environment)	Brief note only — mention label presence in Scope & Limitations	Label coverage stats from Phase 1 assessment only
User asks specifically about labels	Full label analysis regardless of coverage percentage	All label queries (11-15)

Phase 5: Report Generation

Render findings using the Report Template below.

Sample KQL Queries

Well-Known SIT GUID Mapping (datatable)

Use this let block as a prefix for any query that needs name resolution. It covers the most common Microsoft SITs plus placeholders for custom SITs from config.json.

// Well-known SIT GUID mapping — covers ~90% of typical detections
// Add custom/EDM SIT GUIDs from config.json sit_mapping section
let SITMapping = datatable(SITId: string, SITName: string) [
    // ── Financial ──
    "50842eb7-edc8-4019-85dd-5a5c1f2bb085", "Credit Card Number",
    "cb353f78-2b72-4c3c-8827-92ebe4f69fdf", "ABA Routing Number",
    "78e09124-f2c3-4656-b32a-c1a132cd2711", "Brazil CPF Number",
    // ── Identity / PII ──
    "a44669fe-0d48-453d-a9b1-2cc83f2cba77", "U.S. Social Security Number (SSN)",
    "a7dd5e5f-e7f9-4626-a2c6-86a8cb6830d2", "IP Address v4",
    "1daa4ad5-e2dd-4ca4-a788-54722c09efb2", "IP Address",
    "50b8b56b-4ef8-44c2-a924-03374f5831ce", "All Full Names",
    "8548332d-6d71-41f8-97db-cc3b5fa544e6", "All Physical Addresses",
    "44aa44f2-63d1-41df-af0d-970283ac41e2", "U.S. Physical Addresses",
    "d1d18c85-1203-46f5-b32f-2d6309de4e5b", "Australia Physical Addresses",
    "6fa57f91-314a-4561-8248-7ab921957448", "Philippines Passport Number",
    "d0001c83-e72f-4360-98d3-f5a41dc5a380", "Indonesia Passport Number",
    // ── Healthcare ──
    "065bdd91-ef07-40d3-b8a4-0aea722eaa49", "All Medical Terms And Conditions",
    "17066377-466d-43ff-997f-c9240414021c", "Diseases",
    "f6dc2d17-3549-41e2-af29-ae1846ae9542", "Types Of Medication",
    "ee05bb9c-7b87-42e1-9987-446b243245d5", "Lab Test Terms",
    // ── Azure / Cloud secrets ──
    "0f587d92-eb28-44a9-bd1c-90f2892b47aa", "Azure DocumentDB Auth Key",
    "ce1a126d-186f-4700-8c0c-486157b953fd", "Azure SQL Connection String",
    "0b34bec3-d5d6-4974-b7b0-dcdb5c90c29d", "Azure IoT Connection String",
    "c7bc98e8-551a-4c35-a92d-d2c8cda714a7", "Azure Storage Account Key",
    "095a7e6c-efd8-46d5-af7b-5298d53a49fc", "Azure Redis Cache Connection String",
    // ─── ADD CUSTOM / EDM SITs FROM config.json sit_mapping HERE ───
    // Example: "<your-edm-guid>", "Your EDM SIT Name",
    "END_MARKER", "END_MARKER"
];

Instructions: When building queries, read config.json for sit_mapping entries and insert them into the datatable above, replacing the END_MARKER row. If no custom mapping exists, remove the END_MARKER row.

Query 1: SIT Discovery — Active SIT Landscape

Purpose: Find all active SIT GUIDs, their volume, and classify them.

// Query 1: SIT Discovery — What SITs are active in this environment?
// Adjust timespan as needed (default: 30d)
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitiveInfoTypeInfo)
| mv-expand SIT = parse_json(tostring(SensitiveInfoTypeInfo))
| extend SITJson = parse_json(tostring(SIT))
| extend SITId = tostring(SITJson.SensitiveInfoTypeId)
| extend ClassifierType = tostring(SITJson.ClassifierType)
| extend SITConfidence = toint(SITJson.Confidence)
| extend SITCount = toint(SITJson.Count)
| summarize 
    TotalEvents = count(),
    DistinctUsers = dcount(AccountUpn),
    DistinctFiles = dcount(ObjectId),
    AvgConfidence = avg(SITConfidence),
    MaxConfidence = max(SITConfidence),
    ClassifierTypes = make_set(ClassifierType)
    by SITId
| order by TotalEvents desc
| take 50

Post-processing: Join results with the SITMapping datatable to resolve names. Flag any GUIDs not in the mapping as "Unknown — custom/EDM SIT" or "[ML Classifier]" based on ClassifierTypes.

Query 2: Action Type Breakdown

Purpose: Break down SIT detections by what operation triggered them.

// Query 2: Action Type Breakdown — What operations trigger SIT detections?
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitiveInfoTypeInfo)
| summarize 
    EventCount = count(),
    DistinctUsers = dcount(AccountUpn),
    DistinctFiles = dcount(ObjectId)
    by ActionType
| order by EventCount desc

Query 3: Top Users by SIT Interaction Volume

Purpose: Risk-rank users by sensitive data interaction volume. Designed for 100k+ user environments.

// Query 3: Top 50 Users by SIT access volume
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitiveInfoTypeInfo)
| summarize 
    TotalEvents = count(),
    DistinctSITs = dcount(tostring(parse_json(tostring(parse_json(tostring(SensitiveInfoTypeInfo))[0])).SensitiveInfoTypeId)),
    DistinctFiles = dcount(ObjectId),
    ActionTypes = make_set(ActionType),
    Workloads = make_set(Workload),
    FirstSeen = min(Timestamp),
    LastSeen = max(Timestamp)
    by AccountUpn
| order by TotalEvents desc
| take 50

Query 4: Top Files by SIT Detection Count

Purpose: Identify the most sensitive documents — files with the most SIT detections across access events.

// Query 4: Top 30 Files by SIT detection frequency
// Excludes system/operational files (DLPCache, EBWebView) that are Defender operational reads, not user-initiated
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitiveInfoTypeInfo)
| where isnotempty(ObjectId)
| where ObjectId !has "DLPCache" and ObjectId !has "EBWebView" and ObjectId !has "\\ProgramData\\Microsoft\\Windows Defender\\"
| summarize 
    AccessCount = count(),
    DistinctUsers = dcount(AccountUpn),
    ActionTypes = make_set(ActionType),
    LastAccessed = max(Timestamp)
    by ObjectId
| order by AccessCount desc
| take 30

Query 5: Temporal Pattern — Daily SIT Event Volume

Purpose: Detect volume spikes or anomalies in SIT-related activity over time.

// Query 5: Daily SIT event volume trend — includes file-based column for spike attribution (Rule 10)
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitiveInfoTypeInfo)
| summarize 
    DailyEvents = count(), 
    FileEvents = countif(Workload !in ("Copilot", "ConnectedAIApp")),
    DistinctUsers = dcount(AccountUpn) 
    by Day = bin(Timestamp, 1d)
| order by Day asc

Query 6: DLP Policy Match Correlation

Purpose: Show DLP policy matches alongside SIT detections — which policies fired and how often.

// Query 6: DLP Policy Match breakdown
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(DlpPolicyMatchInfo)
| extend DlpInfo = parse_json(DlpPolicyMatchInfo)
| mv-expand DlpPolicy = DlpInfo
| extend PolicyName = tostring(DlpPolicy.PolicyName)
| extend PolicyId = tostring(DlpPolicy.PolicyId)
| summarize 
    MatchCount = count(),
    DistinctUsers = dcount(AccountUpn),
    DistinctFiles = dcount(ObjectId),
    ActionTypes = make_set(ActionType)
    by PolicyName
| order by MatchCount desc

Query 7: Workload Breakdown

Purpose: Where is sensitive data being accessed — SharePoint, OneDrive, Exchange, Teams, Endpoints, Copilot?

// Query 7: Workload distribution of SIT events
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitiveInfoTypeInfo)
| summarize 
    EventCount = count(),
    DistinctUsers = dcount(AccountUpn),
    DistinctFiles = dcount(ObjectId)
    by Workload
| order by EventCount desc

Query 8: Single-SIT Deep Dive

Purpose: Full breakdown for a specific SIT GUID — who accessed it, which files, what operations, over what time period.

Usage: Replace <TARGET_SIT_GUID> with the specific SIT GUID to investigate (e.g., an EDM SIT GUID).

// Query 8: Single-SIT deep dive — replace GUID
let targetSIT = "<TARGET_SIT_GUID>";
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitiveInfoTypeInfo)
| where SensitiveInfoTypeInfo has targetSIT
| mv-expand SIT = parse_json(tostring(SensitiveInfoTypeInfo))
| extend SITJson = parse_json(tostring(SIT))
| extend SITId = tostring(SITJson.SensitiveInfoTypeId)
| where SITId == targetSIT
| extend SITConfidence = toint(SITJson.Confidence)
| extend SITCount = toint(SITJson.Count)
| summarize 
    AccessCount = count(),
    AvgConfidence = avg(SITConfidence),
    TotalSITInstances = sum(SITCount),
    ActionTypes = make_set(ActionType),
    Workloads = make_set(Workload),
    FirstSeen = min(Timestamp),
    LastSeen = max(Timestamp)
    by AccountUpn, ObjectId
| order by AccessCount desc
| take 100

Query 9: Single-User SIT Access Profile

Purpose: Complete SIT interaction profile for a specific user — what SITs they accessed, which files, operations, and when.

Usage: Replace <TARGET_UPN> with the user's UPN.

// Query 9: Single-user SIT access profile
let targetUser = "<TARGET_UPN>";
DataSecurityEvents
| where Timestamp > ago(30d)
| where AccountUpn =~ targetUser
| where isnotempty(SensitiveInfoTypeInfo)
| mv-expand SIT = parse_json(tostring(SensitiveInfoTypeInfo))
| extend SITJson = parse_json(tostring(SIT))
| extend SITId = tostring(SITJson.SensitiveInfoTypeId)
| extend SITConfidence = toint(SITJson.Confidence)
| extend SITCount = toint(SITJson.Count)
| summarize 
    AccessCount = count(),
    DistinctFiles = dcount(ObjectId),
    AvgConfidence = avg(SITConfidence),
    ActionTypes = make_set(ActionType),
    FirstSeen = min(Timestamp),
    LastSeen = max(Timestamp)
    by SITId
| order by AccessCount desc

Query 10: Anomaly Detection — Users with SIT Access Spikes

Purpose: Compare each user's recent 7-day SIT activity against their 30-day daily average to detect sudden spikes. Designed for 100k+ user environments.

// Query 10: SIT access spike detection (7d recent vs 23d baseline) — ALL events
// NOTE: This includes Copilot events. For file-based-only spikes, use Query 10b below.
let baseline = DataSecurityEvents
| where Timestamp between (ago(30d) .. ago(7d))
| where isnotempty(SensitiveInfoTypeInfo)
| summarize BaselineTotal = count() by AccountUpn
| extend BaselineDailyAvg = round(BaselineTotal / 23.0, 1); // 23 days in baseline window
let recent = DataSecurityEvents
| where Timestamp > ago(7d)
| where isnotempty(SensitiveInfoTypeInfo)
| summarize RecentTotal = count() by AccountUpn
| extend RecentDailyAvg = round(RecentTotal / 7.0, 1);
recent
| join kind=inner baseline on AccountUpn
| extend SpikeRatio = round(RecentDailyAvg / BaselineDailyAvg, 2)
| where SpikeRatio > 2.0 and RecentTotal > 20 and BaselineTotal >= 10
| project AccountUpn, BaselineDailyAvg, RecentDailyAvg, SpikeRatio, BaselineTotal, RecentTotal
| order by SpikeRatio desc
| take 30

Query 10b: File-Based-Only Spike Detection (Excludes Copilot)

Purpose: Same as Query 10 but excludes Copilot and ConnectedAIApp events to surface actual file access spikes. This is the primary risk signal — Copilot spikes often just reflect adoption changes.

// Query 10b: File-based SIT access spike detection (excludes Copilot/AI events)
let CopilotActionTypes = dynamic(["Risky prompt entered in Copilot", "Sensitive response received in Copilot",
    "Risky prompt entered in connected AI apps", "Sensitive response received in connected AI apps"]);
let baseline = DataSecurityEvents
| where Timestamp between (ago(30d) .. ago(7d))
| where isnotempty(SensitiveInfoTypeInfo)
| where not(ActionType has_any (CopilotActionTypes))
| where Workload !in ("Copilot", "ConnectedAIApp")
| summarize BaselineTotal = count() by AccountUpn
| extend BaselineDailyAvg = round(BaselineTotal / 23.0, 1);
let recent = DataSecurityEvents
| where Timestamp > ago(7d)
| where isnotempty(SensitiveInfoTypeInfo)
| where not(ActionType has_any (CopilotActionTypes))
| where Workload !in ("Copilot", "ConnectedAIApp")
| summarize RecentTotal = count() by AccountUpn
| extend RecentDailyAvg = round(RecentTotal / 7.0, 1);
recent
| join kind=inner baseline on AccountUpn
| extend SpikeRatio = round(RecentDailyAvg / BaselineDailyAvg, 2)
| where SpikeRatio > 2.0 and RecentTotal > 10 and BaselineTotal >= 10
| project AccountUpn, BaselineDailyAvg, RecentDailyAvg, SpikeRatio, BaselineTotal, RecentTotal
| order by SpikeRatio desc
| take 30

Well-Known Label GUID Mapping (datatable)

Use this let block as a prefix for label queries that need name resolution. It covers the Microsoft default sensitivity labels plus placeholder slots for custom labels from config.json.

// Microsoft default sensitivity label GUID mapping — all 12 labels
// Confirmed via Get-Label on default-configuration tenants
// Older tenants may have renamed labels or replaced parent GUIDs — validate with Get-Label if needed
// Add custom labels from config.json label_mapping section
let LabelMapping = datatable(LabelId: string, LabelName: string, LabelParent: string) [
    // ── Microsoft Default Labels (defa4170-0d19-0005-XXXX-bc88714345d2 family) ──
    // Priority 0-11, sequential GUID suffixes
    "defa4170-0d19-0005-0000-bc88714345d2", "Personal", "",
    "defa4170-0d19-0005-0001-bc88714345d2", "Public", "",
    "defa4170-0d19-0005-0002-bc88714345d2", "General", "",
    "defa4170-0d19-0005-0003-bc88714345d2", "Anyone (unrestricted)", "General",
    "defa4170-0d19-0005-0004-bc88714345d2", "All Employees (unrestricted)", "General",
    "defa4170-0d19-0005-0005-bc88714345d2", "Confidential", "",
    "defa4170-0d19-0005-0006-bc88714345d2", "Anyone (unrestricted)", "Confidential",
    "defa4170-0d19-0005-0007-bc88714345d2", "All Employees", "Confidential",
    "defa4170-0d19-0005-0008-bc88714345d2", "Trusted People", "Confidential",
    "defa4170-0d19-0005-0009-bc88714345d2", "Highly Confidential", "",
    "defa4170-0d19-0005-000a-bc88714345d2", "All Employees", "Highly Confidential",
    "defa4170-0d19-0005-000b-bc88714345d2", "Specified People", "Highly Confidential",
    // ─── ADD CUSTOM LABELS FROM config.json label_mapping HERE ───
    // Example: "<your-custom-label-guid>", "Your Custom Label", "Parent Label",
    "END_MARKER", "END_MARKER", "END_MARKER"
];

Instructions: When building queries, read config.json for label_mapping entries and insert them into the datatable above, replacing the END_MARKER row. Older/customized tenants: If Get-Label shows parent labels (Confidential, Highly Confidential) with random GUIDs instead of defa4170-*, the admin has recreated them — add the tenant-specific GUIDs from config.json label_mapping or Get-Label. If resolved sub-label names differ from the datatable (e.g., "Non-business" vs "Personal"), prefer the Get-Label name for that tenant.

Query 11: Label Coverage Overview — Sensitivity Label Landscape

Purpose: Discover which sensitivity labels appear in the data, their volume, and resolve GUIDs to names. Also includes a quick stats variant for Phase 1 coverage assessment.

Quick Stats Variant (run first in Phase 1):

// Label Coverage Assessment — run in Phase 1 to determine label analysis depth
DataSecurityEvents
| where Timestamp > ago(30d)
| summarize
    TotalEvents = count(),
    WithSIT = countif(isnotempty(SensitiveInfoTypeInfo) and SensitiveInfoTypeInfo != "[]"),
    WithLabel = countif(isnotempty(SensitivityLabelId)),
    WithPrevLabel = countif(isnotempty(PreviousSensitivityLabelId)),
    LabelOnly_NoSIT = countif(isnotempty(SensitivityLabelId) and (isempty(SensitiveInfoTypeInfo) or SensitiveInfoTypeInfo == "[]")),
    SIT_WithLabel = countif(isnotempty(SensitivityLabelId) and isnotempty(SensitiveInfoTypeInfo) and SensitiveInfoTypeInfo != "[]")

Full Label Landscape Query:

// Query 11: Label Landscape — which sensitivity labels appear and how often
// Prefix with LabelMapping datatable from above
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitivityLabelId)
| extend LabelIds = split(SensitivityLabelId, ",")
| mv-expand LabelIdRaw = LabelIds
| extend LabelId = tostring(trim(" ", tostring(LabelIdRaw)))
| where isnotempty(LabelId)
| lookup kind=leftouter LabelMapping on LabelId
| extend LabelDisplay = iff(isempty(LabelName) or LabelName == "END_MARKER",
    strcat("[Unknown] ", LabelId),
    iff(isempty(LabelParent), LabelName, strcat(LabelParent, " / ", LabelName)))
| summarize
    EventCount = count(),
    DistinctUsers = dcount(AccountUpn),
    DistinctFiles = dcount(ObjectId),
    ActionTypes = make_set(ActionType)
    by LabelDisplay, LabelId
| order by EventCount desc

Post-processing: Flag any [Unknown] GUIDs for Tier 2/3 resolution. The LabelDisplay column renders as "Parent / Child" for sub-labels (e.g., "Highly Confidential / Project Obsidian") and just the label name for top-level labels.

Query 12: Top Users by Labeled Document Access (File-Based)

Purpose: Risk-rank users by labeled document access volume, excluding Copilot/AI events. This is the label-dimension equivalent of Query 3.

// Query 12: Top users by labeled document access (file-based only)
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitivityLabelId)
| where ActionType !has "Copilot" and Workload !in ("Copilot", "ConnectedAIApp")
| summarize
    EventCount = count(),
    DistinctLabels = dcount(SensitivityLabelId),
    DistinctFiles = dcount(ObjectId),
    ActionTypes = make_set(ActionType),
    Workloads = make_set(Workload),
    FirstSeen = min(Timestamp),
    LastSeen = max(Timestamp)
    by AccountUpn
| order by EventCount desc
| take 30

Query 13: Label Downgrade & Change Tracking

Purpose: Find all events where a sensitivity label was downgraded, removed, or changed. Critical for detecting policy circumvention or insider risk.

// Query 13: Label downgrade/removal events — detect label circumvention
// Prefix with LabelMapping datatable to resolve both current and previous label GUIDs
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(PreviousSensitivityLabelId)
| extend CurrentLabelId = SensitivityLabelId
| extend PrevLabelId = PreviousSensitivityLabelId
| lookup kind=leftouter (LabelMapping | project LabelId, CurrentLabelName=LabelName, CurrentParent=LabelParent) on $left.CurrentLabelId == $right.LabelId
| lookup kind=leftouter (LabelMapping | project LabelId, PrevLabelName=LabelName, PrevParent=LabelParent) on $left.PrevLabelId == $right.LabelId
| extend CurrentDisplay = iff(isempty(CurrentLabelName), iff(isempty(CurrentLabelId), "[Removed]", strcat("[Unknown] ", CurrentLabelId)),
    iff(isempty(CurrentParent), CurrentLabelName, strcat(CurrentParent, " / ", CurrentLabelName)))
| extend PrevDisplay = iff(isempty(PrevLabelName), strcat("[Unknown] ", PrevLabelId),
    iff(isempty(PrevParent), PrevLabelName, strcat(PrevParent, " / ", PrevLabelName)))
| project Timestamp, ActionType, AccountUpn, ObjectId,
    PreviousLabel = PrevDisplay, CurrentLabel = CurrentDisplay, Workload
| order by Timestamp desc

Key ActionTypes in label change events:

Label downgraded on a file — label lowered (e.g., HC → Confidential)
Label removed from a file — label stripped entirely
Label on file downgraded or removed, then file accessed by Copilot — label reduced AND Copilot accessed the now-less-protected file

Query 14: Copilot Label Exposure — Labeled Resources Accessed by Copilot

Purpose: Identify which sensitivity-labeled documents Copilot accessed during risky AI interactions. This surfaces data exposure risk where Copilot may be surfacing Highly Confidential content.

// Query 14: Copilot label exposure — what labeled docs did Copilot access in risky interactions?
// Prefix with LabelMapping datatable
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(RiskyAIUsageSensitivityLabelsInfo)
| where tostring(RiskyAIUsageSensitivityLabelsInfo) !has "null"
    or (tostring(RiskyAIUsageSensitivityLabelsInfo) has "null" and tostring(RiskyAIUsageSensitivityLabelsInfo) has "SensitivityLabelId")
| mv-expand LabelEntry = parse_json(tostring(RiskyAIUsageSensitivityLabelsInfo))
| extend LabelJson = parse_json(tostring(LabelEntry))
| extend SubEntityName = tostring(LabelJson.SubEntityName)
| extend LabelId = tostring(LabelJson.SensitivityLabelId)
| where isnotempty(LabelId)
| lookup kind=leftouter LabelMapping on LabelId
| extend LabelDisplay = iff(isempty(LabelName) or LabelName == "END_MARKER",
    strcat("[Unknown] ", LabelId),
    iff(isempty(LabelParent), LabelName, strcat(LabelParent, " / ", LabelName)))
| summarize
    EventCount = count(),
    DistinctUsers = dcount(AccountUpn),
    SubEntities = make_set(SubEntityName),
    ActionTypes = make_set(ActionType)
    by LabelDisplay, LabelId
| order by EventCount desc

SubEntityName values:

ResponseAccessedResource — a labeled document that Copilot cited in its response
Response — the Copilot response itself that was flagged

Query 15: Label-Only Events — Events Triggered Purely by Label (No SIT Content Match)

Purpose: Find events where the trigger was the sensitivity label alone, not SIT content detection. These represent label-aware DLP/IRM policy matches.

// Query 15: Label-only events — triggered by label, not SIT content
// Prefix with LabelMapping datatable
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitivityLabelId)
| where isempty(SensitiveInfoTypeInfo) or SensitiveInfoTypeInfo == "[]"
| extend LabelIds = split(SensitivityLabelId, ",")
| mv-expand LabelIdRaw = LabelIds
| extend LabelId = tostring(trim(" ", tostring(LabelIdRaw)))
| where isnotempty(LabelId)
| lookup kind=leftouter LabelMapping on LabelId
| extend LabelDisplay = iff(isempty(LabelName) or LabelName == "END_MARKER",
    strcat("[Unknown] ", LabelId),
    iff(isempty(LabelParent), LabelName, strcat(LabelParent, " / ", LabelName)))
| summarize
    EventCount = count(),
    DistinctUsers = dcount(AccountUpn),
    DistinctFiles = dcount(ObjectId),
    ActionTypes = make_set(ActionType),
    Workloads = make_set(Workload)
    by LabelDisplay, LabelId
| order by EventCount desc

Why this matters: In label-mature environments, this query can surface significant activity that the SIT-only queries completely miss. If a document has a "Highly Confidential" label but no SIT content (e.g., manually labeled strategic document), it only appears here.

Query 16a: Copilot SIT Landscape with Priority Tiers

Purpose: Break down which SITs fire in Copilot interactions, classify by priority tier, and quantify service account vs human split. This is the entry point for Phase 2.5.

Prerequisite: Merge the well-known SIT GUID mapping datatable (above) with any config.json sit_mapping entries before running.

// Query 16a: Copilot SIT Landscape — priority-tiered breakdown with agent/human split
// Prefix with SITMapping datatable
// ── SIT Priority Classification (canonical definition — Queries 16c/16d reference this) ──
let HighPrioritySITs = dynamic([
    "50842eb7-edc8-4019-85dd-5a5c1f2bb085",  // Credit Card Number
    "a44669fe-0d48-453d-a9b1-2cc83f2cba77",  // U.S. SSN
    "0f587d92-eb28-44a9-bd1c-90f2892b47aa",  // Azure DocumentDB Auth Key
    "ce1a126d-186f-4700-8c0c-486157b953fd",  // Azure SQL Connection String
    "0b34bec3-d5d6-4974-b7b0-dcdb5c90c29d",  // Azure IoT Connection String
    "c7bc98e8-551a-4c35-a92d-d2c8cda714a7",  // Azure Storage Account Key
    "095a7e6c-efd8-46d5-af7b-5298d53a49fc"   // Azure Redis Cache Connection String
    // ── ADD credential/HR SIT GUIDs from config.json sit_mapping HERE ──
]);
let MediumPrioritySITs = dynamic([
    // ── ADD project/employee ID SIT GUIDs from config.json sit_mapping HERE ──
]);
// ── Service account regex (update with org-specific patterns) ──
let ServiceAccountPattern = @"^(securitycopilotagentuser-|svc-)";
DataSecurityEvents
| where Timestamp > ago(30d)
| where ActionType has "Copilot"
| where isnotempty(SensitiveInfoTypeInfo)
| mv-expand SIT = parse_json(tostring(SensitiveInfoTypeInfo))
| extend SITJson = parse_json(tostring(SIT))
| extend SITId = tostring(SITJson.SensitiveInfoTypeId)
| extend IsAgent = AccountUpn matches regex ServiceAccountPattern
| summarize
    TotalEvents = count(),
    AgentEvents = countif(IsAgent),
    HumanEvents = countif(not(IsAgent) and isnotempty(AccountUpn)),
    HumanUsers = dcountif(AccountUpn, not(IsAgent) and isnotempty(AccountUpn)),
    PromptEvents = countif(ActionType has "prompt"),
    ResponseEvents = countif(ActionType has "response")
    by SITId
| lookup kind=leftouter SITMapping on $left.SITId == $right.SITId
| extend SITDisplay = iff(isempty(SITName) or SITName == "END_MARKER", strcat("[Unknown] ", SITId), SITName)
| extend PriorityTier = case(
    SITId in (HighPrioritySITs), "🔴 High",
    SITId in (MediumPrioritySITs), "🟡 Medium",
    "🔵 Low")
| project SITDisplay, PriorityTier, TotalEvents, AgentEvents, HumanEvents, HumanUsers, PromptEvents, ResponseEvents, SITId
| order by TotalEvents desc

Post-processing:

Populate HighPrioritySITs and MediumPrioritySITs arrays with credential, HR, and custom SIT GUIDs from config.json sit_mapping. Any SIT not in either array defaults to Low.
If config.json has a sit_priority section (GUID → tier mapping), use it to override the default classification.
Calculate Agent % of total — if > 50%, flag prominently in report ("⚠️ N% of Copilot SIT events are from automated service accounts").
Unknown SITs ([Unknown]) should be classified as 🔴 High by default (conservative — unknown custom SITs may be high-value EDM/exact data match).

Query 16b: Top Users by High-Priority SIT in Copilot (Excluding Service Accounts)

Purpose: Risk-rank human users whose Copilot interactions triggered high-priority SIT detections. Excludes automated service accounts.

// Query 16b: Top 20 human users by high-priority SIT in Copilot interactions
// Prefix with SITMapping datatable and HighPrioritySITs + ServiceAccountPattern from Query 16a
DataSecurityEvents
| where Timestamp > ago(30d)
| where ActionType has "Copilot"
| where isnotempty(SensitiveInfoTypeInfo)
| where not(AccountUpn matches regex ServiceAccountPattern)
| where isnotempty(AccountUpn)
| mv-expand SIT = parse_json(tostring(SensitiveInfoTypeInfo))
| extend SITJson = parse_json(tostring(SIT))
| extend SITId = tostring(SITJson.SensitiveInfoTypeId)
| where SITId in (HighPrioritySITs)
| lookup kind=leftouter SITMapping on $left.SITId == $right.SITId
| extend SITDisplay = iff(isempty(SITName) or SITName == "END_MARKER", strcat("[Unknown] ", SITId), SITName)
| summarize
    Events = count(),
    DistinctHighSITs = dcount(SITId),
    SITNames = make_set(SITDisplay),
    PromptEvents = countif(ActionType has "prompt"),
    ResponseEvents = countif(ActionType has "response"),
    FirstSeen = min(Timestamp),
    LastSeen = max(Timestamp)
    by AccountUpn
| order by Events desc
| take 20

Query 16c: Daily Copilot SIT Trend by Priority Category

Purpose: Track daily Copilot SIT detection volume by priority tier to distinguish adoption changes from risk spikes.

// Query 16c: Daily Copilot SIT trend by priority category (human users only)
// ── Copy HighPrioritySITs from Query 16a (canonical list: CCN, SSN, Azure credentials + config.json custom) ──
// ── Copy ServiceAccountPattern from Query 16a ──
DataSecurityEvents
| where Timestamp > ago(30d)
| where ActionType has "Copilot"
| where isnotempty(SensitiveInfoTypeInfo)
| where not(AccountUpn matches regex ServiceAccountPattern)
| where isnotempty(AccountUpn)
| mv-expand SIT = parse_json(tostring(SensitiveInfoTypeInfo))
| extend SITJson = parse_json(tostring(SIT))
| extend SITId = tostring(SITJson.SensitiveInfoTypeId)
| extend PriorityTier = iff(SITId in (HighPrioritySITs), "High", "Low")
| summarize Events = count() by Day = bin(Timestamp, 1d), PriorityTier
| order by Day asc, PriorityTier asc

Post-processing: Render as a dual-line chart or table with High vs Low columns per day. Spikes in the High tier warrant investigation; spikes in Low tier alone are typically noise from broad SITs (All Full Names, IP Addresses) and can be noted but not escalated.

Query 16d: Prompt-Only Analysis — Human Users Typing Sensitive Data INTO Copilot

Purpose: The primary risk signal — users who typed sensitive data into Copilot prompts (not just receiving sensitive responses). Excludes service accounts and response-only events.

// Query 16d: Prompt-only human users with high-priority SIT detections
// This is the key risk signal: sensitive data entered BY the user INTO Copilot
// ── Copy HighPrioritySITs and ServiceAccountPattern from Query 16a ──
DataSecurityEvents
| where Timestamp > ago(30d)
| where ActionType has "prompt"  // Prompts only — user-initiated risk
| where isnotempty(SensitiveInfoTypeInfo)
| where not(AccountUpn matches regex ServiceAccountPattern)
| where isnotempty(AccountUpn)
| mv-expand SIT = parse_json(tostring(SensitiveInfoTypeInfo))
| extend SITJson = parse_json(tostring(SIT))
| extend SITId = tostring(SITJson.SensitiveInfoTypeId)
| where SITId in (HighPrioritySITs)
| lookup kind=leftouter SITMapping on $left.SITId == $right.SITId
| extend SITDisplay = iff(isempty(SITName) or SITName == "END_MARKER", strcat("[Unknown] ", SITId), SITName)
| summarize
    PromptEvents = count(),
    DistinctHighSITs = dcount(SITId),
    SITNames = make_set(SITDisplay),
    FirstSeen = min(Timestamp),
    LastSeen = max(Timestamp)
    by AccountUpn
| order by PromptEvents desc
| take 20

Why prompts matter more than responses:

"Risky prompt entered in Copilot" = User typed/pasted sensitive data (SSN, credit card, credentials) into a Copilot prompt. This is a behavioral risk — the user chose to share sensitive data with AI.
"Sensitive response received in Copilot" = Copilot retrieved and surfaced sensitive data from grounding sources (SharePoint, OneDrive, email). This is an access/configuration risk — overpermissioned data exposure.
Prompt events are the stronger signal for insider risk and user coaching interventions.

Report Template

Report Rendering Rules

These rules are MANDATORY for all report output (inline chat and markdown file). Follow strictly.

Rule 1: Risk Rating Scale

When assigning risk levels to users in the file-based user ranking, use this hierarchy:

Risk Level	Evidence Required
Critical	"Files collected and exfiltrated" ActionType present — confirmed insider risk exfiltration signal OR mass exfiltration pattern: ≥1,000 distinct files to removable media + file deletions within a ≤48-hour window (volume-based escalation even without the IRM-labeled ActionType)
High	Exfiltration signals below Critical thresholds (e.g., USB copies < 1,000 files without deletion pattern) OR sustained high DLP alert volume (top 2-3 by events/files)
Medium	Broad SIT diversity (10+ SIT types) OR cross-workload activity (3+ workloads) OR external domain WITHOUT explicit exfiltration signal
Low	Single-workload, moderate volume, no exfiltration or anomaly signals

⛔ PROHIBITED: Rating a user with "Files collected and exfiltrated" as Medium or Low. This ActionType is always High or Critical. ⛔ PROHIBITED: Rating a user with ≥1,000 files USB-copied + deletion in ≤48h as anything below Critical.

Rule 2: Executive Summary Uses File-Based Metrics Only

The Executive Summary MUST cite file-based (non-Copilot) event counts and file counts for user risk descriptions. Never cite overall metrics that include Copilot volume — this inflates perceived risk.

Context	Cite	Example
✅ File-based risk summary	Non-Copilot events, non-Copilot files	"u3087 generated 211 file-based events across 32 files"
❌ Inflated overall metrics	Total events including Copilot	~~"u1812 — 294 total events including 185 files"~~

Rule 3: Top Users Overall Section — Copilot Compression

When Copilot events exceed 80% of total volume:

Do NOT render a standalone "Top Users Overall" section dominated by Copilot service accounts
Instead, include a brief note: "Top overall users are dominated by Copilot service accounts/heavy Copilot users — see file-based user ranking below for actual data access risk."
If any users in the top-10 overall have multi-workload activity (Copilot + file operations), call them out in a single sentence rather than a full table

Rule 4: Copilot Count Reconciliation

When reporting Copilot vs file-based splits, ensure the counts reconcile across sections:

Action Type breakdown Copilot total = Workload breakdown Copilot total
If they differ (e.g., Connected AI App events counted differently), annotate the delta
Show the reconciliation in the Action Type section: "Copilot interactions: N events (Action Types: Risky prompt X + Sensitive response Y + Combined Z = N)"

Rule 5: Scope & Limitations Section (Required for Markdown Reports)

Markdown file reports MUST include a Scope & Limitations section immediately after the Executive Summary. Include:

## Scope & Limitations

| Consideration | Detail |
|--------------|--------|
| **Data Source** | DataSecurityEvents (Defender XDR Advanced Hunting) — requires Insider Risk Management opt-in to share data with Defender XDR |
| **Coverage** | SIT detections + sensitivity label events — files with neither SIT content match nor sensitivity label do NOT appear in this data |
| **Label Coverage** | X% of events have sensitivity labels — label analysis depth adjusted accordingly (see Rule 11) |
| **Retention** | 30-day Advanced Hunting retention |
| **ML Classifiers** | N trainable classifier GUIDs could not be resolved — see Unresolved SIT GUIDs section |
| **Copilot Volume** | Copilot events represent X% of total volume and are separated from file-based analysis throughout this report |

Fill in the actual values for N and X% from the query results.

Rule 6: SIT Landscape Table Integrity

Each GUID must appear exactly once in the SIT Landscape table — one row per GUID, no exceptions
NEVER group multiple GUIDs into a single row with slash-separated values (e.g., 1e883268/d2cdc387/bf6e0b84...). Even "copy" variants of the same SIT that share identical metrics MUST be separate rows
After all GUID resolution tiers complete, deduplicate by GUID — if conflicts exist, prefer the most specific resolution (Tier 3 PowerShell > Tier 2 config > Tier 1 embedded)
Group the table by category: Custom/Organization SITs, Built-in Microsoft SITs, ML Classifiers (Unresolvable)
Do NOT include a GUID under two different names
The total distinct SIT count cited in the Executive Summary must equal the number of rows in the SIT Landscape tables (sum of all category sub-tables)
Post-render verification (MANDATORY): After building all SIT Landscape sub-tables, count the total rows. If the exec summary cites a different number, update the exec summary to match. Format: "N active SIT types" where N = sum of rows across Custom, Built-in, and Unresolved sub-tables

⛔ PROHIBITED: Bundling GUIDs like "Credit Card Number copy (x3)" with 6,966 ea. — each of the 3 GUIDs must be its own row with its own exact counts ⛔ PROHIBITED: Exec summary citing a SIT count that doesn't match the actual row count in the SIT Landscape tables

Rule 7: Spike Detection — File-Based Primary, Overall Secondary

When rendering spike alerts:

Primary: Always run and display Query 10b (file-based-only spikes) as the main spike alert section. This surfaces actual sensitive data access spikes.
Secondary: Run Query 10 (all events) only if the user requests overall spikes or if there are interesting patterns worth noting. Include a clear note that these spikes are predominantly Copilot-driven.
If only running one spike query, always prefer Query 10b.
In the report, label sections clearly: "File-Based SIT Access Spikes" vs "Overall SIT Access Spikes (incl. Copilot)".

Rule 8: Top Files — Exclude System/Operational Files

The Top Files section must exclude Defender for Endpoint operational file reads:

C:\ProgramData\Microsoft\Windows Defender\DLPCache\* — DLP label metadata cache reads
*\EBWebView\* — Edge WebView browser cache
Any path matching \ProgramData\Microsoft\ that is clearly a system/cache path

If system files appear in results despite the query filter, separate them into a "System/Operational Files" subsection below the main "User-Accessed Files" list.

Rule 9: Risk Rating Consistency — Exec Summary Must Match User Table

Every user mentioned in the Executive Summary MUST use the same risk rating as in the File-Based Top Users table. If the table says 🔴 Critical, the exec summary must say Critical (and vice versa).

After building the File-Based Top Users table (the source of truth), cross-check every user mention in the exec summary
If there is a conflict, the User Table rating wins — update the exec summary to match
Never rate a user differently in two sections of the same report

⛔ PROHIBITED: Exec summary says "High" while user table says "Critical" for the same user (or any other mismatch).

Rule 10: Temporal Pattern — Include File-Based Event Column

The Temporal Pattern (daily volume) section MUST include a File Events column alongside the total. Without this, daily spikes appear alarming when they may be entirely Copilot-driven.

Column	Required	Source
Date	✅	`bin(Timestamp, 1d)`
Daily Events	✅	Total `count()`
File Events	✅	`countif(Workload !in ("Copilot", "ConnectedAIApp"))`
Distinct Users	✅	`dcount(AccountUpn)`
Notable	✅	Annotation for spikes

When annotating spikes (🔴), clarify whether the spike is Copilot-driven or file-driven:

"🔴 Major spike — file-driven" (if File Events also spike)
"🟡 Copilot adoption spike — file activity normal" (if only total spikes but File Events are stable)

Use Query 5 (updated) which returns both columns.

Rule 11: Adaptive Label Analysis Depth

The depth of sensitivity label analysis MUST be determined dynamically based on the Phase 1 Label Coverage Assessment. This prevents wasting queries in SIT-dominant environments while ensuring full coverage in label-mature environments.

Label Coverage	Report Behavior
≥5% labeled events	Full dedicated label sections: Label Landscape table, Label-Based Top Users, Label Changes, Copilot Label Exposure, Label-Only Events. These render as peer sections alongside SIT analysis
1-5% labeled events	Condensed "Sensitivity Labels" section: Label Landscape table + Label Changes only. Other label dimensions mentioned in summary notes
<1% labeled events	Brief note in Scope & Limitations: "Sensitivity label data is sparse (<1% of events) — environment is SIT-dominant. N events had labels; see label coverage stats below." No dedicated label sections unless user asks
User explicitly asks about labels	Full label sections regardless of coverage percentage

The Label Coverage Assessment also determines the overall report framing:

SIT-dominant (<5% labels): Report title/framing stays "SIT Access Analysis" with label addendum
Dual signal (5-50% labels): Report framing becomes "Data Security Analysis (SIT + Labels)"
Label-dominant (>50% labels): Report framing becomes "Data Security Analysis" with labels as primary signal and SIT as secondary

⛔ PROHIBITED: Running all 5 label queries (11-15) when coverage is <1% and user didn't ask about labels — this wastes API calls and clutters the report.

Rule 12: Label Display Format — Always Show Parent/Child Hierarchy

When rendering sensitivity label names, ALWAYS show the parent-child hierarchy using "/" notation:

Raw Label	Correct Display	Incorrect Display
Sub-label under Highly Confidential	"Highly Confidential / All Employees"	"All Employees"
Sub-label under Confidential	"Confidential / All employees"	"All employees"
Top-level label	"General"	"General" (correct as-is)
Unknown label	"[Unknown] abc12345..."	blank or omitted

This prevents ambiguity — "All Employees" exists under BOTH Confidential and Highly Confidential as different labels with different GUIDs.

⛔ PROHIBITED: Displaying sub-label names without their parent (e.g., just "Specific people" — which parent?).

Rule 13: Service Account Separation in Copilot Analysis

All Copilot SIT analysis MUST separate automated service accounts from human users. Service accounts (Security Copilot agents, Purview agents, etc.) can generate 50-70% of Copilot event volume and must not inflate human risk metrics.

Requirement	Detail
Identify service accounts	Filter: `AccountUpn matches regex ServiceAccountPattern` (defined in Query 16a: `@"^(securitycopilotagentuser-\|svc-)"`). Check Query 16a results for additional org-specific patterns (any account with >10K events and a GUID-like UPN prefix is likely automated)
Report agent volume separately	Include a summary line: "⚠️ N automated service accounts generated X events (Y% of Copilot volume) — excluded from human risk analysis below"
Never mix in user rankings	Queries 16b-16d exclude agents by default. If rendering an overall Copilot user table, agents go in a separate subsection
Power-user flagging	After agent exclusion, if any single human user accounts for >20% of remaining Copilot events, flag them: "⚠️ Power user — may indicate automated workflow via Copilot"

⛔ PROHIBITED: Including automated service accounts in human user risk rankings for Copilot SIT analysis. ⛔ PROHIBITED: Reporting raw Copilot event counts as "user interactions" without noting the ~2-3x row multiplication factor.

Rule 14: Copilot Event Row Multiplication Awareness

Each Copilot interaction generates multiple DSE rows (average ~2-3x, up to 35x for complex exchanges). Purview creates separate rows for:

The prompt (if it contains a SIT match) — "Risky prompt entered in Copilot"
The response (if grounding data contains SIT matches) — "Sensitive response received in Copilot"
Compound events — agent interactions, SharePoint access, multiple simultaneous conditions

Reporting Requirement	Format
Raw event counts	"N Copilot SIT events (raw DSE rows)"
Estimated interactions	"~N/2.5 estimated real interactions" (use the ratio from the environment if calculated)
User daily rates	If citing per-user daily rates, note whether raw or estimated: "~X interactions/day (estimated from Y raw events)"

This rule prevents stakeholders from seeing "270K Copilot events" and panicking when the actual interaction count is ~95K.

Rule 15: Copilot SIT Report Section (Phase 2.5 Output)

When Phase 2.5 is triggered (Copilot >30% of events), render a dedicated "Copilot SIT Exposure" section with:

Service Account vs Human Split — "X events from N service accounts (Y%), Z events from N human users (Y%)" with the agent exclusion note
SIT Priority Landscape — Table from Query 16a showing SITs by priority tier (High/Medium/Low) with human-only metrics
High-Priority SIT Users — Top 10 human users from Query 16b with prompt vs response breakdown
Prompt Risk Signal — Top 10 users from Query 16d who typed high-priority SITs into Copilot prompts — this is the primary actionable finding
Temporal Trend — Daily trend from Query 16c (optional — include if spikes are interesting)

If Copilot events are <30% of total, skip Phase 2.5 entirely and note in the Copilot section: "Copilot events represent X% of total volume — below threshold for dedicated analysis. See Action Type breakdown for Copilot summary."

Inline Chat Format

## 📊 Data Security Events Analysis
**Scope:** <Tenant-wide / SIT: <name> / User: <UPN>>
**Time Range:** <start> to <end>
**Total Events:** <N> | **Distinct Users:** <N> | **Distinct Files:** <N>

### SIT Landscape
| # | SIT Name | GUID (short) | Events | Users | Files | Classifier |
|---|----------|-------------|--------|-------|-------|------------|
| 1 | Credit Card Number | 50842eb7... | 7,255 | 46 | 346 | Content |
| 2 | All Full Names | 50b8b56b... | 128,957 | 1,475 | 119 | Content |
| ... | | | | | | |

### Action Type Breakdown
| Action Type | Events | Users | Files |
|-------------|--------|-------|-------|
| Sensitive response received in Copilot | 228,564 | ... | ... |
| Risky prompt entered in Copilot | 390,905 | ... | ... |
| ... | | | |

### 🔴 Top Users by SIT Volume (Risk-Ranked)
| # | User | Total Events | Distinct SITs | Distinct Files | Last Active |
|---|------|-------------|---------------|----------------|-------------|
| 1 | user@domain.com | 12,345 | 8 | 42 | 2026-03-16 |
| ... | | | | | |

### 🏷️ Sensitivity Label Landscape (if ≥1% coverage)
| # | Label | GUID (short) | Events | Users | Files |
|---|-------|-------------|--------|-------|-------|
| 1 | Highly Confidential / All Employees | defa4170...000a | 62 | 9 | 31 |
| 2 | General | defa4170...0002 | 55 | 12 | 32 |
| ... | | | | | |

### ⚠️ Label Changes (if any PreviousSensitivityLabelId events exist)
| Timestamp | User | File | Previous Label | Current Label | Action |
|-----------|------|------|---------------|--------------|--------|
| 2026-03-11 | user@domain.com | doc.docx | HC / Project X | Confidential / All employees | Label downgraded |
| ... | | | | | |

### 🤖 Copilot SIT Exposure (if Copilot >30% of events)
**Service Accounts:** N agents generated X events (Y%) — excluded from analysis below
**Row Multiplication:** ~2-3 DSE rows per real interaction (est. ~Z real interactions)

| # | SIT Name | Priority | Human Events | Human Users | Prompts | Responses |
|---|----------|----------|-------------|-------------|---------|----------|
| 1 | Credit Card Number | 🔴 High | 1,234 | 89 | 456 | 778 |
| 2 | All Full Names | 🔵 Low | 45,678 | 1,200 | 12,345 | 33,333 |
| ... | | | | | | |

**🔴 Prompt Risk — Users Typing High-Priority SITs Into Copilot:**
| # | User | Prompt Events | SIT Types | Last Active |
|---|------|-------------|-----------|-------------|
| 1 | user@domain.com | 142 | SSN, Credit Card | 2026-03-16 |
| ... | | | | |

### ⚠️ SIT Access Spike Alerts
| User | Baseline (daily avg) | Recent (daily avg) | Spike Ratio | Status |
|------|---------------------|-------------------|-------------|--------|
| user@domain.com | 5.2 | 47.1 | 9.06x | 🔴 Spike |
| ... | | | | |

### Unresolved SIT GUIDs
<List of GUIDs not in mapping — offer PowerShell resolution>

Markdown File Format

Same structure as inline, wrapped in proper markdown with:

Report metadata header (generated timestamp, scope, tool versions)
Scope & Limitations section immediately after Executive Summary (see Rule 5 above)
Each section as H2/H3
File-based user ranking as the primary risk section (NOT the Copilot-dominated overall ranking)
DLP Policy Match section with DefaultRule explanation for empty PolicyName entries
Sensitivity Label sections (if coverage ≥1% or user requested): Label Landscape, Label-Based Top Users, Label Changes, Copilot Label Exposure
Code fences for any raw data samples
Save to: reports/data-security/DataSecurity_Analysis_<scope>_YYYYMMDD_HHMMSS.md

Known Pitfalls

Pitfall	Detail	Mitigation
`SensitiveInfoTypeInfo` is `Collection(String)`, not dynamic	Each element is a JSON string requiring double-parse: `parse_json(tostring(SensitiveInfoTypeInfo))` to expand array, then `parse_json(tostring(element))` to access fields	Always double-parse. Direct dot-notation fails silently
Massive row counts	600k+ rows in 30 days for mid-size tenants; millions for 100k+ user orgs	ALWAYS `summarize` first. NEVER retrieve raw rows without `take` limit
`mv-expand` is expensive	Expanding SensitiveInfoTypeInfo across 600k rows without pre-filtering is extremely slow	Pre-filter with `where SensitiveInfoTypeInfo has "<GUID>"` before `mv-expand`
Copilot dominates event volume	90%+ of events can be Copilot-related. See Rules 2-3 for report handling and Phase 2.5 for dedicated Copilot analysis	Filter to specific `ActionType` values for file access. Use `Workload !in ("Copilot", "ConnectedAIApp")` for file-only analysis
Trainable classifiers (MLModel) don't resolve	GUIDs with `ClassifierType: "MLModel"` may not exist in `Get-DlpSensitiveInformationType`	Display as `[ML Classifier] <GUID>` and note in report
SIT GUID is per-SIT, not per-detection	Multiple documents can match the same SIT GUID — the GUID identifies the SIT type, not the specific match	Use `Count` and `Confidence` fields from SITJson for detection-level detail
`ObjectId` can be empty	Copilot interaction events may not have an ObjectId (no specific file)	Filter `isnotempty(ObjectId)` for file-specific analysis
IRM opt-in required	DataSecurityEvents is populated by Insider Risk Management. No opt-in = empty table	Check for 0 results and explain the prerequisite
Table schema evolves	DataSecurityEvents is in Preview — column names and availability may change	Run `getschema` if queries fail with column resolution errors
`DlpPolicyMatchInfo` is sparse	Only ~0.5% of rows have DLP policy match data (the rest are IRM-only detections)	Don't assume all SIT events have DLP data; they're independent signals
Duplicate GUID in SIT mapping	One GUID can only resolve to one SIT name. If a GUID appears in both the embedded `datatable` and a Tier 2/3 resolution with a different name, the result will have duplicate rows with conflicting names. This can happen when a built-in SIT GUID overlaps with a custom SIT copy, or when PowerShell returns a different display name than the embedded mapping	After resolving all GUIDs, deduplicate by GUID before rendering the SIT Landscape table. If a GUID maps to multiple names, prefer the Tier 3 (PowerShell) name over Tier 1 (embedded). Never show the same GUID on two rows with different names
Empty `PolicyName` = DefaultRule ("Always audit")	DLP alerts with empty/null `PolicyName` are typically generated by the built-in `DefaultRule` that fires when "Always audit file activity for devices" is enabled (ON by default). These are NOT orphaned or misconfigured policies — they are the expected result of the default audit setting	In the DLP Policy Match section, explain: "Events with empty PolicyName are generated by the built-in DefaultRule, which audits all monitored file types (Office, PDF, CSV) on onboarded devices when 'Always audit file activity for devices' is enabled (default: ON). No explicit DLP policy is required for these events."
Compound ActionType values	Some events have ActionType values that combine multiple labels (e.g., "Risky prompt entered in Copilot, Sensitive response received in Copilot" or "Sensitive info shared on Teams, DLP Rule Matched"). These are literal string values from the data, not aggregation artifacts	Display compound ActionTypes exactly as they appear in the data. Do NOT split or merge them — they represent events where multiple conditions were met simultaneously
System/operational files dominate Top Files	Files under `C:\ProgramData\Microsoft\Windows Defender\DLPCache\RMSLabels\` and `\EBWebView\` are Defender for Endpoint reading sensitivity label metadata — NOT user-initiated file access. These can dominate 90%+ of the Top Files list	Query 4 filters these paths. If they still appear, separate into a "System/Operational Files" subsection. Never present DLPCache reads as user data access risk
Localized SIT names in CloudAppEvents	CloudAppEvents `DLPRuleMatch` events include SIT names, but names appear in the user's locale (e.g., "የዱቤ ካርድ ቁጥር" instead of "Credit Card Number" for Amharic users). Same GUID can map to different name strings depending on locale	Always aggregate by SIT GUID, never by SIT name. Use the GUID-to-name mapping (Tier 1/2/3) for canonical English names. This applies when cross-referencing CloudAppEvents with DataSecurityEvents
Browsing events are not files — separate from Top Files	ActionTypes like "Generative AI websites browsed" and "Gambling websites browsed" reference URLs, not files. They have no `ObjectId` file path — only a URL domain. Including them under "Top Files" is misleading	Render browsing/URL events in a separate subsection (e.g., "External / AI Service Access") below Top Files. Never mix URL-based events into the file ranking tables. Title the Top Files section accurately ("Top Files" not "Top Files & URLs")
Temporal spike annotations must reference known users	When annotating daily spikes in the Temporal Pattern table (e.g., "🔴 Major spike — u625 SharePoint batch"), the attributed user MUST appear elsewhere in the report — in the File-Based Top Users table, a drill-down section, or at minimum a footnote. Referencing a user that exists nowhere else in the report violates the evidence-based analysis rule and creates an unverifiable claim	Before annotating a spike with a user attribution, verify the user appears in the Top Users ranking. If they don't make the top-10 but are the spike driver, either: (a) add them to the user table with a note "included due to temporal spike attribution", or (b) use a generic annotation ("🔴 Major spike — file-driven") without naming the user
`SensitivityLabelId` can contain comma-separated GUIDs	When an event involves multiple sub-entities (e.g., Copilot accessing multiple labeled resources), the `SensitivityLabelId` column contains comma-separated GUIDs like `aaaa-...,aaaa-...,bbbb-...`. This is NOT a single GUID	Always `split(SensitivityLabelId, ",")` then `mv-expand` to handle multi-GUID values. Querying with `== "<GUID>"` will miss events; use `has "<GUID>"` for filtering or `split()` + `mv-expand` for enumeration
`RiskyAIUsageSensitivityLabelsInfo` is mostly `[null]`	The column is populated on 90%+ of Copilot events, but almost all contain `[null]` (a JSON array with a literal null element). Only events where Copilot actually accessed labeled resources have real data	Filter with `tostring(RiskyAIUsageSensitivityLabelsInfo) !has "null"` to exclude the null-dominated rows. The `isnotempty()` check alone is insufficient — `[null]` passes it
Label GUIDs have no `ClassifierType` field	Unlike SIT GUIDs which have `ClassifierType: "Content"` or `"MLModel"`, label GUIDs have no built-in type indicator. Resolution requires external lookup (datatable, config, or PowerShell `Get-Label`)	Use the Label GUID Mapping Strategy (3 tiers). Unknown labels display as `[Unknown] <GUID>`
Default label GUIDs are deterministic but not officially documented	All 12 default labels (including parents) use `defa4170-0d19-0005-XXXX-bc88714345d2` with sequential suffixes `0000`–`000b` (confirmed via `Get-Label` on default-configuration tenants). However, Microsoft does not publish these GUIDs in official docs. Older/customized tenants may have: (a) renamed default labels (e.g., "Non-business" instead of "Personal"), (b) replaced parent label GUIDs with random tenant-specific ones, or (c) added custom sub-labels that break the sequential pattern	The embedded datatable includes all 12 confirmed default GUIDs. For tenants with custom parent GUIDs, add them via `config.json` `label_mapping`. If `Get-Label` returns different names for a `defa4170` GUID, prefer the `Get-Label` name for that tenant
`PreviousSensitivityLabelId` can equal `SensitivityLabelId`	On "Label removed from a file" events, both fields may contain the same GUID (the label that was removed). The current label after removal may be empty	Check `ActionType` to distinguish: "Label downgraded" = actual label change; "Label removed" = label stripped (current may be empty); "Label on file downgraded or removed, then file accessed by Copilot" = compound event with Copilot exposure
Label-only events require label-aware DLP/IRM policies	Events with labels but no SIT content (Query 15) only appear if a DLP or IRM policy explicitly targets sensitivity labels. Environments with no label-based policies will see zero label-only events even if documents are labeled	If Query 15 returns 0 results AND labels exist (Query 11 > 0), this indicates no label-based DLP/IRM policies are configured — mention this as a gap in the report
Copilot service accounts inflate event counts	Automated service accounts can generate 50-70% of all Copilot events. See Rule 13 for full requirements	Filter: `where not(AccountUpn matches regex ServiceAccountPattern)` (defined in Query 16a). Run Query 16a to quantify agent vs human split
Copilot row multiplication (~2-3x per interaction)	Each interaction generates ~2-3 DSE rows (prompt + response + compound events), up to 35 for complex exchanges. See Rule 14 for reporting requirements	Estimate real interactions as `raw_events / 2.5`. Always include multiplier context when reporting Copilot volumes
Correct field name is `SensitiveInfoTypeId` (not `SensitiveType`)	Inside the `SensitiveInfoTypeInfo` JSON, the SIT GUID field is named `SensitiveInfoTypeId`. LLMs frequently generate `SensitiveType` or `SensitiveTypeId` — both are wrong and return null/empty when accessed	Always use `tostring(SITJson.SensitiveInfoTypeId)`. Other valid fields: `Confidence` (int), `ClassifierType` (string), `Count` (int), `SubEntityName` (string — "Prompt" or "Response" in Copilot events)
`SubEntityName` distinguishes Prompt vs Response in Copilot SIT events	Inside `SensitiveInfoTypeInfo` JSON, `SubEntityName` contains "Prompt" or "Response" — indicating whether the SIT was detected in the user's prompt or Copilot's response. This is more reliable than relying on `ActionType` text matching for prompt/response classification within individual SIT matches	Use `tostring(SITJson.SubEntityName)` when doing per-SIT prompt/response breakdowns. `ActionType` classifies the event-level action; `SubEntityName` classifies the per-SIT-match context

Error Handling

Error	Cause	Resolution
`Failed to resolve table 'DataSecurityEvents'`	Table not available — IRM not opted in, or not connected to Defender XDR	Inform user: "DataSecurityEvents requires Microsoft Purview Insider Risk Management opt-in to share data with Defender XDR."
`0 results` for SensitiveInfoTypeInfo queries	No SIT detections in timeframe, or SIT detection not enabled in IRM policies	Widen time range; check if IRM policies include SIT detection
`Failed to resolve column 'ObjectName'`	Schema changed or column renamed	Use `ObjectId` instead (confirmed available). Run `getschema` to verify current schema
PowerShell `Get-DlpSensitiveInformationType` fails	Not connected to IPPS session	Run `Connect-IPPSSession -UserPrincipalName <UPN>` first
`The term 'Get-DlpSensitiveInformationType' is not recognized`	Module not installed or IPPS session in different terminal	`Install-Module ExchangeOnlineManagement` then `Connect-IPPSSession` in the same terminal session
`The term 'Get-Label' is not recognized`	Not connected to IPPS session or module not loaded	Same as above — `Connect-IPPSSession` provides both `Get-DlpSensitiveInformationType` and `Get-Label`
Label GUIDs all show as `[Unknown]`	Default label datatable doesn't match target tenant (parent GUIDs vary)	Resolve via `Get-Label` and persist to `config.json` `label_mapping`

File Access Action Types Reference

When the user specifically asks about who opened/accessed/downloaded documents, filter to these ActionTypes:

ActionType	Meaning
`Sensitive File read`	File opened on endpoint (Defender for Endpoint)
`File accessed on SPO`	File opened in SharePoint Online / OneDrive
`File downloaded from SharePoint`	File downloaded from SPO/OneDrive
`File copied to Removable media`	File copied to USB/removable storage
`File upload to cloud`	File uploaded to cloud storage
`Sensitive file created`	New file created with sensitive content
`File Archived`	File moved to archive
`Text copied to clipboard from sensitive file`	Clipboard copy from sensitive doc

Copilot-related ActionTypes (separate category — AI interaction, not direct file access):

ActionType	Meaning
`Sensitive response received in Copilot`	Copilot surfaced content matching a SIT
`Risky prompt entered in Copilot`	User prompt triggered risk detection
`Sensitive response received in Copilot;Agent generating sensitive responses`	Copilot agent generated response containing SIT matches
`Risky prompt entered in Copilot;Sensitive response received in Copilot`	Both prompt and response contained SIT matches
`Risky prompt entered in Copilot;Sensitive response received in Copilot;Exposing agent to risky prompts;Agent generating sensitive responses`	Compound agent event — prompt + response + agent risk
`Risky prompt entered in Copilot;Exposing agent to risky prompts`	User exposed a Copilot agent to a risky prompt

Label-related ActionTypes (sensitivity label events):

ActionType	Meaning
`Label downgraded on a file`	Sensitivity label lowered (e.g., HC → Confidential)
`Label removed from a file`	Sensitivity label stripped entirely
`Label on file downgraded or removed, then file accessed by Copilot`	Label reduced AND Copilot subsequently accessed the file

DLP ActionTypes:

ActionType	Meaning
`Generated High severity DLP alerts`	DLP policy triggered a high-severity alert
`DLP Rule Matched`	DLP rule matched (may be combined with other types)

SVG Dashboard Generation

After generating a Data Security Events analysis report (markdown file output), an SVG dashboard can be created using the shared SVG rendering skill.

Trigger: User asks "generate an SVG dashboard from the report" or "visualize this report"

Workflow:

Read this skill's svg-widgets.yaml (widget manifest — defines layout, colors, field mapping)
Read .github/skills/svg-dashboard/SKILL.md (rendering rules — component library, quality standards)
Extract data from the completed report using data_sources.field_mapping_notes
Render SVG → save as {report_basename}_dashboard.svg in the same directory

Layout: 5 rows — title banner, KPI cards (events/users/files/file%/copilot%/label%), top SITs bar chart + workload donut, risk-ranked users table + file action bars, assessment banner + recommendations.

name	data-security-analysis
description	Analyze data security events, sensitive information type (SIT) access, sensitivity label access, DLP matches, or Purview insider risk activity. Triggers on keywords like "data security", "sensitive information type", "SIT access", "DLP events", "DataSecurityEvents", "EDM access", "credit card access", "insider risk activity", "Purview data security", "sensitivity label", "label downgrade", "label change", "Copilot label exposure". Queries DataSecurityEvents in Advanced Hunting to produce SIT and label access analysis: volume breakdowns, user drill-downs, file inventories, action type distribution, DLP correlation, label change tracking, Copilot label exposure, temporal patterns, and risk-ranked user summaries. Inline chat or markdown output. Designed for large environments (100k+ users) with tiered drill-down.
threat_pulse_domains	["admin","cloud"]
drill_down_prompt	Analyze data security events — SIT access patterns, label changes, DLP policy matches

Data Security Events Analysis — Instructions

Purpose

Primary Table: DataSecurityEvents (Defender XDR Advanced Hunting)

Use Case	Example Question
SIT access audit	"Who accessed files with credit card numbers in the last 30 days?"
EDM monitoring	"Show me all access to documents matching our EDM SIT"
DLP event analysis	"What DLP policy matches occurred this week?"
Insider risk triage	"Which users have the most sensitive data interactions?"
SIT landscape overview	"What sensitive information types exist in our environment?"
Sensitivity label audit	"Who accessed Highly Confidential labeled documents?"
Label change tracking	"Show me all label downgrades in the last 30 days"
Copilot label exposure	"What labeled documents did Copilot access in risky interactions?"

📑 TABLE OF CONTENTS

Critical Workflow Rules - Start here!
SIT GUID Mapping Strategy - How SIT GUIDs are resolved to names
Label GUID Mapping Strategy - How sensitivity label GUIDs are resolved to names
Output Modes - Inline chat vs. Markdown file
Quick Start - 8-step execution pattern
Execution Workflow - 6-phase analysis process
Sample KQL Queries - Validated query patterns (Queries 1-16d)
Report Template - Rendering rules (15 rules) + output format specification
Known Pitfalls - Table quirks and edge cases (27 entries)
Error Handling - Troubleshooting guide
SVG Dashboard Generation - Visual dashboard from report data

Investigation shortcuts:

DLP/exfiltration incident entities (TP Q1): Q3 (top users by SIT volume) → Q6 (DLP policy matches) → Q9 (single-user SIT profile for each incident entity) → Q10b (file-based spikes)
High-volume mailbox API access (TP Q9): Q9 (single-user SIT profile for API actors) → Q4 (top files accessed) → Q10b (file-based spikes) → Q6 (DLP policy matches)
Risky identity with data access (TP Q3): Q9 (single-user SIT profile) → Q4 (top files) → Q13 (label downgrade/changes by user)
Copilot sensitive data exposure (TP Q1 Copilot incidents, or TP Q10 AppRegistration with AI keywords): Q16a (Copilot SIT landscape + agent/human split) → Q16b (top human users, high-priority SITs) → Q16d (prompt-only risk signal)
Label compliance / downgrade alert (TP Q1 label-related incidents): Q13 (label changes) → Q15 (label-only events) → Q14 (Copilot label exposure)
Tenant-wide data security posture (standalone, no TP trigger): Full Phase 1–5 workflow

⛔ Shortcut Default Rule: When a matching shortcut exists for the investigation context, use it — don't run the full workflow. Only run the full Phase 1-5 sequence when the user explicitly requests "full analysis", "comprehensive", or "tenant-wide overview". Shortcuts render only the report sections relevant to their query chain (plus Executive Summary and Recommendations, always).

When invoked from a parent skill (threat-pulse, incident-investigation, user-investigation):

Inherit the workspace selection from the parent investigation context
Skip output mode prompts — default to inline chat (the parent skill controls the final output format)
Match the TP Q# trigger to the shortcuts above and execute that chain with entity substitution
Use 30d lookback (AH default) unless the parent specifies otherwise

⚠️ CRITICAL WORKFLOW RULES - READ FIRST ⚠️

Before starting ANY data security analysis:

ALWAYS use RunAdvancedHuntingQuery — DataSecurityEvents is an Advanced Hunting table, NOT available in Sentinel Data Lake
ALWAYS run Query 1 (SIT Discovery) first — establishes which SITs are active and builds the GUID-to-Name mapping
ALWAYS use summarize aggressively — this table can have 600k+ rows in 30 days even in mid-size tenants. NEVER retrieve raw rows except for targeted samples
ALWAYS pre-filter with has before mv-expand on SensitiveInfoTypeInfo — the has "<GUID>" filter avoids expensive expansion on non-matching rows
ALWAYS use tostring() + double parse_json() for SensitiveInfoTypeInfo — it's Collection(String), not native dynamic
NEVER report SIT GUIDs without attempting name resolution — use the mapping strategy below
ALWAYS ask for output mode if not specified: inline chat or markdown file
Prerequisite: DataSecurityEvents requires Insider Risk Management opt-in to share data with Defender XDR. If the table returns 0 rows or "table not found", inform the user of this requirement
ALWAYS run the Label Coverage Assessment (Query 11 quick stats variant) during Phase 1 to determine if this environment has significant label usage. Adapt the report accordingly (see Rule 11)
NEVER report sensitivity label GUIDs without attempting name resolution — use the label mapping strategy below
ALWAYS use split() on SensitivityLabelId — this column can contain comma-separated GUIDs (one per sub-entity), not a single GUID

⛔ PROHIBITED ACTIONS

Action	Status
Querying DataSecurityEvents via `mcp_sentinel-data_query_lake`	❌ PROHIBITED — AH-only table
Retrieving raw rows without `summarize` or `take` limit	❌ PROHIBITED — table is massive
Reporting SIT GUIDs without name resolution attempt	❌ PROHIBITED
Reporting sensitivity label GUIDs without name resolution attempt	❌ PROHIBITED
Running `mv-expand` on SensitiveInfoTypeInfo without pre-filtering with `has`	❌ PROHIBITED — performance killer at scale
Assuming `SensitiveInfoTypeInfo` is native dynamic	❌ PROHIBITED — it's `Collection(String)`, requires double-parse

SIT GUID Mapping Strategy

The Problem

DataSecurityEvents.SensitiveInfoTypeInfo contains SIT GUIDs, not human-readable names. SIT GUIDs fall into three categories:

Category	Resolvable via KQL?	Example
Built-in Microsoft SITs	✅ Yes — use embedded mapping	`50842eb7-...-b085` → "Credit Card Number"
Custom/EDM SITs	❌ No — org-specific GUIDs	`b28fcea1-...-9291` → "Project Obsidian" (custom)
Trainable Classifiers (ML)	❌ No — `ClassifierType: "MLModel"`	`77a140be-...-7560` → unknown ML classifier

Resolution Strategy (3 tiers, in order)

Tier 1: Embedded Well-Known SIT Mapping (instant, no auth)

The query library below includes a datatable of the most common Microsoft SIT GUIDs encountered in production environments. This covers ~90% of detections in typical tenants.

Tier 2: User-Provided Custom SIT Mapping (config-driven)

If the user has custom/EDM SITs, they can provide a mapping in config.json under a sit_mapping key:

{
  "sit_mapping": {
    "<custom-sit-guid-1>": "Your Custom SIT Name",
    "<custom-sit-guid-2>": "Your EDM SIT Name"
  }
}

At skill startup: Check if config.json has a sit_mapping section. If yes, merge it into the KQL datatable for name resolution.

Tier 3: PowerShell Resolution (optional, on-demand)

If unresolved GUIDs remain after Tier 1+2, offer to resolve them via PowerShell:

"I found N SIT GUIDs that aren't in the built-in mapping. Would you like me to resolve them via Get-DlpSensitiveInformationType? This requires an active Security & Compliance PowerShell session (Connect-IPPSSession)."

If the user agrees:

# Requires: Install-Module ExchangeOnlineManagement
# Requires: Connect-IPPSSession -UserPrincipalName <UPN>
Get-DlpSensitiveInformationType -Identity "<GUID>" | Select-Object Name, Id, Publisher

After resolution: Offer to save the mapping to config.json for future runs.

Post-Resolution Persistence (MANDATORY)

After Tier 3 PowerShell resolution completes, always offer to persist the resolved GUIDs:

"I resolved N SIT GUIDs via PowerShell. Would you like me to save these to config.json under sit_mapping so future runs resolve them automatically via Tier 2?"

If the user agrees, read the current config.json, add/merge a sit_mapping object with the resolved GUIDs, and write it back. Format:

{
  "sit_mapping": {
    "<guid>": "<resolved-name>",
    "<guid>": "<resolved-name>"
  }
}

Trainable Classifiers

GUIDs with ClassifierType: "MLModel" are trainable classifiers and may not resolve via Get-DlpSensitiveInformationType. Display them as:

[ML Classifier] <GUID> if unresolved
Check if the GUID appears in the well-known mapping (some trainable classifiers have known GUIDs)

Label GUID Mapping Strategy

The Problem

DataSecurityEvents has 4 label-related columns, all containing sensitivity label GUIDs (not names):

Column	Type	Content
`SensitivityLabelId`	string	Label on the document at event time. Can contain comma-separated GUIDs (one per sub-entity)
`PreviousSensitivityLabelId`	string	Previous label — only populated on label-change events (downgrade, removal)
`SharepointSiteSensitivityLabelId`	string	Label on the SharePoint site (not the document)
`RiskyAIUsageSensitivityLabelsInfo`	Collection(String)	Labels on resources Copilot accessed in risky AI events — JSON array of objects with `SubEntityId`, `SubEntityName`, `SensitivityLabelId`

Resolution Strategy (3 tiers, in order)

Tier 1: Embedded Well-Known Label Mapping (instant, no auth)

⚠️ Important: Microsoft does not publish default label GUIDs in official documentation. The GUID pattern is confirmed via Get-Label on default-configuration tenants. Older tenants may have renamed default labels (e.g., "Non-business" instead of "Personal", "Internal exception" instead of "Anyone (unrestricted)") or replaced default parent label GUIDs with random tenant-specific ones. Always validate with Get-Label (Tier 3) when accuracy matters.

Default Label GUID Pattern: defa4170-0d19-0005-XXXX-bc88714345d2 — complete mapping (12 labels, priority-ordered):

GUID suffix	Priority	Default Name	Parent
`0000`	0	Personal	(top-level)
`0001`	1	Public	(top-level)
`0002`	2	General	(top-level)
`0003`	3	Anyone (unrestricted)	General
`0004`	4	All Employees (unrestricted)	General
`0005`	5	Confidential	(top-level, parent)
`0006`	6	Anyone (unrestricted)	Confidential
`0007`	7	All Employees	Confidential
`0008`	8	Trusted People	Confidential
`0009`	9	Highly Confidential	(top-level, parent)
`000a`	10	All Employees	Highly Confidential
`000b`	11	Specified People	Highly Confidential

Older/customized tenants: Admins may have renamed default labels or deleted and recreated parent labels with random GUIDs. If Get-Label returns a different GUID for "Confidential" or "Highly Confidential" (not matching defa4170-*), the tenant has custom parent labels — add them via config.json (Tier 2).

Tier 2: User-Provided Custom Label Mapping (config-driven)

Custom labels (org-created) have random GUIDs. Users can provide a mapping in config.json:

{
  "label_mapping": {
    "<custom-label-guid-1>": "Your Custom Label",
    "<sub-label-guid>": "Sub-Label Name|Parent Label Name",
    "<parent-label-guid>": "Confidential"
  }
}

Tier 3: PowerShell Resolution (optional, on-demand)

If unresolved label GUIDs remain after Tier 1+2, offer to resolve them via PowerShell:

"I found N label GUIDs that aren't in the built-in mapping. Would you like me to resolve them via Get-Label? This requires an active Security & Compliance PowerShell session (Connect-IPPSSession)."

If the user agrees:

# Requires: Install-Module ExchangeOnlineManagement
# Requires: Connect-IPPSSession -UserPrincipalName <UPN>
Get-Label | Select-Object DisplayName, @{N='LabelGuid';E={$_.Guid.ToString()}}, ParentLabelDisplayName | Format-List

After resolution: Offer to save the mapping to config.json under label_mapping for future runs (same persistence pattern as SIT mapping).

Key difference from SIT resolution: Labels use Get-Label (not Get-DlpSensitiveInformationType). The cmdlet returns ALL labels at once — no need to query by individual GUID.

Output Modes

ASK the user which they prefer if not explicitly specified. Both may be selected.

Mode 1: Inline Chat Summary (Default)

Render analysis directly in chat
Includes summary tables, top-N breakdowns, risk-ranked user list
Best for quick review and follow-up questions

Mode 2: Markdown File Report

Save to reports/data-security/DataSecurity_Analysis_<scope>_<timestamp>.md
Full detail including all phases, temporal charts, file inventories
Use create_file tool — NEVER use terminal commands for file output
Filename pattern: DataSecurity_Analysis_<scope>_YYYYMMDD_HHMMSS.md
- <scope> = tenant_wide, sit_<SITname>, user_<username>, etc.

Quick Start (TL;DR)

Determine scope → Tenant-wide overview? Specific SIT? Specific user? Specific label? Time range?
Check config.json → Look for sit_mapping and label_mapping sections for custom name resolution
Run Phase 1 → Query 1 (SIT Discovery) + Label Coverage Assessment (quick stats variant) to determine environment maturity
Run Phase 2 → Queries 2-5 (breakdowns by action type, user, file, time)
Run Phase 2.5 → Queries 16a-16d (Copilot SIT exposure analysis) — conditional on Copilot volume (see Phase 2.5 trigger)
Run Phase 3 → Queries 6-8 (DLP correlation, workload, SIT drill-down), Query 10b (file-based spikes)
Run Phase 4 → Queries 11-15 (label landscape, label-based user ranking, label downgrades, Copilot label exposure, label-only events) — depth depends on label coverage (see Rule 11)
Output Results → Render in selected mode(s), offer PowerShell resolution for unknowns

Execution Workflow

Phase 1: Discovery & Mapping (always run first)

Goal: Establish what SITs and labels exist in the data, their volume, and resolve GUIDs to names.

Run Query 1 (SIT Discovery) — returns top SIT GUIDs with hit counts
Run Label Coverage Assessment (quick stats from Query 11 comment block) — returns label vs SIT coverage percentages
Apply Tier 1 mapping (embedded datatable) to resolve known SIT and label GUIDs
Check config.json for Tier 2 mapping (sit_mapping + label_mapping) to resolve custom GUIDs
Flag any remaining unresolved GUIDs for optional Tier 3 (PowerShell)
Present the SIT + label landscape to the user before proceeding
Determine label analysis depth based on coverage (see Rule 11)

Phase 2: Breakdown Analysis

Goal: Decompose SIT access patterns by multiple dimensions.

Run these queries in parallel where possible:

Query	Dimension	Purpose
Query 2	Action Type	What operations triggered SIT detections (file read, download, copy, Copilot response, etc.)
Query 3	User Ranking	Top users by SIT interaction volume — risk-ranked
Query 4	File Inventory	Top files/documents containing the most SIT detections
Query 5	Temporal Pattern	Daily/hourly volume trend to spot spikes

Phase 2.5: Copilot SIT Exposure Analysis (conditional on Copilot volume)

Trigger: Run this phase when Copilot/AI events exceed 30% of total volume (determined from Query 2 Action Type breakdown or Query 7 Workload breakdown).

Query	Purpose
Query 16a	Copilot SIT Landscape — Which SITs fire in Copilot interactions, classified by priority tier (High/Medium/Low)
Query 16b	Top Users by High-Priority SIT in Copilot — Risk-ranked users excluding service accounts
Query 16c	Daily Temporal Trend by SIT Category — Spot adoption vs risk pattern changes over time
Query 16d	Prompt-Only Human Users — Users typing sensitive data INTO Copilot (primary risk signal), excluding service accounts and responses

Priority SIT Classification: SITs are classified into tiers for Copilot risk assessment:

Tier	SIT Categories	Risk Rationale
🔴 High	Credit Card Numbers, SSNs, Azure/Cloud Credentials, Employee HR Data (custom)	Direct financial, identity, or infrastructure exposure
🟡 Medium	Project code names (custom), Employee IDs (custom)	Business-sensitive but not directly exploitable
🔵 Low	All Full Names, IP Addresses, Physical Addresses, Medical Terms	High-volume, low-specificity — noise in Copilot context

Custom SIT classification: Organizations should classify their custom/EDM SITs into these tiers. If config.json has a sit_priority section (mapping GUID → tier), use it. Otherwise, classify custom SITs as 🔴 High by default (conservative).

Phase 3: Deep Dive (conditional on scope)

Scenario	Run These
Tenant-wide overview	Query 6 (DLP policy matches), Query 7 (Workload breakdown)
Specific SIT investigation	Query 8 (Single-SIT deep dive with full user/file/action breakdown)
Specific user investigation	Query 9 (Single-user SIT access profile)
Anomaly detection	Query 10b (file-based spikes — PRIMARY), Query 10 (overall spikes — secondary, includes Copilot)

Phase 4: Sensitivity Label Analysis (conditional on coverage)

Goal: Analyze sensitivity label access patterns, label changes, and Copilot label exposure.

Run the Label Coverage Assessment first (Phase 1, step 2). Then adapt depth per Rule 11:

Label Coverage	Analysis Depth	Run These
≥5% of events have labels (label-mature environment)	Full label analysis — dedicated report sections	Query 11 (label landscape), Query 12 (label-based user ranking), Query 13 (label changes), Query 14 (Copilot label exposure), Query 15 (label-only events)
1-5% of events have labels (emerging label environment)	Summary label section — condensed into one section	Query 11 (label landscape), Query 13 (label changes)
<1% of events have labels (SIT-dominant environment)	Brief note only — mention label presence in Scope & Limitations	Label coverage stats from Phase 1 assessment only
User asks specifically about labels	Full label analysis regardless of coverage percentage	All label queries (11-15)

Phase 5: Report Generation

Render findings using the Report Template below.

Sample KQL Queries

Well-Known SIT GUID Mapping (datatable)

Use this let block as a prefix for any query that needs name resolution. It covers the most common Microsoft SITs plus placeholders for custom SITs from config.json.

// Well-known SIT GUID mapping — covers ~90% of typical detections
// Add custom/EDM SIT GUIDs from config.json sit_mapping section
let SITMapping = datatable(SITId: string, SITName: string) [
    // ── Financial ──
    "50842eb7-edc8-4019-85dd-5a5c1f2bb085", "Credit Card Number",
    "cb353f78-2b72-4c3c-8827-92ebe4f69fdf", "ABA Routing Number",
    "78e09124-f2c3-4656-b32a-c1a132cd2711", "Brazil CPF Number",
    // ── Identity / PII ──
    "a44669fe-0d48-453d-a9b1-2cc83f2cba77", "U.S. Social Security Number (SSN)",
    "a7dd5e5f-e7f9-4626-a2c6-86a8cb6830d2", "IP Address v4",
    "1daa4ad5-e2dd-4ca4-a788-54722c09efb2", "IP Address",
    "50b8b56b-4ef8-44c2-a924-03374f5831ce", "All Full Names",
    "8548332d-6d71-41f8-97db-cc3b5fa544e6", "All Physical Addresses",
    "44aa44f2-63d1-41df-af0d-970283ac41e2", "U.S. Physical Addresses",
    "d1d18c85-1203-46f5-b32f-2d6309de4e5b", "Australia Physical Addresses",
    "6fa57f91-314a-4561-8248-7ab921957448", "Philippines Passport Number",
    "d0001c83-e72f-4360-98d3-f5a41dc5a380", "Indonesia Passport Number",
    // ── Healthcare ──
    "065bdd91-ef07-40d3-b8a4-0aea722eaa49", "All Medical Terms And Conditions",
    "17066377-466d-43ff-997f-c9240414021c", "Diseases",
    "f6dc2d17-3549-41e2-af29-ae1846ae9542", "Types Of Medication",
    "ee05bb9c-7b87-42e1-9987-446b243245d5", "Lab Test Terms",
    // ── Azure / Cloud secrets ──
    "0f587d92-eb28-44a9-bd1c-90f2892b47aa", "Azure DocumentDB Auth Key",
    "ce1a126d-186f-4700-8c0c-486157b953fd", "Azure SQL Connection String",
    "0b34bec3-d5d6-4974-b7b0-dcdb5c90c29d", "Azure IoT Connection String",
    "c7bc98e8-551a-4c35-a92d-d2c8cda714a7", "Azure Storage Account Key",
    "095a7e6c-efd8-46d5-af7b-5298d53a49fc", "Azure Redis Cache Connection String",
    // ─── ADD CUSTOM / EDM SITs FROM config.json sit_mapping HERE ───
    // Example: "<your-edm-guid>", "Your EDM SIT Name",
    "END_MARKER", "END_MARKER"
];

Instructions: When building queries, read config.json for sit_mapping entries and insert them into the datatable above, replacing the END_MARKER row. If no custom mapping exists, remove the END_MARKER row.

Query 1: SIT Discovery — Active SIT Landscape

Purpose: Find all active SIT GUIDs, their volume, and classify them.

// Query 1: SIT Discovery — What SITs are active in this environment?
// Adjust timespan as needed (default: 30d)
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitiveInfoTypeInfo)
| mv-expand SIT = parse_json(tostring(SensitiveInfoTypeInfo))
| extend SITJson = parse_json(tostring(SIT))
| extend SITId = tostring(SITJson.SensitiveInfoTypeId)
| extend ClassifierType = tostring(SITJson.ClassifierType)
| extend SITConfidence = toint(SITJson.Confidence)
| extend SITCount = toint(SITJson.Count)
| summarize 
    TotalEvents = count(),
    DistinctUsers = dcount(AccountUpn),
    DistinctFiles = dcount(ObjectId),
    AvgConfidence = avg(SITConfidence),
    MaxConfidence = max(SITConfidence),
    ClassifierTypes = make_set(ClassifierType)
    by SITId
| order by TotalEvents desc
| take 50

Post-processing: Join results with the SITMapping datatable to resolve names. Flag any GUIDs not in the mapping as "Unknown — custom/EDM SIT" or "[ML Classifier]" based on ClassifierTypes.

Query 2: Action Type Breakdown

Purpose: Break down SIT detections by what operation triggered them.

// Query 2: Action Type Breakdown — What operations trigger SIT detections?
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitiveInfoTypeInfo)
| summarize 
    EventCount = count(),
    DistinctUsers = dcount(AccountUpn),
    DistinctFiles = dcount(ObjectId)
    by ActionType
| order by EventCount desc

Query 3: Top Users by SIT Interaction Volume

Purpose: Risk-rank users by sensitive data interaction volume. Designed for 100k+ user environments.

// Query 3: Top 50 Users by SIT access volume
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitiveInfoTypeInfo)
| summarize 
    TotalEvents = count(),
    DistinctSITs = dcount(tostring(parse_json(tostring(parse_json(tostring(SensitiveInfoTypeInfo))[0])).SensitiveInfoTypeId)),
    DistinctFiles = dcount(ObjectId),
    ActionTypes = make_set(ActionType),
    Workloads = make_set(Workload),
    FirstSeen = min(Timestamp),
    LastSeen = max(Timestamp)
    by AccountUpn
| order by TotalEvents desc
| take 50

Query 4: Top Files by SIT Detection Count

Purpose: Identify the most sensitive documents — files with the most SIT detections across access events.

// Query 4: Top 30 Files by SIT detection frequency
// Excludes system/operational files (DLPCache, EBWebView) that are Defender operational reads, not user-initiated
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitiveInfoTypeInfo)
| where isnotempty(ObjectId)
| where ObjectId !has "DLPCache" and ObjectId !has "EBWebView" and ObjectId !has "\\ProgramData\\Microsoft\\Windows Defender\\"
| summarize 
    AccessCount = count(),
    DistinctUsers = dcount(AccountUpn),
    ActionTypes = make_set(ActionType),
    LastAccessed = max(Timestamp)
    by ObjectId
| order by AccessCount desc
| take 30

Query 5: Temporal Pattern — Daily SIT Event Volume

Purpose: Detect volume spikes or anomalies in SIT-related activity over time.

// Query 5: Daily SIT event volume trend — includes file-based column for spike attribution (Rule 10)
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitiveInfoTypeInfo)
| summarize 
    DailyEvents = count(), 
    FileEvents = countif(Workload !in ("Copilot", "ConnectedAIApp")),
    DistinctUsers = dcount(AccountUpn) 
    by Day = bin(Timestamp, 1d)
| order by Day asc

Query 6: DLP Policy Match Correlation

Purpose: Show DLP policy matches alongside SIT detections — which policies fired and how often.

// Query 6: DLP Policy Match breakdown
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(DlpPolicyMatchInfo)
| extend DlpInfo = parse_json(DlpPolicyMatchInfo)
| mv-expand DlpPolicy = DlpInfo
| extend PolicyName = tostring(DlpPolicy.PolicyName)
| extend PolicyId = tostring(DlpPolicy.PolicyId)
| summarize 
    MatchCount = count(),
    DistinctUsers = dcount(AccountUpn),
    DistinctFiles = dcount(ObjectId),
    ActionTypes = make_set(ActionType)
    by PolicyName
| order by MatchCount desc

Query 7: Workload Breakdown

Purpose: Where is sensitive data being accessed — SharePoint, OneDrive, Exchange, Teams, Endpoints, Copilot?

// Query 7: Workload distribution of SIT events
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitiveInfoTypeInfo)
| summarize 
    EventCount = count(),
    DistinctUsers = dcount(AccountUpn),
    DistinctFiles = dcount(ObjectId)
    by Workload
| order by EventCount desc

Query 8: Single-SIT Deep Dive

Purpose: Full breakdown for a specific SIT GUID — who accessed it, which files, what operations, over what time period.

Usage: Replace <TARGET_SIT_GUID> with the specific SIT GUID to investigate (e.g., an EDM SIT GUID).

// Query 8: Single-SIT deep dive — replace GUID
let targetSIT = "<TARGET_SIT_GUID>";
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitiveInfoTypeInfo)
| where SensitiveInfoTypeInfo has targetSIT
| mv-expand SIT = parse_json(tostring(SensitiveInfoTypeInfo))
| extend SITJson = parse_json(tostring(SIT))
| extend SITId = tostring(SITJson.SensitiveInfoTypeId)
| where SITId == targetSIT
| extend SITConfidence = toint(SITJson.Confidence)
| extend SITCount = toint(SITJson.Count)
| summarize 
    AccessCount = count(),
    AvgConfidence = avg(SITConfidence),
    TotalSITInstances = sum(SITCount),
    ActionTypes = make_set(ActionType),
    Workloads = make_set(Workload),
    FirstSeen = min(Timestamp),
    LastSeen = max(Timestamp)
    by AccountUpn, ObjectId
| order by AccessCount desc
| take 100

Query 9: Single-User SIT Access Profile

Purpose: Complete SIT interaction profile for a specific user — what SITs they accessed, which files, operations, and when.

Usage: Replace <TARGET_UPN> with the user's UPN.

// Query 9: Single-user SIT access profile
let targetUser = "<TARGET_UPN>";
DataSecurityEvents
| where Timestamp > ago(30d)
| where AccountUpn =~ targetUser
| where isnotempty(SensitiveInfoTypeInfo)
| mv-expand SIT = parse_json(tostring(SensitiveInfoTypeInfo))
| extend SITJson = parse_json(tostring(SIT))
| extend SITId = tostring(SITJson.SensitiveInfoTypeId)
| extend SITConfidence = toint(SITJson.Confidence)
| extend SITCount = toint(SITJson.Count)
| summarize 
    AccessCount = count(),
    DistinctFiles = dcount(ObjectId),
    AvgConfidence = avg(SITConfidence),
    ActionTypes = make_set(ActionType),
    FirstSeen = min(Timestamp),
    LastSeen = max(Timestamp)
    by SITId
| order by AccessCount desc

Query 10: Anomaly Detection — Users with SIT Access Spikes

Purpose: Compare each user's recent 7-day SIT activity against their 30-day daily average to detect sudden spikes. Designed for 100k+ user environments.

// Query 10: SIT access spike detection (7d recent vs 23d baseline) — ALL events
// NOTE: This includes Copilot events. For file-based-only spikes, use Query 10b below.
let baseline = DataSecurityEvents
| where Timestamp between (ago(30d) .. ago(7d))
| where isnotempty(SensitiveInfoTypeInfo)
| summarize BaselineTotal = count() by AccountUpn
| extend BaselineDailyAvg = round(BaselineTotal / 23.0, 1); // 23 days in baseline window
let recent = DataSecurityEvents
| where Timestamp > ago(7d)
| where isnotempty(SensitiveInfoTypeInfo)
| summarize RecentTotal = count() by AccountUpn
| extend RecentDailyAvg = round(RecentTotal / 7.0, 1);
recent
| join kind=inner baseline on AccountUpn
| extend SpikeRatio = round(RecentDailyAvg / BaselineDailyAvg, 2)
| where SpikeRatio > 2.0 and RecentTotal > 20 and BaselineTotal >= 10
| project AccountUpn, BaselineDailyAvg, RecentDailyAvg, SpikeRatio, BaselineTotal, RecentTotal
| order by SpikeRatio desc
| take 30

Query 10b: File-Based-Only Spike Detection (Excludes Copilot)

// Query 10b: File-based SIT access spike detection (excludes Copilot/AI events)
let CopilotActionTypes = dynamic(["Risky prompt entered in Copilot", "Sensitive response received in Copilot",
    "Risky prompt entered in connected AI apps", "Sensitive response received in connected AI apps"]);
let baseline = DataSecurityEvents
| where Timestamp between (ago(30d) .. ago(7d))
| where isnotempty(SensitiveInfoTypeInfo)
| where not(ActionType has_any (CopilotActionTypes))
| where Workload !in ("Copilot", "ConnectedAIApp")
| summarize BaselineTotal = count() by AccountUpn
| extend BaselineDailyAvg = round(BaselineTotal / 23.0, 1);
let recent = DataSecurityEvents
| where Timestamp > ago(7d)
| where isnotempty(SensitiveInfoTypeInfo)
| where not(ActionType has_any (CopilotActionTypes))
| where Workload !in ("Copilot", "ConnectedAIApp")
| summarize RecentTotal = count() by AccountUpn
| extend RecentDailyAvg = round(RecentTotal / 7.0, 1);
recent
| join kind=inner baseline on AccountUpn
| extend SpikeRatio = round(RecentDailyAvg / BaselineDailyAvg, 2)
| where SpikeRatio > 2.0 and RecentTotal > 10 and BaselineTotal >= 10
| project AccountUpn, BaselineDailyAvg, RecentDailyAvg, SpikeRatio, BaselineTotal, RecentTotal
| order by SpikeRatio desc
| take 30

Well-Known Label GUID Mapping (datatable)

Use this let block as a prefix for label queries that need name resolution. It covers the Microsoft default sensitivity labels plus placeholder slots for custom labels from config.json.

// Microsoft default sensitivity label GUID mapping — all 12 labels
// Confirmed via Get-Label on default-configuration tenants
// Older tenants may have renamed labels or replaced parent GUIDs — validate with Get-Label if needed
// Add custom labels from config.json label_mapping section
let LabelMapping = datatable(LabelId: string, LabelName: string, LabelParent: string) [
    // ── Microsoft Default Labels (defa4170-0d19-0005-XXXX-bc88714345d2 family) ──
    // Priority 0-11, sequential GUID suffixes
    "defa4170-0d19-0005-0000-bc88714345d2", "Personal", "",
    "defa4170-0d19-0005-0001-bc88714345d2", "Public", "",
    "defa4170-0d19-0005-0002-bc88714345d2", "General", "",
    "defa4170-0d19-0005-0003-bc88714345d2", "Anyone (unrestricted)", "General",
    "defa4170-0d19-0005-0004-bc88714345d2", "All Employees (unrestricted)", "General",
    "defa4170-0d19-0005-0005-bc88714345d2", "Confidential", "",
    "defa4170-0d19-0005-0006-bc88714345d2", "Anyone (unrestricted)", "Confidential",
    "defa4170-0d19-0005-0007-bc88714345d2", "All Employees", "Confidential",
    "defa4170-0d19-0005-0008-bc88714345d2", "Trusted People", "Confidential",
    "defa4170-0d19-0005-0009-bc88714345d2", "Highly Confidential", "",
    "defa4170-0d19-0005-000a-bc88714345d2", "All Employees", "Highly Confidential",
    "defa4170-0d19-0005-000b-bc88714345d2", "Specified People", "Highly Confidential",
    // ─── ADD CUSTOM LABELS FROM config.json label_mapping HERE ───
    // Example: "<your-custom-label-guid>", "Your Custom Label", "Parent Label",
    "END_MARKER", "END_MARKER", "END_MARKER"
];

Instructions: When building queries, read config.json for label_mapping entries and insert them into the datatable above, replacing the END_MARKER row. Older/customized tenants: If Get-Label shows parent labels (Confidential, Highly Confidential) with random GUIDs instead of defa4170-*, the admin has recreated them — add the tenant-specific GUIDs from config.json label_mapping or Get-Label. If resolved sub-label names differ from the datatable (e.g., "Non-business" vs "Personal"), prefer the Get-Label name for that tenant.

Query 11: Label Coverage Overview — Sensitivity Label Landscape

Purpose: Discover which sensitivity labels appear in the data, their volume, and resolve GUIDs to names. Also includes a quick stats variant for Phase 1 coverage assessment.

Quick Stats Variant (run first in Phase 1):

// Label Coverage Assessment — run in Phase 1 to determine label analysis depth
DataSecurityEvents
| where Timestamp > ago(30d)
| summarize
    TotalEvents = count(),
    WithSIT = countif(isnotempty(SensitiveInfoTypeInfo) and SensitiveInfoTypeInfo != "[]"),
    WithLabel = countif(isnotempty(SensitivityLabelId)),
    WithPrevLabel = countif(isnotempty(PreviousSensitivityLabelId)),
    LabelOnly_NoSIT = countif(isnotempty(SensitivityLabelId) and (isempty(SensitiveInfoTypeInfo) or SensitiveInfoTypeInfo == "[]")),
    SIT_WithLabel = countif(isnotempty(SensitivityLabelId) and isnotempty(SensitiveInfoTypeInfo) and SensitiveInfoTypeInfo != "[]")

Full Label Landscape Query:

// Query 11: Label Landscape — which sensitivity labels appear and how often
// Prefix with LabelMapping datatable from above
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitivityLabelId)
| extend LabelIds = split(SensitivityLabelId, ",")
| mv-expand LabelIdRaw = LabelIds
| extend LabelId = tostring(trim(" ", tostring(LabelIdRaw)))
| where isnotempty(LabelId)
| lookup kind=leftouter LabelMapping on LabelId
| extend LabelDisplay = iff(isempty(LabelName) or LabelName == "END_MARKER",
    strcat("[Unknown] ", LabelId),
    iff(isempty(LabelParent), LabelName, strcat(LabelParent, " / ", LabelName)))
| summarize
    EventCount = count(),
    DistinctUsers = dcount(AccountUpn),
    DistinctFiles = dcount(ObjectId),
    ActionTypes = make_set(ActionType)
    by LabelDisplay, LabelId
| order by EventCount desc

Query 12: Top Users by Labeled Document Access (File-Based)

Purpose: Risk-rank users by labeled document access volume, excluding Copilot/AI events. This is the label-dimension equivalent of Query 3.

// Query 12: Top users by labeled document access (file-based only)
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitivityLabelId)
| where ActionType !has "Copilot" and Workload !in ("Copilot", "ConnectedAIApp")
| summarize
    EventCount = count(),
    DistinctLabels = dcount(SensitivityLabelId),
    DistinctFiles = dcount(ObjectId),
    ActionTypes = make_set(ActionType),
    Workloads = make_set(Workload),
    FirstSeen = min(Timestamp),
    LastSeen = max(Timestamp)
    by AccountUpn
| order by EventCount desc
| take 30

Query 13: Label Downgrade & Change Tracking

Purpose: Find all events where a sensitivity label was downgraded, removed, or changed. Critical for detecting policy circumvention or insider risk.

// Query 13: Label downgrade/removal events — detect label circumvention
// Prefix with LabelMapping datatable to resolve both current and previous label GUIDs
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(PreviousSensitivityLabelId)
| extend CurrentLabelId = SensitivityLabelId
| extend PrevLabelId = PreviousSensitivityLabelId
| lookup kind=leftouter (LabelMapping | project LabelId, CurrentLabelName=LabelName, CurrentParent=LabelParent) on $left.CurrentLabelId == $right.LabelId
| lookup kind=leftouter (LabelMapping | project LabelId, PrevLabelName=LabelName, PrevParent=LabelParent) on $left.PrevLabelId == $right.LabelId
| extend CurrentDisplay = iff(isempty(CurrentLabelName), iff(isempty(CurrentLabelId), "[Removed]", strcat("[Unknown] ", CurrentLabelId)),
    iff(isempty(CurrentParent), CurrentLabelName, strcat(CurrentParent, " / ", CurrentLabelName)))
| extend PrevDisplay = iff(isempty(PrevLabelName), strcat("[Unknown] ", PrevLabelId),
    iff(isempty(PrevParent), PrevLabelName, strcat(PrevParent, " / ", PrevLabelName)))
| project Timestamp, ActionType, AccountUpn, ObjectId,
    PreviousLabel = PrevDisplay, CurrentLabel = CurrentDisplay, Workload
| order by Timestamp desc

Key ActionTypes in label change events:

Label downgraded on a file — label lowered (e.g., HC → Confidential)
Label removed from a file — label stripped entirely
Label on file downgraded or removed, then file accessed by Copilot — label reduced AND Copilot accessed the now-less-protected file

Query 14: Copilot Label Exposure — Labeled Resources Accessed by Copilot

Purpose: Identify which sensitivity-labeled documents Copilot accessed during risky AI interactions. This surfaces data exposure risk where Copilot may be surfacing Highly Confidential content.

// Query 14: Copilot label exposure — what labeled docs did Copilot access in risky interactions?
// Prefix with LabelMapping datatable
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(RiskyAIUsageSensitivityLabelsInfo)
| where tostring(RiskyAIUsageSensitivityLabelsInfo) !has "null"
    or (tostring(RiskyAIUsageSensitivityLabelsInfo) has "null" and tostring(RiskyAIUsageSensitivityLabelsInfo) has "SensitivityLabelId")
| mv-expand LabelEntry = parse_json(tostring(RiskyAIUsageSensitivityLabelsInfo))
| extend LabelJson = parse_json(tostring(LabelEntry))
| extend SubEntityName = tostring(LabelJson.SubEntityName)
| extend LabelId = tostring(LabelJson.SensitivityLabelId)
| where isnotempty(LabelId)
| lookup kind=leftouter LabelMapping on LabelId
| extend LabelDisplay = iff(isempty(LabelName) or LabelName == "END_MARKER",
    strcat("[Unknown] ", LabelId),
    iff(isempty(LabelParent), LabelName, strcat(LabelParent, " / ", LabelName)))
| summarize
    EventCount = count(),
    DistinctUsers = dcount(AccountUpn),
    SubEntities = make_set(SubEntityName),
    ActionTypes = make_set(ActionType)
    by LabelDisplay, LabelId
| order by EventCount desc

SubEntityName values:

ResponseAccessedResource — a labeled document that Copilot cited in its response
Response — the Copilot response itself that was flagged

Query 15: Label-Only Events — Events Triggered Purely by Label (No SIT Content Match)

Purpose: Find events where the trigger was the sensitivity label alone, not SIT content detection. These represent label-aware DLP/IRM policy matches.

// Query 15: Label-only events — triggered by label, not SIT content
// Prefix with LabelMapping datatable
DataSecurityEvents
| where Timestamp > ago(30d)
| where isnotempty(SensitivityLabelId)
| where isempty(SensitiveInfoTypeInfo) or SensitiveInfoTypeInfo == "[]"
| extend LabelIds = split(SensitivityLabelId, ",")
| mv-expand LabelIdRaw = LabelIds
| extend LabelId = tostring(trim(" ", tostring(LabelIdRaw)))
| where isnotempty(LabelId)
| lookup kind=leftouter LabelMapping on LabelId
| extend LabelDisplay = iff(isempty(LabelName) or LabelName == "END_MARKER",
    strcat("[Unknown] ", LabelId),
    iff(isempty(LabelParent), LabelName, strcat(LabelParent, " / ", LabelName)))
| summarize
    EventCount = count(),
    DistinctUsers = dcount(AccountUpn),
    DistinctFiles = dcount(ObjectId),
    ActionTypes = make_set(ActionType),
    Workloads = make_set(Workload)
    by LabelDisplay, LabelId
| order by EventCount desc

Query 16a: Copilot SIT Landscape with Priority Tiers

Purpose: Break down which SITs fire in Copilot interactions, classify by priority tier, and quantify service account vs human split. This is the entry point for Phase 2.5.

Prerequisite: Merge the well-known SIT GUID mapping datatable (above) with any config.json sit_mapping entries before running.

// Query 16a: Copilot SIT Landscape — priority-tiered breakdown with agent/human split
// Prefix with SITMapping datatable
// ── SIT Priority Classification (canonical definition — Queries 16c/16d reference this) ──
let HighPrioritySITs = dynamic([
    "50842eb7-edc8-4019-85dd-5a5c1f2bb085",  // Credit Card Number
    "a44669fe-0d48-453d-a9b1-2cc83f2cba77",  // U.S. SSN
    "0f587d92-eb28-44a9-bd1c-90f2892b47aa",  // Azure DocumentDB Auth Key
    "ce1a126d-186f-4700-8c0c-486157b953fd",  // Azure SQL Connection String
    "0b34bec3-d5d6-4974-b7b0-dcdb5c90c29d",  // Azure IoT Connection String
    "c7bc98e8-551a-4c35-a92d-d2c8cda714a7",  // Azure Storage Account Key
    "095a7e6c-efd8-46d5-af7b-5298d53a49fc"   // Azure Redis Cache Connection String
    // ── ADD credential/HR SIT GUIDs from config.json sit_mapping HERE ──
]);
let MediumPrioritySITs = dynamic([
    // ── ADD project/employee ID SIT GUIDs from config.json sit_mapping HERE ──
]);
// ── Service account regex (update with org-specific patterns) ──
let ServiceAccountPattern = @"^(securitycopilotagentuser-|svc-)";
DataSecurityEvents
| where Timestamp > ago(30d)
| where ActionType has "Copilot"
| where isnotempty(SensitiveInfoTypeInfo)
| mv-expand SIT = parse_json(tostring(SensitiveInfoTypeInfo))
| extend SITJson = parse_json(tostring(SIT))
| extend SITId = tostring(SITJson.SensitiveInfoTypeId)
| extend IsAgent = AccountUpn matches regex ServiceAccountPattern
| summarize
    TotalEvents = count(),
    AgentEvents = countif(IsAgent),
    HumanEvents = countif(not(IsAgent) and isnotempty(AccountUpn)),
    HumanUsers = dcountif(AccountUpn, not(IsAgent) and isnotempty(AccountUpn)),
    PromptEvents = countif(ActionType has "prompt"),
    ResponseEvents = countif(ActionType has "response")
    by SITId
| lookup kind=leftouter SITMapping on $left.SITId == $right.SITId
| extend SITDisplay = iff(isempty(SITName) or SITName == "END_MARKER", strcat("[Unknown] ", SITId), SITName)
| extend PriorityTier = case(
    SITId in (HighPrioritySITs), "🔴 High",
    SITId in (MediumPrioritySITs), "🟡 Medium",
    "🔵 Low")
| project SITDisplay, PriorityTier, TotalEvents, AgentEvents, HumanEvents, HumanUsers, PromptEvents, ResponseEvents, SITId
| order by TotalEvents desc

Post-processing:

Populate HighPrioritySITs and MediumPrioritySITs arrays with credential, HR, and custom SIT GUIDs from config.json sit_mapping. Any SIT not in either array defaults to Low.
If config.json has a sit_priority section (GUID → tier mapping), use it to override the default classification.
Calculate Agent % of total — if > 50%, flag prominently in report ("⚠️ N% of Copilot SIT events are from automated service accounts").
Unknown SITs ([Unknown]) should be classified as 🔴 High by default (conservative — unknown custom SITs may be high-value EDM/exact data match).

Query 16b: Top Users by High-Priority SIT in Copilot (Excluding Service Accounts)

Purpose: Risk-rank human users whose Copilot interactions triggered high-priority SIT detections. Excludes automated service accounts.

// Query 16b: Top 20 human users by high-priority SIT in Copilot interactions
// Prefix with SITMapping datatable and HighPrioritySITs + ServiceAccountPattern from Query 16a
DataSecurityEvents
| where Timestamp > ago(30d)
| where ActionType has "Copilot"
| where isnotempty(SensitiveInfoTypeInfo)
| where not(AccountUpn matches regex ServiceAccountPattern)
| where isnotempty(AccountUpn)
| mv-expand SIT = parse_json(tostring(SensitiveInfoTypeInfo))
| extend SITJson = parse_json(tostring(SIT))
| extend SITId = tostring(SITJson.SensitiveInfoTypeId)
| where SITId in (HighPrioritySITs)
| lookup kind=leftouter SITMapping on $left.SITId == $right.SITId
| extend SITDisplay = iff(isempty(SITName) or SITName == "END_MARKER", strcat("[Unknown] ", SITId), SITName)
| summarize
    Events = count(),
    DistinctHighSITs = dcount(SITId),
    SITNames = make_set(SITDisplay),
    PromptEvents = countif(ActionType has "prompt"),
    ResponseEvents = countif(ActionType has "response"),
    FirstSeen = min(Timestamp),
    LastSeen = max(Timestamp)
    by AccountUpn
| order by Events desc
| take 20

Query 16c: Daily Copilot SIT Trend by Priority Category

Purpose: Track daily Copilot SIT detection volume by priority tier to distinguish adoption changes from risk spikes.

// Query 16c: Daily Copilot SIT trend by priority category (human users only)
// ── Copy HighPrioritySITs from Query 16a (canonical list: CCN, SSN, Azure credentials + config.json custom) ──
// ── Copy ServiceAccountPattern from Query 16a ──
DataSecurityEvents
| where Timestamp > ago(30d)
| where ActionType has "Copilot"
| where isnotempty(SensitiveInfoTypeInfo)
| where not(AccountUpn matches regex ServiceAccountPattern)
| where isnotempty(AccountUpn)
| mv-expand SIT = parse_json(tostring(SensitiveInfoTypeInfo))
| extend SITJson = parse_json(tostring(SIT))
| extend SITId = tostring(SITJson.SensitiveInfoTypeId)
| extend PriorityTier = iff(SITId in (HighPrioritySITs), "High", "Low")
| summarize Events = count() by Day = bin(Timestamp, 1d), PriorityTier
| order by Day asc, PriorityTier asc

Query 16d: Prompt-Only Analysis — Human Users Typing Sensitive Data INTO Copilot

Purpose: The primary risk signal — users who typed sensitive data into Copilot prompts (not just receiving sensitive responses). Excludes service accounts and response-only events.

// Query 16d: Prompt-only human users with high-priority SIT detections
// This is the key risk signal: sensitive data entered BY the user INTO Copilot
// ── Copy HighPrioritySITs and ServiceAccountPattern from Query 16a ──
DataSecurityEvents
| where Timestamp > ago(30d)
| where ActionType has "prompt"  // Prompts only — user-initiated risk
| where isnotempty(SensitiveInfoTypeInfo)
| where not(AccountUpn matches regex ServiceAccountPattern)
| where isnotempty(AccountUpn)
| mv-expand SIT = parse_json(tostring(SensitiveInfoTypeInfo))
| extend SITJson = parse_json(tostring(SIT))
| extend SITId = tostring(SITJson.SensitiveInfoTypeId)
| where SITId in (HighPrioritySITs)
| lookup kind=leftouter SITMapping on $left.SITId == $right.SITId
| extend SITDisplay = iff(isempty(SITName) or SITName == "END_MARKER", strcat("[Unknown] ", SITId), SITName)
| summarize
    PromptEvents = count(),
    DistinctHighSITs = dcount(SITId),
    SITNames = make_set(SITDisplay),
    FirstSeen = min(Timestamp),
    LastSeen = max(Timestamp)
    by AccountUpn
| order by PromptEvents desc
| take 20

Why prompts matter more than responses:

"Risky prompt entered in Copilot" = User typed/pasted sensitive data (SSN, credit card, credentials) into a Copilot prompt. This is a behavioral risk — the user chose to share sensitive data with AI.
"Sensitive response received in Copilot" = Copilot retrieved and surfaced sensitive data from grounding sources (SharePoint, OneDrive, email). This is an access/configuration risk — overpermissioned data exposure.
Prompt events are the stronger signal for insider risk and user coaching interventions.

Report Template

Report Rendering Rules

These rules are MANDATORY for all report output (inline chat and markdown file). Follow strictly.

Rule 1: Risk Rating Scale

When assigning risk levels to users in the file-based user ranking, use this hierarchy:

Risk Level	Evidence Required
Critical	"Files collected and exfiltrated" ActionType present — confirmed insider risk exfiltration signal OR mass exfiltration pattern: ≥1,000 distinct files to removable media + file deletions within a ≤48-hour window (volume-based escalation even without the IRM-labeled ActionType)
High	Exfiltration signals below Critical thresholds (e.g., USB copies < 1,000 files without deletion pattern) OR sustained high DLP alert volume (top 2-3 by events/files)
Medium	Broad SIT diversity (10+ SIT types) OR cross-workload activity (3+ workloads) OR external domain WITHOUT explicit exfiltration signal
Low	Single-workload, moderate volume, no exfiltration or anomaly signals

Rule 2: Executive Summary Uses File-Based Metrics Only

Context	Cite	Example
✅ File-based risk summary	Non-Copilot events, non-Copilot files	"u3087 generated 211 file-based events across 32 files"
❌ Inflated overall metrics	Total events including Copilot	~~"u1812 — 294 total events including 185 files"~~

Rule 3: Top Users Overall Section — Copilot Compression

When Copilot events exceed 80% of total volume:

Do NOT render a standalone "Top Users Overall" section dominated by Copilot service accounts
Instead, include a brief note: "Top overall users are dominated by Copilot service accounts/heavy Copilot users — see file-based user ranking below for actual data access risk."
If any users in the top-10 overall have multi-workload activity (Copilot + file operations), call them out in a single sentence rather than a full table

Rule 4: Copilot Count Reconciliation

When reporting Copilot vs file-based splits, ensure the counts reconcile across sections:

Action Type breakdown Copilot total = Workload breakdown Copilot total
If they differ (e.g., Connected AI App events counted differently), annotate the delta
Show the reconciliation in the Action Type section: "Copilot interactions: N events (Action Types: Risky prompt X + Sensitive response Y + Combined Z = N)"

Rule 5: Scope & Limitations Section (Required for Markdown Reports)

Markdown file reports MUST include a Scope & Limitations section immediately after the Executive Summary. Include:

## Scope & Limitations

| Consideration | Detail |
|--------------|--------|
| **Data Source** | DataSecurityEvents (Defender XDR Advanced Hunting) — requires Insider Risk Management opt-in to share data with Defender XDR |
| **Coverage** | SIT detections + sensitivity label events — files with neither SIT content match nor sensitivity label do NOT appear in this data |
| **Label Coverage** | X% of events have sensitivity labels — label analysis depth adjusted accordingly (see Rule 11) |
| **Retention** | 30-day Advanced Hunting retention |
| **ML Classifiers** | N trainable classifier GUIDs could not be resolved — see Unresolved SIT GUIDs section |
| **Copilot Volume** | Copilot events represent X% of total volume and are separated from file-based analysis throughout this report |

Fill in the actual values for N and X% from the query results.

Rule 6: SIT Landscape Table Integrity

Each GUID must appear exactly once in the SIT Landscape table — one row per GUID, no exceptions
NEVER group multiple GUIDs into a single row with slash-separated values (e.g., 1e883268/d2cdc387/bf6e0b84...). Even "copy" variants of the same SIT that share identical metrics MUST be separate rows
After all GUID resolution tiers complete, deduplicate by GUID — if conflicts exist, prefer the most specific resolution (Tier 3 PowerShell > Tier 2 config > Tier 1 embedded)
Group the table by category: Custom/Organization SITs, Built-in Microsoft SITs, ML Classifiers (Unresolvable)
Do NOT include a GUID under two different names
The total distinct SIT count cited in the Executive Summary must equal the number of rows in the SIT Landscape tables (sum of all category sub-tables)
Post-render verification (MANDATORY): After building all SIT Landscape sub-tables, count the total rows. If the exec summary cites a different number, update the exec summary to match. Format: "N active SIT types" where N = sum of rows across Custom, Built-in, and Unresolved sub-tables

Rule 7: Spike Detection — File-Based Primary, Overall Secondary

When rendering spike alerts:

Primary: Always run and display Query 10b (file-based-only spikes) as the main spike alert section. This surfaces actual sensitive data access spikes.
Secondary: Run Query 10 (all events) only if the user requests overall spikes or if there are interesting patterns worth noting. Include a clear note that these spikes are predominantly Copilot-driven.
If only running one spike query, always prefer Query 10b.
In the report, label sections clearly: "File-Based SIT Access Spikes" vs "Overall SIT Access Spikes (incl. Copilot)".

Rule 8: Top Files — Exclude System/Operational Files

The Top Files section must exclude Defender for Endpoint operational file reads:

C:\ProgramData\Microsoft\Windows Defender\DLPCache\* — DLP label metadata cache reads
*\EBWebView\* — Edge WebView browser cache
Any path matching \ProgramData\Microsoft\ that is clearly a system/cache path

If system files appear in results despite the query filter, separate them into a "System/Operational Files" subsection below the main "User-Accessed Files" list.

Rule 9: Risk Rating Consistency — Exec Summary Must Match User Table

After building the File-Based Top Users table (the source of truth), cross-check every user mention in the exec summary
If there is a conflict, the User Table rating wins — update the exec summary to match
Never rate a user differently in two sections of the same report

⛔ PROHIBITED: Exec summary says "High" while user table says "Critical" for the same user (or any other mismatch).

Rule 10: Temporal Pattern — Include File-Based Event Column

The Temporal Pattern (daily volume) section MUST include a File Events column alongside the total. Without this, daily spikes appear alarming when they may be entirely Copilot-driven.

Column	Required	Source
Date	✅	`bin(Timestamp, 1d)`
Daily Events	✅	Total `count()`
File Events	✅	`countif(Workload !in ("Copilot", "ConnectedAIApp"))`
Distinct Users	✅	`dcount(AccountUpn)`
Notable	✅	Annotation for spikes

When annotating spikes (🔴), clarify whether the spike is Copilot-driven or file-driven:

"🔴 Major spike — file-driven" (if File Events also spike)
"🟡 Copilot adoption spike — file activity normal" (if only total spikes but File Events are stable)

Use Query 5 (updated) which returns both columns.

Rule 11: Adaptive Label Analysis Depth

Label Coverage	Report Behavior
≥5% labeled events	Full dedicated label sections: Label Landscape table, Label-Based Top Users, Label Changes, Copilot Label Exposure, Label-Only Events. These render as peer sections alongside SIT analysis
1-5% labeled events	Condensed "Sensitivity Labels" section: Label Landscape table + Label Changes only. Other label dimensions mentioned in summary notes
<1% labeled events	Brief note in Scope & Limitations: "Sensitivity label data is sparse (<1% of events) — environment is SIT-dominant. N events had labels; see label coverage stats below." No dedicated label sections unless user asks
User explicitly asks about labels	Full label sections regardless of coverage percentage

The Label Coverage Assessment also determines the overall report framing:

SIT-dominant (<5% labels): Report title/framing stays "SIT Access Analysis" with label addendum
Dual signal (5-50% labels): Report framing becomes "Data Security Analysis (SIT + Labels)"
Label-dominant (>50% labels): Report framing becomes "Data Security Analysis" with labels as primary signal and SIT as secondary

⛔ PROHIBITED: Running all 5 label queries (11-15) when coverage is <1% and user didn't ask about labels — this wastes API calls and clutters the report.

Rule 12: Label Display Format — Always Show Parent/Child Hierarchy

When rendering sensitivity label names, ALWAYS show the parent-child hierarchy using "/" notation:

Raw Label	Correct Display	Incorrect Display
Sub-label under Highly Confidential	"Highly Confidential / All Employees"	"All Employees"
Sub-label under Confidential	"Confidential / All employees"	"All employees"
Top-level label	"General"	"General" (correct as-is)
Unknown label	"[Unknown] abc12345..."	blank or omitted

This prevents ambiguity — "All Employees" exists under BOTH Confidential and Highly Confidential as different labels with different GUIDs.

⛔ PROHIBITED: Displaying sub-label names without their parent (e.g., just "Specific people" — which parent?).

Rule 13: Service Account Separation in Copilot Analysis

Requirement	Detail
Identify service accounts	Filter: `AccountUpn matches regex ServiceAccountPattern` (defined in Query 16a: `@"^(securitycopilotagentuser-\|svc-)"`). Check Query 16a results for additional org-specific patterns (any account with >10K events and a GUID-like UPN prefix is likely automated)
Report agent volume separately	Include a summary line: "⚠️ N automated service accounts generated X events (Y% of Copilot volume) — excluded from human risk analysis below"
Never mix in user rankings	Queries 16b-16d exclude agents by default. If rendering an overall Copilot user table, agents go in a separate subsection
Power-user flagging	After agent exclusion, if any single human user accounts for >20% of remaining Copilot events, flag them: "⚠️ Power user — may indicate automated workflow via Copilot"

Rule 14: Copilot Event Row Multiplication Awareness

Each Copilot interaction generates multiple DSE rows (average ~2-3x, up to 35x for complex exchanges). Purview creates separate rows for:

The prompt (if it contains a SIT match) — "Risky prompt entered in Copilot"
The response (if grounding data contains SIT matches) — "Sensitive response received in Copilot"
Compound events — agent interactions, SharePoint access, multiple simultaneous conditions

Reporting Requirement	Format
Raw event counts	"N Copilot SIT events (raw DSE rows)"
Estimated interactions	"~N/2.5 estimated real interactions" (use the ratio from the environment if calculated)
User daily rates	If citing per-user daily rates, note whether raw or estimated: "~X interactions/day (estimated from Y raw events)"

This rule prevents stakeholders from seeing "270K Copilot events" and panicking when the actual interaction count is ~95K.

Rule 15: Copilot SIT Report Section (Phase 2.5 Output)

When Phase 2.5 is triggered (Copilot >30% of events), render a dedicated "Copilot SIT Exposure" section with:

Service Account vs Human Split — "X events from N service accounts (Y%), Z events from N human users (Y%)" with the agent exclusion note
SIT Priority Landscape — Table from Query 16a showing SITs by priority tier (High/Medium/Low) with human-only metrics
High-Priority SIT Users — Top 10 human users from Query 16b with prompt vs response breakdown
Prompt Risk Signal — Top 10 users from Query 16d who typed high-priority SITs into Copilot prompts — this is the primary actionable finding
Temporal Trend — Daily trend from Query 16c (optional — include if spikes are interesting)

Inline Chat Format

## 📊 Data Security Events Analysis
**Scope:** <Tenant-wide / SIT: <name> / User: <UPN>>
**Time Range:** <start> to <end>
**Total Events:** <N> | **Distinct Users:** <N> | **Distinct Files:** <N>

### SIT Landscape
| # | SIT Name | GUID (short) | Events | Users | Files | Classifier |
|---|----------|-------------|--------|-------|-------|------------|
| 1 | Credit Card Number | 50842eb7... | 7,255 | 46 | 346 | Content |
| 2 | All Full Names | 50b8b56b... | 128,957 | 1,475 | 119 | Content |
| ... | | | | | | |

### Action Type Breakdown
| Action Type | Events | Users | Files |
|-------------|--------|-------|-------|
| Sensitive response received in Copilot | 228,564 | ... | ... |
| Risky prompt entered in Copilot | 390,905 | ... | ... |
| ... | | | |

### 🔴 Top Users by SIT Volume (Risk-Ranked)
| # | User | Total Events | Distinct SITs | Distinct Files | Last Active |
|---|------|-------------|---------------|----------------|-------------|
| 1 | user@domain.com | 12,345 | 8 | 42 | 2026-03-16 |
| ... | | | | | |

### 🏷️ Sensitivity Label Landscape (if ≥1% coverage)
| # | Label | GUID (short) | Events | Users | Files |
|---|-------|-------------|--------|-------|-------|
| 1 | Highly Confidential / All Employees | defa4170...000a | 62 | 9 | 31 |
| 2 | General | defa4170...0002 | 55 | 12 | 32 |
| ... | | | | | |

### ⚠️ Label Changes (if any PreviousSensitivityLabelId events exist)
| Timestamp | User | File | Previous Label | Current Label | Action |
|-----------|------|------|---------------|--------------|--------|
| 2026-03-11 | user@domain.com | doc.docx | HC / Project X | Confidential / All employees | Label downgraded |
| ... | | | | | |

### 🤖 Copilot SIT Exposure (if Copilot >30% of events)
**Service Accounts:** N agents generated X events (Y%) — excluded from analysis below
**Row Multiplication:** ~2-3 DSE rows per real interaction (est. ~Z real interactions)

| # | SIT Name | Priority | Human Events | Human Users | Prompts | Responses |
|---|----------|----------|-------------|-------------|---------|----------|
| 1 | Credit Card Number | 🔴 High | 1,234 | 89 | 456 | 778 |
| 2 | All Full Names | 🔵 Low | 45,678 | 1,200 | 12,345 | 33,333 |
| ... | | | | | | |

**🔴 Prompt Risk — Users Typing High-Priority SITs Into Copilot:**
| # | User | Prompt Events | SIT Types | Last Active |
|---|------|-------------|-----------|-------------|
| 1 | user@domain.com | 142 | SSN, Credit Card | 2026-03-16 |
| ... | | | | |

### ⚠️ SIT Access Spike Alerts
| User | Baseline (daily avg) | Recent (daily avg) | Spike Ratio | Status |
|------|---------------------|-------------------|-------------|--------|
| user@domain.com | 5.2 | 47.1 | 9.06x | 🔴 Spike |
| ... | | | | |

### Unresolved SIT GUIDs
<List of GUIDs not in mapping — offer PowerShell resolution>

Markdown File Format

Same structure as inline, wrapped in proper markdown with:

Report metadata header (generated timestamp, scope, tool versions)
Scope & Limitations section immediately after Executive Summary (see Rule 5 above)
Each section as H2/H3
File-based user ranking as the primary risk section (NOT the Copilot-dominated overall ranking)
DLP Policy Match section with DefaultRule explanation for empty PolicyName entries
Sensitivity Label sections (if coverage ≥1% or user requested): Label Landscape, Label-Based Top Users, Label Changes, Copilot Label Exposure
Code fences for any raw data samples
Save to: reports/data-security/DataSecurity_Analysis_<scope>_YYYYMMDD_HHMMSS.md

Known Pitfalls

Pitfall	Detail	Mitigation
`SensitiveInfoTypeInfo` is `Collection(String)`, not dynamic	Each element is a JSON string requiring double-parse: `parse_json(tostring(SensitiveInfoTypeInfo))` to expand array, then `parse_json(tostring(element))` to access fields	Always double-parse. Direct dot-notation fails silently
Massive row counts	600k+ rows in 30 days for mid-size tenants; millions for 100k+ user orgs	ALWAYS `summarize` first. NEVER retrieve raw rows without `take` limit
`mv-expand` is expensive	Expanding SensitiveInfoTypeInfo across 600k rows without pre-filtering is extremely slow	Pre-filter with `where SensitiveInfoTypeInfo has "<GUID>"` before `mv-expand`
Copilot dominates event volume	90%+ of events can be Copilot-related. See Rules 2-3 for report handling and Phase 2.5 for dedicated Copilot analysis	Filter to specific `ActionType` values for file access. Use `Workload !in ("Copilot", "ConnectedAIApp")` for file-only analysis
Trainable classifiers (MLModel) don't resolve	GUIDs with `ClassifierType: "MLModel"` may not exist in `Get-DlpSensitiveInformationType`	Display as `[ML Classifier] <GUID>` and note in report
SIT GUID is per-SIT, not per-detection	Multiple documents can match the same SIT GUID — the GUID identifies the SIT type, not the specific match	Use `Count` and `Confidence` fields from SITJson for detection-level detail
`ObjectId` can be empty	Copilot interaction events may not have an ObjectId (no specific file)	Filter `isnotempty(ObjectId)` for file-specific analysis
IRM opt-in required	DataSecurityEvents is populated by Insider Risk Management. No opt-in = empty table	Check for 0 results and explain the prerequisite
Table schema evolves	DataSecurityEvents is in Preview — column names and availability may change	Run `getschema` if queries fail with column resolution errors
`DlpPolicyMatchInfo` is sparse	Only ~0.5% of rows have DLP policy match data (the rest are IRM-only detections)	Don't assume all SIT events have DLP data; they're independent signals
Duplicate GUID in SIT mapping	One GUID can only resolve to one SIT name. If a GUID appears in both the embedded `datatable` and a Tier 2/3 resolution with a different name, the result will have duplicate rows with conflicting names. This can happen when a built-in SIT GUID overlaps with a custom SIT copy, or when PowerShell returns a different display name than the embedded mapping	After resolving all GUIDs, deduplicate by GUID before rendering the SIT Landscape table. If a GUID maps to multiple names, prefer the Tier 3 (PowerShell) name over Tier 1 (embedded). Never show the same GUID on two rows with different names
Empty `PolicyName` = DefaultRule ("Always audit")	DLP alerts with empty/null `PolicyName` are typically generated by the built-in `DefaultRule` that fires when "Always audit file activity for devices" is enabled (ON by default). These are NOT orphaned or misconfigured policies — they are the expected result of the default audit setting	In the DLP Policy Match section, explain: "Events with empty PolicyName are generated by the built-in DefaultRule, which audits all monitored file types (Office, PDF, CSV) on onboarded devices when 'Always audit file activity for devices' is enabled (default: ON). No explicit DLP policy is required for these events."
Compound ActionType values	Some events have ActionType values that combine multiple labels (e.g., "Risky prompt entered in Copilot, Sensitive response received in Copilot" or "Sensitive info shared on Teams, DLP Rule Matched"). These are literal string values from the data, not aggregation artifacts	Display compound ActionTypes exactly as they appear in the data. Do NOT split or merge them — they represent events where multiple conditions were met simultaneously
System/operational files dominate Top Files	Files under `C:\ProgramData\Microsoft\Windows Defender\DLPCache\RMSLabels\` and `\EBWebView\` are Defender for Endpoint reading sensitivity label metadata — NOT user-initiated file access. These can dominate 90%+ of the Top Files list	Query 4 filters these paths. If they still appear, separate into a "System/Operational Files" subsection. Never present DLPCache reads as user data access risk
Localized SIT names in CloudAppEvents	CloudAppEvents `DLPRuleMatch` events include SIT names, but names appear in the user's locale (e.g., "የዱቤ ካርድ ቁጥር" instead of "Credit Card Number" for Amharic users). Same GUID can map to different name strings depending on locale	Always aggregate by SIT GUID, never by SIT name. Use the GUID-to-name mapping (Tier 1/2/3) for canonical English names. This applies when cross-referencing CloudAppEvents with DataSecurityEvents
Browsing events are not files — separate from Top Files	ActionTypes like "Generative AI websites browsed" and "Gambling websites browsed" reference URLs, not files. They have no `ObjectId` file path — only a URL domain. Including them under "Top Files" is misleading	Render browsing/URL events in a separate subsection (e.g., "External / AI Service Access") below Top Files. Never mix URL-based events into the file ranking tables. Title the Top Files section accurately ("Top Files" not "Top Files & URLs")
Temporal spike annotations must reference known users	When annotating daily spikes in the Temporal Pattern table (e.g., "🔴 Major spike — u625 SharePoint batch"), the attributed user MUST appear elsewhere in the report — in the File-Based Top Users table, a drill-down section, or at minimum a footnote. Referencing a user that exists nowhere else in the report violates the evidence-based analysis rule and creates an unverifiable claim	Before annotating a spike with a user attribution, verify the user appears in the Top Users ranking. If they don't make the top-10 but are the spike driver, either: (a) add them to the user table with a note "included due to temporal spike attribution", or (b) use a generic annotation ("🔴 Major spike — file-driven") without naming the user
`SensitivityLabelId` can contain comma-separated GUIDs	When an event involves multiple sub-entities (e.g., Copilot accessing multiple labeled resources), the `SensitivityLabelId` column contains comma-separated GUIDs like `aaaa-...,aaaa-...,bbbb-...`. This is NOT a single GUID	Always `split(SensitivityLabelId, ",")` then `mv-expand` to handle multi-GUID values. Querying with `== "<GUID>"` will miss events; use `has "<GUID>"` for filtering or `split()` + `mv-expand` for enumeration
`RiskyAIUsageSensitivityLabelsInfo` is mostly `[null]`	The column is populated on 90%+ of Copilot events, but almost all contain `[null]` (a JSON array with a literal null element). Only events where Copilot actually accessed labeled resources have real data	Filter with `tostring(RiskyAIUsageSensitivityLabelsInfo) !has "null"` to exclude the null-dominated rows. The `isnotempty()` check alone is insufficient — `[null]` passes it
Label GUIDs have no `ClassifierType` field	Unlike SIT GUIDs which have `ClassifierType: "Content"` or `"MLModel"`, label GUIDs have no built-in type indicator. Resolution requires external lookup (datatable, config, or PowerShell `Get-Label`)	Use the Label GUID Mapping Strategy (3 tiers). Unknown labels display as `[Unknown] <GUID>`
Default label GUIDs are deterministic but not officially documented	All 12 default labels (including parents) use `defa4170-0d19-0005-XXXX-bc88714345d2` with sequential suffixes `0000`–`000b` (confirmed via `Get-Label` on default-configuration tenants). However, Microsoft does not publish these GUIDs in official docs. Older/customized tenants may have: (a) renamed default labels (e.g., "Non-business" instead of "Personal"), (b) replaced parent label GUIDs with random tenant-specific ones, or (c) added custom sub-labels that break the sequential pattern	The embedded datatable includes all 12 confirmed default GUIDs. For tenants with custom parent GUIDs, add them via `config.json` `label_mapping`. If `Get-Label` returns different names for a `defa4170` GUID, prefer the `Get-Label` name for that tenant
`PreviousSensitivityLabelId` can equal `SensitivityLabelId`	On "Label removed from a file" events, both fields may contain the same GUID (the label that was removed). The current label after removal may be empty	Check `ActionType` to distinguish: "Label downgraded" = actual label change; "Label removed" = label stripped (current may be empty); "Label on file downgraded or removed, then file accessed by Copilot" = compound event with Copilot exposure
Label-only events require label-aware DLP/IRM policies	Events with labels but no SIT content (Query 15) only appear if a DLP or IRM policy explicitly targets sensitivity labels. Environments with no label-based policies will see zero label-only events even if documents are labeled	If Query 15 returns 0 results AND labels exist (Query 11 > 0), this indicates no label-based DLP/IRM policies are configured — mention this as a gap in the report
Copilot service accounts inflate event counts	Automated service accounts can generate 50-70% of all Copilot events. See Rule 13 for full requirements	Filter: `where not(AccountUpn matches regex ServiceAccountPattern)` (defined in Query 16a). Run Query 16a to quantify agent vs human split
Copilot row multiplication (~2-3x per interaction)	Each interaction generates ~2-3 DSE rows (prompt + response + compound events), up to 35 for complex exchanges. See Rule 14 for reporting requirements	Estimate real interactions as `raw_events / 2.5`. Always include multiplier context when reporting Copilot volumes
Correct field name is `SensitiveInfoTypeId` (not `SensitiveType`)	Inside the `SensitiveInfoTypeInfo` JSON, the SIT GUID field is named `SensitiveInfoTypeId`. LLMs frequently generate `SensitiveType` or `SensitiveTypeId` — both are wrong and return null/empty when accessed	Always use `tostring(SITJson.SensitiveInfoTypeId)`. Other valid fields: `Confidence` (int), `ClassifierType` (string), `Count` (int), `SubEntityName` (string — "Prompt" or "Response" in Copilot events)
`SubEntityName` distinguishes Prompt vs Response in Copilot SIT events	Inside `SensitiveInfoTypeInfo` JSON, `SubEntityName` contains "Prompt" or "Response" — indicating whether the SIT was detected in the user's prompt or Copilot's response. This is more reliable than relying on `ActionType` text matching for prompt/response classification within individual SIT matches	Use `tostring(SITJson.SubEntityName)` when doing per-SIT prompt/response breakdowns. `ActionType` classifies the event-level action; `SubEntityName` classifies the per-SIT-match context

Error Handling

Error	Cause	Resolution
`Failed to resolve table 'DataSecurityEvents'`	Table not available — IRM not opted in, or not connected to Defender XDR	Inform user: "DataSecurityEvents requires Microsoft Purview Insider Risk Management opt-in to share data with Defender XDR."
`0 results` for SensitiveInfoTypeInfo queries	No SIT detections in timeframe, or SIT detection not enabled in IRM policies	Widen time range; check if IRM policies include SIT detection
`Failed to resolve column 'ObjectName'`	Schema changed or column renamed	Use `ObjectId` instead (confirmed available). Run `getschema` to verify current schema
PowerShell `Get-DlpSensitiveInformationType` fails	Not connected to IPPS session	Run `Connect-IPPSSession -UserPrincipalName <UPN>` first
`The term 'Get-DlpSensitiveInformationType' is not recognized`	Module not installed or IPPS session in different terminal	`Install-Module ExchangeOnlineManagement` then `Connect-IPPSSession` in the same terminal session
`The term 'Get-Label' is not recognized`	Not connected to IPPS session or module not loaded	Same as above — `Connect-IPPSSession` provides both `Get-DlpSensitiveInformationType` and `Get-Label`
Label GUIDs all show as `[Unknown]`	Default label datatable doesn't match target tenant (parent GUIDs vary)	Resolve via `Get-Label` and persist to `config.json` `label_mapping`

File Access Action Types Reference

When the user specifically asks about who opened/accessed/downloaded documents, filter to these ActionTypes:

ActionType	Meaning
`Sensitive File read`	File opened on endpoint (Defender for Endpoint)
`File accessed on SPO`	File opened in SharePoint Online / OneDrive
`File downloaded from SharePoint`	File downloaded from SPO/OneDrive
`File copied to Removable media`	File copied to USB/removable storage
`File upload to cloud`	File uploaded to cloud storage
`Sensitive file created`	New file created with sensitive content
`File Archived`	File moved to archive
`Text copied to clipboard from sensitive file`	Clipboard copy from sensitive doc

Copilot-related ActionTypes (separate category — AI interaction, not direct file access):

ActionType	Meaning
`Sensitive response received in Copilot`	Copilot surfaced content matching a SIT
`Risky prompt entered in Copilot`	User prompt triggered risk detection
`Sensitive response received in Copilot;Agent generating sensitive responses`	Copilot agent generated response containing SIT matches
`Risky prompt entered in Copilot;Sensitive response received in Copilot`	Both prompt and response contained SIT matches
`Risky prompt entered in Copilot;Sensitive response received in Copilot;Exposing agent to risky prompts;Agent generating sensitive responses`	Compound agent event — prompt + response + agent risk
`Risky prompt entered in Copilot;Exposing agent to risky prompts`	User exposed a Copilot agent to a risky prompt

Label-related ActionTypes (sensitivity label events):

ActionType	Meaning
`Label downgraded on a file`	Sensitivity label lowered (e.g., HC → Confidential)
`Label removed from a file`	Sensitivity label stripped entirely
`Label on file downgraded or removed, then file accessed by Copilot`	Label reduced AND Copilot subsequently accessed the file

DLP ActionTypes:

ActionType	Meaning
`Generated High severity DLP alerts`	DLP policy triggered a high-severity alert
`DLP Rule Matched`	DLP rule matched (may be combined with other types)

SVG Dashboard Generation

After generating a Data Security Events analysis report (markdown file output), an SVG dashboard can be created using the shared SVG rendering skill.

Trigger: User asks "generate an SVG dashboard from the report" or "visualize this report"

Workflow:

Read this skill's svg-widgets.yaml (widget manifest — defines layout, colors, field mapping)
Read .github/skills/svg-dashboard/SKILL.md (rendering rules — component library, quality standards)
Extract data from the completed report using data_sources.field_mapping_notes
Render SVG → save as {report_basename}_dashboard.svg in the same directory

data-security-analysis

Mehr aus diesem Repository

Mehr aus diesem Repository

Data Security Events Analysis — Instructions

Purpose

📑 TABLE OF CONTENTS

When invoked from a parent skill (threat-pulse, incident-investigation, user-investigation):

⚠️ CRITICAL WORKFLOW RULES - READ FIRST ⚠️

⛔ PROHIBITED ACTIONS

SIT GUID Mapping Strategy

The Problem

Resolution Strategy (3 tiers, in order)

Tier 1: Embedded Well-Known SIT Mapping (instant, no auth)

Tier 2: User-Provided Custom SIT Mapping (config-driven)

Tier 3: PowerShell Resolution (optional, on-demand)

Post-Resolution Persistence (MANDATORY)

Trainable Classifiers

Label GUID Mapping Strategy

The Problem

Resolution Strategy (3 tiers, in order)

Tier 1: Embedded Well-Known Label Mapping (instant, no auth)

Tier 2: User-Provided Custom Label Mapping (config-driven)

Tier 3: PowerShell Resolution (optional, on-demand)

Output Modes

Mode 1: Inline Chat Summary (Default)

Mode 2: Markdown File Report

Quick Start (TL;DR)

Execution Workflow

Phase 1: Discovery & Mapping (always run first)

Phase 2: Breakdown Analysis

Phase 2.5: Copilot SIT Exposure Analysis (conditional on Copilot volume)

Phase 3: Deep Dive (conditional on scope)

Phase 4: Sensitivity Label Analysis (conditional on coverage)

Phase 5: Report Generation

Sample KQL Queries

Well-Known SIT GUID Mapping (datatable)

Query 1: SIT Discovery — Active SIT Landscape

Query 2: Action Type Breakdown

Query 3: Top Users by SIT Interaction Volume

Query 4: Top Files by SIT Detection Count

Query 5: Temporal Pattern — Daily SIT Event Volume

Query 6: DLP Policy Match Correlation

Query 7: Workload Breakdown

Query 8: Single-SIT Deep Dive

Query 9: Single-User SIT Access Profile

Query 10: Anomaly Detection — Users with SIT Access Spikes

Query 10b: File-Based-Only Spike Detection (Excludes Copilot)

Well-Known Label GUID Mapping (datatable)

Query 11: Label Coverage Overview — Sensitivity Label Landscape

Query 12: Top Users by Labeled Document Access (File-Based)

Query 13: Label Downgrade & Change Tracking

Query 14: Copilot Label Exposure — Labeled Resources Accessed by Copilot

Query 15: Label-Only Events — Events Triggered Purely by Label (No SIT Content Match)

Query 16a: Copilot SIT Landscape with Priority Tiers

Query 16b: Top Users by High-Priority SIT in Copilot (Excluding Service Accounts)

Query 16c: Daily Copilot SIT Trend by Priority Category

Query 16d: Prompt-Only Analysis — Human Users Typing Sensitive Data INTO Copilot

Report Template

Report Rendering Rules

Rule 1: Risk Rating Scale

Rule 2: Executive Summary Uses File-Based Metrics Only

Rule 3: Top Users Overall Section — Copilot Compression

Rule 4: Copilot Count Reconciliation

Rule 5: Scope & Limitations Section (Required for Markdown Reports)

Rule 6: SIT Landscape Table Integrity

Rule 7: Spike Detection — File-Based Primary, Overall Secondary

Rule 8: Top Files — Exclude System/Operational Files

Rule 9: Risk Rating Consistency — Exec Summary Must Match User Table

Rule 10: Temporal Pattern — Include File-Based Event Column

Rule 11: Adaptive Label Analysis Depth

Rule 12: Label Display Format — Always Show Parent/Child Hierarchy

Rule 13: Service Account Separation in Copilot Analysis

Rule 14: Copilot Event Row Multiplication Awareness

Rule 15: Copilot SIT Report Section (Phase 2.5 Output)

Inline Chat Format

Markdown File Format

Known Pitfalls

Error Handling

File Access Action Types Reference

SVG Dashboard Generation