تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

fabric-metadata-creation

Name: Fabric Metadata Creation
Author: drhelius

// Create reviewable metadata proposals for Microsoft Fabric lakehouse tables and semantic models. Covers schema analysis, concise technical and business descriptions, glossary terms, classifications, Purview-like sensitivity label proposals, PII and sensitive data detection, relationship and lineage hints, data quality rules, CDE candidates, data products, JSON/PDF output, and steward review questions. Use when: metadata creation, data catalog, Purview mapping, glossary inference, data classification, sensitivity labels, stewardship review, governance metadata.

تشغيل في Manus

$ git log --oneline --stat

stars:٢

forks:٣

updated:٦ مايو ٢٠٢٦ في ١٠:٥٠

مستكشف الملفات

2 ملفات

SKILL.md

readonly

related-skills.json

نفس المستودع

fabric-optimization-review.md

from "drhelius/gh-copilot-fabric-agents"

Review and optimize data storage, data models, and semantic models in Microsoft Fabric lakehouses. Covers Delta table optimization (V-Order, Z-Order, compaction, partitioning), data model anti-pattern detection, semantic model DAX review, relationship optimization, AI/agentic readiness (descriptions, synonyms, linguistic schema, display folders), workspace audit (Fabric best practices, dimensional modeling, Direct Lake), and actionable recommendation reports. Use when: optimization, performance, review, tuning, best practices, V-Order, Z-Order, partitioning, compaction, DAX optimization, data model review, AI readiness, agentic, NLQ, Copilot, natural language, workspace audit, Direct Lake, dimensional modeling, audit.

2026-05-132

fabric-synthetic-data.md

from "drhelius/gh-copilot-fabric-agents"

Generate and upload synthetic data to Microsoft Fabric lakehouse tables. Covers star-schema generation, realistic fake data (names, dates, IDs, transactions), configurable row counts, referential integrity between tables, Parquet export, and OneLake upload. Use when: synthetic data, test data, fake data, sample data, generate tables, populate lakehouse, seed data, mock data.

2026-04-202

fabric-data-cleaner.md

from "drhelius/gh-copilot-fabric-agents"

Algorithm reference and self-contained PySpark notebook templates for cleaning Microsoft Fabric lakehouse tables. Covers profiling, duplicate detection, null analysis, type validation, statistics, IQR outlier detection, date format validation, Spanish DNI/NIE checksum validation, and email/phone format checks. Each notebook can be uploaded and run independently in Fabric.

2026-04-072

fabric-semantic-model.md

from "drhelius/gh-copilot-fabric-agents"

Infer and generate Power BI semantic models (TMDL format) from Microsoft Fabric lakehouse tables. Covers star-schema detection, table classification (fact/dimension), relationship inference, DAX measure generation, data type mapping, and deployment via REST API. Use when: create semantic model, build data model, star schema, Power BI model, TMDL, fact table, dimension table.

2026-04-062

package.json

"author": "drhelius"

"repository": "drhelius/gh-copilot-fabric-agents"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

$ useful --forSOC

مصممو قواعد البياناتمهن الحاسوب والرياضيات15-1243L4

name	fabric-metadata-creation
description	Create reviewable metadata proposals for Microsoft Fabric lakehouse tables and semantic models. Covers schema analysis, concise technical and business descriptions, glossary terms, classifications, Purview-like sensitivity label proposals, PII and sensitive data detection, relationship and lineage hints, data quality rules, CDE candidates, data products, JSON/PDF output, and steward review questions. Use when: metadata creation, data catalog, Purview mapping, glossary inference, data classification, sensitivity labels, stewardship review, governance metadata.
compatibility	Requires the Fabric MCP Server VS Code extension for interactive discovery when available. Uses Python helper scripts with azure-identity and requests for Fabric discovery/export and markdown plus weasyprint for styled local PDFs. Produces local proposal artifacts only.

Fabric Metadata Creation - Reference

Generates non-destructive metadata proposals for Microsoft Fabric lakehouse tables, files, views, and semantic models. The result is a steward-review package that can later be mapped into a data catalog such as Microsoft Purview, but it must not apply catalog changes automatically.

The agent's core principle is traceability: every suggestion must distinguish metadata explicitly found in the source from metadata inferred by the agent, include a confidence level, and carry assumptions or open questions when evidence is incomplete.

Global Rules

Proposal only. Never modify Fabric assets, Purview assets, glossary terms, labels, classifications, owners, stewards, retention policies, or access controls.
Human review required. Governance outcomes are suggestions, not decisions.
No invented accountability. Do not invent real owners, stewards, contacts, responsible teams, legal requirements, or compliance decisions. Use Unknown, TBD, or an open question when not explicit.
Evidence labeling. Every metadata object must include metadata_source with one of: explicit, inferred, or mixed.
Confidence required. Every proposal must include confidence: high, medium, or low.
Assumptions required. Every inferred proposal should include assumptions; use an empty list only when there are no assumptions.
Open questions required. Every asset-level and column-level proposal must include review questions when evidence is incomplete.
Conservative language. Use candidate, suggested, likely, possible, or requires human review for inferred classifications and sensitivity labels.
Structured output. Generate only the four final artifacts documented below: proposal JSON/PDF and standalone glossary Markdown/PDF.
Spanish review documents. Human-readable artifacts must be written in Spanish. Keep original workspace, lakehouse, table, column, measure, semantic model, and glossary candidate names exactly as they appear in the source; do not translate source identifiers.

Helper Script

Use ./scripts/fabric_metadata.py for Fabric discovery/export, PDF rendering, and output validation. The agent must activate .venv/ before running it.

Install dependencies:

test -d .venv || python3 -m venv .venv
source .venv/bin/activate
pip install azure-identity requests markdown weasyprint

WeasyPrint requires system-level Cairo/Pango libraries. They are pre-installed on many Linux environments; on macOS install Pango with brew install pango.

Commands:

python .github/skills/fabric-metadata-creation/scripts/fabric_metadata.py list-workspaces
python .github/skills/fabric-metadata-creation/scripts/fabric_metadata.py list-items <workspace_id> [<item_type>]
python .github/skills/fabric-metadata-creation/scripts/fabric_metadata.py list-lakehouse-tables <workspace_id> <lakehouse_id>
python .github/skills/fabric-metadata-creation/scripts/fabric_metadata.py get-lakehouse-table <workspace_id> <lakehouse_id> <table_name>
python .github/skills/fabric-metadata-creation/scripts/fabric_metadata.py list-semantic-models <workspace_id> [<model_name>]
python .github/skills/fabric-metadata-creation/scripts/fabric_metadata.py export-semantic-model <workspace_id> <semantic_model_id> <output_dir>
python .github/skills/fabric-metadata-creation/scripts/fabric_metadata.py render-pdf <metadata_proposal.json> <metadata_proposal.pdf>
python .github/skills/fabric-metadata-creation/scripts/fabric_metadata.py render-glossary-md <metadata_proposal.json> <glossary_terms.md>
python .github/skills/fabric-metadata-creation/scripts/fabric_metadata.py render-glossary-pdf <metadata_proposal.json> <glossary_terms.pdf>
python .github/skills/fabric-metadata-creation/scripts/fabric_metadata.py render-markdown-pdf <input.md> <output.pdf>
python .github/skills/fabric-metadata-creation/scripts/fabric_metadata.py validate <output_dir>

Use MCP tools first when they are available for interactive discovery. Use the script when MCP does not expose the needed metadata, when exporting semantic model definitions, or when rendering/validating local outputs.

Output Directory

Save each run to:

./metadata_proposals/{SOURCE_NAME}/{YYYY-MM-DD_HHmmss}/

Required files:

File	Purpose
`metadata_proposal.json`	Compact machine-readable metadata proposal
`metadata_proposal.pdf`	Styled A4 proposal PDF rendered from Markdown with WeasyPrint
`glossary_terms.md`	Standalone human-readable glossary proposal
`glossary_terms.pdf`	Styled A4 glossary PDF rendered from Markdown with WeasyPrint

Do not generate metadata_proposal.md, metadata_proposal.yaml, evidence.json, or other final artifacts. Temporary discovery/export files may be used while working, but they must not remain in the final output directory.

Default Proposal PDF Sections

The proposal PDF must present these simplified review sections in order. The JSON is the source of truth; the PDF is the human-readable rendering.

Resumen ejecutivo
Entrada interpretada
Activos y columnas
Glosario
Sugerencias de gobierno
Relaciones, linaje y calidad de datos
Supuestos y preguntas abiertas

Both PDFs must include an index/table of contents. Sections and subsections must be numbered. Asset sections must be visually isolated with page breaks, and the asset name must appear in the asset table header. Source names that contain Markdown-looking characters, such as # Case Updates, must be rendered as literal names rather than Markdown syntax.

Metadata Categories

Asset-Level Metadata

Use these compact fields for databases, schemas, tables, views, files, semantic models, measures, dimensions, and relationship assets when applicable:

asset_name, asset_type, fully_qualified_name, platform, source_system,
database_name, schema_name, object_type, data_domain, data_product,
environment, lifecycle_status, business_description, technical_description,
data_owner, data_steward, glossary_terms, upstream_lineage,
downstream_lineage, source_of_truth, security_notes, quality_notes,
confidence

Additional required traceability fields: metadata_source, assumptions, open_questions.

Column-Level Metadata

Use these compact fields for columns, semantic model columns, calculated columns, measures, and dimensional attributes when applicable:

column_name, business_name, data_type, nullable, key_role,
measure_or_dimension, business_description, technical_description,
semantic_type, classification, sensitivity_label, pii_indicator,
critical_data_element_candidate, derivation_logic, allowed_values,
example_values, quality_rules, relationship_hints, lineage_hints,
masking_recommendation, confidence

Additional required traceability fields: metadata_source, assumptions, open_questions.

Business Glossary Term Metadata

Use these compact fields for suggested business glossary terms:

term_name, definition, synonyms, related_terms, domain, status,
owner_or_steward, associated_assets, associated_columns, confidence,
metadata_source, open_questions

Use status: Proposed unless an explicit existing glossary status is present.

The glossary must also be published as an independent human-readable output in glossary_terms.md and glossary_terms.pdf. This glossary is a steward-review artifact, not a Purview import file. It should be readable without opening the full metadata proposal.

Classification Metadata

Use these fields for classification proposals:

classification_name, classification_reason, detection_basis, confidence

Allowed classification names include, when evidence supports them: personal data, financial data, operational data, identifier, reference data, master data, transactional data, metric, dimension, audit field, system field, derived field.

Sensitivity Metadata

Use these fields for Purview-like sensitivity label proposals:

suggested_label, label_reason, protection_recommendation, confidence

Allowed suggested labels: Public, Internal, Confidential, Highly Confidential, Restricted.

Data Quality Metadata

Use these fields for quality rules and profiling recommendations:

quality_dimension, quality_rule, severity, confidence

Quality dimensions include: completeness, uniqueness, validity, accuracy, consistency, timeliness, freshness, integrity, conformity, reasonableness.

Lineage and Relationship Metadata

Use these fields for relationship and lineage hints:

upstream_source, downstream_consumer, relationship_or_join_hint, confidence

Critical Data Element Metadata

Use these fields for CDE candidates:

cde_name, business_reason, associated_columns, confidence

Data Product Metadata

Use these fields for suggested data products or domain groupings:

data_product_name, purpose, included_assets, target_consumers,
owner_or_steward, confidence

Evidence and Confidence Rules

Metadata Source

Value	Meaning
`explicit`	Directly present in Fabric metadata, TMDL, schema, annotations, descriptions, names, or user-provided context
`inferred`	Deduced from names, data types, relationships, sample values, model structure, or common enterprise patterns
`mixed`	Combines explicit evidence with inferred enrichment

Confidence Levels

Confidence	Use When
`high`	Strong explicit evidence exists, or multiple independent signals agree, such as name pattern plus data type plus sample pattern
`medium`	Reasonable inference from names, data types, or model context, but no validating sample or explicit business description
`low`	Weak evidence, ambiguous abbreviations, missing samples, unclear domain, or conflicting signals

Never raise confidence only because a term sounds plausible. Lower confidence when samples, row counts, relationships, or business context are missing.

Inference Rules

Asset Type and Domain

Use table/model names, workspace names, lakehouse names, schemas, and measure names to infer candidate domains. Examples:

Pattern	Suggested Domain or Context
`customer`, `client`, `account`, `contact`	Customer, sales, CRM, master data
`order`, `invoice`, `payment`, `booking`, `transaction`	Sales, finance, transactional process
`product`, `item`, `sku`, `catalog`	Product, inventory, reference/master data
`employee`, `staff`, `payroll`, `hr`	Workforce or HR; possible sensitive data
`patient`, `diagnosis`, `doctor`, `visit`, `hospital`	Healthcare; possible regulated data
`flight`, `passenger`, `airport`, `aircraft`	Airline operations
`incident`, `ticket`, `case`, `status`, `priority`	Operations or service management
`calendar`, `date`, `time`	Date/time dimension
`audit`, `log`, `event`, `telemetry`	Operational, audit, or observability data

If domain is ambiguous, use Unknown or a broad proposed domain and add an open question.

Descriptions

Generate two descriptions for each asset and column:

Technical description: what the object stores structurally, based on schema and type evidence.
Business description: what the object likely means to a business user, based on names, model context, relationships, and measures.

Descriptions must be concise, deterministic, and free of unsupported claims. For low-confidence descriptions, explicitly say Candidate description or Likely represents.

Keys and Relationships

Infer key metadata with these signals:

Signal	Suggested Metadata
Column named `id`, `{table}_id`, `{entity}_id`, `{entity}_key`, `uuid`	Candidate primary, foreign, natural, or surrogate key depending on context
Single non-null unique ID-like column in a dimension table	Candidate primary key
Fact table column matching a dimension PK name/type	Candidate foreign key
Columns named `code`, `number`, `reference`, `dni`, `nif`, `passport`, `iban`	Candidate natural key or business identifier
Integer sequential ID column	Candidate surrogate key
Composite repeated columns such as `order_id` + `line_number`	Candidate composite key

Default cardinality to many-to-one candidate only when a transactional/fact-like table points to a reference/dimension-like table. Use unknown when there is insufficient evidence.

Measure and Dimension Detection

Pattern	Suggestion
Numeric additive columns: `amount`, `total`, `quantity`, `cost`, `revenue`, `sales`, `importe`, `cantidad`	Metric/measure candidate
Descriptive strings: `name`, `description`, `category`, `status`, `type`, `city`, `country`	Dimension attribute candidate
DAX measure expression exists	Explicit measure
Calculated expression exists	Derived field candidate, include derivation logic

Classification Detection

Pattern	Candidate Classification
`id`, `_id`, `_key`, `uuid`, `guid`, `code`, `number`	Identifier
`email`, `phone`, `mobile`, `telefono`, `address`, `dni`, `nif`, `nie`, `passport`, `name`, `birth_date`	Personal data candidate
`iban`, `bank`, `card`, `salary`, `payroll`, `payment`, `invoice`, `amount`, `revenue`, `margin`, `cost`, `price`	Financial data candidate
`patient`, `diagnosis`, `medical`, `doctor`, `visit`, `treatment`	Regulated or sensitive data candidate
`status`, `category`, `type`, `priority`, `country`, `city`, `currency`	Reference data candidate
`created_at`, `updated_at`, `loaded_at`, `ingestion_time`, `batch_id`, `source_file`, `run_id`	Audit or system field
`score`, `ratio`, `total`, `average`, `margin`, `age`, `duration`, `days_since`	Derived field or metric candidate

Do not use regex-only matches as definitive evidence. Describe uncertainty in classification_reason for sensitive classifications.

Sensitivity Label Suggestions

Suggested Label	Use When
`Public`	Asset appears intended for open/public use and no sensitive indicators are present. Rarely infer this without explicit context.
`Internal`	Default for ordinary enterprise metadata without sensitive indicators.
`Confidential`	Contains candidate business confidential data, identifiers, contact data, financial values, or internal operational data.
`Highly Confidential`	Contains likely PII, financial identifiers, health-related fields, security-relevant fields, or combinations of sensitive attributes.
`Restricted`	Contains secrets, credentials, tokens, private keys, strong regulated data indicators, or explicitly restricted labels in source metadata.

Treat every inferred label above Internal as requiring human review.

Privacy and Sensitive Data Signals

Potential PII and sensitive fields include:

name, first_name, last_name, full_name, email, phone, mobile, address, city,
postal_code, latitude, longitude, dni, nif, nie, ssn, passport, national_id,
tax_id, birth_date, gender, patient_id, diagnosis, medical_record, iban,
bank_account, credit_card, salary, password, secret, token, api_key, private_key,
ip_address, device_id, user_agent

Mark these as candidates unless sample values or explicit metadata confirm the pattern.

Data Quality Rules

Suggest quality rules from type and classification evidence:

Evidence	Suggested Rule
Candidate primary key	Completeness and uniqueness checks, expected threshold 100%
Candidate foreign key	Referential integrity check against referenced asset
Required business column	Completeness check with agreed null threshold
Email/phone/identifier pattern	Validity and conformity checks
Date/timestamp column	Freshness, timeliness, and valid range checks
Amount/quantity/metric	Reasonableness, non-negative, range, outlier, and aggregation checks
Status/category/code	Allowed values and reference set checks
Audit field	Presence, monotonicity, ingestion recency, and run consistency checks

Severity guidance:

Severity	Use When
`high`	Key integrity, sensitive fields, CDE candidates, business-critical metrics
`medium`	Important dimensions, common reporting filters, relationship fields
`low`	Descriptive fields or optional enrichment attributes

Critical Data Element Candidates

Suggest a CDE candidate when one or more conditions apply:

Column is a key identifier used in relationships.
Column is a financial, operational, compliance, or customer-impacting metric.
Column appears in a semantic model measure or relationship.
Column is likely used for regulatory, privacy, security, or executive reporting.
Column is needed to join important assets or identify master/reference entities.

Always describe CDEs as candidates and avoid definitive CDE status.

Glossary Term Generation

Create glossary terms by converting physical names to business names and grouping repeated concepts.

Rules:

Convert snake_case and camelCase to title case business terms.
Expand common abbreviations only when likely: id -> Identifier, qty -> Quantity, amt -> Amount, dt -> Date, cd -> Code, desc -> Description.
Do not invent acronyms unless they are present in the source.
Create related terms from relationship context, such as Customer, Customer Identifier, and Customer Segment.
Set owner_or_steward to Unknown or TBD, and status to Proposed, unless explicit.

Purview-Oriented Mapping

Use this mapping for catalog ingestion planning:

Proposal Area	Purview-Oriented Target
Asset metadata	Data map asset attributes, custom attributes, contacts, collections
Business glossary terms	Glossary terms and term relationships
Classification proposals	Custom or built-in classifications for review
Sensitivity proposals	Microsoft Purview Information Protection label proposal, not applied label
Data quality rules	Data quality rule backlog or observability configuration
Relationships and lineage hints	Lineage relationships or process mappings for validation
CDE candidates	Governance critical data element candidate register
Data products	Data product/domain grouping proposal

Never state that a Purview label or classification has been applied.

Machine-Readable JSON Schema

The JSON file must follow this compact top-level structure:

{
  "metadata_proposal": {
    "generated_at": "ISO-8601 timestamp",
    "generated_by": "fabric-metadata-creation agent",
    "proposal_version": "1.0",
    "input_interpreted": {},
    "executive_summary": "",
    "assets": [],
    "columns": [],
    "glossary_terms": [],
    "classifications": [],
    "sensitivity_labels": [],
    "critical_data_elements": [],
    "data_products": [],
    "relationships_lineage": [],
    "data_quality_rules": [],
    "assumptions": [],
    "open_questions": []
  }
}

Every asset and column object must include metadata_source, confidence, assumptions, and open_questions. Other proposal objects must include metadata_source and confidence. Use review_required only when it materially helps prioritize steward review.

Output Authoring Rules

Keep metadata_proposal.json compact and machine-readable; do not add legacy YAML, Markdown proposal, or evidence files.
Render metadata_proposal.pdf from metadata_proposal.json using the helper script's Markdown-to-WeasyPrint renderer.
Create glossary_terms.md as a standalone document that can be reviewed independently from the proposal PDF.
Render glossary_terms.pdf from structured glossary terms using the helper script's Markdown-to-WeasyPrint renderer; do not manually export the PDF.
Use A4 pages with 1.5cm margins, Segoe UI/Helvetica/Arial fonts, blue headings, zebra-striped tables, and VS Code-style code blocks.
Write review prose, section titles, field labels, assumptions, questions, and generated descriptions in Spanish.
Preserve original Fabric identifiers and source values exactly; do not translate table names, column names, measure names, workspace names, lakehouse names, semantic model names, or explicit glossary/source terms.
Use concise enterprise language. Prefer brief explanations over long generated prose.
Include confidence and metadata source in every proposal object.

Validation Rules

Before finishing, validate:

metadata_proposal.json parses as JSON.
metadata_proposal.pdf exists and is non-empty.
glossary_terms.md exists and includes a standalone glossary title.
glossary_terms.pdf exists and is non-empty.
No legacy artifacts exist in the output folder: metadata_proposal.md, metadata_proposal.yaml, or evidence.json.
Every asset and column proposal includes confidence, metadata_source, assumptions, and open_questions.
No real owner, steward, contact, legal obligation, retention policy, access policy, or compliance decision was invented.

Use:

python .github/skills/fabric-metadata-creation/scripts/fabric_metadata.py validate <output_dir>

Gotchas

A physical column named customer_id is not proof that the field is PII. It is an identifier candidate and may become personal data only when it can identify or link to a person.
A field named name can describe a person, product, department, status, or location. Use surrounding table context before proposing PII.
Financial measures are often confidential business data even when they are not personal data.
Health-related names such as patient, diagnosis, or treatment are strong sensitivity signals, but still require human review.
Semantic model measures often carry business meaning. Use DAX expressions and format strings as explicit evidence for metric descriptions.
Hidden key columns in semantic models are still governance-relevant and should be included in metadata proposals.
Semantic model getDefinition calls may return 202 Accepted. The helper must poll the operation and then fetch the operation result payload; the operation status itself does not contain the TMDL parts.
Do not assign Public unless public availability is explicit or strongly implied by context.
Do not assign legal categories such as GDPR, HIPAA, PCI, or SOX as definitive obligations. Use possible regulatory review area only when the source context suggests it.
If no sample values are available, avoid sample-based claims and reduce confidence.

fabric-metadata-creation

المزيد من هذا المستودع

Fabric Metadata Creation - Reference

Global Rules

Helper Script

Output Directory

Default Proposal PDF Sections

Metadata Categories

Asset-Level Metadata

Column-Level Metadata

Business Glossary Term Metadata

Classification Metadata

Sensitivity Metadata

Data Quality Metadata

Lineage and Relationship Metadata

Critical Data Element Metadata

Data Product Metadata

Evidence and Confidence Rules

Metadata Source

Confidence Levels

Inference Rules

Asset Type and Domain

Descriptions

Keys and Relationships

Measure and Dimension Detection

Classification Detection

Sensitivity Label Suggestions

Privacy and Sensitive Data Signals

Data Quality Rules

Critical Data Element Candidates

Glossary Term Generation

Purview-Oriented Mapping

Machine-Readable JSON Schema

Output Authoring Rules

Validation Rules

Gotchas

Fabric Metadata Creation - Reference

Global Rules

Helper Script

Output Directory

Default Proposal PDF Sections

Metadata Categories

Asset-Level Metadata

Column-Level Metadata

Business Glossary Term Metadata

Classification Metadata

Sensitivity Metadata

Data Quality Metadata

Lineage and Relationship Metadata

Critical Data Element Metadata

Data Product Metadata

Evidence and Confidence Rules

Metadata Source

Confidence Levels

Inference Rules

Asset Type and Domain

Descriptions

Keys and Relationships

Measure and Dimension Detection

Classification Detection

Sensitivity Label Suggestions

Privacy and Sensitive Data Signals

Data Quality Rules

Critical Data Element Candidates

Glossary Term Generation

Purview-Oriented Mapping

Machine-Readable JSON Schema

Output Authoring Rules

Validation Rules

Gotchas

المزيد من هذا المستودع