Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

named-entity-normalization

Normalize named entities and concept variants into canonical forms using controlled vocabularies or standard taxonomies.

Exécuter dans Manus

Aperçu

Normalize named entities and concept variants into canonical forms using controlled vocabularies or standard taxonomies.

Commande d'installation

npx skills add https://github.com/dandye/information-architecture --skill named-entity-normalization

Copiez et collez cette commande dans Claude Code pour installer le skill

Source

dandye/information-architecture

Étoiles1

Forks0

Mis à jour23 mai 2026 à 13:32

Explorateur de fichiers

2 fichiers

SKILL.md

readonly

name	named-entity-normalization
description	Normalize named entities and concept variants into canonical forms using controlled vocabularies or standard taxonomies.
required_roles	{"scribe":"roles/scribe.viewer"}
personas	["information-architect","content-strategist","taxonomist","knowledge-engineer"]

Named Entity Normalization (NEN) & Entity Grounding Skill

Resolve lexical variations, typos, casing differences, and industry aliases of named entities within a document set into single canonical concepts. This skill maps unstructured entity mentions directly to unique database identifiers, standard taxonomies, or a custom controlled vocabulary (thesaurus), resolving vocabulary bloat and embedding fragmentation.

Inputs

PATH - The document directory or file path containing raw entities to normalize (e.g., "/threat-intel-reports").
TAXONOMY_PATH - (Optional) Path to a target taxonomy, glossary, database, or controlled vocabulary (e.g., a MITRE ATT&CK group catalog or medical UMLS schema) to ground entities against. If not provided, the agent will construct a self-consistent canonical thesaurus directly from the document set's unique concepts.
CANONICAL_MAPPING - (Optional) Boolean, whether to output a JSON mapping dictionary (thesaurus) alongside the markdown report (default: true).
HYBRID_STRATEGY - (Optional) Boolean, whether to apply LLM-driven semantic inference (HNEN) for highly heterogeneous or inconsistent parameter names alongside lexical string matching (default: true).

Workflow

Step 1: Entity Extraction & Mentions Profiling

Locate all Named Entity mentions in the document set at PATH.

Extract proper nouns, technical acronyms, brand names, and domain terms.
Preserve casing and structural delimiters (e.g., hyphens, spaces) to retain context.
Profile entity occurrences, document frequencies, and variation patterns.

Step 2: Orthographic & Lexical Normalization

Apply base linguistic cleaning rules to standardize formatting variants.

Clean punctuation, strip extra whitespaces, and standardize hyphens.
Identify obvious case variants (e.g., "Scattered Spider", "SCATTERED SPIDER", "scattered spider") and orthographic variants (e.g., "ScatteredSpider").

Step 3: Canonical Candidate Generation & Grounding

Match entity mentions to target canonical entries in TAXONOMY_PATH or general domain baselines.

Dictionary / Database Grounding: Compare cleaned entity strings against the target taxonomy's names and alias fields.
Two-Step Retrieval (Semantic Re-ranking): For ambiguous entities, retrieve candidates based on string similarity and re-rank candidates by analyzing sentence context and concept definitions.
LLM-Driven Semantic Inference (HNEN): If HYBRID_STRATEGY is true and terms are highly heterogeneous (e.g., parameter naming conventions like V_in, Vin, InputVoltage), use semantic inference to map them to unified concepts based on datasheet conventions and domain schemas.

Step 4: Synonym Grouping & Thesaurus Synthesis

Compile synonym sets and designate Preferred Terms.

Resolve multiple different text names (e.g., "UNC3944", "Oktapus", "Scattered Spider") to a single Preferred Term representing the canonical concept.
Create a comprehensive Synonym & Alias Map capturing all variations under their respective preferred canonical keys.

Required Outputs

A NAMED_ENTITY_NORMALIZATION_REPORT in markdown format containing:

Executive Summary: Metrics on total entity mentions, unique canonical concepts generated, compression ratio (raw mentions to canonical entities), and taxonomy coverage.
Canonical Concept Glossary: A structured table containing:
- Preferred Term: The formal, capitalized canonical name.
- Entity Type: Category (e.g., Threat Group, Software, Concept, Parameter).
- Target Identifier: Grounded ID (e.g., MITRE_G1017 or custom ID) if matched against a taxonomy.
- Lexical Variants: List of casing, spelling, and abbreviation variations consolidated.
- Vendor/Industry Aliases: Industry synonyms resolved (e.g., "Oktapus", "UNC3944").
Actionable Preprocessor Implementation Plan: A python snippet or configuration schema showing how to load the generated mapping as a pre-tokenization preprocessor layer to prevent embedding fragmentation.

If CANONICAL_MAPPING is true:

nen-thesaurus.json: A structured JSON mapping file where keys are raw spelling variants (all lowercased for robust lookup) and values are the corresponding Preferred Terms.

Quick Reference

Purpose: Deduplicate and ground unstructured text entities to standard taxonomies or a controlled vocabulary, preventing parameter duplication and embedding fragmentation.
Outcome: A canonical entity glossary and preprocessor-ready JSON thesaurus mapping file.

References

BERN2: Sung, M., Jeong, M., Choi, Y., Kim, D., Lee, J., & Kang, J. (2022). BERN2: an advanced neural biomedical named entity recognition and normalization tool. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac598
Species Normalization: Awan, Z., Kahlke, T., Ralph, P., & Kennedy, P. (2023). Bi-Encoders based Species Normalization -- Pairwise Sentence Learning to Rank. arXiv:2310.14366. https://doi.org/10.48550/arxiv.2310.14366
KG Canonicalization: Dash, S., Rossiello, G., Mihindukulasooriya, N., Bagchi, S., & Gliozzo, A. (2020). Open Knowledge Graphs Canonicalization using Variational Autoencoders. arXiv:2012.04780. https://doi.org/10.48550/arxiv.2012.04780
Heterogeneous NEN (HNEN): Chen, H. C., Xu, Y. P., & Zhang, Y. (2025). D2S-FLOW: Automated Parameter Extraction from Datasheets... Using Large Language Models. arXiv:2502.16540. https://doi.org/10.48550/arxiv.2502.16540

Plus depuis ce dépôt

même dépôt

vocabulary-overlap-analysis

dandye/information-architecture

Analyze vocabulary overlap and identify named entities unique to domain/document corpora.

2026-05-231

document-tree-generate

dandye/information-architecture

Transform lengthy documents into a semantic tree structure. It extracts sections, summaries, and hierarchies optimized for use with Large Language Models (LLMs).

2026-04-091

ontology-define

dandye/information-architecture

Design formal ontologies using OWL/RDFS. Defines classes, properties, and relationships for complex semantic modeling.

2026-02-061

content-inventory

dandye/information-architecture

Systematic cataloging of information assets. Creates comprehensive inventories or card sorting materials from content.

2026-02-061

json-ld-generate

dandye/information-architecture

Generate JSON-LD structured data for web content. Maps content to Schema.org types to improve search engine understanding and rich result eligibility.

2026-02-061

knowledge-graph-generate

dandye/information-architecture

Analyze multiple documents to extract entities and relationships, generating a knowledge graph structure (RDF/Turtle).

2026-02-061

Source

dandye

dandye/information-architecture

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Commande d'installation

Téléchargement

Exécuter dans Manus

Utile pourSOC

Développeurs de logicielsProfessions informatiques et mathématiques15-1252L4

name	named-entity-normalization
description	Normalize named entities and concept variants into canonical forms using controlled vocabularies or standard taxonomies.
required_roles	{"scribe":"roles/scribe.viewer"}
personas	["information-architect","content-strategist","taxonomist","knowledge-engineer"]

Named Entity Normalization (NEN) & Entity Grounding Skill

Inputs

PATH - The document directory or file path containing raw entities to normalize (e.g., "/threat-intel-reports").
TAXONOMY_PATH - (Optional) Path to a target taxonomy, glossary, database, or controlled vocabulary (e.g., a MITRE ATT&CK group catalog or medical UMLS schema) to ground entities against. If not provided, the agent will construct a self-consistent canonical thesaurus directly from the document set's unique concepts.
CANONICAL_MAPPING - (Optional) Boolean, whether to output a JSON mapping dictionary (thesaurus) alongside the markdown report (default: true).
HYBRID_STRATEGY - (Optional) Boolean, whether to apply LLM-driven semantic inference (HNEN) for highly heterogeneous or inconsistent parameter names alongside lexical string matching (default: true).

Workflow

Step 1: Entity Extraction & Mentions Profiling

Locate all Named Entity mentions in the document set at PATH.

Extract proper nouns, technical acronyms, brand names, and domain terms.
Preserve casing and structural delimiters (e.g., hyphens, spaces) to retain context.
Profile entity occurrences, document frequencies, and variation patterns.

Step 2: Orthographic & Lexical Normalization

Apply base linguistic cleaning rules to standardize formatting variants.

Clean punctuation, strip extra whitespaces, and standardize hyphens.
Identify obvious case variants (e.g., "Scattered Spider", "SCATTERED SPIDER", "scattered spider") and orthographic variants (e.g., "ScatteredSpider").

Step 3: Canonical Candidate Generation & Grounding

Match entity mentions to target canonical entries in TAXONOMY_PATH or general domain baselines.

Dictionary / Database Grounding: Compare cleaned entity strings against the target taxonomy's names and alias fields.
Two-Step Retrieval (Semantic Re-ranking): For ambiguous entities, retrieve candidates based on string similarity and re-rank candidates by analyzing sentence context and concept definitions.
LLM-Driven Semantic Inference (HNEN): If HYBRID_STRATEGY is true and terms are highly heterogeneous (e.g., parameter naming conventions like V_in, Vin, InputVoltage), use semantic inference to map them to unified concepts based on datasheet conventions and domain schemas.

Step 4: Synonym Grouping & Thesaurus Synthesis

Compile synonym sets and designate Preferred Terms.

Resolve multiple different text names (e.g., "UNC3944", "Oktapus", "Scattered Spider") to a single Preferred Term representing the canonical concept.
Create a comprehensive Synonym & Alias Map capturing all variations under their respective preferred canonical keys.

Required Outputs

A NAMED_ENTITY_NORMALIZATION_REPORT in markdown format containing:

Executive Summary: Metrics on total entity mentions, unique canonical concepts generated, compression ratio (raw mentions to canonical entities), and taxonomy coverage.
Canonical Concept Glossary: A structured table containing:
- Preferred Term: The formal, capitalized canonical name.
- Entity Type: Category (e.g., Threat Group, Software, Concept, Parameter).
- Target Identifier: Grounded ID (e.g., MITRE_G1017 or custom ID) if matched against a taxonomy.
- Lexical Variants: List of casing, spelling, and abbreviation variations consolidated.
- Vendor/Industry Aliases: Industry synonyms resolved (e.g., "Oktapus", "UNC3944").
Actionable Preprocessor Implementation Plan: A python snippet or configuration schema showing how to load the generated mapping as a pre-tokenization preprocessor layer to prevent embedding fragmentation.

If CANONICAL_MAPPING is true:

nen-thesaurus.json: A structured JSON mapping file where keys are raw spelling variants (all lowercased for robust lookup) and values are the corresponding Preferred Terms.

Quick Reference

Purpose: Deduplicate and ground unstructured text entities to standard taxonomies or a controlled vocabulary, preventing parameter duplication and embedding fragmentation.
Outcome: A canonical entity glossary and preprocessor-ready JSON thesaurus mapping file.

References

BERN2: Sung, M., Jeong, M., Choi, Y., Kim, D., Lee, J., & Kang, J. (2022). BERN2: an advanced neural biomedical named entity recognition and normalization tool. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac598
Species Normalization: Awan, Z., Kahlke, T., Ralph, P., & Kennedy, P. (2023). Bi-Encoders based Species Normalization -- Pairwise Sentence Learning to Rank. arXiv:2310.14366. https://doi.org/10.48550/arxiv.2310.14366
KG Canonicalization: Dash, S., Rossiello, G., Mihindukulasooriya, N., Bagchi, S., & Gliozzo, A. (2020). Open Knowledge Graphs Canonicalization using Variational Autoencoders. arXiv:2012.04780. https://doi.org/10.48550/arxiv.2012.04780
Heterogeneous NEN (HNEN): Chen, H. C., Xu, Y. P., & Zhang, Y. (2025). D2S-FLOW: Automated Parameter Extraction from Datasheets... Using Large Language Models. arXiv:2502.16540. https://doi.org/10.48550/arxiv.2502.16540