Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

dataset-curation

Name: Dataset Curation
Author: fcakyon

// Use when the user wants to analyze dataset bias, create stratified samples, evaluate fairness, or plan dataset collection. Triggers on phrases like "dataset bias", "stratified sample", "class imbalance", "data distribution", "fairness analysis", or "ethical review".

Ejecutar en Manus

$ git log --oneline --stat

stars:250

forks:24

updated:12 de marzo de 2026, 13:54

SKILL.md

readonly

name	dataset-curation
description	Use when the user wants to analyze dataset bias, create stratified samples, evaluate fairness, or plan dataset collection. Triggers on phrases like "dataset bias", "stratified sample", "class imbalance", "data distribution", "fairness analysis", or "ethical review".

Dataset Curation Methodology

You are helping a researcher curate, analyze, or expand a dataset with attention to bias, fairness, and quality.

Step 1: Distribution Analysis

Before any curation action, understand the current state:

Per-Class Distribution

Count instances per class/label/tag
Compute imbalance ratio (max_count / min_count)
Identify severely underrepresented classes (< 5% of max class)
Visualize: bar chart of class frequencies sorted by count

Co-occurrence Analysis

Build co-occurrence matrix: which labels appear together
Identify spurious correlations (e.g., "violence" always co-occurs with "male")
Check for label leakage between splits

Metadata Distribution

Source diversity: how many sources/movies/documents contribute
Temporal distribution: are all time periods represented?
Content diversity: genre, style, domain coverage

Step 2: Bias Assessment

For each identified imbalance or correlation:

Is it real-world reflective? Some imbalances reflect genuine phenomena
Is it harmful? Would a model trained on this data make unfair predictions?
Is it fixable? Can we collect more data, resample, or reweight?

Fairness Dimensions

Check for bias along relevant protected attributes:

Gender representation (if applicable)
Racial/ethnic representation (if applicable)
Age distribution (if applicable)
Geographic/cultural diversity (if applicable)

Bias Metrics

Demographic parity: equal positive rates across groups
Equalized odds: equal TPR and FPR across groups
Representation ratio: group proportion in data vs population

Step 3: Stratified Sampling

When creating splits (train/val/test):

Primary stratification: by label/class distribution
Secondary stratification: by source (prevent source leakage across splits)
Validation:
- Chi-squared test for label distribution similarity across splits
- No source overlap between splits
- Rare classes have minimum representation in each split

Split ratios depend on dataset size:

Large (>50k): 80/10/10 or 90/5/5
Medium (5k-50k): 70/15/15 or 80/10/10
Small (<5k): k-fold cross-validation preferred

Step 4: Quality Assessment

For labeled datasets, assess annotation quality:

Inter-annotator agreement: Cohen's kappa, Fleiss' kappa, or Krippendorff's alpha
Label noise estimation: sample and manually verify N labels
Edge cases: identify ambiguous examples that annotators might disagree on
Consistency checks: automated rules for label validity

Step 5: Expansion Recommendations

If the dataset needs more data:

Priority classes: which classes benefit most from more data
Source suggestions: where to find more data for underrepresented classes
Collection strategy: active learning, targeted scraping, synthetic augmentation
Cost estimation: time and resources for each approach

Step 6: Ethical Review Checklist

Before using or publishing any dataset:

Content sensitivity: does the data contain sensitive material?
Consent: was data collected with appropriate consent?
Privacy: are individuals identifiable? Is anonymization needed?
Licensing: are data sources used within their license terms?
Potential harms: could the dataset be misused?
Documentation: is the dataset documented with a datasheet/data card?

Output Format

Produce:

Distribution report: per-class counts, imbalance ratios, co-occurrence matrix
Bias findings: identified biases with severity and actionability
Split recommendation: stratification strategy with validation results
Expansion plan: prioritized suggestions for addressing gaps
Ethics checklist: completed checklist with notes per item

related-skills.json

mismo repositorio

compare.md

from "fcakyon/phd-skills"

Same-epoch comparison of training runs across wandb, neptune, tensorboard, or mlflow. Aligns runs at the student's current step (never current-vs-final-of-baseline) and separates proxy metrics from downstream targets. Use when the user asks to compare runs, check if a run is improving, track lag against a baseline, rank experiments, or evaluate run-vs-run performance.

2026-04-30250

debug.md

from "fcakyon/phd-skills"

Evidence-before-action diagnosis of failing ML experiments. Probes the system before guessing causes, process list, dmesg, GPU stats, log scrollback, checkpoint state, then states a hypothesis as a hypothesis and runs a smoke before claiming a root cause. Use when the user asks why a run is failing, diverging, OOMing, hanging, slow, producing weird metrics, has crashed, or asks to debug, diagnose, troubleshoot, or investigate a training issue.

2026-04-30250

launch.md

from "fcakyon/phd-skills"

Pre-flight checklist for long-running ML training jobs covering config diff, run naming, path verification, monitoring setup, and restart-cleanup. Use when the user asks to launch, kick off, start, restart, or kill a training run, or mentions launching a multi-hour or multi-day GPU job (python train, accelerate launch, torchrun, deepspeed, sbatch, tmux training).

2026-04-30250

reproduce.md

from "fcakyon/phd-skills"

End-to-end paper reproduction from arxiv URL through smoke runs to replication experiments. Handles missing or partial official code, missing training scripts, missing hyperparameters, and private datasets via similar-public-dataset substitution. Use when the user asks to reproduce, implement, replicate, or re-run a paper from scratch, or pastes an arxiv URL with reproduction intent.

2026-04-30250

experiment-design.md

from "fcakyon/phd-skills"

Use when the user wants to design experiments, plan ablation studies, structure baselines, or create incremental evaluation strategies. Triggers on phrases like "design ablation", "plan experiment", "what experiments should I run", "baseline comparison", or "experiment matrix".

2026-03-12250

latex-setup.md

from "fcakyon/phd-skills"

Use when the user wants to set up or troubleshoot a LaTeX environment, choose between biber and bibtex, install packages for a specific venue template, or configure compilation. Triggers on phrases like "setup latex", "biber vs bibtex", "latex compilation error", "install latex packages", "venue template", or "texlive setup".

2026-03-12250

package.json

"author": "fcakyon"

"repository": "fcakyon/phd-skills"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Científicos de datosOcupaciones informáticas y matemáticas15-2051L4

name	dataset-curation
description	Use when the user wants to analyze dataset bias, create stratified samples, evaluate fairness, or plan dataset collection. Triggers on phrases like "dataset bias", "stratified sample", "class imbalance", "data distribution", "fairness analysis", or "ethical review".

Dataset Curation Methodology

You are helping a researcher curate, analyze, or expand a dataset with attention to bias, fairness, and quality.

Step 1: Distribution Analysis

Before any curation action, understand the current state:

Per-Class Distribution

Count instances per class/label/tag
Compute imbalance ratio (max_count / min_count)
Identify severely underrepresented classes (< 5% of max class)
Visualize: bar chart of class frequencies sorted by count

Co-occurrence Analysis

Build co-occurrence matrix: which labels appear together
Identify spurious correlations (e.g., "violence" always co-occurs with "male")
Check for label leakage between splits

Metadata Distribution

Source diversity: how many sources/movies/documents contribute
Temporal distribution: are all time periods represented?
Content diversity: genre, style, domain coverage

Step 2: Bias Assessment

For each identified imbalance or correlation:

Is it real-world reflective? Some imbalances reflect genuine phenomena
Is it harmful? Would a model trained on this data make unfair predictions?
Is it fixable? Can we collect more data, resample, or reweight?

Fairness Dimensions

Check for bias along relevant protected attributes:

Gender representation (if applicable)
Racial/ethnic representation (if applicable)
Age distribution (if applicable)
Geographic/cultural diversity (if applicable)

Bias Metrics

Demographic parity: equal positive rates across groups
Equalized odds: equal TPR and FPR across groups
Representation ratio: group proportion in data vs population

Step 3: Stratified Sampling

When creating splits (train/val/test):

Primary stratification: by label/class distribution
Secondary stratification: by source (prevent source leakage across splits)
Validation:
- Chi-squared test for label distribution similarity across splits
- No source overlap between splits
- Rare classes have minimum representation in each split

Split ratios depend on dataset size:

Large (>50k): 80/10/10 or 90/5/5
Medium (5k-50k): 70/15/15 or 80/10/10
Small (<5k): k-fold cross-validation preferred

Step 4: Quality Assessment

For labeled datasets, assess annotation quality:

Inter-annotator agreement: Cohen's kappa, Fleiss' kappa, or Krippendorff's alpha
Label noise estimation: sample and manually verify N labels
Edge cases: identify ambiguous examples that annotators might disagree on
Consistency checks: automated rules for label validity

Step 5: Expansion Recommendations

If the dataset needs more data:

Priority classes: which classes benefit most from more data
Source suggestions: where to find more data for underrepresented classes
Collection strategy: active learning, targeted scraping, synthetic augmentation
Cost estimation: time and resources for each approach

Step 6: Ethical Review Checklist

Before using or publishing any dataset:

Content sensitivity: does the data contain sensitive material?
Consent: was data collected with appropriate consent?
Privacy: are individuals identifiable? Is anonymization needed?
Licensing: are data sources used within their license terms?
Potential harms: could the dataset be misused?
Documentation: is the dataset documented with a datasheet/data card?

Output Format

Produce:

Distribution report: per-class counts, imbalance ratios, co-occurrence matrix
Bias findings: identified biases with severity and actionability
Split recommendation: stratification strategy with validation results
Expansion plan: prioritized suggestions for addressing gaps
Ethics checklist: completed checklist with notes per item

dataset-curation

Dataset Curation Methodology

Step 1: Distribution Analysis

Per-Class Distribution

Co-occurrence Analysis

Metadata Distribution

Step 2: Bias Assessment

Fairness Dimensions

Bias Metrics

Step 3: Stratified Sampling

Step 4: Quality Assessment

Step 5: Expansion Recommendations

Step 6: Ethical Review Checklist

Output Format

Más de este repositorio

Más de este repositorio

Dataset Curation Methodology

Step 1: Distribution Analysis

Per-Class Distribution

Co-occurrence Analysis

Metadata Distribution

Step 2: Bias Assessment

Fairness Dimensions

Bias Metrics

Step 3: Stratified Sampling

Step 4: Quality Assessment

Step 5: Expansion Recommendations

Step 6: Ethical Review Checklist

Output Format