Run any Skill in Manus with one click

$pwd:

dataset-profiler

Name: Dataset Profiler
Author: oxbshw

// Use when first encountering a new dataset — produces a structured profile (schema, missingness, distributions, outliers, gotchas) before any analysis.

Run Skill in Manus

$ git log --oneline --stat

stars:526

forks:83

updated:May 9, 2026 at 23:58

File Explorer

2 files

SKILL.md

readonly

name	dataset-profiler
description	Use when first encountering a new dataset — produces a structured profile (schema, missingness, distributions, outliers, gotchas) before any analysis.
version	0.1.0
status	experimental
risk	low
tags	["data","read-only","writes-files"]

Dataset Profiler

When to use

A new dataset arrives and you need to understand it before using it
Before reproducing an analysis that referenced a dataset
When data quality is suspect ("the chart looked wrong")

When NOT to use

Streaming / online data (this is point-in-time)
Sensitive PII without an explicit allow-list

Inputs

Name	Type	Required	Notes
`path`	path	yes	CSV / Parquet / JSONL
`target`	string	no	column of interest (gets extra distribution detail)

Outputs

profile.md with: Source, Schema, Missingness, Distributions, Outliers, Joins / keys, Gotchas, Open questions.

Workflow

Load with the right reader (extension-detected); record row count, file size
Schema: column → dtype → nullable → example value
Missingness: % per column, top columns by missingness
Distributions: numeric (min, p50, p95, max, std), categorical (top-k, cardinality)
Outliers: flag rows beyond p99 + 3·IQR for numerics
Identify potential keys (unique columns) and join candidates
Gotchas: timezone columns, mixed encodings, suspicious all-zero rows, magic values (-1, 9999-12-31)
Open questions: ambiguous columns / values that need owner input

References

references/profile-template.md

Success criteria

Every column appears in Schema + Missingness
Outliers section includes example rows
Gotchas section is non-empty (real datasets always have some)

Failure modes

File too large to read in memory → switch to streaming + sampled stats; flag prominently
Encoding fails → try common alternatives; if all fail, surface and stop

related-skills.json

same repository

adr-writer.md

from "oxbshw/LLM-Agents-Ecosystem-Handbook"

Use when capturing an architecture decision so it survives turnover — produces an ADR-NNNN.md from context, options considered, and the chosen path.

2026-05-09526

api-design-reviewer.md

from "oxbshw/LLM-Agents-Ecosystem-Handbook"

Use when reviewing a proposed REST or GraphQL API change before merge — checks contract clarity, backwards compatibility, errors, pagination, auth, and naming.

2026-05-09526

incident-postmortem.md

from "oxbshw/LLM-Agents-Ecosystem-Handbook"

Use after an incident is resolved — drafts a blameless postmortem from timeline notes, alerts, and chat threads.

2026-05-09526

pr-summarizer.md

from "oxbshw/LLM-Agents-Ecosystem-Handbook"

Use when opening a PR — produces a clean PR description (what / why / how to verify / risks) from a branch diff against base.

2026-05-09526

sprint-planner.md

from "oxbshw/LLM-Agents-Ecosystem-Handbook"

Use when planning the next sprint — turns ticket intake + team capacity into a planned sprint with explicit non-goals.

2026-05-09526

agent-memory-curator.md

from "oxbshw/LLM-Agents-Ecosystem-Handbook"

Use after a session to promote useful episodic notes from logs/episodic/ into distilled, dated entries in MEMORY.md and memory/semantic/.

2026-05-09526

package.json

"author": "oxbshw"

"repository": "oxbshw/LLM-Agents-Ecosystem-Handbook"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

name	dataset-profiler
description	Use when first encountering a new dataset — produces a structured profile (schema, missingness, distributions, outliers, gotchas) before any analysis.
version	0.1.0
status	experimental
risk	low
tags	["data","read-only","writes-files"]

Dataset Profiler

When to use

A new dataset arrives and you need to understand it before using it
Before reproducing an analysis that referenced a dataset
When data quality is suspect ("the chart looked wrong")

When NOT to use

Streaming / online data (this is point-in-time)
Sensitive PII without an explicit allow-list

Inputs

Name	Type	Required	Notes
`path`	path	yes	CSV / Parquet / JSONL
`target`	string	no	column of interest (gets extra distribution detail)

Outputs

profile.md with: Source, Schema, Missingness, Distributions, Outliers, Joins / keys, Gotchas, Open questions.

Workflow

Load with the right reader (extension-detected); record row count, file size
Schema: column → dtype → nullable → example value
Missingness: % per column, top columns by missingness
Distributions: numeric (min, p50, p95, max, std), categorical (top-k, cardinality)
Outliers: flag rows beyond p99 + 3·IQR for numerics
Identify potential keys (unique columns) and join candidates
Gotchas: timezone columns, mixed encodings, suspicious all-zero rows, magic values (-1, 9999-12-31)
Open questions: ambiguous columns / values that need owner input

References

references/profile-template.md

Success criteria

Every column appears in Schema + Missingness
Outliers section includes example rows
Gotchas section is non-empty (real datasets always have some)

Failure modes

File too large to read in memory → switch to streaming + sampled stats; flag prominently
Encoding fails → try common alternatives; if all fail, surface and stop

dataset-profiler

Dataset Profiler

When to use

When NOT to use

Inputs

Outputs

Workflow

References

Success criteria

Failure modes

More from this repository

More from this repository

Dataset Profiler

When to use

When NOT to use

Inputs

Outputs

Workflow

References

Success criteria

Failure modes