Run any Skill in Manus with one click

$pwd:

science-find-datasets

Name: Science Find Datasets
Author: khughitt

// Discover and document candidate datasets for research or tool demos. Uses LLM knowledge + dataset repository search to find, rank, and document relevant public datasets.

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:May 6, 2026 at 09:39

SKILL.md

readonly

package.json

"author": "khughitt"

"repository": "khughitt/science"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	science-find-datasets
description	Discover and document candidate datasets for research or tool demos. Uses LLM knowledge + dataset repository search to find, rank, and document relevant public datasets.

Find Datasets

Converted from Claude command /science:find-datasets.

Science Codex Command Preamble

Before executing any research command:

Resolve project profile: Read science.yaml and identify the project's profile. Use the canonical layout for that profile:
- research → doc/, specs/, tasks/, knowledge/, papers/, models/, data/, code/
- software → doc/, specs/, tasks/, knowledge/, plus native implementation roots such as src/ and tests/
Load role prompt: .ai/prompts/<role>.md if present, else references/role-prompts/<role>.md.
Load the science-research-methodology and science-scientific-writing Codex skills. If native skill loading is unavailable, use codex-skills/INDEX.md to map canonical Science skill names to generated skill files and source paths.
Read specs/research-question.md for project context when it exists.
Load project aspects: Read aspects from science.yaml (default: empty list). For each declared aspect, resolve the aspect file in this order:
1. aspects/<name>/<name>.md — canonical Science aspects
2. .ai/aspects/<name>.md — project-local aspect override or addition
If neither path exists (the project declares an aspect that isn't shipped with Science and has no project-local definition), do not block: log a single line like aspect "<name>" declared in science.yaml but no definition found — proceeding without it and continue. Suggest the user either (a) drop the aspect from science.yaml, (b) author it under .ai/aspects/<name>.md, or (c) align the name with one shipped under aspects/.

When executing command steps, incorporate the additional sections, guidance, and signal categories from loaded aspects. Aspect-contributed sections are whole sections inserted at the placement indicated in each aspect file.

Check for missing aspects: Scan for structural signals that suggest aspects the project could benefit from but hasn't declared:

Signal	Suggests
Files in `specs/hypotheses/`	`hypothesis-testing`
Files in `models/` (`.dot`, `.json` DAG files)	`causal-modeling`
Workflow files, notebooks, or benchmark scripts in `code/`	`computational-analysis`
Package manifests (`pyproject.toml`, `package.json`, `Cargo.toml`) at project root with project source code (not just tool dependencies)	`software-development`

If a signal is detected and the corresponding aspect is not in the aspects list, briefly note it to the user before proceeding:

"This project has [signal] but the [aspect] aspect isn't enabled. This would add [brief description of what the aspect contributes]. Want me to add it to science.yaml?"

If the user agrees, add the aspect to science.yaml and load the aspect file before continuing. If they decline, proceed without it.

Only check once per command invocation — do not re-prompt for the same aspect if the user has previously declined it in this session.

Resolve templates: When a command says "Read .ai/templates/<name>.md", check the project's .ai/templates/ directory first. If not found, read from templates/<name>.md. If neither exists, warn the user and proceed without a template — the command's Writing section provides sufficient structure.
Resolve science CLI invocation: When a command says to run science, prefer the project-local install path: uv run science <command>. This assumes the root pyproject.toml includes science as a dev dependency installed via uv add --dev --editable "$SCIENCE_TOOL_PATH" (the distribution is science; the entry point it installs is science). If that fails (no root pyproject.toml or science not in dependencies), fall back to: uv run --with <science-plugin-root>/science science <command>

Find datasets for the user input. If no argument is provided, derive candidate search terms from specs/research-question.md, active hypotheses, and inquiry variables, then ask the user to confirm the focus.

Setup

Follow the Science Codex Command Preamble before executing this skill. Use the research-assistant role prompt.

Additionally:

Read skills/data/SKILL.md for data management conventions.
If present, read skills/data/frictionless.md for Data Package guidance.
Read .ai/templates/dataset.md first; if not found, read templates/dataset.md.
Read project context:
- specs/research-question.md
- specs/scope-boundaries.md
- specs/hypotheses/
- Existing doc/datasets/ (to avoid duplicating known datasets)
If an inquiry exists, check inquiry variables to understand what data the project needs:
```
science inquiry list --format json
```

Tool invocation

All science commands below use this pattern:

uv run science <command>

For brevity, the examples below write just science <command> — always expand to uv run science <command> when executing. See step 8 of references/command-preamble.md for the fallback.

Workflow

Step 1: Identify data needs

Based on project context:

What variables does the project need data for?
What modalities are relevant? (genomics, clinical, survey, imaging, etc.)
What organisms or populations?
What access constraints apply? (must be public, specific licenses, etc.)
What formats are preferred?

Summarize needs concisely before searching.

Step 2: LLM candidate generation

Using your knowledge of available datasets in the field:

Suggest 5-10 candidate datasets with rationale
Include known accessions, DOIs, or repository names where possible
Explain why each is relevant to the project

Step 3: Adapter-driven search

Use science datasets search to find datasets across repositories:

# Broad search across all sources
science datasets search "<query>" --format json

# Targeted search on specific sources
science datasets search "<query>" --source zenodo,geo --format json

For each promising result, get full metadata:

science datasets metadata <source>:<id> --format json

And list available files:

science datasets files <source>:<id> --format json

Cross-reference LLM suggestions with search results. Note which candidates were verified and which remain unverified.

Step 4: Rank candidates

Rank by:

Relevance — covers project variables, matches research question
Quality — sample size, known provenance, peer-reviewed origin
Accessibility — public access, permissive license, standard format
Completeness — covers multiple needed variables, adequate sample size
Recency — newer datasets may have better methods/standards

Label each as:

Use now — download and integrate immediately
Evaluate next — promising but needs closer inspection
Track — potentially useful, defer

Step 5: Document selected datasets

For each Use now or Evaluate next dataset, create a dataset note:

File: doc/datasets/data-<slug>.md using .ai/templates/dataset.md first, then templates/dataset.md

Fill in all available fields. For fields you cannot verify, mark as [UNVERIFIED].

Required frontmatter fields to populate (see template comments for enum values):

tier — one of use-now, evaluate-next, track (mirror the ranking label from Step 4)
access — one of public, controlled, mixed
license — SPDX identifier or unknown
formats — list of lower-case format slugs
size_estimate — with unit (e.g., "12 GB", "~500 MB", "unknown")
update_cadence — static, rolling, monthly, quarterly, annual, or versioned-releases
ontology_terms — canonical CURIEs (UBERON:, CL:, MONDO:, DOID:, EFO:*, etc.), not free text

Multi-accession resources

If a resource spans multiple accessions (primary + mirror, atlas + component studies, etc.), pick one pattern:

One record, multi-accession list — preferred default. List every accession in datasets:.
Parent + overview + children — preferred for atlases. One overview record, one per component, linked via related:.
One record per accession — last resort, when accessions share little context.

See the ## Multi-accession resources comment block in templates/dataset.md for examples.

Step 6: Variable mapping (if inquiry exists)

If the project has an active inquiry, create a coverage matrix:

List each inquiry variable
Map which dataset(s) provide data for it
Flag unmapped variables (data gaps)
Flag variables with multiple dataset sources (potential for cross-validation)

Include this mapping in a ## Variable Coverage section of the search output.

Step 7: Update project files

Update science.yaml data_sources section with new entries.
Write machine-readable search results to doc/searches/YYYY-MM-DD-datasets-<slug>.json.

If appropriate, suggest download commands:

science datasets download <source>:<id> --dest data/raw/

Offer to create follow-up tasks via science tasks add:
- Download and inspect Use now datasets
- Create datapackage.json for downloaded data
- Map variables for pipeline planning

Step 8: Suggest next steps

Download selected datasets
Create Frictionless Data Package descriptors
Run science-plan-pipeline to build computational workflow
Run science-discuss to evaluate dataset choices

Emission rules (rev 2.1)

When emitting doc/datasets/<slug>.md:

One entity per distinguishable artefact at a distinct access level. A paper with one public supplement and one controlled EGA deposit produces TWO entities, optionally plus a third umbrella entity linking them via parent_dataset / siblings.
Always set origin: "external".
Default access.verified: false, access.last_reviewed: "", consumed_by: [].
Populate access.level, access.source_url, and access.credentials_required from discovery evidence. When uncertain, use the most restrictive known level — the verification step corrects it.
The accessions: field carries external accession IDs (renamed from datasets:; legacy entries continue to read).
Do NOT emit origin: derived entities — those are produced by science dataset register-run after a workflow run.

Output Summary

Present a concise summary table:

Dataset	Source	Accession/DOI	Tier	Key Variables	Size

Followed by any data gaps that need to be addressed.

Process Reflection

Reflect on the template and workflow used above.

If you have feedback (friction, gaps, suggestions, or things that worked well), report each item via:

science feedback add \
  --target "command:find-datasets" \
  --category <friction|gap|guidance|suggestion|positive> \
  --summary "<one-line summary>" \
  --detail "<optional prose>"

Guidelines:

One entry per distinct issue (not one big dump)
If the same issue has occurred before, the tool will detect it and increment recurrence automatically
Skip if everything worked smoothly — no feedback is valid feedback
For template-specific issues, use --target "template:<name>" instead

name	science-find-datasets
description	Discover and document candidate datasets for research or tool demos. Uses LLM knowledge + dataset repository search to find, rank, and document relevant public datasets.