with one click
science-find-datasets
// Discover and document candidate datasets for research or tool demos. Uses LLM knowledge + dataset repository search to find, rank, and document relevant public datasets.
// Discover and document candidate datasets for research or tool demos. Uses LLM knowledge + dataset repository search to find, rank, and document relevant public datasets.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | science-find-datasets |
| description | Discover and document candidate datasets for research or tool demos. Uses LLM knowledge + dataset repository search to find, rank, and document relevant public datasets. |
Converted from Claude command /science:find-datasets.
Before executing any research command:
Resolve project profile: Read science.yaml and identify the project's profile.
Use the canonical layout for that profile:
research → doc/, specs/, tasks/, knowledge/, papers/, models/, data/, code/software → doc/, specs/, tasks/, knowledge/, plus native implementation roots such as src/ and tests/Load role prompt: .ai/prompts/<role>.md if present, else references/role-prompts/<role>.md.
Load the science-research-methodology and science-scientific-writing Codex skills. If native skill loading is unavailable, use codex-skills/INDEX.md to map canonical Science skill names to generated skill files and source paths.
Read specs/research-question.md for project context when it exists.
Load project aspects: Read aspects from science.yaml (default: empty list).
For each declared aspect, resolve the aspect file in this order:
aspects/<name>/<name>.md — canonical Science aspects.ai/aspects/<name>.md — project-local aspect override or additionIf neither path exists (the project declares an aspect that isn't shipped with
Science and has no project-local definition), do not block: log a single line
like aspect "<name>" declared in science.yaml but no definition found — proceeding without it and continue. Suggest the user either (a) drop the
aspect from science.yaml, (b) author it under .ai/aspects/<name>.md, or
(c) align the name with one shipped under aspects/.
When executing command steps, incorporate the additional sections, guidance, and signal categories from loaded aspects. Aspect-contributed sections are whole sections inserted at the placement indicated in each aspect file.
Check for missing aspects: Scan for structural signals that suggest aspects the project could benefit from but hasn't declared:
| Signal | Suggests |
|---|---|
Files in specs/hypotheses/ | hypothesis-testing |
Files in models/ (.dot, .json DAG files) | causal-modeling |
Workflow files, notebooks, or benchmark scripts in code/ | computational-analysis |
Package manifests (pyproject.toml, package.json, Cargo.toml) at project root with project source code (not just tool dependencies) | software-development |
If a signal is detected and the corresponding aspect is not in the aspects list,
briefly note it to the user before proceeding:
"This project has [signal] but the
[aspect]aspect isn't enabled. This would add [brief description of what the aspect contributes]. Want me to add it toscience.yaml?"
If the user agrees, add the aspect to science.yaml and load the aspect file
before continuing. If they decline, proceed without it.
Only check once per command invocation — do not re-prompt for the same aspect if the user has previously declined it in this session.
Resolve templates: When a command says "Read .ai/templates/<name>.md",
check the project's .ai/templates/ directory first. If not found, read from
templates/<name>.md. If neither exists, warn the
user and proceed without a template — the command's Writing section provides
sufficient structure.
Resolve science CLI invocation: When a command says to run science,
prefer the project-local install path: uv run science <command>.
This assumes the root pyproject.toml includes science as a dev
dependency installed via uv add --dev --editable "$SCIENCE_TOOL_PATH"
(the distribution is science; the entry point it installs is science).
If that fails (no root pyproject.toml or science not in dependencies),
fall back to:
uv run --with <science-plugin-root>/science science <command>
Find datasets for the user input.
If no argument is provided, derive candidate search terms from specs/research-question.md, active hypotheses, and inquiry variables, then ask the user to confirm the focus.
Follow the Science Codex Command Preamble before executing this skill. Use the research-assistant role prompt.
Additionally:
skills/data/SKILL.md for data management conventions.skills/data/frictionless.md for Data Package guidance..ai/templates/dataset.md first; if not found, read templates/dataset.md.specs/research-question.mdspecs/scope-boundaries.mdspecs/hypotheses/doc/datasets/ (to avoid duplicating known datasets)science inquiry list --format json
All science commands below use this pattern:
uv run science <command>
For brevity, the examples below write just science <command> — always expand to uv run science <command> when executing. See step 8 of references/command-preamble.md for the fallback.
Based on project context:
Summarize needs concisely before searching.
Using your knowledge of available datasets in the field:
Use science datasets search to find datasets across repositories:
# Broad search across all sources
science datasets search "<query>" --format json
# Targeted search on specific sources
science datasets search "<query>" --source zenodo,geo --format json
For each promising result, get full metadata:
science datasets metadata <source>:<id> --format json
And list available files:
science datasets files <source>:<id> --format json
Cross-reference LLM suggestions with search results. Note which candidates were verified and which remain unverified.
Rank by:
Label each as:
Use now — download and integrate immediatelyEvaluate next — promising but needs closer inspectionTrack — potentially useful, deferFor each Use now or Evaluate next dataset, create a dataset note:
File: doc/datasets/data-<slug>.md using .ai/templates/dataset.md first, then templates/dataset.md
Fill in all available fields. For fields you cannot verify, mark as [UNVERIFIED].
Required frontmatter fields to populate (see template comments for enum values):
tier — one of use-now, evaluate-next, track (mirror the ranking label from Step 4)access — one of public, controlled, mixedlicense — SPDX identifier or unknownformats — list of lower-case format slugssize_estimate — with unit (e.g., "12 GB", "~500 MB", "unknown")update_cadence — static, rolling, monthly, quarterly, annual, or versioned-releasesontology_terms — canonical CURIEs (UBERON:, CL:, MONDO:, DOID:, EFO:*, etc.), not free textIf a resource spans multiple accessions (primary + mirror, atlas + component studies, etc.), pick one pattern:
datasets:.related:.See the ## Multi-accession resources comment block in templates/dataset.md for examples.
If the project has an active inquiry, create a coverage matrix:
Include this mapping in a ## Variable Coverage section of the search output.
science.yaml data_sources section with new entries.doc/searches/YYYY-MM-DD-datasets-<slug>.json.science datasets download <source>:<id> --dest data/raw/
science tasks add:
Use now datasetsdatapackage.json for downloaded datascience-plan-pipeline to build computational workflowscience-discuss to evaluate dataset choicesWhen emitting doc/datasets/<slug>.md:
parent_dataset /
siblings.origin: "external".access.verified: false, access.last_reviewed: "", consumed_by: [].access.level, access.source_url, and access.credentials_required
from discovery evidence. When uncertain, use the most restrictive known level
— the verification step corrects it.accessions: field carries external accession IDs (renamed from datasets:;
legacy entries continue to read).origin: derived entities — those are produced by science dataset register-run after a workflow run.Present a concise summary table:
| Dataset | Source | Accession/DOI | Tier | Key Variables | Size |
|---|
Followed by any data gaps that need to be addressed.
Reflect on the template and workflow used above.
If you have feedback (friction, gaps, suggestions, or things that worked well), report each item via:
science feedback add \
--target "command:find-datasets" \
--category <friction|gap|guidance|suggestion|positive> \
--summary "<one-line summary>" \
--detail "<optional prose>"
Guidelines:
--target "template:<name>" instead