Run any Skill in Manus with one click

$pwd:

datachain-knowledge

Name: Datachain Knowledge
Author: datachain-ai

// Use whenever datasets, cloud storage buckets, or data pipelines are mentioned — creating, saving, querying, listing, exploring, deleting, or processing data in S3, GCS, Azure Blob, or local storage. Also use when running any script that may create datasets as a side effect. Maintains a knowledge base at dc-knowledge/ (JSON + markdown). ALWAYS use this skill when the user creates a dataset, saves pipeline output, runs a data script, or references any storage bucket.

Run Skill in Manus

$ git log --oneline --stat

stars:2,750

forks:143

updated:May 27, 2026 at 09:59

File Explorer

19 files

SKILL.md

readonly

related-skills.json

same repository

datachain-core.md

from "datachain-ai/datachain"

Use ONLY for abstract DataChain SDK questions — API usage, method signatures, or code patterns — when no specific dataset or bucket is referenced. If the request mentions creating, saving, listing, exploring datasets or buckets, use datachain-knowledge instead.

2026-05-272.8k

datachain-jobs.md

from "datachain-ai/datachain"

Use when asked about Studio job analytics — compute hours, user spend, failure rates, cost estimation, cluster usage. Generates and maintains dc-knowledge/jobs/index.md.

2026-04-152.8k

package.json

"author": "datachain-ai"

"repository": "datachain-ai/datachain"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	datachain-knowledge
description	Use whenever datasets, cloud storage buckets, or data pipelines are mentioned — creating, saving, querying, listing, exploring, deleting, or processing data in S3, GCS, Azure Blob, or local storage. Also use when running any script that may create datasets as a side effect. Maintains a knowledge base at dc-knowledge/ (JSON + markdown). ALWAYS use this skill when the user creates a dataset, saves pipeline output, runs a data script, or references any storage bucket.
triggers	["what datasets exist","show me the schema","list datasets","datachain knowledge","update the knowledge base","refresh dataset docs","what's in this bucket","explore bucket","scan bucket","bucket overview","what files are in s3://","what files are in gs://","create dataset","save dataset","delete dataset","new dataset","build dataset","make dataset","generate dataset","save the results","save to dataset","s3://","gs://","az://","read_storage","from bucket","from s3","from gcs","process files","extract metadata","filter dataset","query dataset","run script","run pipeline","python scan","run scan"]

Maintain a knowledge base at dc-knowledge/. .md files are the persistent output. .json files are intermediate (generated in Step 3, consumed in Step 4, then deleted).

CAST.md (sibling to this file) is the canonical methodology — the four layers, naming + tagging, layer-ladder planning, calibration, dialogue template, reuse rules, methodology transmission. Mode B reads it in full as a precondition. When something methodology-related needs to change, change CAST.md, not this file.

Critical Rules

CAST.md §6 owns the CAST-doctrine rules (follow CAST, never bypass DataChain, C/A/S substrate mandatory, one script per stage, one .save() per script). The rules below are operational additions unique to this skill.

Path is dc-knowledge/ — NOT .datachain/. The .datachain/ directory is the internal database; the knowledge base lives at dc-knowledge/.
Never pass update=True to dc.read_storage() in Task or exploration code unless the user explicitly asks to refresh the listing. L1/L2/L3 build scripts are the exception (CAST.md §5).
Prefer DataChain operations over plain Python for all metadata analysis.
Bounded output — JSON and markdown files stay small regardless of data size.
Stop on auth/connection errors — bucket_scan.py runs a fast access check. If it exits with an error JSON on stderr, stop immediately and show the error to the user. Do not retry with different regions, profiles, or endpoints — ask for the missing credentials.
Follow the enrichment prompt template literally in Step 4. Downstream tooling (render_index.py, cast_layer resolution) parses the exact frontmatter the prompt prescribes.

Common gotchas in UDF scripts

parallel=N vs workers=N. parallel=N is local multiprocessing (works anywhere). workers=N is Studio-only and MUST be guarded: chain = chain.settings(parallel=N); if dc.is_studio(): chain = chain.settings(workers=N).
No from __future__ import annotations in UDF modules. It stringifies type hints and DataChain's signal-schema resolution rejects the string-vs-class mismatch.
Type the UDF return precisely. Iterator[object] / Iterator[Any] / bare dict fail schema resolution. Return a specific Iterator[T], a Pydantic BaseModel, or a primitive.
Generators aren't subscriptable. Iterators returned by file APIs do not support [:N]. Use enumerate + break, or list(...) only when the result is genuinely small.
datachain.__version__ does not exist. Use from importlib.metadata import version; version("datachain").

Workflow Mode Detection

Mode A — Discovery/Exploration (e.g., "what datasets exist", "show schema", "explore bucket"): → If the user references a specific bucket URI, run Step 1 (Bucket Enlistment) for its root first. → Then run Steps 2–7.

Mode B — Dataset Creation/Pipeline (e.g., "create dataset X from ...", "process files and save"):

Precondition (do this FIRST — before ANY tool call):
$ cat dc-knowledge/index.md
$ cat {skill_dir}/CAST.md
If index.md exists and the task can be solved by reading an existing dataset, do not write a pipeline — read it directly with dc.read_dataset("name") and filter/merge/extend from there. This avoids recomputing expensive operations.

CAST.md drives every layer / scope / shape decision. Re-read on each new task so the layer-ladder walk and dialogue template are in working context when you plan.

Never parse files under dc-knowledge/datasets/*.json or dc-knowledge/buckets/**/*.json directly — those are pre-render intermediates that get deleted. The information you need is in index.md.

If dc-knowledge/index.md does not exist, proceed with Steps 1–7 to build it.

→ If the pipeline reads from a bucket, run Step 1 (Bucket Enlistment) for the bucket root first. → Run the access check (if not already done in Step 1): datachain bucket status <uri>. If not found / denied, stop and ask for credentials. → Read {skill_dir}/../core/SKILL.md for DataChain SDK rules. → Follow CAST.md §4 (planning) and §4.10 (dialogue) before writing pipeline code. → While the pipeline is running, enrich any Step 1 bucket JSON that does not yet have a .md (parallel work). → After the pipeline completes, run Steps 2–7 to update the knowledge base. → Report both: pipeline result AND knowledge base update status.

Mode C — Script Execution (e.g., user runs an existing .py file that touches data): → If the script references bucket URIs, run Step 1 for each bucket root first. → Scripts can create datasets as side effects. → While the script is running, enrich Step 1 bucket JSON in parallel. → After ANY data-related script finishes, run Steps 2–7 to detect and record new/changed datasets.

Mode D — Knowledge Base Maintenance (e.g., "update the knowledge base", "refresh dataset docs"): → Run Steps 2–7. Existing session context in .md files is preserved automatically during re-enrichment.

Step 1 — Bucket Enlistment

When any storage URI is encountered, enlist the whole bucket first.

Extract bucket root. From any URI, derive {scheme}://{bucket}/.
Check if already enlisted. Look for dc-knowledge/buckets/{scheme}/{bucket_slug}.md or .json. If either exists, skip.
Access check. Run datachain bucket status {root_uri}. If denied / not found, stop and ask.

Scan with timeout. Default 60s; user can override:

python3 {skill_dir}/scripts/bucket_scan.py {root_uri} \
  --output dc-knowledge/buckets/{scheme}/{bucket_slug}.json --timeout 60

Handle timeout (exit code 124). Run the hierarchical fallback:

python3 {skill_dir}/scripts/bucket_overview.py {root_uri} \
  --bucket-json dc-knowledge/buckets/{scheme}/{bucket_slug}.json

Report. "Enlisted bucket {bucket} — {N} files, total size {size}, primarily {top 2-3 extensions}." Do not enrich here; Step 4 batches it.

Step 1 runs once per bucket root per session.

Step 2 — Sync

python3 {skill_dir}/scripts/plan.py [--studio] --output dc-knowledge/.plan.json

Buckets are auto-discovered from catalog listings. Do not add --studio unless requested. If "up_to_date": true, print "Knowledge base is up to date." and stop. Entries with status of "new" or "stale" need processing in Step 3.

Step 3 — Save Data

For each dataset where status != "ok":

python3 {skill_dir}/scripts/dataset_all.py <name> \
  --plan dc-knowledge/.plan.json --output dc-knowledge/<file_path>.json

For each bucket where status != "ok" (and not enlisted in Step 1):

python3 {skill_dir}/scripts/bucket_scan.py <uri> --output dc-knowledge/<file_path>.json

Run independent calls concurrently.

Step 4 — Enrich

Generate .md from .json for each entry processed in Step 3 (and any Step 1 bucket JSON that lacks a .md).

Datasets: read {skill_dir}/prompts/enrich.md, then write dc-knowledge/<file_path>.md per the template.
Buckets: read {skill_dir}/prompts/enrich_bucket.md, then write dc-knowledge/<file_path>.md.

The prompt template is authoritative — downstream tooling parses the exact frontmatter it prescribes. Skip this step only if the user requests raw output only.

Step 5 — Build Index

python3 {skill_dir}/scripts/render_index.py --plan dc-knowledge/.plan.json --output dc-knowledge/index.md

Step 6 — Cleanup

python3 {skill_dir}/scripts/cleanup_json.py --plan dc-knowledge/.plan.json

Keeps .plan.json for Step 7. Skip if the user asks to retain JSON for debugging.

Step 7 — Report

Knowledge base updated: <N> datasets (<M> updated, <K> unchanged), <B> buckets (<X> scanned, <Y> unchanged).

If any buckets have listing_expired: true, add:

Warning: Listing for <bucket> is expired (last scanned: <date>). Run dc.read_storage("<uri>", update=True) to refresh.

datachain-knowledge

More from this repository

More from this repository

Critical Rules

Common gotchas in UDF scripts

Workflow Mode Detection

Step 1 — Bucket Enlistment

Step 2 — Sync

Step 3 — Save Data

Step 4 — Enrich

Step 5 — Build Index

Step 6 — Cleanup

Step 7 — Report

Critical Rules

Common gotchas in UDF scripts

Workflow Mode Detection

Step 1 — Bucket Enlistment

Step 2 — Sync

Step 3 — Save Data

Step 4 — Enrich

Step 5 — Build Index

Step 6 — Cleanup

Step 7 — Report