Run any Skill in Manus with one click

init

Lightweight project initialization — infer the project goal and in-scope datasources, scan the file tree and database metadata (db/table/desc/sample), classify into business domains, then write an AGENTS.md inventory skeleton plus the cheap file-based stores (atomic facts to ./knowledge/*.md via lite extract-knowledge, durable preferences to memory). Stops short of the expensive vector-indexed stores (semantic_models / metrics / reference_sql). Single confirmation-free pass, low token cost.

Run Skill in Manus

Stars1,285

Forks204

UpdatedJune 16, 2026 at 05:42

Source

Datus-ai

Datus-ai/Datus-agent

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

SKILL.md

readonly

More from this repository

same repository

build-kb

Datus-ai/Datus-agent

Build the project's vector-indexed knowledge base from files plus database metadata — optionally scoped to specific files / tables / datasources / domains. Scan the in-scope material, classify it into business domains, explore each domain in parallel with explore subagents, then (after the user confirms a generation manifest) route every artifact to its store via storage-classify, generating semantic_models / metrics / reference_sql (and mining any extra knowledge), and refresh AGENTS.md's KB index. The lightweight /init handles the AGENTS.md inventory plus file-based knowledge/memory; this skill owns the heavy vector-store generation.

2026-06-161.3k

storage-classify

Datus-ai/Datus-agent

Decide where a produced artifact must be persisted before writing it, then route it the prescribed way — semantic_models / metrics / reference_sql via the matching task() subagent, knowledge via extract-knowledge (lite), memory via add_memory, skills via create-skill, and AGENTS.md edited directly. Load before persisting any business fact, validated SQL, metric/model definition, session preference, project convention, or reusable workflow.

2026-06-161.3k

extract-knowledge

Datus-ai/Datus-agent

Mine shortest atomic facts from (question + gold_sql) pairs into ./knowledge/*.md; either by simulating SQL drafting (lite) or by driving the gen_sql subagent in blind iteration (deep)

2026-06-111.3k

memory-organization

Datus-ai/Datus-agent

Audit and reorganize every persistent store — semantic_models, metrics, reference_sql, knowledge, memory, AGENTS.md, skills — verifying each item sits in the correct store per storage-classify, and surfacing duplicates, misclassifications, conflicts, and stale/erroneous entries. Produce a Remediation Plan, STOP for confirmation, then execute. Use ask_user only for genuine decisions during analysis. If nothing needs fixing, report it and stop.

2026-06-111.3k

session-summarize

Datus-ai/Datus-agent

Review the current chat session and persist its valuable takeaways — business facts/rules, validated SQL, metric/model definitions, durable preferences, project conventions, reusable workflows — by classifying each via storage-classify and routing it to the right store. Present a Summary Manifest and STOP for confirmation before any heavy generation. Use at the end of a working session or when the user asks to capture what was learned.

2026-06-111.3k

gen-metrics

Datus-ai/Datus-agent

Generate MetricFlow metrics from natural language business descriptions

2026-06-081.3k

name	init
description	Lightweight project initialization — infer the project goal and in-scope datasources, scan the file tree and database metadata (db/table/desc/sample), classify into business domains, then write an AGENTS.md inventory skeleton plus the cheap file-based stores (atomic facts to ./knowledge/*.md via lite extract-knowledge, durable preferences to memory). Stops short of the expensive vector-indexed stores (semantic_models / metrics / reference_sql). Single confirmation-free pass, low token cost.
tags	["init","workspace","project","classify"]
version	3.0.0
user_invocable	true

Lightweight Project Initialization

You are initializing a project workspace lightly. The goal is a fast, low-cost first pass that gives a downstream agent a usable map of the project without paying for the heavy vector-store generation. You scan the project (files + database metadata), classify it into business domains, then write:

an AGENTS.md inventory skeleton (the project map injected into every node's <project_context>), and
the cheap file-based stores — atomic business facts to ./knowledge/*.md (via lite extract-knowledge) and durable cross-session preferences to memory (via add_memory).

You do not build the vector-indexed stores (semantic_models, metrics, reference_sql) here — those cost tokens, fan out explore subagents, and write LanceDB, so they are out of scope for this lightweight pass. There is no confirmation gate in this skill: knowledge/memory/AGENTS.md are cheap markdown writes, so just do them.

You run in the main agent context, so you may call todo_write/todo_list/todo_read/todo_update, add_memory/edit_memory, the filesystem tools (glob, grep, read_file, write_file, edit_file), the database tools (list_databases, list_tables, describe_table, get_table_ddl, search_table, read_query), and load_skill (to run extract-knowledge lite and consult storage-classify).

Routing authority is storage-classify. This skill decides what to scan; for which content lands in knowledge vs memory vs AGENTS.md, follow storage-classify's decision tree (branches 1, 5, 6) — do not re-invent routing here.

Step 0 — Project Context (inferred, no questions)

Do NOT call ask_user in this step. Infer the project goal and the in-scope datasources from the repository and configuration.

Infer the goal. Read README.md if it exists (first 3000 chars); otherwise derive a 1-2 sentence goal from the directory name, top-level files, and the datasource/table names. State it as an explicit assumption you surface in AGENTS.md.
Infer datasource scope from configuration — default to the single active datasource. Use the currently active datasource of the session (the one pinned via --datasource / /datasource, i.e. the project's default_datasource). When multiple datasources are configured, scope to that active one only — do NOT initialize all of them. Fall back to the sole configured datasource when only one exists, or to a specific one the project files clearly reference. Broaden beyond the active datasource only if the user explicitly asks (via the free-text hints). Do not invent unconfigured services — only mention an extra service (Airflow, Superset, …) if a project file explicitly references it.
Use glob to scan the directory tree (top 3 levels). Skip hidden dirs and __pycache__ / node_modules / .venv.

Step 1 — Scan & Classify

Gather the raw material, then classify it into a multi-level taxonomy of business domains / subtopics (e.g. sales/orders, sales/refunds, infra/etl). This taxonomy feeds the AGENTS.md inventory and helps you locate atomic facts worth filing as knowledge.

File side:

Use glob / grep to collect candidate files: *.sql, *.md, *.yml / *.yaml, scripts, configs, notebooks.
Note any validated-query corpus but do NOT enumerate it here. A corpus of validated (question, SQL) pairs (a queries file, a golden/benchmark set, a saved-query catalog, a dbt/analysis SQL folder) feeds the vector-indexed reference_sql store, which is out of scope for this lightweight pass. You may mention its location under ## Data Assets as a project asset, but do not generate reference_sql entries or a dedicated index section.

Database side (for each in-scope datasource):

list_databases → list_tables to enumerate tables/views.
For representative tables: describe_table (or get_table_ddl) for desc (column names/types/comments) and search_table (its sample_data) or read_query("SELECT * FROM <t> LIMIT 5") for sample. Sampling desc/sample is enough for the inventory — you do not need exhaustive statistics here.
For large databases (>50 tables), sample representative tables per naming pattern rather than describing every table.

Classify every file and table into the domain taxonomy. A single domain may contain both files and tables.

Record with todos when the taxonomy is large (> 3 domains): call todo_write with one todo per domain — title = domain name (≤ 8 words), content = the files + tables it covers. Skip todos for small projects (≤ 3 domains) to avoid noise.

Step 2 — File Cheap Stores (knowledge + memory)

While scanning, you will encounter atomic business facts that a downstream agent cannot infer from INFORMATION_SCHEMA / column comments alone — field encodings / enum / status codes, mandatory constant filters, join traps, business-term→field mappings. These are cheap to file (plain markdown, no vector index) and high-value.

Atomic facts → knowledge. Run extract-knowledge in lite mode (do NOT trigger its deep blind-SQL flow): pass the source (a table's comments/sample, a doc, a config) and the specific fact to mine, plus the datasource it applies to. It writes ./knowledge/<domain-slug>.md. Only file facts that are genuinely non-inferable — skip anything mechanically derivable from the schema.
- Do not enumerate the validated-query corpus for knowledge here. Mining the (question, SQL) corpus pair-by-pair is out of scope for this lightweight pass. Here, only file facts you can read directly off table metadata / docs.
Durable preferences/context → memory. A user/team habit or a default the agent should remember next session goes to add_memory (≤ 2000 bytes). Skip session-specific or one-shot content.

No confirmation gate — these are cheap, reversible writes. If a write would destructively overwrite an existing ./knowledge/*.md, prefer a scoped edit_file and do not clobber unrelated entries.

Step 3 — Write the AGENTS.md Inventory Skeleton

Write or update ./AGENTS.md, following the AGENTS.md Section Ownership from storage-classify. The canonical section order is:

# <project name> · ## Architecture · ## Directory Map · ## Services · ## Data Assets · ## Recommended Tools · ## SQL Conventions · ## Knowledge

AGENTS.md is the project's KB entry point. Because the agentic runtime injects AGENTS.md's first ~200 lines into every node's <project_context>, this is the one place that reliably tells a downstream agent what exists and how to reach it.

You own and fill the inventory sections below — but write only the sections that have real content, in the canonical order:

# <project name> — one-line description (the inferred goal).
## Architecture — brief data flow / stack; ASCII diagram only if complex.
## Directory Map — main dirs only (Directory / Purpose / Key Entry Point).
## Services — configured datasources + any user-mentioned services.
## Data Assets — never enumerate every table. Summarize per database by domain + table count, naming 3-5 representative tables; note "Use list_tables / search_table to explore details at runtime."
## Recommended Tools — runtime tools per configured service type.
## SQL Conventions — if the project has a validated-SQL corpus, induce its recurring output conventions as a short bullet list (this rides in <project_context> and nudges downstream gen_sql toward the project's answer shape):
- Induce, do not hardcode. Read a representative slice of the corpus; state only patterns you actually observe, phrased schema-free (no specific table/column/code names).
- Every rule carries its trigger phrasing — write each as "when the question says/asks ⟨observable phrasing⟩ → ⟨output shape⟩". A rule whose trigger you cannot state as question wording is not a rule — drop it.
- Counter-example scan before persisting — for each candidate rule, scan for pairs whose question matches the trigger but whose SQL differs; any counter-example means narrow or drop it.
- If the corpus is absent or too small to induce any convention, omit the ## SQL Conventions section entirely — do not write a placeholder.
## Knowledge — index of the ./knowledge/*.md files you wrote in Step 2. One bullet per file: - [<Domain>](knowledge/<slug>.md) — <one-line scope>. Write the scope line to convey unguessable specifics — name the kinds of concrete values inside (exact thresholds, literal filter codes, enum spellings, term→column mappings), not just the topic. If you wrote no knowledge files this run, omit the ## Knowledge section — do not write a placeholder.

Only write sections that have real content. The vector-index sections (## Semantic Models, ## Metrics, ## Reference SQL) are out of scope for this lightweight pass, so do not write them at all — neither the section header nor a "none yet" placeholder. Likewise omit any inventory section you have nothing concrete to put in. Never emit empty placeholder lines like _No metrics yet._.

Hard constraints:

AGENTS.md is a top-level overview, not a data dictionary. Target ≤ 200 lines.
If AGENTS.md already exists, prefer a scoped edit_file over a full rewrite, and ask via ask_user before overwriting an existing file wholesale (this is the only ask_user this skill may make). Never touch ## Knowledge entries you did not write this run.

Step 4 — Wrap Up

Close by telling the user what was written: the AGENTS.md inventory, any ./knowledge/*.md files, and any memory entries. Keep it to a short summary — do not propose follow-up actions.

Important Notes

Routing lives in storage-classify, not here. Knowledge vs memory vs AGENTS.md follows its decision tree (branches 1, 5, 6).
This is the lightweight pass — stay in scope. Do NOT call gen_semantic_model / gen_metrics / gen_sql_summary, do NOT fan out explore subagents, and do NOT emit a Generation Manifest — the vector-indexed stores are out of scope for this skill.
Use placeholder comments (e.g. ) when you cannot determine something rather than inventing facts.
Do not ask the user anything except before wholesale-overwriting an existing AGENTS.md. Goal, datasource scope, knowledge, and memory are all inferred and written directly.