| name | repo-archaeologist |
| description | Use when setting up a new repo for TKS team, onboarding a legacy project, or when institutional knowledge is missing/outdated. Triggers: 'document this repo', 'new project setup', 'knowledge extraction', 'what does this codebase do', missing DOMAIN_MAP.md or TRAPS.md or DECISIONS_LOG.md in docs/institutional-memory/. |
Repo Archaeologist
Overview
Extract institutional knowledge from a codebase that only exists in people's heads or scattered across git history. Output 3 living documents that protect against 属人化 (knowledge trapped in individuals).
Core principle: Read code + git history + bug tickets → generate knowledge docs → dev reviews in 5 minutes. Dev never writes documentation from scratch.
When to Use
- New repo joining TKS portfolio
- Legacy repo with no documentation
- Quarterly refresh of existing docs
- Before a key developer rotates off project
- NOT for: routine PR documentation (use
/knowledge-capture instead)
Prerequisites
- GitHub CLI (
gh) authenticated
- ClickUp API access (optional, enriches output)
- Repo cloned locally with full git history
Process
Phase 1: Gather Raw Data
First, detect project type and set variables:
if [ -d "src/main/java" ]; then
SRC_ROOT="src/main/java"
TEST_ROOT="src/test/java"
FILE_EXT="*.java"
elif [ -d "src" ]; then
SRC_ROOT="src"
TEST_ROOT="src"
FILE_EXT="*.ts"
else
SRC_ROOT="."
TEST_ROOT="."
FILE_EXT="*.py"
fi
if [ -f "build.gradle" ] || [ -f "build.gradle.kts" ]; then
BUILD_TOOL="gradle"
elif [ -f "pom.xml" ]; then
BUILD_TOOL="maven"
elif [ -f "package.json" ]; then
BUILD_TOOL="npm"
else
BUILD_TOOL="unknown"
fi
TMPDIR="${TMPDIR:-/tmp}"
Then gather data:
gh pr list --state merged --json number,title,body,mergedAt,author,files \
--limit 500 --jq '.[] | select(.body | length > 50)' > "$TMPDIR/pr-history.json"
git log --numstat --since="12 months ago" --format="COMMIT:%H|%ae|%aI|%s" \
> "$TMPDIR/git-churn.txt"
{ git log --all --oneline --grep="fix" --since="12 months ago"; \
git log --all --oneline --grep="bug" --since="12 months ago"; \
git log --all --oneline --grep="hotfix" --since="12 months ago"; \
} | sort -u > "$TMPDIR/bug-commits.txt"
grep -rn "FIXME\|HACK\|WORKAROUND\|DO NOT\|注意\|TODO.*careful\|WARNING" \
--include="$FILE_EXT" "$SRC_ROOT" > "$TMPDIR/warning-comments.txt" 2>/dev/null || true
find . \( -name "application*.yml" -o -name "application*.properties" \
-o -name "application*.yaml" -o -name ".env*" -o -name ".env.example" \
-o -name "docker-compose*" -o -name "Dockerfile" \) \
-not -path "*/build/*" -not -path "*/node_modules/*" \
| head -20 > "$TMPDIR/config-files.txt"
Note on ClickUp: ClickUp integration is OPTIONAL. If no ClickUp API token is available, skip bug history from ClickUp and rely solely on git history (step 3 above). The skill MUST work without ClickUp — git history alone provides sufficient data for a useful output.
Phase 2: Generate Documents
Create docs/institutional-memory/ directory. Generate 3 files:
File 1: DECISIONS_LOG.md
What to extract from PRs:
- Filter PRs where body contains: "because", "instead of", "trade-off", "decided", "chosen", "なぜ", "理由", "選定"
- Classify each: Architecture Decision | Business Rule Change | Performance Optimization | Bug Root Cause
- For each decision: WHEN (date) → WHO (author) → WHAT (change) → WHY (reasoning) → IMPACT (what breaks if reversed)
Structure:
# Decision Log — [Project Name]
Auto-generated by /repo-archaeologist | Last scan: [date]
Repo: [repo-url] | PRs analyzed: [count] | Period: [date range]
## How to use this file
- Before making architectural changes: search here first
- Before reverting old code: check if there's a decision explaining WHY
- After making new decisions: /knowledge-capture will auto-update this file
## Architecture Decisions
### DEC-001: [Short title]
- **When**: [date] (PR #[number], by @[author])
- **Context**: [Why this decision was needed — business driver]
- **Choice**: [What was chosen]
- **Alternatives considered**: [What was rejected and why]
- **Constraint**: [External constraint if any — client requirement, legacy system, regulation]
- **⚠️ Reversal risk**: [What breaks if someone changes this without understanding context]
Critical rule: Every decision MUST have "Reversal risk". This is the most valuable field — it's what prevents the next developer from breaking things.
File 2: DOMAIN_MAP.md
What to extract from code:
- Package/module structure → business domain mapping
- Database migration files → domain model understanding
- Scheduled jobs/batch configs → operational context
- API endpoints → user-facing functionality
- External service integrations → dependency awareness
Structure:
# Business Domain Map — [Project Name]
Read this BEFORE reading any code.
## Client Profile
- Company: [name], Industry: [industry]
- End users: [who uses the system and why]
- Peak seasons: [when workload spikes — CRITICAL for deploy planning]
- Escalation pattern: [what makes client call/email urgently]
## Module Map
### `[module-name]/` — [Business meaning in Vietnamese]
- **Japanese term**: [日本語の業務用語]
- **What it does for the business**: [2-3 sentences, no code jargon]
- **Trigger**: [What causes this module to run — user action, schedule, event]
- **If it fails**: [Business impact + escalation level: 低/中/高]
- **Key files**: [3-5 most important files]
- **Domain expert on team**: [@person] (backup: [@person])
- **Connected to**: [Other modules that depend on or feed into this]
## Data Flow
[Describe how data moves through the system in business terms]
Example: "Client's customer uploads CSV → system validates → calculates tax →
generates 請求書 → sends to client for approval → client sends to their customer"
## External Dependencies
| System | Purpose | Owner | SLA | Risk if down |
|--------|---------|-------|-----|-------------|
Critical rule: Write in Vietnamese for team, keep Japanese business terms (日本語) as-is. A dev who reads this file should understand the BUSINESS before touching code.
File 3: TRAPS.md
What to scan for:
| Trap Category | Detection Method |
|---|
| Timezone | Search for LocalDateTime.now(), new Date(), datetime.now() without timezone |
| Rounding | Search for BigDecimal, Math.round, toFixed near money/tax logic |
| Encoding | Search for charset, encoding, Shift-JIS, Windows-31J, CSV export |
| Hardcoded config | Search for hardcoded URLs, IPs, magic numbers in business logic |
| External API quirks | Search for Thread.sleep, retry logic, rate limit comments |
| Date edge cases | Search for month-end, fiscal year, leap year logic |
| Multi-tenant leaks | Search for missing tenant filters in queries |
Also extract from:
- Bug commits that took > 1 day to resolve (git log + ClickUp cross-reference if available)
- Code comments with WARNING/FIXME/HACK/注意
- Files with > 5 hotfixes in same area (git blame analysis)
- Dependency health (based on auto-detected build tool):
- Gradle:
./gradlew dependencyUpdates (if plugin available) or read build.gradle
- Maven:
mvn versions:display-dependency-updates
- npm:
npm outdated
- If dependency check command unavailable, skip — this is optional enrichment
Structure:
# ⚠️ Trap Map — [Project Name]
READ THIS BEFORE WRITING CODE. Every trap here caused a real bug.
## 🔴 Critical Traps (Mistake = production bug)
### TRAP-001: [Short name]
- **Files**: [Exact file paths]
- **Wrong way**: [Code that looks right but is wrong]
- **Right way**: [Correct code with explanation]
- **Why it's a trap**: [Why the wrong way seems reasonable]
- **History**: [Bug ticket or incident that revealed this trap]
- **Detection**: [How to catch this in code review]
## 🟡 Caution Traps (Mistake = wasted time)
[Same structure, lower severity]
## 🟢 Known Workarounds (Ugly but necessary)
[Document workarounds that look like bugs but are intentional]
Critical rule: Every trap MUST have "Wrong way" and "Right way" as actual code. Abstract descriptions are useless — show the exact code pattern.
Phase 3: Validate
After generating all 3 files:
- Run
wc -l on each file — DOMAIN_MAP should be longest, TRAPS shortest
- Cross-check: Every module in DOMAIN_MAP should appear in at least one DECISIONS_LOG entry
- Cross-check: At least 50% of TRAPS entries should reference a real bug/PR number
- Ask dev to review: "Anything critical missing? Anything wrong?" — target 5-minute review
Phase 4: Commit
git add docs/institutional-memory/
git commit -m "docs: generate institutional memory via /repo-archaeologist
Generated:
- DECISIONS_LOG.md: [N] decisions from [M] PRs
- DOMAIN_MAP.md: [N] modules documented
- TRAPS.md: [N] traps identified
Review: [senior dev name] confirmed accuracy"
Common Mistakes
| Mistake | Fix |
|---|
| Writing code documentation instead of business knowledge | Ask "would a PM understand this?" — if no, rewrite |
| Missing "Reversal risk" on decisions | Every decision needs this — it's the whole point |
| Traps without code examples | Abstract trap = useless trap. Show wrong vs right code |
| Trying to document everything | Focus on top 20% most critical modules first |
| Domain map uses developer jargon | Write for a new developer who has ZERO context |
Refresh Schedule
- Full rescan: Quarterly, or when project scope changes significantly
- Incremental updates: Handled by
/knowledge-capture on every PR
- Trap additions: Immediately after any production bug