// "Strategic multi-database microbiome data analysis that THINKS before executing. Uses sequential reasoning to assess existing systems (Patient Similarity DB, Neo4j, Supabase, Upstash) BEFORE querying. Use when answering strategic questions about microbiome data availability, connecting dots across systems, age prediction, patient clustering, or any question requiring cross-database synthesis. ALWAYS starts with pre-flight assessment."
| name | Microbiome Intelligence |
| description | Strategic multi-database microbiome data analysis that THINKS before executing. Uses sequential reasoning to assess existing systems (Patient Similarity DB, Neo4j, Supabase, Upstash) BEFORE querying. Use when answering strategic questions about microbiome data availability, connecting dots across systems, age prediction, patient clustering, or any question requiring cross-database synthesis. ALWAYS starts with pre-flight assessment. |
Strategic Data Intelligence - Not just database queries, but intelligent assessment of what data exists where and how to optimally answer questions.
This skill follows a 3-phase strategic protocol:
Key Difference: Unlike procedural skills that immediately query databases, this skill REASONS about the problem first.
docs/architecture/MULTI_DATABASE_ARCHITECTURE_GUIDE.mdUse this skill when you need to:
Do NOT use for:
Before ANY database queries, use mcp__seq_think__sequentialthinking to assess:
**Thought 1**: What is the user REALLY asking?
- Break down the actual objective
- Identify the underlying data need
- Recognize if this might duplicate existing work
**Thought 2**: What systems/data ALREADY exist?
- Patient Similarity Database: 6,293 samples with 384-dim vectors
- 2,026 samples WITH customer age data
- Multi-modal embeddings (256 microbiome + 128 clinical)
- Ready for ML training
- Neo4j Graph: 61M+ pathway relationships, species-sample links
- Customers: 1,773 with age data
- Samples: 6,293 total
- Relationships: Species, Pathways, Customers
- Supabase: Master clinical data source
- Customers with demographics
- Health scores, gut profiles
- Source of truth for patient metadata
- Upstash Vector Databases:
- CONTEXT_PAPERS: 13,118 publications (3072-dim) for research
- DISCOVERY: 1,762 functional samples (3072-dim) for high-precision research
- PRIMARY: 6,014 functional samples (1536-dim) for fast production queries
- PATIENT_SIMILARITY: 6,293 patient profiles (384-dim) for patient clustering
**Thought 3**: Can we answer this WITHOUT new queries?
- Check if Patient Similarity already has the data
- Review if prior session docs contain the answer
- Assess if inference system outputs exist
**Thought 4**: What's the OPTIMAL data source?
- If question involves patient similarity ā Use Upstash Patient Similarity
- If question needs graph relationships ā Use Neo4j
- If question needs demographic truth ā Use Supabase
- If question needs functional pathways ā Use Upstash DISCOVERY
- If question needs research literature ā Use Upstash CONTEXT_PAPERS
**Thought 5**: What's the minimal query set?
- Design queries that leverage existing infrastructure
- Avoid redundant queries across databases
- Plan for parallel execution where independent
Create a strategic assessment:
## Strategic Assessment for [User Question]
### Existing Resources Found:
- [ ] Patient Similarity DB has relevant data
- [ ] Neo4j has required relationships
- [ ] Supabase has needed metadata
- [ ] Prior session docs contain answer
- [ ] Inference system outputs exist
### Optimal Approach:
[Describe the best way to answer using existing systems]
### Queries Required:
- Database 1: [specific query with rationale]
- Database 2: [specific query with rationale]
### Queries NOT Needed:
- ā [Query avoided because...]
Use this matrix to select optimal databases:
| Question Type | Primary DB | Secondary DB | Rationale |
|---|---|---|---|
| Patient Similarity | Upstash Patient Similarity | None needed | 6,293 samples pre-computed |
| Age Prediction | Upstash Patient Similarity | Neo4j (validation) | 2,026 age-labeled samples ready for ML |
| Species-Pathway Co-occurrence | Neo4j | Upstash DISCOVERY | 61M pathway rels, 3072-dim vectors |
| Demographic Queries | Supabase | Patient Similarity (enriched) | Source of truth |
| Functional Analysis | Upstash DISCOVERY | Neo4j (relationships) | Pathway embeddings |
| Multi-View Learning | Patient Similarity + DISCOVERY | Neo4j (validation) | Taxonomic + Functional |
| Research Literature | Upstash CONTEXT_PAPERS | None needed | 13,118 publications indexed |
// When: Questions about patient clustering, similarity, age prediction
// Use: Patient Similarity DB (already has 384-dim vectors with age metadata)
const ageDataSamples = await patientIndex.query({
topK: 2026, // All samples with customer data
includeMetadata: true,
filter: "age > 0 AND age < 120"
});
// NO NEED to query Neo4j or Supabase - data already enriched!
// When: Questions about species-pathway relationships, network analysis
// Use: Neo4j for graph traversal
MATCH (c:Customer)-[:HAS_SAMPLE]->(s:Sample)-[:HAS_SPECIES]->(sp:Species)
WHERE c.age IS NOT NULL
WITH sp, c.age, count(s) as sample_count
WHERE sample_count >= 5
RETURN sp.name, avg(c.age) as avg_age, sample_count
ORDER BY sample_count DESC
// When: Questions requiring data from multiple sources
// Strategy: Query in parallel, join in application
const [patientVectors, neo4jRelationships, functionalVectors] = await Promise.all([
// Query 1: Patient Similarity (Upstash)
patientIndex.query({ topK: 100, filter: "age > 60" }),
// Query 2: Graph relationships (Neo4j)
cypherQuery(`MATCH (c:Customer)-[:HAS_SAMPLE]->(s)
WHERE c.age > 60
RETURN c.customer_id, s.uuid`),
// Query 3: Functional pathways (Upstash DISCOVERY)
discoveryIndex.fetch(sampleIds)
]);
// Merge results in application layer
const combined = mergeMultiDBResults(patientVectors, neo4jRelationships, functionalVectors);
Query Only What's Needed
Parallel Execution
Data Validation
Efficiency Reporting
User Question: "Find age-related microbes for age prediction"
Traditional Approach (microbiome-data-expert skill):
1. Query Neo4j for species-age correlations
2. Find 5 species (data sparsity)
3. Search literature
4. Generate report
ā Missed that Patient Similarity already has 2,026 age-labeled samples!
Strategic Approach (this skill):
Phase 1 - Pre-Flight Think:
- Patient Similarity DB has 2,026 samples WITH age metadata
- Already has 384-dim vectors (256 microbiome + 128 clinical)
- Multi-view learning ready (taxonomic + clinical features)
- Chen et al. 2022 methodology matches our infrastructure
Phase 2 - Strategic Plan:
- Primary: Use Patient Similarity DB (2,026 samples)
- Secondary: Query Neo4j only for validation
- Avoid: Supabase query (data already in Patient Similarity metadata)
Phase 3 - Execute:
const ageTrainingData = await patientIndex.query({
topK: 2026,
includeMetadata: true,
filter: "age > 0"
});
ā
Result: 2,026 training samples (vs 5 species)
ā
405x more data
ā
No redundant queries
The inference system generates predicted age/health data for samples without customer data.
Check Inference System Outputs:
// Query Neo4j for inferred profiles
const inferredData = await cypherQuery(`
MATCH (s:Sample)-[:HAS_INFERRED_PROFILE]->(i:InferredProfile)
WHERE i.confidence > 0.7
RETURN s.uuid, i.inferred_age, i.inferred_conditions, i.confidence
LIMIT 1000
`);
// Combine with real data for semi-supervised learning
const combined = [...realAgeData, ...inferredData.map(d => ({
age: d.inferred_age,
confidence: d.confidence,
is_inferred: true
}))];
Clinical vs Research lanes have different data quality/completeness:
// Clinical Lane: High-confidence, customer-linked data
const clinicalData = await patientIndex.query({
filter: "lane = 'clinical' AND age > 0",
topK: 2026
});
// Research Lane: May have inferred data
const researchData = await patientIndex.query({
filter: "lane = 'research'",
topK: 4267
});
Combine taxonomic + functional features as recommended by Chen et al. 2022:
// Step 1: Get taxonomic features from Patient Similarity
const taxonomicFeatures = await patientIndex.fetch(sampleIds);
// Returns 256-dim species embeddings + 128-dim clinical
// Step 2: Get functional features from DISCOVERY
const functionalFeatures = await discoveryIndex.fetch(sampleIds);
// Returns 3072-dim pathway abundance vectors
// Step 3: Merge for multi-view learning
const multiViewData = sampleIds.map((id, i) => ({
sample_id: id,
taxonomic: taxonomicFeatures[i].vector, // 384 dims
functional: functionalFeatures[i].vector, // 3072 dims
age: taxonomicFeatures[i].metadata.age // Target
}));
// Train dual models + meta-learner
Question: "Build age prediction model from microbiome"
Strategic Assessment:
Execution:
// Extract training data (single query)
const trainingData = await patientIndex.query({
topK: 2026,
includeMetadata: true,
filter: "age > 0 AND age < 120"
});
// Train model (no additional queries needed!)
const model = trainRandomForest(
trainingData.map(d => d.vector), // 384-dim features
trainingData.map(d => d.metadata.age) // Target
);
Efficiency: 1 query vs 10+ in traditional approach
Question: "Which species co-occur with specific pathways?"
Strategic Assessment:
Execution:
MATCH (s:Sample)-[:HAS_SPECIES]->(sp:Species),
(s)-[:HAS_FUNCTION]->(p:Pathway)
WHERE sp.relative_abundance > 0.05
WITH sp.name as species, p.key as pathway, count(s) as co_occurrence
ORDER BY co_occurrence DESC
LIMIT 50
RETURN species, pathway, co_occurrence
Efficiency: Direct Neo4j query, no multi-DB needed
Question: "Find patient subgroups by microbiome similarity"
Strategic Assessment:
Execution:
// Find similar patients to a reference
const similarPatients = await patientIndex.query({
vector: referencePatient.vector,
topK: 20,
includeMetadata: true,
filter: "constipation_severity > 5"
});
// Cluster all patients
const allSamples = await patientIndex.query({
topK: 6293,
includeMetadata: true
});
const clusters = kMeansClustering(
allSamples.map(s => s.vector),
numClusters: 5
);
Efficiency: Vector DB optimized for this, <1 second query
Symptoms: Neo4j query returns 5 species when expecting 50+
Strategic Diagnosis:
Solution:
// Lower thresholds incrementally
WHERE has.abundance >= 0.005 -- Was 0.01
AND customer_count >= 5 -- Was 10
Or Better: Check if Patient Similarity already has the data!
Symptoms: Same sample appears in multiple databases
Strategic Understanding:
Solution: Query the OPTIMAL database for your question:
Symptoms: Age values seem inconsistent
Diagnosis:
customer.age (from Supabase)sample.estimated_age (from inference system)Solution: Always check metadata source:
if (sample.metadata.patient_id === "unknown") {
// This is inferred data, use with caution
// Weight by estimation_confidence
}
Before executing ANY database query, ask yourself:
Remember: The best query is the one you don't have to run because the data already exists elsewhere.
Created: 2025-10-24 Category: Strategic Intelligence Complexity: Advanced Prerequisites: Multi-database architecture understanding