$pwd:

microbiome-intelligence

Name: Microbiome Intelligence
Author: scan-iq

// "Strategic multi-database microbiome data analysis that THINKS before executing. Uses sequential reasoning to assess existing systems (Patient Similarity DB, Neo4j, Supabase, Upstash) BEFORE querying. Use when answering strategic questions about microbiome data availability, connecting dots across systems, age prediction, patient clustering, or any question requiring cross-database synthesis. ALWAYS starts with pre-flight assessment."

$ git log --oneline --stat

stars:0

forks:0

updated:November 23, 2025 at 08:59

SKILL.md

readonly

package.json

"author": "scan-iq"

"repository": "scan-iq/FoxRev"

$ git clone

$ download --local

[HINT] Download the complete skill directory including SKILL.md and all related files

related-imports.ts

// Related Skills

import data-stats-analysis

from "Starlitnightly"

783

import hypothesis-testing-engine

from "OneWave-AI"

import gene-set-enrichment

from "monarch-initiative"

import market-research-and-competitive-analysis

from "ShunsukeHayashi"

import statistical-analyzer

from "dkyazzentwatwa"

import spatialdata

from "Ketomihine"

import experiment-analyzer

from "StreamPilotOrg"

import experiment-design-checklist

from "GhostScientist"

import exploratory-data-analysis

from "K-Dense-AI"

2,097

name

Microbiome Intelligence

description

Strategic multi-database microbiome data analysis that THINKS before executing. Uses sequential reasoning to assess existing systems (Patient Similarity DB, Neo4j, Supabase, Upstash) BEFORE querying. Use when answering strategic questions about microbiome data availability, connecting dots across systems, age prediction, patient clustering, or any question requiring cross-database synthesis. ALWAYS starts with pre-flight assessment.

Microbiome Intelligence

What This Skill Does

Strategic Data Intelligence - Not just database queries, but intelligent assessment of what data exists where and how to optimally answer questions.

This skill follows a 3-phase strategic protocol:

Pre-Flight Think: Assess existing systems and data before executing queries
Strategic Planning: Determine optimal data sources and query patterns
Intelligent Execution: Query only what's needed with multi-database awareness

Key Difference: Unlike procedural skills that immediately query databases, this skill REASONS about the problem first.

Prerequisites

Access to microbiome project databases (Neo4j, Supabase, Upstash)
Understanding of two-lane architecture (Clinical vs Research)
Multi-Database Architecture Guide: docs/architecture/MULTI_DATABASE_ARCHITECTURE_GUIDE.md

Quick Reference: When to Use This Skill

Use this skill when you need to:

✅ Strategic Questions: "What age data do we have?" (not "query Neo4j for age data")
✅ Cross-System Analysis: Questions spanning multiple databases
✅ Resource Assessment: "Do we already have X?" before building X
✅ Pattern Discovery: Age prediction, patient similarity, clustering
✅ Data Availability: Understanding what exists before designing solutions
✅ Historical Context: Questions requiring knowledge of Patient Similarity system, inference system, etc.

Do NOT use for:

❌ Simple single-database queries (use MCP tools directly)
❌ Repetitive data extraction (use automation scripts)
❌ Questions with no strategic component

Phase 1: Pre-Flight Strategic Assessment (ALWAYS START HERE)

Step 1: Use Sequential Thinking to Map Resources

Before ANY database queries, use mcp__seq_think__sequentialthinking to assess:

**Thought 1**: What is the user REALLY asking?
- Break down the actual objective
- Identify the underlying data need
- Recognize if this might duplicate existing work

**Thought 2**: What systems/data ALREADY exist?
- Patient Similarity Database: 6,293 samples with 384-dim vectors
  - 2,026 samples WITH customer age data
  - Multi-modal embeddings (256 microbiome + 128 clinical)
  - Ready for ML training
- Neo4j Graph: 61M+ pathway relationships, species-sample links
  - Customers: 1,773 with age data
  - Samples: 6,293 total
  - Relationships: Species, Pathways, Customers
- Supabase: Master clinical data source
  - Customers with demographics
  - Health scores, gut profiles
  - Source of truth for patient metadata
- Upstash Vector Databases:
  - CONTEXT_PAPERS: 13,118 publications (3072-dim) for research
  - DISCOVERY: 1,762 functional samples (3072-dim) for high-precision research
  - PRIMARY: 6,014 functional samples (1536-dim) for fast production queries
  - PATIENT_SIMILARITY: 6,293 patient profiles (384-dim) for patient clustering

**Thought 3**: Can we answer this WITHOUT new queries?
- Check if Patient Similarity already has the data
- Review if prior session docs contain the answer
- Assess if inference system outputs exist

**Thought 4**: What's the OPTIMAL data source?
- If question involves patient similarity → Use Upstash Patient Similarity
- If question needs graph relationships → Use Neo4j
- If question needs demographic truth → Use Supabase
- If question needs functional pathways → Use Upstash DISCOVERY
- If question needs research literature → Use Upstash CONTEXT_PAPERS

**Thought 5**: What's the minimal query set?
- Design queries that leverage existing infrastructure
- Avoid redundant queries across databases
- Plan for parallel execution where independent

Step 2: Document Strategic Findings

Create a strategic assessment:

## Strategic Assessment for [User Question]

### Existing Resources Found:
- [ ] Patient Similarity DB has relevant data
- [ ] Neo4j has required relationships
- [ ] Supabase has needed metadata
- [ ] Prior session docs contain answer
- [ ] Inference system outputs exist

### Optimal Approach:
[Describe the best way to answer using existing systems]

### Queries Required:
- Database 1: [specific query with rationale]
- Database 2: [specific query with rationale]

### Queries NOT Needed:
- ❌ [Query avoided because...]

Phase 2: Strategic Planning

Database Specialization Matrix

Use this matrix to select optimal databases:

Question Type	Primary DB	Secondary DB	Rationale
Patient Similarity	Upstash Patient Similarity	None needed	6,293 samples pre-computed
Age Prediction	Upstash Patient Similarity	Neo4j (validation)	2,026 age-labeled samples ready for ML
Species-Pathway Co-occurrence	Neo4j	Upstash DISCOVERY	61M pathway rels, 3072-dim vectors
Demographic Queries	Supabase	Patient Similarity (enriched)	Source of truth
Functional Analysis	Upstash DISCOVERY	Neo4j (relationships)	Pathway embeddings
Multi-View Learning	Patient Similarity + DISCOVERY	Neo4j (validation)	Taxonomic + Functional
Research Literature	Upstash CONTEXT_PAPERS	None needed	13,118 publications indexed

Query Pattern Templates

Pattern 1: Patient Similarity + Demographics

// When: Questions about patient clustering, similarity, age prediction
// Use: Patient Similarity DB (already has 384-dim vectors with age metadata)

const ageDataSamples = await patientIndex.query({
  topK: 2026,  // All samples with customer data
  includeMetadata: true,
  filter: "age > 0 AND age < 120"
});

// NO NEED to query Neo4j or Supabase - data already enriched!

Pattern 2: Graph Relationships (Neo4j)

// When: Questions about species-pathway relationships, network analysis
// Use: Neo4j for graph traversal

MATCH (c:Customer)-[:HAS_SAMPLE]->(s:Sample)-[:HAS_SPECIES]->(sp:Species)
WHERE c.age IS NOT NULL
WITH sp, c.age, count(s) as sample_count
WHERE sample_count >= 5
RETURN sp.name, avg(c.age) as avg_age, sample_count
ORDER BY sample_count DESC

Pattern 3: Multi-Database Synthesis

// When: Questions requiring data from multiple sources
// Strategy: Query in parallel, join in application

const [patientVectors, neo4jRelationships, functionalVectors] = await Promise.all([
  // Query 1: Patient Similarity (Upstash)
  patientIndex.query({ topK: 100, filter: "age > 60" }),

  // Query 2: Graph relationships (Neo4j)
  cypherQuery(`MATCH (c:Customer)-[:HAS_SAMPLE]->(s)
                WHERE c.age > 60
                RETURN c.customer_id, s.uuid`),

  // Query 3: Functional pathways (Upstash DISCOVERY)
  discoveryIndex.fetch(sampleIds)
]);

// Merge results in application layer
const combined = mergeMultiDBResults(patientVectors, neo4jRelationships, functionalVectors);

Phase 3: Intelligent Execution

Execution Principles

Query Only What's Needed
- If Patient Similarity has the data, don't query Neo4j
- If prior docs answer the question, don't query databases
- If thresholds need adjustment, say so instead of trying all combinations
Parallel Execution
- Group all independent queries in single message
- Use MCP tools in parallel for different databases
- Minimize round trips
Data Validation
- Check for reasonable results (e.g., age 0-120, not negative)
- Verify sample counts match expected ranges
- Flag unexpected patterns
Efficiency Reporting
- Document which databases were queried and why
- Report which queries were avoided and why
- Measure query execution time

Example: Age Prediction Question

User Question: "Find age-related microbes for age prediction"

Traditional Approach (microbiome-data-expert skill):

1. Query Neo4j for species-age correlations
2. Find 5 species (data sparsity)
3. Search literature
4. Generate report
❌ Missed that Patient Similarity already has 2,026 age-labeled samples!

Strategic Approach (this skill):

Phase 1 - Pre-Flight Think:
- Patient Similarity DB has 2,026 samples WITH age metadata
- Already has 384-dim vectors (256 microbiome + 128 clinical)
- Multi-view learning ready (taxonomic + clinical features)
- Chen et al. 2022 methodology matches our infrastructure

Phase 2 - Strategic Plan:
- Primary: Use Patient Similarity DB (2,026 samples)
- Secondary: Query Neo4j only for validation
- Avoid: Supabase query (data already in Patient Similarity metadata)

Phase 3 - Execute:
const ageTrainingData = await patientIndex.query({
  topK: 2026,
  includeMetadata: true,
  filter: "age > 0"
});

✅ Result: 2,026 training samples (vs 5 species)
✅ 405x more data
✅ No redundant queries

Advanced Features

Feature 1: Inferred Data Integration

The inference system generates predicted age/health data for samples without customer data.

Check Inference System Outputs:

// Query Neo4j for inferred profiles
const inferredData = await cypherQuery(`
  MATCH (s:Sample)-[:HAS_INFERRED_PROFILE]->(i:InferredProfile)
  WHERE i.confidence > 0.7
  RETURN s.uuid, i.inferred_age, i.inferred_conditions, i.confidence
  LIMIT 1000
`);

// Combine with real data for semi-supervised learning
const combined = [...realAgeData, ...inferredData.map(d => ({
  age: d.inferred_age,
  confidence: d.confidence,
  is_inferred: true
}))];

Feature 2: Two-Lane Architecture Awareness

Clinical vs Research lanes have different data quality/completeness:

// Clinical Lane: High-confidence, customer-linked data
const clinicalData = await patientIndex.query({
  filter: "lane = 'clinical' AND age > 0",
  topK: 2026
});

// Research Lane: May have inferred data
const researchData = await patientIndex.query({
  filter: "lane = 'research'",
  topK: 4267
});

Feature 3: Multi-View Learning Preparation

Combine taxonomic + functional features as recommended by Chen et al. 2022:

// Step 1: Get taxonomic features from Patient Similarity
const taxonomicFeatures = await patientIndex.fetch(sampleIds);
// Returns 256-dim species embeddings + 128-dim clinical

// Step 2: Get functional features from DISCOVERY
const functionalFeatures = await discoveryIndex.fetch(sampleIds);
// Returns 3072-dim pathway abundance vectors

// Step 3: Merge for multi-view learning
const multiViewData = sampleIds.map((id, i) => ({
  sample_id: id,
  taxonomic: taxonomicFeatures[i].vector,      // 384 dims
  functional: functionalFeatures[i].vector,    // 3072 dims
  age: taxonomicFeatures[i].metadata.age       // Target
}));

// Train dual models + meta-learner

Common Use Cases

Use Case 1: Age Prediction Model

Question: "Build age prediction model from microbiome"

Strategic Assessment:

✅ Patient Similarity has 2,026 age-labeled samples
✅ Already has 384-dim multi-modal features
✅ Ready for ML training
❌ No need to query Neo4j for species

Execution:

// Extract training data (single query)
const trainingData = await patientIndex.query({
  topK: 2026,
  includeMetadata: true,
  filter: "age > 0 AND age < 120"
});

// Train model (no additional queries needed!)
const model = trainRandomForest(
  trainingData.map(d => d.vector),  // 384-dim features
  trainingData.map(d => d.metadata.age)  // Target
);

Efficiency: 1 query vs 10+ in traditional approach

Use Case 2: Species-Pathway Co-occurrence

Question: "Which species co-occur with specific pathways?"

Strategic Assessment:

✅ Neo4j has 57M HAS_FUNCTION relationships
✅ Graph traversal optimal for this question
❌ Patient Similarity not needed (no similarity question)

Execution:

MATCH (s:Sample)-[:HAS_SPECIES]->(sp:Species),
      (s)-[:HAS_FUNCTION]->(p:Pathway)
WHERE sp.relative_abundance > 0.05
WITH sp.name as species, p.key as pathway, count(s) as co_occurrence
ORDER BY co_occurrence DESC
LIMIT 50
RETURN species, pathway, co_occurrence

Efficiency: Direct Neo4j query, no multi-DB needed

Use Case 3: Patient Clustering

Question: "Find patient subgroups by microbiome similarity"

Strategic Assessment:

✅ Patient Similarity DB designed for this exact use case
✅ 6,293 samples with pre-computed vectors
❌ No Neo4j query needed

Execution:

// Find similar patients to a reference
const similarPatients = await patientIndex.query({
  vector: referencePatient.vector,
  topK: 20,
  includeMetadata: true,
  filter: "constipation_severity > 5"
});

// Cluster all patients
const allSamples = await patientIndex.query({
  topK: 6293,
  includeMetadata: true
});

const clusters = kMeansClustering(
  allSamples.map(s => s.vector),
  numClusters: 5
);

Efficiency: Vector DB optimized for this, <1 second query

Troubleshooting

Issue: Query Returns Too Few Results

Symptoms: Neo4j query returns 5 species when expecting 50+

Strategic Diagnosis:

Check filter thresholds (abundance >= 0.01 may be too strict)
Check customer count requirements (>= 10 may exclude data)
Verify age data availability in Neo4j vs Supabase

Solution:

// Lower thresholds incrementally
WHERE has.abundance >= 0.005  -- Was 0.01
AND customer_count >= 5        -- Was 10

Or Better: Check if Patient Similarity already has the data!

Issue: Duplicate Data Across Databases

Symptoms: Same sample appears in multiple databases

Strategic Understanding:

Patient Similarity is an ENRICHED copy (combines Neo4j + Supabase + Upstash)
Neo4j is the graph relationship store
Supabase is the source of truth for clinical data
Upstash DISCOVERY is functional pathway vectors
Upstash CONTEXT_PAPERS is research publications database
Upstash PRIMARY is for fast production taxonomic queries

Solution: Query the OPTIMAL database for your question:

Patient similarity questions → Patient Similarity DB
Graph relationship questions → Neo4j
Clinical metadata updates → Supabase

Issue: Inferred vs Real Data Confusion

Symptoms: Age values seem inconsistent

Diagnosis:

Real age: customer.age (from Supabase)
Inferred age: sample.estimated_age (from inference system)

Solution: Always check metadata source:

if (sample.metadata.patient_id === "unknown") {
  // This is inferred data, use with caution
  // Weight by estimation_confidence
}

Strategic Thinking Checklist

Before executing ANY database query, ask yourself:

Have I used sequential thinking to map existing systems?
Does Patient Similarity DB already have this data?
Have I checked prior session docs (Oct 22 Patient Similarity, etc.)?
Am I querying the OPTIMAL database for this question?
Can I answer this with fewer queries?
Am I avoiding redundant cross-database queries?
Have I planned for parallel execution?
Will I document why certain queries were avoided?

Resources

Architecture Documentation

System Documentation

Prior Analysis

Age Prediction Analysis Report

Key Principles

Think Before Execute: Use sequential thinking to assess resources
Leverage Existing Systems: Patient Similarity DB is often the answer
Minimize Queries: Query only what's truly needed
Multi-DB Awareness: Know which database specializes in what
Efficiency Reporting: Document avoided queries and rationale
Historical Context: Check prior session docs before querying

Remember: The best query is the one you don't have to run because the data already exists elsewhere.

Created: 2025-10-24 Category: Strategic Intelligence Complexity: Advanced Prerequisites: Multi-database architecture understanding

name

Microbiome Intelligence

description

Microbiome Intelligence

What This Skill Does

Strategic Data Intelligence - Not just database queries, but intelligent assessment of what data exists where and how to optimally answer questions.

This skill follows a 3-phase strategic protocol:

Pre-Flight Think: Assess existing systems and data before executing queries
Strategic Planning: Determine optimal data sources and query patterns
Intelligent Execution: Query only what's needed with multi-database awareness

Key Difference: Unlike procedural skills that immediately query databases, this skill REASONS about the problem first.

Prerequisites

Access to microbiome project databases (Neo4j, Supabase, Upstash)
Understanding of two-lane architecture (Clinical vs Research)
Multi-Database Architecture Guide: docs/architecture/MULTI_DATABASE_ARCHITECTURE_GUIDE.md

Quick Reference: When to Use This Skill

Use this skill when you need to:

✅ Strategic Questions: "What age data do we have?" (not "query Neo4j for age data")
✅ Cross-System Analysis: Questions spanning multiple databases
✅ Resource Assessment: "Do we already have X?" before building X
✅ Pattern Discovery: Age prediction, patient similarity, clustering
✅ Data Availability: Understanding what exists before designing solutions
✅ Historical Context: Questions requiring knowledge of Patient Similarity system, inference system, etc.

Do NOT use for:

❌ Simple single-database queries (use MCP tools directly)
❌ Repetitive data extraction (use automation scripts)
❌ Questions with no strategic component

Phase 1: Pre-Flight Strategic Assessment (ALWAYS START HERE)

Step 1: Use Sequential Thinking to Map Resources

Before ANY database queries, use mcp__seq_think__sequentialthinking to assess:

**Thought 1**: What is the user REALLY asking?
- Break down the actual objective
- Identify the underlying data need
- Recognize if this might duplicate existing work

**Thought 2**: What systems/data ALREADY exist?
- Patient Similarity Database: 6,293 samples with 384-dim vectors
  - 2,026 samples WITH customer age data
  - Multi-modal embeddings (256 microbiome + 128 clinical)
  - Ready for ML training
- Neo4j Graph: 61M+ pathway relationships, species-sample links
  - Customers: 1,773 with age data
  - Samples: 6,293 total
  - Relationships: Species, Pathways, Customers
- Supabase: Master clinical data source
  - Customers with demographics
  - Health scores, gut profiles
  - Source of truth for patient metadata
- Upstash Vector Databases:
  - CONTEXT_PAPERS: 13,118 publications (3072-dim) for research
  - DISCOVERY: 1,762 functional samples (3072-dim) for high-precision research
  - PRIMARY: 6,014 functional samples (1536-dim) for fast production queries
  - PATIENT_SIMILARITY: 6,293 patient profiles (384-dim) for patient clustering

**Thought 3**: Can we answer this WITHOUT new queries?
- Check if Patient Similarity already has the data
- Review if prior session docs contain the answer
- Assess if inference system outputs exist

**Thought 4**: What's the OPTIMAL data source?
- If question involves patient similarity → Use Upstash Patient Similarity
- If question needs graph relationships → Use Neo4j
- If question needs demographic truth → Use Supabase
- If question needs functional pathways → Use Upstash DISCOVERY
- If question needs research literature → Use Upstash CONTEXT_PAPERS

**Thought 5**: What's the minimal query set?
- Design queries that leverage existing infrastructure
- Avoid redundant queries across databases
- Plan for parallel execution where independent

Step 2: Document Strategic Findings

Create a strategic assessment:

## Strategic Assessment for [User Question]

### Existing Resources Found:
- [ ] Patient Similarity DB has relevant data
- [ ] Neo4j has required relationships
- [ ] Supabase has needed metadata
- [ ] Prior session docs contain answer
- [ ] Inference system outputs exist

### Optimal Approach:
[Describe the best way to answer using existing systems]

### Queries Required:
- Database 1: [specific query with rationale]
- Database 2: [specific query with rationale]

### Queries NOT Needed:
- ❌ [Query avoided because...]

Phase 2: Strategic Planning

Database Specialization Matrix

Use this matrix to select optimal databases:

Question Type	Primary DB	Secondary DB	Rationale
Patient Similarity	Upstash Patient Similarity	None needed	6,293 samples pre-computed
Age Prediction	Upstash Patient Similarity	Neo4j (validation)	2,026 age-labeled samples ready for ML
Species-Pathway Co-occurrence	Neo4j	Upstash DISCOVERY	61M pathway rels, 3072-dim vectors
Demographic Queries	Supabase	Patient Similarity (enriched)	Source of truth
Functional Analysis	Upstash DISCOVERY	Neo4j (relationships)	Pathway embeddings
Multi-View Learning	Patient Similarity + DISCOVERY	Neo4j (validation)	Taxonomic + Functional
Research Literature	Upstash CONTEXT_PAPERS	None needed	13,118 publications indexed

Query Pattern Templates

Pattern 1: Patient Similarity + Demographics

// When: Questions about patient clustering, similarity, age prediction
// Use: Patient Similarity DB (already has 384-dim vectors with age metadata)

const ageDataSamples = await patientIndex.query({
  topK: 2026,  // All samples with customer data
  includeMetadata: true,
  filter: "age > 0 AND age < 120"
});

// NO NEED to query Neo4j or Supabase - data already enriched!

Pattern 2: Graph Relationships (Neo4j)

// When: Questions about species-pathway relationships, network analysis
// Use: Neo4j for graph traversal

MATCH (c:Customer)-[:HAS_SAMPLE]->(s:Sample)-[:HAS_SPECIES]->(sp:Species)
WHERE c.age IS NOT NULL
WITH sp, c.age, count(s) as sample_count
WHERE sample_count >= 5
RETURN sp.name, avg(c.age) as avg_age, sample_count
ORDER BY sample_count DESC

Pattern 3: Multi-Database Synthesis

// When: Questions requiring data from multiple sources
// Strategy: Query in parallel, join in application

const [patientVectors, neo4jRelationships, functionalVectors] = await Promise.all([
  // Query 1: Patient Similarity (Upstash)
  patientIndex.query({ topK: 100, filter: "age > 60" }),

  // Query 2: Graph relationships (Neo4j)
  cypherQuery(`MATCH (c:Customer)-[:HAS_SAMPLE]->(s)
                WHERE c.age > 60
                RETURN c.customer_id, s.uuid`),

  // Query 3: Functional pathways (Upstash DISCOVERY)
  discoveryIndex.fetch(sampleIds)
]);

// Merge results in application layer
const combined = mergeMultiDBResults(patientVectors, neo4jRelationships, functionalVectors);

Phase 3: Intelligent Execution

Execution Principles

Query Only What's Needed
- If Patient Similarity has the data, don't query Neo4j
- If prior docs answer the question, don't query databases
- If thresholds need adjustment, say so instead of trying all combinations
Parallel Execution
- Group all independent queries in single message
- Use MCP tools in parallel for different databases
- Minimize round trips
Data Validation
- Check for reasonable results (e.g., age 0-120, not negative)
- Verify sample counts match expected ranges
- Flag unexpected patterns
Efficiency Reporting
- Document which databases were queried and why
- Report which queries were avoided and why
- Measure query execution time

Example: Age Prediction Question

User Question: "Find age-related microbes for age prediction"

Traditional Approach (microbiome-data-expert skill):

1. Query Neo4j for species-age correlations
2. Find 5 species (data sparsity)
3. Search literature
4. Generate report
❌ Missed that Patient Similarity already has 2,026 age-labeled samples!

Strategic Approach (this skill):

Phase 1 - Pre-Flight Think:
- Patient Similarity DB has 2,026 samples WITH age metadata
- Already has 384-dim vectors (256 microbiome + 128 clinical)
- Multi-view learning ready (taxonomic + clinical features)
- Chen et al. 2022 methodology matches our infrastructure

Phase 2 - Strategic Plan:
- Primary: Use Patient Similarity DB (2,026 samples)
- Secondary: Query Neo4j only for validation
- Avoid: Supabase query (data already in Patient Similarity metadata)

Phase 3 - Execute:
const ageTrainingData = await patientIndex.query({
  topK: 2026,
  includeMetadata: true,
  filter: "age > 0"
});

✅ Result: 2,026 training samples (vs 5 species)
✅ 405x more data
✅ No redundant queries

Advanced Features

Feature 1: Inferred Data Integration

The inference system generates predicted age/health data for samples without customer data.

Check Inference System Outputs:

// Query Neo4j for inferred profiles
const inferredData = await cypherQuery(`
  MATCH (s:Sample)-[:HAS_INFERRED_PROFILE]->(i:InferredProfile)
  WHERE i.confidence > 0.7
  RETURN s.uuid, i.inferred_age, i.inferred_conditions, i.confidence
  LIMIT 1000
`);

// Combine with real data for semi-supervised learning
const combined = [...realAgeData, ...inferredData.map(d => ({
  age: d.inferred_age,
  confidence: d.confidence,
  is_inferred: true
}))];

Feature 2: Two-Lane Architecture Awareness

Clinical vs Research lanes have different data quality/completeness:

// Clinical Lane: High-confidence, customer-linked data
const clinicalData = await patientIndex.query({
  filter: "lane = 'clinical' AND age > 0",
  topK: 2026
});

// Research Lane: May have inferred data
const researchData = await patientIndex.query({
  filter: "lane = 'research'",
  topK: 4267
});

Feature 3: Multi-View Learning Preparation

Combine taxonomic + functional features as recommended by Chen et al. 2022:

// Step 1: Get taxonomic features from Patient Similarity
const taxonomicFeatures = await patientIndex.fetch(sampleIds);
// Returns 256-dim species embeddings + 128-dim clinical

// Step 2: Get functional features from DISCOVERY
const functionalFeatures = await discoveryIndex.fetch(sampleIds);
// Returns 3072-dim pathway abundance vectors

// Step 3: Merge for multi-view learning
const multiViewData = sampleIds.map((id, i) => ({
  sample_id: id,
  taxonomic: taxonomicFeatures[i].vector,      // 384 dims
  functional: functionalFeatures[i].vector,    // 3072 dims
  age: taxonomicFeatures[i].metadata.age       // Target
}));

// Train dual models + meta-learner

Common Use Cases

Use Case 1: Age Prediction Model

Question: "Build age prediction model from microbiome"

Strategic Assessment:

✅ Patient Similarity has 2,026 age-labeled samples
✅ Already has 384-dim multi-modal features
✅ Ready for ML training
❌ No need to query Neo4j for species

Execution:

// Extract training data (single query)
const trainingData = await patientIndex.query({
  topK: 2026,
  includeMetadata: true,
  filter: "age > 0 AND age < 120"
});

// Train model (no additional queries needed!)
const model = trainRandomForest(
  trainingData.map(d => d.vector),  // 384-dim features
  trainingData.map(d => d.metadata.age)  // Target
);

Efficiency: 1 query vs 10+ in traditional approach

Use Case 2: Species-Pathway Co-occurrence

Question: "Which species co-occur with specific pathways?"

Strategic Assessment:

✅ Neo4j has 57M HAS_FUNCTION relationships
✅ Graph traversal optimal for this question
❌ Patient Similarity not needed (no similarity question)

Execution:

MATCH (s:Sample)-[:HAS_SPECIES]->(sp:Species),
      (s)-[:HAS_FUNCTION]->(p:Pathway)
WHERE sp.relative_abundance > 0.05
WITH sp.name as species, p.key as pathway, count(s) as co_occurrence
ORDER BY co_occurrence DESC
LIMIT 50
RETURN species, pathway, co_occurrence

Efficiency: Direct Neo4j query, no multi-DB needed

Use Case 3: Patient Clustering

Question: "Find patient subgroups by microbiome similarity"

Strategic Assessment:

✅ Patient Similarity DB designed for this exact use case
✅ 6,293 samples with pre-computed vectors
❌ No Neo4j query needed

Execution:

// Find similar patients to a reference
const similarPatients = await patientIndex.query({
  vector: referencePatient.vector,
  topK: 20,
  includeMetadata: true,
  filter: "constipation_severity > 5"
});

// Cluster all patients
const allSamples = await patientIndex.query({
  topK: 6293,
  includeMetadata: true
});

const clusters = kMeansClustering(
  allSamples.map(s => s.vector),
  numClusters: 5
);

Efficiency: Vector DB optimized for this, <1 second query

Troubleshooting

Issue: Query Returns Too Few Results

Symptoms: Neo4j query returns 5 species when expecting 50+

Strategic Diagnosis:

Check filter thresholds (abundance >= 0.01 may be too strict)
Check customer count requirements (>= 10 may exclude data)
Verify age data availability in Neo4j vs Supabase

Solution:

// Lower thresholds incrementally
WHERE has.abundance >= 0.005  -- Was 0.01
AND customer_count >= 5        -- Was 10

Or Better: Check if Patient Similarity already has the data!

Issue: Duplicate Data Across Databases

Symptoms: Same sample appears in multiple databases

Strategic Understanding:

Patient Similarity is an ENRICHED copy (combines Neo4j + Supabase + Upstash)
Neo4j is the graph relationship store
Supabase is the source of truth for clinical data
Upstash DISCOVERY is functional pathway vectors
Upstash CONTEXT_PAPERS is research publications database
Upstash PRIMARY is for fast production taxonomic queries

Solution: Query the OPTIMAL database for your question:

Patient similarity questions → Patient Similarity DB
Graph relationship questions → Neo4j
Clinical metadata updates → Supabase

Issue: Inferred vs Real Data Confusion

Symptoms: Age values seem inconsistent

Diagnosis:

Real age: customer.age (from Supabase)
Inferred age: sample.estimated_age (from inference system)

Solution: Always check metadata source:

if (sample.metadata.patient_id === "unknown") {
  // This is inferred data, use with caution
  // Weight by estimation_confidence
}

Strategic Thinking Checklist

Before executing ANY database query, ask yourself:

Have I used sequential thinking to map existing systems?
Does Patient Similarity DB already have this data?
Have I checked prior session docs (Oct 22 Patient Similarity, etc.)?
Am I querying the OPTIMAL database for this question?
Can I answer this with fewer queries?
Am I avoiding redundant cross-database queries?
Have I planned for parallel execution?
Will I document why certain queries were avoided?

Resources

Architecture Documentation

System Documentation

Prior Analysis

Age Prediction Analysis Report

Key Principles

Think Before Execute: Use sequential thinking to assess resources
Leverage Existing Systems: Patient Similarity DB is often the answer
Minimize Queries: Query only what's truly needed
Multi-DB Awareness: Know which database specializes in what
Efficiency Reporting: Document avoided queries and rationale
Historical Context: Check prior session docs before querying

Remember: The best query is the one you don't have to run because the data already exists elsewhere.

Created: 2025-10-24 Category: Strategic Intelligence Complexity: Advanced Prerequisites: Multi-database architecture understanding