Run any Skill in Manus with one click

$pwd:

tiledbvcf

Name: Tiledbvcf
Author: K-Dense-AI

// Efficient storage and retrieval of genomic variant data using TileDB. Scalable VCF/BCF ingestion, incremental sample addition, compressed storage, parallel queries, and export capabilities for population genomics.

Run Skill in Manus

$ git log --oneline --stat

stars:26,663

forks:2,760

updated:May 27, 2026 at 22:53

SKILL.md

readonly

related-skills.json

same repository

scvi-tools.md

from "K-Dense-AI/scientific-agent-skills"

Deep generative models for single-cell omics. Use when you need probabilistic batch correction (scVI), transfer learning, differential expression with uncertainty, or multi-modal integration (TOTALVI, MultiVI). Best for advanced modeling, batch effects, multimodal data. For standard analysis pipelines use scanpy.

2026-05-2826.7k

scikit-bio.md

from "K-Dense-AI/scientific-agent-skills"

Biological data toolkit. Sequence analysis, alignments, phylogenetic trees, diversity metrics (alpha/beta, UniFrac), ordination (PCoA), PERMANOVA, FASTA/Newick I/O, for microbiome analysis.

2026-05-2826.7k

modal.md

from "K-Dense-AI/scientific-agent-skills"

Modal is a serverless cloud platform for running Python on demand, including on-demand GPUs. Use when deploying or serving AI/ML models, running GPU-accelerated workloads (training, fine-tuning, inference), serving web endpoints, scheduling batch jobs, or scaling Python code to cloud containers with the Modal SDK.

2026-05-2826.7k

bulk-rnaseq.md

from "K-Dense-AI/scientific-agent-skills"

End-to-end bulk RNA-seq orchestrator — takes raw FASTQ reads through QC and trimming (FastQC, fastp/Trim Galore), alignment and quantification (STAR, Salmon, featureCounts), assembles a gene-level counts matrix, then hands off to differential expression (pydeseq2), pathway/GSEA enrichment (pathway-enrichment), and publication figures (scientific-visualization). Use whenever the user has bulk RNA-seq reads or quant output and wants a complete, reproducible differential-expression workflow — e.g. "analyze my RNA-seq", "FASTQ to DESeq2", "run nf-core/rnaseq", "STAR/Salmon quantification", "build a counts matrix for DESeq2", or "go from reads to differentially expressed genes and enriched pathways". Routes between an nf-core/rnaseq (Nextflow) path and a standalone STAR/Salmon path, and covers experimental design, strandedness, and QC gates. For single-cell RNA-seq use the scanpy skill instead.

2026-05-2826.7k

pathway-enrichment.md

from "K-Dense-AI/scientific-agent-skills"

Run pathway and gene-set enrichment analysis on gene lists or ranked gene data, then interpret the results. Use whenever the user has a set of genes (differentially expressed genes from PyDESeq2/Scanpy, CRISPR-screen hits, cluster marker genes, proteomics hits) and wants to know which biological pathways, GO terms, or gene sets are over-represented or enriched. Covers over-representation analysis (ORA / Enrichr / Fisher / hypergeometric), ranked Gene Set Enrichment Analysis (GSEA / preranked), single-sample scoring (ssGSEA/GSVA), and functional profiling via gseapy, g:Profiler, Enrichr libraries, MSigDB, GO, KEGG, Reactome, and WikiPathways — plus gene-ID mapping, choosing the right background universe, multiple-testing correction, redundancy reduction, dotplots/enrichment maps, and publication-ready tables. Use this for "pathway analysis", "enrichment analysis", "GO enrichment", "KEGG/Reactome pathways", "GSEA", "over-representation", "functional annotation", or "what pathways are my genes in".

2026-05-2826.7k

nextflow.md

from "K-Dense-AI/scientific-agent-skills"

Build, run, and debug Nextflow data pipelines and nf-core workflows end to end. Use whenever the user mentions Nextflow, nf-core, .nf files, nextflow.config, DSL2, processes/channels/operators, samplesheets, or wants to run a community pipeline (e.g. nf-core/rnaseq, nf-core/sarek), write or test a module/subworkflow with nf-test, configure executors/containers (Docker, Singularity/Apptainer, Conda, Wave), scale a workflow to HPC/SLURM or cloud (AWS Batch, Google Batch, Azure, Kubernetes), or debug a failed/-resume run. Make sure to use this skill for any reproducible scientific/bioinformatics workflow work even if the user does not say the word "Nextflow", and for authoring nf-core-compliant pipelines, modules, configs, and linting.

2026-05-2826.7k

package.json

"author": "K-Dense-AI"

"repository": "K-Dense-AI/scientific-agent-skills"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	tiledbvcf
description	Efficient storage and retrieval of genomic variant data using TileDB. Scalable VCF/BCF ingestion, incremental sample addition, compressed storage, parallel queries, and export capabilities for population genomics.
license	MIT license
metadata	{"version":"1.0","skill-author":"Jeremy Leipzig"}

TileDB-VCF

Overview

TileDB-VCF is a high-performance C++ library with Python and CLI interfaces for efficient storage and retrieval of genomic variant-call data. Built on TileDB's sparse array technology, it enables scalable ingestion of VCF/BCF files, incremental sample addition without expensive merging operations, and efficient parallel queries of variant data stored locally or in the cloud.

When to Use This Skill

This skill should be used when:

Learning TileDB-VCF concepts and workflows
Prototyping genomics analyses and pipelines
Working with small-to-medium datasets (< 1000 samples)
Need incremental addition of new samples to existing datasets
Require efficient querying of specific genomic regions across many samples
Working with cloud-stored variant data (S3, Azure, GCS)
Need to export subsets of large VCF datasets
Building variant databases for cohort studies
Educational projects and method development
Performance is critical for variant data operations

Quick Start

Installation

Preferred Method: Conda/Mamba

# Enter the following two lines if you are on a M1 Mac
CONDA_SUBDIR=osx-64
conda config --env --set subdir osx-64

# Create the conda environment
conda create -n tiledb-vcf "python<3.10"
conda activate tiledb-vcf

# Mamba is a faster and more reliable alternative to conda
conda install -c conda-forge mamba

# Install TileDB-Py and TileDB-VCF, align with other useful libraries
mamba install -y -c conda-forge -c bioconda -c tiledb tiledb-py tiledbvcf-py pandas pyarrow numpy

Alternative: Docker Images

docker pull tiledb/tiledbvcf-py     # Python interface
docker pull tiledb/tiledbvcf-cli    # Command-line interface

Basic Examples

Create and populate a dataset:

import tiledbvcf

# Create a new dataset
ds = tiledbvcf.Dataset(uri="my_dataset", mode="w",
                      cfg=tiledbvcf.ReadConfig(memory_budget=1024))

# Ingest VCF files (must be single-sample with indexes)
# Requirements:
# - VCFs must be single-sample (not multi-sample)
# - Must have indexes: .csi (bcftools) or .tbi (tabix)
ds.ingest_samples(["sample1.vcf.gz", "sample2.vcf.gz"])

Query variant data:

# Open existing dataset for reading
ds = tiledbvcf.Dataset(uri="my_dataset", mode="r")

# Query specific regions and samples
df = ds.read(
    attrs=["sample_name", "pos_start", "pos_end", "alleles", "fmt_GT"],
    regions=["chr1:1000000-2000000", "chr2:500000-1500000"],
    samples=["sample1", "sample2", "sample3"]
)
print(df.head())

Export to VCF:

import os

# Export two VCF samples
ds.export(
    regions=["chr21:8220186-8405573"],
    samples=["HG00101", "HG00097"],
    output_format="v",
    output_dir=os.path.expanduser("~"),
)

Core Capabilities

1. Dataset Creation and Ingestion

Create TileDB-VCF datasets and incrementally ingest variant data from multiple VCF/BCF files. This is appropriate for building population genomics databases and cohort studies.

Requirements:

Single-sample VCFs only: Multi-sample VCFs are not supported
Index files required: VCF/BCF files must have indexes (.csi or .tbi)

Common operations:

Create new datasets with optimized array schemas
Ingest single or multiple VCF/BCF files in parallel
Add new samples incrementally without re-processing existing data
Configure memory usage and compression settings
Handle various VCF formats and INFO/FORMAT fields
Resume interrupted ingestion processes
Validate data integrity during ingestion

2. Efficient Querying and Filtering

Query variant data with high performance across genomic regions, samples, and variant attributes. This is appropriate for association studies, variant discovery, and population analysis.

Common operations:

Query specific genomic regions (single or multiple)
Filter by sample names or sample groups
Extract specific variant attributes (position, alleles, genotypes, quality)
Access INFO and FORMAT fields efficiently
Combine spatial and attribute-based filtering
Stream large query results
Perform aggregations across samples or regions

3. Data Export and Interoperability

Export data in various formats for downstream analysis or integration with other genomics tools. This is appropriate for sharing datasets, creating analysis subsets, or feeding other pipelines.

Common operations:

Export to standard VCF/BCF formats
Generate TSV files with selected fields
Create sample/region-specific subsets
Maintain data provenance and metadata
Lossless data export preserving all annotations
Compressed output formats
Streaming exports for large datasets

4. Population Genomics Workflows

TileDB-VCF excels at large-scale population genomics analyses requiring efficient access to variant data across many samples and genomic regions.

Common workflows:

Genome-wide association studies (GWAS) data preparation
Rare variant burden testing
Population stratification analysis
Allele frequency calculations across populations
Quality control across large cohorts
Variant annotation and filtering
Cross-population comparative analysis

Key Concepts

Array Schema and Data Model

TileDB-VCF Data Model:

Variants stored as sparse arrays with genomic coordinates as dimensions
Samples stored as attributes allowing efficient sample-specific queries
INFO and FORMAT fields preserved with original data types
Automatic compression and chunking for optimal storage

Schema Configuration:

# Custom schema with specific tile extents
config = tiledbvcf.ReadConfig(
    memory_budget=2048,  # MB
    region_partition=(0, 3095677412),  # Full genome
    sample_partition=(0, 10000)  # Up to 10k samples
)

Coordinate Systems and Regions

Critical: TileDB-VCF uses 1-based genomic coordinates following VCF standard:

Positions are 1-based (first base is position 1)
Ranges are inclusive on both ends
Region "chr1:1000-2000" includes positions 1000-2000 (1001 bases total)

Region specification formats:

# Single region
regions = ["chr1:1000000-2000000"]

# Multiple regions
regions = ["chr1:1000000-2000000", "chr2:500000-1500000"]

# Whole chromosome
regions = ["chr1"]

# BED-style (0-based, half-open converted internally)
regions = ["chr1:999999-2000000"]  # Equivalent to 1-based chr1:1000000-2000000

Memory Management

Performance considerations:

Set appropriate memory budget based on available system memory
Use streaming queries for very large result sets
Partition large ingestions to avoid memory exhaustion
Configure tile cache for repeated region access
Use parallel ingestion for multiple files
Optimize region queries by combining nearby regions

Cloud Storage Integration

TileDB-VCF seamlessly works with cloud storage:

# S3 dataset
ds = tiledbvcf.Dataset(uri="s3://bucket/dataset", mode="r")

# Azure Blob Storage
ds = tiledbvcf.Dataset(uri="azure://container/dataset", mode="r")

# Google Cloud Storage
ds = tiledbvcf.Dataset(uri="gcs://bucket/dataset", mode="r")

Common Pitfalls

Memory exhaustion during ingestion: Use appropriate memory budget and batch processing for large VCF files
Inefficient region queries: Combine nearby regions instead of many separate queries
Missing sample names: Ensure sample names in VCF headers match query sample specifications
Coordinate system confusion: Remember TileDB-VCF uses 1-based coordinates like VCF standard
Large result sets: Use streaming or pagination for queries returning millions of variants
Cloud permissions: Ensure proper authentication for cloud storage access
Concurrent access: Multiple writers to the same dataset can cause corruption—use appropriate locking

CLI Usage

TileDB-VCF provides a command-line interface with the following subcommands:

Available Subcommands:

create - Creates an empty TileDB-VCF dataset
store - Ingests samples into a TileDB-VCF dataset
export - Exports data from a TileDB-VCF dataset
list - Lists all sample names present in a TileDB-VCF dataset
stat - Prints high-level statistics about a TileDB-VCF dataset
utils - Utils for working with a TileDB-VCF dataset
version - Print the version information and exit

# Create empty dataset
tiledbvcf create --uri my_dataset

# Ingest samples (requires single-sample VCFs with indexes)
tiledbvcf store --uri my_dataset --samples sample1.vcf.gz,sample2.vcf.gz

# Export data
tiledbvcf export --uri my_dataset \
  --regions "chr1:1000000-2000000" \
  --sample-names "sample1,sample2"

# List all samples
tiledbvcf list --uri my_dataset

# Show dataset statistics
tiledbvcf stat --uri my_dataset

Advanced Features

Allele Frequency Analysis

# Calculate allele frequencies
af_df = tiledbvcf.read_allele_frequency(
    uri="my_dataset",
    regions=["chr1:1000000-2000000"],
    samples=["sample1", "sample2", "sample3"]
)

Sample Quality Control

# Perform sample QC
qc_results = tiledbvcf.sample_qc(
    uri="my_dataset",
    samples=["sample1", "sample2"]
)

Custom Configurations

# Advanced configuration
config = tiledbvcf.ReadConfig(
    memory_budget=4096,
    tiledb_config={
        "sm.tile_cache_size": "1000000000",
        "vfs.s3.region": "us-east-1"
    }
)

Resources

Getting Help

Open Source TileDB-VCF Resources

Open Source Documentation:

TileDB Academy: https://cloud.tiledb.com/academy/
Population Genomics Guide: https://cloud.tiledb.com/academy/structure/life-sciences/population-genomics/
TileDB-VCF GitHub: https://github.com/TileDB-Inc/TileDB-VCF

TileDB-Cloud Resources

For Large-Scale/Production Genomics:

TileDB-Cloud Platform: https://cloud.tiledb.com
TileDB Academy (All Documentation): https://cloud.tiledb.com/academy/

Getting Started:

Free account signup: https://cloud.tiledb.com
Contact: sales@tiledb.com for enterprise needs

Scaling to TileDB-Cloud

When your genomics workloads outgrow single-node processing, TileDB-Cloud provides enterprise-scale capabilities for production genomics pipelines.

Note: This section covers TileDB-Cloud capabilities based on available documentation. For complete API details and current functionality, consult the official TileDB-Cloud documentation and API reference.

Setting Up TileDB-Cloud

1. Create Account and Get API Token

# Sign up at https://cloud.tiledb.com
# Generate API token in your account settings

2. Install TileDB-Cloud Python Client

# Base installation
pip install tiledb-cloud

# With genomics-specific functionality
pip install tiledb-cloud[life-sciences]

3. Configure Authentication

# Set environment variable with your API token
export TILEDB_REST_TOKEN="your_api_token"

import tiledb.cloud

# Authentication is automatic via TILEDB_REST_TOKEN
# No explicit login required in code

Migrating from Open Source to TileDB-Cloud

Large-Scale Ingestion

# TileDB-Cloud: Distributed VCF ingestion
import tiledb.cloud.vcf

# Use specialized VCF ingestion module
# Note: Exact API requires TileDB-Cloud documentation
# This represents the available functionality structure
tiledb.cloud.vcf.ingestion.ingest_vcf_dataset(
    source="s3://my-bucket/vcf-files/",
    output="tiledb://my-namespace/large-dataset",
    namespace="my-namespace",
    acn="my-s3-credentials",
    ingest_resources={"cpu": "16", "memory": "64Gi"}
)

Distributed Query Processing

# TileDB-Cloud: VCF querying across distributed storage
import tiledb.cloud.vcf
import tiledbvcf

# Define the dataset URI
dataset_uri = "tiledb://TileDB-Inc/gvcf-1kg-dragen-v376"

# Get all samples from the dataset
ds = tiledbvcf.Dataset(dataset_uri, tiledb_config=cfg)
samples = ds.samples()

# Define attributes and ranges to query on
attrs = ["sample_name", "fmt_GT", "fmt_AD", "fmt_DP"]
regions = ["chr13:32396898-32397044", "chr13:32398162-32400268"]

# Perform the read, which is executed in a distributed fashion
df = tiledb.cloud.vcf.read(
    dataset_uri=dataset_uri,
    regions=regions,
    samples=samples,
    attrs=attrs,
    namespace="my-namespace",  # specifies which account to charge
)
df.to_pandas()

Enterprise Features

Data Sharing and Collaboration

# TileDB-Cloud provides enterprise data sharing capabilities
# through namespace-based permissions and group management

# Access shared datasets via TileDB-Cloud URIs
dataset_uri = "tiledb://shared-namespace/population-study"

# Collaborate through shared notebooks and compute resources
# (Specific API requires TileDB-Cloud documentation)

Cost Optimization

Serverless Compute: Pay only for actual compute time
Auto-scaling: Automatically scale up/down based on workload
Spot Instances: Use cost-optimized compute for batch jobs
Data Tiering: Automatic hot/cold storage management

Security and Compliance

End-to-end Encryption: Data encrypted in transit and at rest
Access Controls: Fine-grained permissions and audit logs
HIPAA/SOC2 Compliance: Enterprise security standards
VPC Support: Deploy in private cloud environments

When to Migrate Checklist

✅ Migrate to TileDB-Cloud if you have:

Datasets > 1000 samples
Need to process > 100GB of VCF data
Require distributed computing
Multiple team members need access
Need enterprise security/compliance
Want cost-optimized serverless compute
Require 24/7 production uptime

Getting Started with TileDB-Cloud

Start Free: TileDB-Cloud offers free tier for evaluation
Migration Support: TileDB team provides migration assistance
Training: Access to genomics-specific tutorials and examples
Professional Services: Custom deployment and optimization

Next Steps:

Visit https://cloud.tiledb.com to create account
Review documentation at https://cloud.tiledb.com/academy/
Contact sales@tiledb.com for enterprise needs

tiledbvcf

More from this repository

More from this repository

TileDB-VCF

Overview

When to Use This Skill

Quick Start

Installation

Basic Examples

Core Capabilities

1. Dataset Creation and Ingestion

2. Efficient Querying and Filtering

3. Data Export and Interoperability

4. Population Genomics Workflows

Key Concepts

Array Schema and Data Model

Coordinate Systems and Regions

Memory Management

Cloud Storage Integration

Common Pitfalls

CLI Usage

Advanced Features

Allele Frequency Analysis

Sample Quality Control

Custom Configurations

Resources

Getting Help

Open Source TileDB-VCF Resources

TileDB-Cloud Resources

Scaling to TileDB-Cloud

Setting Up TileDB-Cloud

Migrating from Open Source to TileDB-Cloud

Enterprise Features

When to Migrate Checklist

Getting Started with TileDB-Cloud

TileDB-VCF

Overview

When to Use This Skill

Quick Start

Installation

Basic Examples

Core Capabilities

1. Dataset Creation and Ingestion

2. Efficient Querying and Filtering

3. Data Export and Interoperability

4. Population Genomics Workflows

Key Concepts

Array Schema and Data Model

Coordinate Systems and Regions

Memory Management

Cloud Storage Integration

Common Pitfalls

CLI Usage

Advanced Features

Allele Frequency Analysis

Sample Quality Control

Custom Configurations

Resources

Getting Help

Open Source TileDB-VCF Resources

TileDB-Cloud Resources

Scaling to TileDB-Cloud

Setting Up TileDB-Cloud

Migrating from Open Source to TileDB-Cloud

Enterprise Features

When to Migrate Checklist

Getting Started with TileDB-Cloud