con un clic
pydeseq2
Differential gene expression analysis for bulk RNA-seq with PyDESeq2, including formulaic designs, Wald tests, FDR correction, LFC shrinkage, and result visualization.
Menú
Differential gene expression analysis for bulk RNA-seq with PyDESeq2, including formulaic designs, Wald tests, FDR correction, LFC shrinkage, and result visualization.
Build with and use Pi, the minimal terminal coding harness. Use for installing Pi, configuring providers/models/settings, creating Pi skills/extensions/packages/themes/prompt templates, embedding Pi through the SDK, integrating over RPC or JSON event streams, parsing sessions, or developing custom Pi providers and TUI components.
Query the CZ CELLxGENE Census programmatically for versioned public single-cell and spatial transcriptomics data. Use when you need population-scale cell metadata, gene expression slices, Census summary counts, source H5AD URIs/downloads, embeddings, spatial Census data, or reference atlas comparisons across organisms, tissues, diseases, assays, and cell types. For analyzing your own local single-cell data use scanpy, anndata, or scvi-tools.
NGS analysis toolkit. BAM to bigWig conversion, QC (correlation, PCA, fingerprints), heatmaps/profiles (TSS, peaks), for ChIP-seq, RNA-seq, ATAC-seq visualization.
DiffDock and DiffDock-L molecular docking. Use for protein-small-molecule pose prediction from PDB or sequence plus SMILES/SDF/MOL2, batch docking, virtual screening, and pose-confidence interpretation. Not for binding affinity prediction.
Use when working directly with the `esm` Python SDK, ESM3 or ESMC model IDs, Forge/Biohub inference clients, or ESMFold2 folding workflows.
Fast CLI/Python queries to 20+ bioinformatics databases. Use for quick lookups: gene info, BLAST/BLAT, viral sequence downloads, AlphaFold structures, enrichment analysis, OpenTargets, COSMIC, CELLxGENE, and 8cube mouse specificity/expression data. Best for interactive exploration and simple queries. For batch processing or advanced BLAST use biopython; for multi-database Python workflows use bioservices.
| name | pydeseq2 |
| description | Differential gene expression analysis for bulk RNA-seq with PyDESeq2, including formulaic designs, Wald tests, FDR correction, LFC shrinkage, and result visualization. |
| allowed-tools | Read Write Edit Bash |
| compatibility | Requires Python >=3.11 and PyDESeq2 0.5.4-compatible dependencies. Examples target PyDESeq2 0.5.x, formulaic design strings, explicit contrasts, and uv-based installs. |
| license | MIT license |
| metadata | {"version":"1.1","skill-author":"K-Dense Inc."} |
PyDESeq2 is a Python implementation of DESeq2 for differential expression analysis with bulk RNA-seq data. Design and execute complete workflows from data loading through result interpretation, including formulaic single-factor and multi-factor designs, Wald tests with multiple testing correction, optional apeGLM shrinkage, and integration with pandas and AnnData.
This skill should be used when:
For users who want to perform a standard differential expression analysis:
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
# 1. Load data
counts_df = pd.read_csv("counts.csv", index_col=0).T # Transpose to samples × genes
metadata = pd.read_csv("metadata.csv", index_col=0)
# 2. Filter low-count genes
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
# 3. Make the reference level explicit and fit DESeq2
metadata["condition"] = pd.Categorical(
metadata["condition"], categories=["control", "treated"]
)
inference = DefaultInference(n_cpus=4)
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True,
inference=inference,
)
dds.deseq2()
# 4. Perform statistical testing
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"],
inference=inference,
)
ds.summary()
# 5. Access results
results = ds.results_df
significant = results[results.padj < 0.05]
print(f"Found {len(significant)} significant genes")
Input requirements:
Common data loading patterns:
# From CSV (typical format: genes × samples, needs transpose)
counts_df = pd.read_csv("counts.csv", index_col=0).T
metadata = pd.read_csv("metadata.csv", index_col=0)
# From TSV
counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T
# From AnnData
import anndata as ad
adata = ad.read_h5ad("data.h5ad")
counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
metadata = adata.obs
Data filtering:
# Remove low-count genes
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
# Remove samples with missing metadata
samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]
The design formula specifies how gene expression is modeled.
Single-factor designs:
design = "~condition" # Simple two-group comparison
Multi-factor designs:
design = "~batch + condition" # Control for batch effects
design = "~age + condition" # Include continuous covariate
design = "~group + condition + group:condition" # Interaction effects
Design formula guidelines:
C(variable) or a pandas categorical dtypedesign_factors, continuous_factors, or ref_level arguments in new workflowsInitialize the DeseqDataSet and run the complete pipeline:
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
inference = DefaultInference(n_cpus=4)
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True, # Refit after removing outliers
inference=inference,
low_memory=False,
)
# Run the complete DESeq2 pipeline
dds.deseq2()
What deseq2() does:
Perform Wald tests to identify differentially expressed genes:
from pydeseq2.ds import DeseqStats
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"], # Test treated vs control
alpha=0.05, # Significance threshold
cooks_filter=True, # Filter outliers
independent_filter=True # Filter low-power tests
)
ds.summary()
Contrast specification:
[variable, test_level, reference_level]["condition", "treated", "control"] tests treated vs controlcontrastResult DataFrame columns:
baseMean: Mean normalized count across sampleslog2FoldChange: Log2 fold change between conditionslfcSE: Standard error of LFCstat: Wald test statisticpvalue: Raw p-valuepadj: Adjusted p-value (FDR-corrected via Benjamini-Hochberg)Apply shrinkage to reduce noise in fold change estimates:
ds.lfc_shrink(coeff="condition[T.treated]") # Applies apeGLM shrinkage
When to use LFC shrinkage:
Important: Shrinkage affects only the log2FoldChange values, not the statistical test results (p-values remain unchanged). Use shrunk values for visualization but report unshrunken p-values for significance.
Save results and intermediate objects:
# Export results as CSV
ds.results_df.to_csv("deseq2_results.csv")
# Save significant genes only
significant = ds.results_df[ds.results_df.padj < 0.05]
significant.to_csv("significant_genes.csv")
# Save a portable AnnData object for later inspection
dds.to_picklable_anndata().write_h5ad("dds_result.h5ad")
Avoid loading pickle files from untrusted sources. For exchange between agents, pipelines, or collaborators, prefer CSV results and .h5ad AnnData files.
Standard case-control comparison:
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
results = ds.results_df
significant = results[results.padj < 0.05]
Testing multiple treatment groups against control:
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()
treatments = ["treatment_A", "treatment_B", "treatment_C"]
all_results = {}
for treatment in treatments:
ds = DeseqStats(dds, contrast=["condition", treatment, "control"])
ds.summary()
all_results[treatment] = ds.results_df
sig_count = len(ds.results_df[ds.results_df.padj < 0.05])
print(f"{treatment}: {sig_count} significant genes")
Control for technical variation:
# Include batch in design
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition")
dds.deseq2()
# Test condition while controlling for batch
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
Include continuous variables like age or dosage:
# Ensure continuous variable is numeric
metadata["age"] = pd.to_numeric(metadata["age"])
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition")
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
This skill includes a complete command-line script for standard analyses:
# Basic usage
python scripts/run_deseq2_analysis.py \
--counts counts.csv \
--metadata metadata.csv \
--design "~condition" \
--contrast condition treated control \
--output results/
# With additional options
python scripts/run_deseq2_analysis.py \
--counts counts.csv \
--metadata metadata.csv \
--design "~batch + condition" \
--contrast condition treated control \
--output results/ \
--min-counts 10 \
--alpha 0.05 \
--n-cpus 4 \
--shrink-coeff "condition[T.treated]" \
--plots
Script features:
Refer users to scripts/run_deseq2_analysis.py when they need a standalone analysis tool or want to batch process multiple datasets.
# Filter by adjusted p-value
significant = ds.results_df[ds.results_df.padj < 0.05]
# Filter by both significance and effect size
sig_and_large = ds.results_df[
(ds.results_df.padj < 0.05) &
(abs(ds.results_df.log2FoldChange) > 1)
]
# Separate up- and down-regulated
upregulated = significant[significant.log2FoldChange > 0]
downregulated = significant[significant.log2FoldChange < 0]
print(f"Upregulated: {len(upregulated)}")
print(f"Downregulated: {len(downregulated)}")
# Sort by adjusted p-value
top_by_padj = ds.results_df.sort_values("padj").head(20)
# Sort by absolute fold change (use shrunk values)
ds.lfc_shrink(coeff="condition[T.treated]")
ds.results_df["abs_lfc"] = abs(ds.results_df.log2FoldChange)
top_by_lfc = ds.results_df.sort_values("abs_lfc", ascending=False).head(20)
# Sort by a combined metric
ds.results_df["score"] = -np.log10(ds.results_df.padj) * abs(ds.results_df.log2FoldChange)
top_combined = ds.results_df.sort_values("score", ascending=False).head(20)
# Check normalization (size factors should be close to 1)
print("Size factors:", dds.obs["size_factors"])
# Examine dispersion estimates
import matplotlib.pyplot as plt
plt.hist(dds.var["dispersions"], bins=50)
plt.xlabel("Dispersion")
plt.ylabel("Frequency")
plt.title("Dispersion Distribution")
plt.show()
# Check p-value distribution (should be mostly flat with peak near 0)
plt.hist(ds.results_df.pvalue.dropna(), bins=50)
plt.xlabel("P-value")
plt.ylabel("Frequency")
plt.title("P-value Distribution")
plt.show()
Visualize significance vs effect size:
import matplotlib.pyplot as plt
import numpy as np
results = ds.results_df.copy()
results["-log10(padj)"] = -np.log10(results.padj)
plt.figure(figsize=(10, 6))
significant = results.padj < 0.05
plt.scatter(
results.loc[~significant, "log2FoldChange"],
results.loc[~significant, "-log10(padj)"],
alpha=0.3, s=10, c='gray', label='Not significant'
)
plt.scatter(
results.loc[significant, "log2FoldChange"],
results.loc[significant, "-log10(padj)"],
alpha=0.6, s=10, c='red', label='padj < 0.05'
)
plt.axhline(-np.log10(0.05), color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log2 Fold Change")
plt.ylabel("-Log10(Adjusted P-value)")
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)
Show fold change vs mean expression:
plt.figure(figsize=(10, 6))
plt.scatter(
np.log10(results.loc[~significant, "baseMean"] + 1),
results.loc[~significant, "log2FoldChange"],
alpha=0.3, s=10, c='gray'
)
plt.scatter(
np.log10(results.loc[significant, "baseMean"] + 1),
results.loc[significant, "log2FoldChange"],
alpha=0.6, s=10, c='red'
)
plt.axhline(0, color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log10(Base Mean + 1)")
plt.ylabel("Log2 Fold Change")
plt.title("MA Plot")
plt.savefig("ma_plot.png", dpi=300)
Issue: "Index mismatch between counts and metadata"
Solution: Ensure sample names match exactly
print("Counts samples:", counts_df.index.tolist())
print("Metadata samples:", metadata.index.tolist())
# Take intersection if needed
common = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common]
metadata = metadata.loc[common]
Issue: "All genes have zero counts"
Solution: Check if data needs transposition
print(f"Counts shape: {counts_df.shape}")
# If genes > samples, transpose is needed
if counts_df.shape[1] < counts_df.shape[0]:
counts_df = counts_df.T
Issue: "Design matrix is not full rank"
Cause: Confounded variables (e.g., all treated samples in one batch)
Solution: Remove confounded variable or add interaction term
# Check confounding
print(pd.crosstab(metadata.condition, metadata.batch))
# Either simplify design or add interaction
design = "~condition" # Remove batch
# OR
design = "~condition + batch + condition:batch" # Model interaction
Diagnostics:
# Check dispersion distribution
plt.hist(dds.var["dispersions"], bins=50)
plt.show()
# Check size factors
print(dds.obs["size_factors"])
# Look at top genes by raw p-value
print(ds.results_df.nsmallest(20, "pvalue"))
Possible causes:
For comprehensive details beyond this workflow-oriented guide:
API Reference (references/api_reference.md): Complete documentation of PyDESeq2 classes, methods, and data structures. Use when needing detailed parameter information or understanding object attributes.
Workflow Guide (references/workflow_guide.md): In-depth guide covering complete analysis workflows, data loading patterns, multi-factor designs, troubleshooting, and best practices. Use when handling complex experimental designs or encountering issues.
Load these references into context when users need:
Read references/api_reference.mdRead references/workflow_guide.mdRead references/workflow_guide.md (see Troubleshooting section)Data orientation matters: Count matrices typically load as genes × samples but need to be samples × genes. Always transpose with .T if needed.
Sample filtering: Remove samples with missing metadata before analysis to avoid errors.
Gene filtering: Filter low-count genes (e.g., < 10 total reads) to improve power and reduce computational time.
Design formula order: Put adjustment variables before the variable of interest (e.g., "~batch + condition" not "~condition + batch").
LFC shrinkage timing: Apply shrinkage after statistical testing and only for visualization/ranking purposes. P-values remain based on unshrunken estimates.
Result interpretation: Use padj < 0.05 for significance, not raw p-values. The Benjamini-Hochberg procedure controls false discovery rate.
Contrast specification: The format is [variable, test_level, reference_level] where test_level is compared against reference_level.
Save intermediate objects: Prefer dds.to_picklable_anndata().write_h5ad("dds_result.h5ad") for portable outputs. Only load pickle files that you created yourself and trust.
uv pip install pydeseq2==0.5.4
System requirements:
Optional for visualization: