一键导入
omics-analysis-guide
Three-tiered approach to omics data analysis (transcriptomics, proteomics) covering validated pipelines, standard workflows, and custom methods
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Three-tiered approach to omics data analysis (transcriptomics, proteomics) covering validated pipelines, standard workflows, and custom methods
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
基于 SOC 职业分类
Query PubChem (110M+ compounds) directly via the PUG-REST/JSON API with plain `requests` — no SDK install required. Search by name/CID/SMILES/InChIKey/formula, retrieve properties (MW, XLogP, TPSA, H-bond counts), do similarity/substructure searches with async ListKey polling, fetch synonyms, descriptions, assay summaries, and download SDF/PNG. For local cheminformatics use rdkit; for bioactivity-centric workflows use chembl-database-bioactivity.
Bulk RNA-seq DE with R/Bioconductor DESeq2. Negative binomial GLM, empirical Bayes shrinkage, Wald/LRT tests, multi-factor designs, Salmon tximeta import, apeglm LFC shrinkage, MA/volcano/heatmap viz. R gold standard. Use pydeseq2-differential-expression for Python; use edgeR for TMM normalization.
Unified Python interface to 40+ bioinformatics web services: UniProt proteins, KEGG pathways, ChEMBL/ChEBI/PubChem, BLAST, cross-database ID mapping, GO annotations, PPI. For deep single-DB queries use dedicated tools (gget for Ensembl, pubchempy for PubChem); bioservices excels at cross-database workflows.
Cancer genomics (TCGA et al.) via cBioPortal REST API. Retrieve somatic mutations, CNAs, expression, clinical data (survival/stage/treatment) across thousands of studies. Use for TMB, oncoprints, survival analysis. For population frequencies use gnomad-database; for drug-gene interactions use opentargets-database.
Query CELLxGENE Census (61M+ cells). Search by cell type/tissue/disease/organism; get AnnData, stream out-of-core, train PyTorch models. For your own data use scanpy; for annotated data use anndata.
Protein language models (ESM3, ESM C) for sequence generation, structure prediction, inverse folding, and embeddings. Design novel proteins, extract ML features, or fold sequences. Local GPU or EvolutionaryScale Forge API. Use AlphaFold for traditional folding; RDKit for small molecules.
| name | omics-analysis-guide |
| description | Three-tiered approach to omics data analysis (transcriptomics, proteomics) covering validated pipelines, standard workflows, and custom methods |
| license | open |
Short Description: Comprehensive guide for analyzing omics data (transcriptomics, proteomics) using validated pipelines, standard workflows, or custom analysis methods.
Authors: HITS
Version: 1.0
Last Updated: December 2025
License: CC BY 4.0
Commercial Use: Allowed
This guide provides a three-tiered approach to omics data analysis, prioritizing validated pipelines and standard workflows before moving to custom analysis. Always start with Option 1 and proceed to subsequent options only if needed.
The guide covers:
Note: This guide focuses on analysis of already-quantified data. For raw data processing (alignment, quantification), refer to specialized tools and pipelines.
A validated pipeline is a specific tool with peer-reviewed benchmarking data demonstrating performance on data like yours (e.g., DESeq2 for RNA-seq counts, MaxQuant for label-free proteomics). A standard workflow is the canonical sequence of QC → normalization → statistical test → multiple-testing correction assembled from accepted community practice but tuned to your specific dataset. Custom analysis is bespoke statistical or computational modeling required when neither prior tier covers the data type or research question. The progression Option 1 → Option 2 → Option 3 trades reproducibility for flexibility — always exhaust earlier tiers first.
Missing data in omics arises from three distinct mechanisms with different correct treatments. MCAR (Missing Completely At Random) means missingness is independent of any value — safe to impute with mean, median, or KNN. MAR (Missing At Random) means missingness depends on observed variables but not the unobserved value — KNN or model-based imputation is appropriate. MNAR (Missing Not At Random) means missingness depends on the missing value itself, typical in proteomics where low-abundance proteins drop below detection — requires left-censored imputation (minprob/QRILC) below the detection limit. Choosing the wrong mechanism systematically biases downstream statistics.
Parametric tests (Student's t-test, Welch's t-test) assume approximate normality and (for Student's) equal variances; they have higher power than non-parametric tests when assumptions hold. Non-parametric tests (Mann-Whitney U, permutation) make weaker assumptions and are correct under skewed distributions or small n, at the cost of statistical power. The choice depends on sample size (n < 10 favors non-parametric), normality (Shapiro-Wilk / Anderson-Darling at the feature level), variance homogeneity (Levene's test), and outlier prevalence.
Omics analyses test thousands of features simultaneously. Without correction, expected false positives at α=0.05 across 20,000 genes is 1,000. Family-wise error rate (FWER) corrections like Bonferroni control the probability of any false positive but are conservative. False discovery rate (FDR) corrections like Benjamini-Hochberg control the expected proportion of false positives among reported significant features and are the standard for omics. Always report adjusted p-values, never raw p-values, when calling significance.
Use this tree to choose the right analysis tier for your data:
Have you searched for a validated
pipeline matching your data type?
│
┌─────────────┴─────────────┐
│ │
NO YES
│ │
▼ ▼
Run Method 1 Did you find a validated
(literature) AND pipeline with benchmarks
Method 2 (consortia matching your data type
workflows) FIRST and biological question?
│
┌───────┴───────┐
│ │
YES NO
│ │
▼ ▼
OPTION 1: Is your data a
Use validated common type
pipeline (RNA-seq counts,
(e.g., DESeq2, pre-quantified
edgeR, MaxQuant) proteomics)?
│
┌───────┴───────┐
│ │
YES NO
│ │
▼ ▼
OPTION 2: OPTION 3:
Standard Custom analysis
workflow (consult
(QC → norm → statistician;
test → FDR) document
thoroughly)
| Data type | Sample size | Has validated pipeline? | Recommended tier | Specific tool / approach |
|---|---|---|---|---|
| Bulk RNA-seq counts | n ≥ 3/group | Yes (DESeq2, edgeR) | Option 1 | DESeq2 (negative binomial, default FDR < 0.05) |
| Pre-quantified proteomics, normal-distributed | n ≥ 5/group | Sometimes | Option 1 if pipeline matches; else Option 2 | limma or t-test + BH-FDR |
| Pre-quantified proteomics, MNAR-heavy | n ≥ 5/group | No (mechanism-specific) | Option 2 | minprob imputation → t-test or Mann-Whitney → BH-FDR |
| Small-cohort omics (n < 5) | n < 5 | Rarely | Option 2 with caution | Permutation test, report effect sizes; flag results as preliminary |
| Multi-omics integration | Variable | Limited | Option 3 | MOFA, DIABLO, or custom Bayesian model |
| Novel data type (e.g., spatial multi-omics) | Variable | No | Option 3 | Build from first principles; cross-validate |
| Time-series omics | n per timepoint | Sometimes (maSigPro, ImpulseDE2) | Option 1 if available; else Option 3 | maSigPro for transcriptomics; custom for proteomics |
IMPORTANT: You MUST complete BOTH Method 1 AND Method 2 before proceeding to Option 2. Do not skip Method 2 even if Method 1 finds no results.
Search for validated analysis methods using web search tools or literature databases (PubMed, Google Scholar).
Search queries to try (use multiple):
"[DATA_TYPE]" "[ANALYSIS_TYPE]" validated pipeline best practices
"[DATA_TYPE]" analysis workflow "[ORGANISM]" published
"[DATA_TYPE]" "[TOOL_NAME]" validation benchmark comparison
Example for bulk RNA-seq:
"RNA-seq" "differential expression" validated pipeline human
"DESeq2" "edgeR" comparison validation RNA-seq
Example for proteomics:
"proteomics" "differential abundance" analysis validated methods
"proteomics" normalization imputation best practices
What to search for in results:
IMPORTANT: Spend adequate time searching literature. Look through at least the first 10-15 search results and check supplementary materials of relevant papers.
Review established workflows from major consortia and publications:
If you find validated pipelines or methods:
Example result format:
Data Type: Bulk RNA-seq
Analysis Goal: Differential expression
Pipeline: DESeq2 (v1.40.0)
Reference: Love MI, et al. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. PMID: 25516281
Validation: Validated in multiple benchmark studies, recommended for count data
Parameters: Default parameters, FDR < 0.05, log2FC > 1
If no validated pipelines found in BOTH Method 1 AND Method 2: Only then proceed to Option 2: Use Standard Workflows
RNA-seq (Bulk):
Proteomics (Pre-quantified):
CRITICAL: Quality control must be performed before any statistical analysis. Poor data quality will lead to unreliable results regardless of statistical methods used.
Check for outlier samples:
Check sample correlation:
Check for batch effects:
Assess missing value patterns:
Check feature detection consistency:
CRITICAL: Choose imputation method based on missing value mechanism:
MNAR (Missing Not At Random): Use minimum probability imputation (minprob)
MCAR/MAR (Missing Completely/At Random): Use KNN imputation
Simple methods (if few missing values):
For RNA-seq count data: Normalization is typically handled by DESeq2/edgeR (size factors).
For proteomics/continuous data:
CRITICAL: Always check statistical test assumptions before performing analysis. Using the wrong test can lead to incorrect conclusions.
Key checks to perform:
Normality test:
Variance homogeneity test:
Sample size check:
Outlier check:
Test selection logic:
Implementation steps:
For each feature:
Apply FDR correction:
Key libraries:
scipy.stats: Statistical tests (ttest_ind, mannwhitneyu, shapiro, levene)statsmodels.stats.multitest: FDR correction (multipletests with method='fdr_bh')Volcano Plot:
PCA Plot (for quality control):
Once you have completed the standard workflow:
If standard workflows don't meet your needs: Proceed to Option 3: Custom Analysis
Data Quality: Ensure high-quality data before custom analysis
Statistical Rigor:
Reproducibility:
Validation:
Step 1: Quality Control
Step 2: For RNA-seq count data, use DESeq2 (typically in R)
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = count_matrix, colData = sample_metadata, design = ~ condition)
dds <- DESeq(dds)
res <- results(dds, contrast=c("condition", "treatment", "control"))
Step 3: Functional Enrichment (optional)
Step 1: Quality Control
Step 2: Impute Missing Values
Step 3: Normalization
Step 4: Check for Batch Effects
Step 5: Differential Abundance Analysis
Step 6: Visualization
Exhaust validated pipelines before building anything custom. Run both literature search and consortium-workflow review before falling back to bespoke analysis. Rationale: validated pipelines have peer-reviewed benchmarking; novel methods require their own validation effort and reduce reproducibility.
Perform sample-level QC before any statistical analysis. Use PCA + Isolation Forest for outlier detection, sample correlation matrices, and PCA + silhouette score for batch effects. Rationale: a single outlier sample or unrecognized batch effect can dominate test statistics and produce uninterpretable results regardless of the test chosen.
Diagnose the missing-value mechanism (MCAR / MAR / MNAR) before imputing. Check the correlation between mean intensity and missingness rate per feature. Rationale: imputing MNAR data with KNN biases low-abundance features upward; imputing MCAR data with minprob biases everything downward. Mechanism-aware imputation prevents systematic distortion.
Always check test assumptions, then choose the test — never the reverse. Run Shapiro-Wilk / Anderson-Darling for normality and Levene's for variance homogeneity on a representative feature subset. Rationale: applying a t-test to non-normal small-n data inflates type I error; defaulting to Mann-Whitney on well-behaved data wastes power.
Always apply FDR correction (Benjamini-Hochberg) for genome-wide tests. Report p_adj (or q-value), not raw p. Rationale: with 20,000 genes tested at α=0.05, ~1,000 false positives are expected without correction — the result set is meaningless.
Document every parameter and version, save intermediate outputs, and pin random seeds. Record tool version, parameter values, normalization method, imputation method, test choice, FDR threshold, and the seed for any stochastic step. Rationale: omics pipelines have many tunable knobs; without exact provenance the analysis cannot be reproduced or audited.
Validate findings on an independent dataset or with an orthogonal method whenever possible. Examples: confirm DE genes via qPCR, replicate in a public dataset (GEO, ArrayExpress), or compare across batches. Rationale: even FDR-controlled hits can be false positives driven by batch artifacts, contamination, or normalization choices.
Skipping QC and going directly to statistics. Problem: Outlier samples and batch effects produce false signals that pass statistical tests, polluting the result list with artifacts. How to avoid: Always run sample-level PCA, correlation matrices, and outlier detection before any differential test. Treat QC as mandatory, not optional.
Imputing missing values with a one-size-fits-all method. Problem: Using mean imputation on MNAR proteomics data biases low-abundance proteins; using minprob on MCAR data biases everything below the detection limit downward. How to avoid: Diagnose the mechanism (correlation between intensity and missingness), then pick an appropriate imputer: minprob for MNAR, KNN for MCAR/MAR.
Using t-tests on non-normal or small-n data. Problem: Student's t-test assumes normality and (with pooled variance) equal variances; with n < 10 and skewed data, type I error inflates well above the nominal α. How to avoid: Run normality and variance tests first; use Welch's t-test for unequal variance, Mann-Whitney for non-normal, and permutation tests for n < 5.
Reporting raw p-values without multiple testing correction.
Problem: Across thousands of features, raw p-values produce massive false discovery rates; the resulting "significant" gene lists are dominated by noise.
How to avoid: Always apply Benjamini-Hochberg FDR (or BY for dependent tests) and report adjusted p-values. Set p_adj < 0.05 (or q < 0.05) as the significance threshold.
Confusing fold change with statistical significance.
Problem: A high log2 fold change at high p_adj is unreliable noise; a low log2 fold change at very low p_adj may be real but biologically negligible.
How to avoid: Filter on both — typical thresholds are |log2FC| > 1 AND p_adj < 0.05. Report effect sizes alongside p-values.
Failing to correct for batch effects when present.
Problem: Batch effects masquerade as biological signal, especially in proteomics and multi-cohort studies; PC1 ends up reflecting batch rather than condition.
How to avoid: Check batch separation with PCA + silhouette score; if silhouette > ~0.3, apply ComBat, limma's removeBatchEffect, or include batch as a covariate in the model.
Treating Option 3 (custom analysis) as a shortcut. Problem: Jumping straight to custom methods without first running standard workflows skips peer-reviewed validation and makes results harder to publish and reproduce. How to avoid: Document a clear justification for why Options 1 and 2 are inadequate before moving to Option 3, and validate any custom method on simulated or held-out data.
Remember: Always start with validated pipelines (Option 1), then move to standard workflows (Option 2), and only use custom analysis (Option 3) when necessary. Document all steps and parameters for reproducibility. Quality control is essential at every stage of analysis. Always check statistical test assumptions before performing analysis.