| name | bio-de-edger-basics |
| description | Perform differential expression analysis using edgeR in R/Bioconductor. Use for analyzing RNA-seq count data with the quasi-likelihood F-test framework, creating DGEList objects, normalization, dispersion estimation, and statistical testing. Use when performing DE analysis with edgeR. |
| tool_type | r |
| primary_tool | edgeR |
Version Compatibility
Reference examples tested with: DESeq2 1.42+, edgeR 4.0+, limma 3.58+, scanpy 1.10+
Before using code patterns, verify installed versions match. If versions differ:
- R:
packageVersion('<pkg>') then ?function_name to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
edgeR Basics
Differential expression analysis using edgeR's quasi-likelihood framework for RNA-seq count data.
Required Libraries
library(edgeR)
library(limma)
Installation
if (!require('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install('edgeR')
Creating DGEList Object
Goal: Construct an edgeR container from a count matrix with sample group information.
Approach: Wrap raw counts and group labels into a DGEList object for normalization and testing.
"Load my RNA-seq counts into edgeR" → Create a DGEList from a count matrix with sample group assignments and optional gene annotations.
y <- DGEList(counts = counts, group = group)
y <- DGEList(counts = counts, group = group, genes = gene_info)
y
Standard edgeR Workflow (Quasi-Likelihood)
Goal: Run the complete edgeR QL pipeline from raw counts to differentially expressed gene lists.
Approach: Filter, normalize (TMM), estimate dispersions, fit quasi-likelihood GLM, and test coefficients with the QL F-test.
"Find differentially expressed genes between my groups" → Test for significant expression differences using negative binomial models with quasi-likelihood F-tests.
y <- DGEList(counts = counts, group = group)
keep <- filterByExpr(y, group = group)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- calcNormFactors(y)
design <- model.matrix(~ group)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = 2)
topTags(qlf)
Filtering Low-Expression Genes
Goal: Remove genes with insufficient expression to reduce noise and multiple testing burden.
Approach: Apply automatic or manual CPM/count thresholds requiring expression in a minimum number of samples.
keep <- filterByExpr(y, group = group)
y <- y[keep, , keep.lib.sizes = FALSE]
keep <- rowSums(cpm(y) > 1) >= 2
y <- y[keep, , keep.lib.sizes = FALSE]
keep <- rowSums(y$counts >= 10) >= 3
y <- y[keep, , keep.lib.sizes = FALSE]
Normalization Methods
Goal: Correct for differences in library composition between samples.
Approach: Compute TMM (or alternative) normalization factors that adjust effective library sizes.
y <- calcNormFactors(y, method = 'TMM')
y <- calcNormFactors(y, method = 'RLE')
y <- calcNormFactors(y, method = 'upperquartile')
y <- calcNormFactors(y, method = 'none')
y$samples$norm.factors
Design Matrices
Goal: Define the linear model structure for the experimental design.
Approach: Build model matrices encoding group, batch, and interaction terms for the GLM.
design <- model.matrix(~ group)
design <- model.matrix(~ batch + group)
design <- model.matrix(~ genotype + treatment + genotype:treatment)
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
Dispersion Estimation
Goal: Estimate biological variability (dispersion) to parameterize the negative binomial model.
Approach: Compute common, trended, and gene-wise dispersions using empirical Bayes moderation.
y <- estimateDisp(y, design)
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
y$common.dispersion
y$trended.dispersion
y$tagwise.dispersion
plotBCV(y)
Quasi-Likelihood Testing
Goal: Test for differential expression using the quasi-likelihood framework for robust inference.
Approach: Fit a QL GLM and test individual coefficients, contrasts, or multiple coefficients simultaneously.
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = 2)
contrast <- makeContrasts(groupB - groupA, levels = design)
qlf <- glmQLFTest(fit, contrast = contrast)
qlf <- glmQLFTest(fit, coef = 2:3)
Making Contrasts
Goal: Define specific pairwise or complex group comparisons for testing.
Approach: Use makeContrasts with a no-intercept design to specify arbitrary between-group differences.
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
contrast <- makeContrasts(
TreatedVsControl = treated - control,
DrugAVsControl = drugA - control,
DrugBVsControl = drugB - control,
DrugAVsDrugB = drugA - drugB,
levels = design
)
qlf_treated <- glmQLFTest(fit, contrast = contrast[, 'TreatedVsControl'])
qlf_drugA <- glmQLFTest(fit, contrast = contrast[, 'DrugAVsControl'])
Accessing Results
Goal: Retrieve and filter DE results from the fitted model.
Approach: Use topTags to extract ranked gene lists with FDR-corrected p-values.
topTags(qlf, n = 20)
results <- topTags(qlf, n = Inf)$table
summary(decideTests(qlf))
de_genes <- topTags(qlf, n = Inf, p.value = 0.05)$table
Result Columns
| Column | Description |
|---|
logFC | Log2 fold change |
logCPM | Average log2 counts per million |
F | Quasi-likelihood F-statistic |
PValue | Raw p-value |
FDR | False discovery rate (adjusted p-value) |
Alternative: Exact Test (Classic edgeR)
Goal: Perform a simple two-group DE test without a design matrix.
Approach: Use the classic edgeR exact test based on the negative binomial distribution.
y <- DGEList(counts = counts, group = group)
y <- calcNormFactors(y)
y <- estimateDisp(y)
et <- exactTest(y)
topTags(et)
Alternative: glmLRT (Likelihood Ratio Test)
Goal: Test for DE using likelihood ratio tests as an alternative to the QL F-test.
Approach: Fit a standard GLM and compare nested models via the likelihood ratio statistic.
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef = 2)
topTags(lrt)
Treat Test (Log Fold Change Threshold)
Goal: Test whether genes exceed a minimum fold change threshold, not just differ from zero.
Approach: Use glmTreat to apply a fold change threshold directly in the statistical test.
tr <- glmTreat(fit, coef = 2, lfc = log2(1.5))
topTags(tr)
Multi-Factor Designs
Goal: Test for condition effects while adjusting for batch or other covariates.
Approach: Include nuisance variables in the design matrix so the QL test controls for them.
design <- model.matrix(~ batch + condition, data = sample_info)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = ncol(design))
Getting Normalized Counts
Goal: Obtain normalized expression values for visualization and downstream analysis.
Approach: Compute CPM or log-CPM values using TMM-adjusted library sizes.
cpm_values <- cpm(y)
log_cpm <- cpm(y, log = TRUE)
rpm_values <- cpm(y)
log_cpm <- cpm(y, log = TRUE, prior.count = 2)
Exporting Results
Goal: Save DE results and normalized counts to files for sharing or downstream tools.
Approach: Extract all results via topTags and write as CSV alongside CPM values.
all_results <- topTags(qlf, n = Inf)$table
all_results$gene_id <- rownames(all_results)
write.csv(all_results, file = 'edger_results.csv', row.names = FALSE)
write.csv(cpm(y), file = 'cpm_values.csv')
Common Errors
| Error | Cause | Solution |
|---|
| "design matrix not full rank" | Confounded variables | Check sample metadata |
| "No residual df" | Too few samples | Need more replicates |
| "NA/NaN/Inf" | Zero counts in all samples | Filter more stringently |
Deprecated/Changed Functions
| Old | Status | New |
|---|
decidetestsDGE() | Removed (v4.4) | decideTests() |
glmFit() + glmLRT() | Still works | Prefer glmQLFit() + glmQLFTest() |
estimateDisp() | Optional (v4+) | glmQLFit() estimates internally |
mglmLS(), mglmSimple() | Retired | mglmLevenberg() or mglmOneWay() |
Note: calcNormFactors() and normLibSizes() are synonyms - both work.
Quick Reference: Workflow Steps
y <- DGEList(counts = counts, group = group)
keep <- filterByExpr(y, group = group)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- calcNormFactors(y)
design <- model.matrix(~ group)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef = 2)
topTags(qlf, n = 20)
Choosing edgeR vs DESeq2
| Scenario | Recommended | Rationale |
|---|
| Standard bulk RNA-seq (n >= 3/group) | Either — expect ~90% overlap | Both perform well; concordance is high |
| Very small sample size (n = 2-3/group) | edgeR QL F-test | QL framework provides tighter FPR control with few replicates |
| Large datasets (>100 samples, many conditions) | edgeR | Faster; C++ backend in v4 |
| Salmon/Kallisto quantification | DESeq2 via tximport | DESeqDataSetFromTximport handles offset matrix natively |
| Need LFC shrinkage for ranking/GSEA | DESeq2 | apeglm/ashr shrinkage built in; edgeR has no equivalent |
| Formal fold-change threshold testing | edgeR glmTreat | More flexible than DESeq2 lfcThreshold |
| Python-only environment | DESeq2 (via PyDESeq2) | No edgeR Python equivalent |
| Omnibus test (>= 3 groups) | Either (LRT) | Both support likelihood ratio tests |
If overlap between DESeq2 and edgeR is < 60%, investigate differences in filtering, normalization, or dispersion estimation — discordance usually indicates a modeling issue, not a tool difference.
Decision Guidance
Test Selection
| Test | When to Use | Key Property |
|---|
QL F-test (glmQLFit + glmQLFTest) | Default for most experiments | Best FPR control with small n; accounts for dispersion uncertainty |
LRT (glmFit + glmLRT) | Large samples (n >= 6 per group); complex designs where QL is too conservative | More powerful but can be anti-conservative with few replicates |
Exact test (exactTest) | Simple two-group comparison only | No design matrix; cannot adjust for covariates |
TREAT (glmTreat) | Testing against a minimum fold-change threshold | Tests H0: |LFC| <= threshold, not H0: LFC = 0 |
QL F-test p-values are always >= LRT p-values. In null comparisons (replicates vs replicates), QL consistently returns ~0 false positives while LRT can return many.
edgeR v4 Changes
- Constant NB dispersion estimated from the most highly expressed genes (v3 used trended dispersions)
estimateDisp() is now optional before glmQLFit() (but still needed for BCV plots)
- Fractional count support for transcript quantification uncertainty (Gibbs sampling)
- C++ backend for model fitting (faster)
decidetestsDGE() removed; use decideTests() instead
mglmLS(), mglmSimple() retired; use mglmLevenberg() or mglmOneWay()
filterByExpr Internals
Default parameters: min.count = 10, min.total.count = 15, large.n = 10, min.prop = 0.7.
Algorithm:
- Convert
min.count to CPM cutoff: min.count / median(lib.size) * 1e6
- Determine minimum number of samples
n from design (smallest group size)
- If group size >
large.n: n = large.n + (group_size - large.n) * min.prop
- Keep gene if CPM >= cutoff in >= n samples AND total count >=
min.total.count
Example: median library 51M reads -> CPM cutoff ~0.2; 3 replicates per group -> need CPM >= 0.2 in >= 3 samples.
Normalization Caveats
TMM/RLE both assume most genes are NOT DE. This assumption breaks in:
- Prokaryotic stress responses (majority of genes may be DE)
- Extreme perturbations or cross-species comparisons
- Single-cell with many zeros
When violated, consider spike-in normalization or supplying known stable reference genes. For prokaryotic RNA-seq, also note: use non-spliced aligners (BWA-MEM, Bowtie2), rRNA depletion is essential, and KEGG organism codes are strain-specific.
Related Skills
- deseq2-basics - Alternative DE analysis with DESeq2
- de-visualization - MA plots, volcano plots, heatmaps
- de-results - Extract and export significant genes