تشغيل أي مهارة في Manus بنقرة واحدة

bio-crispr-screens-batch-correction

Batch effect correction for CRISPR screens covering ComBat empirical-Bayes, RUV, SVA, control-sgRNA normalization, and the model-based alternative of including batch as a covariate in MAGeCK MLE or Chronos. Covers screen-specific batch sources (passage cohort, library lot, infection day, sequencing run, Cas9 lot, FBS lot), PCA + variance-decomposition diagnostic to decide if correction is needed, when correction harms biology by over-correcting condition into batch, limma removeBatchEffect for visualization-only correction, and relationship to multi-condition design matrices. Use when combining screens for joint analysis, when passage cohort confounds biology, when DepMap-style panels need Chronos with batch covariates, when picking ComBat vs RUV, or when correction harms biology and should be replaced with explicit covariate modeling.

تشغيل في Manus

النجوم١

التفرعات٠

آخر تحديث٨ يونيو ٢٠٢٦ في ٠٣:٢٤

المصدر

bg-szy

bg-szy/TOP-SKILLS

فتح مستودع GitHub عرض مستودعات المنشئ

أمر التثبيت

تنزيل

تشغيل في Manus

مستكشف الملفات

2 ملفات

SKILL.md

readonly

المزيد من هذا المستودع

نفس المستودع

bio-analytical-validation

bg-szy/TOP-SKILLS

Treats a ctDNA assay as a molecule-counting experiment at the Poisson edge and builds its analytical-validation case the measurement-science way. Covers the genome-equivalent currency (~330 haploid copies/ng), the lambda = input_GE x VAF sampling ceiling (lambda>=3 for ~95% detection), the error-suppression ladder (raw NGS ~1e-3 -> single-strand UMI ~1e-4/1e-5 -> duplex <1e-7), the CLSI EP17 LoB/LoD/LoD95/LoQ framework, the per-locus-vs-panel-integrated LoD distinction that lets bespoke MRD reach ppm, contrived/SEQC2 reference standards, and honest LoD reporting conditioned on input mass + consensus depth + replicate detection rate. Use when stating or trusting a sensitivity claim, designing a dilution-series validation, deciding how many genome equivalents are needed at a target VAF, choosing a single-locus vs panel-integrated LoD, or auditing a "detects 0.1% VAF" claim.

2026-06-081

bio-cfdna-preprocessing

bg-szy/TOP-SKILLS

Decides how to preprocess plasma cfDNA sequencing data so the recoverable signal survives - library-prep-aware fragment expectations (dsDNA vs ssDNA/adaptase prep), UMI/duplex consensus with fgbio (ExtractUmisFromBam, GroupReadsByUmi --strategy paired for duplex, CallMolecularConsensusReads vs CallDuplexConsensusReads, FilterConsensusReads min-reads "total s1 s2"), the align->group->consensus->RE-align ordering, and the cfDNA dedup trap where naive coordinate dedup collapses nucleosome-coincident independent molecules. Covers when single-strand consensus suffices vs when duplex is mandatory, the singleton/sensitivity tax at low input, and reading the insert-size histogram as a pre-analytical QC instrument. Use when processing plasma cfDNA reads before fragmentomics, ctDNA mutation calling, or tumor-fraction estimation.

2026-06-081

bio-comparative-genomics-ortholog-inference

bg-szy/TOP-SKILLS

Infer orthologous genes and gene families across species using OrthoFinder3 (HOG-based phylogenetic orthology), SonicParanoid2, Broccoli, ProteinOrtho, OMA / FastOMA hierarchical orthologous groups, eggNOG-mapper, JustOrthologs, and TOGA whole-genome-alignment orthology. Use when building single-copy ortholog sets for phylogenomics, classifying co-orthologs and in/out-paralogs after gene duplication, propagating functional annotation via orthology with awareness of the ortholog conjecture, distinguishing speciation from duplication via gene-tree species-tree reconciliation, computing Quest-for-Orthologs benchmark performance, or running synteny-aware ortholog detection in WGD-affected lineages.

2026-06-081

bio-crispr-screens-batch-correction

bg-szy/TOP-SKILLS

Batch effect correction for CRISPR screens. Covers normalization across batches, technical replicate handling, and batch-aware analysis. Use when combining screens from multiple batches or correcting systematic technical variation.

2026-06-081

bio-ctdna-mutation-detection

bg-szy/TOP-SKILLS

Detects somatic mutations in circulating tumor DNA, treating low-VAF detection as a signal-versus-noise problem set by error suppression and molecules sampled, not by the choice of caller. Distinguishes de novo CALLING (scanning a panel for unknown variants, bounded by per-locus error and multiple testing) from tumor-informed DETECTION (tracking a pre-specified variant set, where panel integration reaches single-ppm). Covers VarDict and Mutect2 for de novo calling, UMI-aware callers, and a pysam-based known-variant VAF tracker, with matched-WBC subtraction as the mandatory defense against clonal hematopoiesis (the dominant false positive). Use when calling or tracking tumor mutations from plasma cfDNA, setting a VAF threshold, or deciding whether a low-VAF call is tumor versus CHIP.

2026-06-081

bio-hi-c-analysis-compartment-analysis

bg-szy/TOP-SKILLS

Detects A/B chromatin compartments from balanced Hi-C contact matrices via eigenvector decomposition of the distance-normalized, Pearson-correlated cis matrix with cooltools (eigs_cis), then orients (phases) the compartment eigenvector against a GC or gene-density track so the active (A) sign is not arbitrary. Covers the eigenvector-is-a-choice problem (per-arm view_df to remove the centromere gradient; picking the eigenvector by max correlation with activity, not by eigenvalue), GC phasing with bioframe.frac_gc, resolution choice (100kb-1Mb), saddle plots and saddle_strength for compartmentalization strength, the cohesin-loss-strengthens-compartments result, subcompartments (SNIPER/Calder/dcHiC), and cross-condition compartment switching. Use when calling A/B compartments, computing E1/eigenvectors, phasing the eigenvector, building saddle plots, choosing a compartment resolution, quantifying compartment strength, or comparing compartmentalization across conditions.

2026-06-081

name	bio-crispr-screens-batch-correction
description	Batch effect correction for CRISPR screens covering ComBat empirical-Bayes, RUV, SVA, control-sgRNA normalization, and the model-based alternative of including batch as a covariate in MAGeCK MLE or Chronos. Covers screen-specific batch sources (passage cohort, library lot, infection day, sequencing run, Cas9 lot, FBS lot), PCA + variance-decomposition diagnostic to decide if correction is needed, when correction harms biology by over-correcting condition into batch, limma removeBatchEffect for visualization-only correction, and relationship to multi-condition design matrices. Use when combining screens for joint analysis, when passage cohort confounds biology, when DepMap-style panels need Chronos with batch covariates, when picking ComBat vs RUV, or when correction harms biology and should be replaced with explicit covariate modeling.
tool_type	mixed
primary_tool	pyComBat

Version Compatibility

Reference examples tested with: pyComBat 0.3.3+ (epigenelabs/pyComBat), MAGeCK 0.5.9+, R/limma 3.58+, sva 3.50+, RUVSeq 1.36+, pandas 2.2+, numpy 1.26+, scikit-learn 1.4+, scipy 1.12+.

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show pycombat; from combat.pycombat import pycombat
R: packageVersion('sva'); ?ComBat; packageVersion('RUVSeq'); ?RUVg

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Batch Correction for CRISPR Screens

"Correct batch effects in my CRISPR screens" -> Diagnose the batch source, decide whether to remove via empirical-Bayes (ComBat), explicit covariate modeling (MAGeCK MLE / Chronos design matrix), control-guide-anchored normalization, or unwanted-variation decomposition (RUV, SVA), then apply only the correction that preserves biological condition signal.

Python: pyComBat.pycombat for empirical-Bayes correction
Python: explicit batch covariates in mageck mle --design-matrix
R: sva::ComBat, RUVSeq::RUVg, limma::removeBatchEffect
CLI: chronos-cn natively handles screen-batch covariates

Batch Sources in CRISPR Screens

Source	Mechanism	Detectable by
Library lot	Different aliquots or PCR amplifications	Gini shift; plasmid-pool sequencing
Cell passage cohort	Cells passaged through different periods	PCA Day-0 samples clustering by passage
Infection day	Lentivirus titer drifts; FBS lot changes	PCA Day-0 samples cluster by day
Cas9 enzyme lot	Cas9 expression heterogeneity	PR-AUC drift across screens
Sequencing run	Lane bias, flowcell variant, machine	Per-sample read-count distribution
FBS / culture lot	Fetal bovine serum lot variations confound proliferation	Day-0 vs endpoint differential not present in vehicle
Tissue-prep batch	In-vivo: animal cohort, surgical day, organ-prep tech	In-vivo screens (see [[in-vivo-screens]])

Critical: Batch effects in CRISPR screens often correlate with biology (e.g., the drug arm was processed in batch 2 because that's when the drug arrived). This confounds correction. Always check for confounding before applying ComBat.

Batch Effect Decision Tree

Diagnostic finding	Recommended correction
PCA shows samples cluster by condition, not batch	No correction needed; biology dominates
PCA PC1 separates batches, PC2 separates conditions	Apply ComBat with condition as biological_covariate
Batch fully confounded with condition (e.g. all drug in batch 2, all vehicle in batch 1)	Correction will destroy biology; instead redesign next screen with cross-batch balance OR re-analyze with batch in MAGeCK MLE design matrix
Day-0 (pre-perturbation) samples cluster by batch	Strong batch effect; ComBat needed
Endpoint samples cluster by batch but not Day-0	Selection-driven artifact (FBS lot etc); correct or include batch as covariate
Replicates within a batch are tight; across-batch much wider	Classic batch effect; ComBat
Each replicate scatters randomly across PCs	Sample-level noise; no batch correction will help
Cancer-line panel with multiple batches	Use Chronos (built-in batch modeling)

Diagnose: PCA + Variance Decomposition

Goal: Quantify what fraction of variance is batch vs condition before correcting.

Approach: Run PCA on log10(counts+1); fit ANOVA decomposing variance into batch and condition components; report variance explained.

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from scipy import stats

def batch_diagnostic(counts_df, metadata_df, batch_col='batch', condition_col='condition'):
    '''Variance decomposition: report fraction of PC1/PC2 variance attributable to batch vs condition.'''
    log_counts = np.log10(counts_df + 1).T  # samples as rows
    pca = PCA(n_components=5)
    pcs = pca.fit_transform(log_counts)
    out = pd.DataFrame({
        'PC': range(1, 6),
        'var_explained': pca.explained_variance_ratio_,
    })
    pc_df = pd.DataFrame(pcs, columns=[f'PC{i+1}' for i in range(5)], index=counts_df.columns).join(metadata_df)
    for i in range(5):
        pc = pc_df[f'PC{i+1}']
        f_b, p_b = stats.f_oneway(*[pc[pc_df[batch_col] == b] for b in pc_df[batch_col].unique()])
        f_c, p_c = stats.f_oneway(*[pc[pc_df[condition_col] == c] for c in pc_df[condition_col].unique()])
        out.loc[i, 'batch_F'] = f_b
        out.loc[i, 'batch_p'] = p_b
        out.loc[i, 'cond_F'] = f_c
        out.loc[i, 'cond_p'] = p_c
    return out

Interpretation: If PC1 has batch F-stat > condition F-stat by 10x, batch is dominating and correction is warranted. If condition dominates PC1, no correction needed.

ComBat Empirical-Bayes Correction

Goal: Remove batch-specific location and scale shifts while preserving biological condition signal.

Approach: Log-transform counts, fit ComBat with explicit biological_covariate indicating condition (so the model knows which signal to preserve), back-transform.

import numpy as np
from combat.pycombat import pycombat

def combat_correct(counts_df, batch_vector, condition_vector=None):
    '''ComBat on log-counts with optional biological covariate (condition).
    Preserves condition signal while removing batch shifts.'''
    data = np.log2(counts_df.values + 1)
    if condition_vector is not None:
        mod = pd.get_dummies(condition_vector).values.astype(float)
        corrected = pycombat(data, np.array(batch_vector), mod=mod)
    else:
        corrected = pycombat(data, np.array(batch_vector))
    return pd.DataFrame(np.power(2, corrected) - 1,
                         index=counts_df.index, columns=counts_df.columns).clip(lower=0)

Critical caveat: ComBat assumes batch effects are linear shifts of mean and variance in log space. Non-linear effects (e.g., gene-specific batch sensitivity) remain. Always re-check PCA after correction to confirm batches now overlap.

RUV (Remove Unwanted Variation)

Goal: Identify hidden batch sources via control sgRNAs whose true signal is known.

Approach: Designate non-targeting controls as "negative controls" (assumed unchanged); RUV decomposes their variance into unwanted factors, then subtracts these from all data.

library(RUVSeq)
# counts_df: rows = sgRNAs, columns = samples
ntc_indices <- which(rownames(counts_df) %in% ntc_sgrna_names)
seqset <- newSeqExpressionSet(counts = as.matrix(counts_df))
ruv_corrected <- RUVg(seqset, cIdx = ntc_indices, k = 2)  # k = 2 unwanted factors
# Access corrected data
corrected_counts <- normCounts(ruv_corrected)

When to use: RUV preferred over ComBat when batches are not annotated (e.g., unknown technical confounders). Worse than ComBat when batch is known and well-annotated; ComBat is more direct.

SVA (Surrogate Variable Analysis)

Goal: Estimate unknown latent factors that may confound the screen.

Approach: SVA computes surrogate variables that capture variance not explained by known biological factors; these can then be added to the MAGeCK MLE design matrix as covariates.

library(sva)
# counts_df: rows = sgRNAs, columns = samples
mod <- model.matrix(~ condition, data = metadata)
mod0 <- model.matrix(~ 1, data = metadata)
sv_obj <- sva(as.matrix(counts_df), mod, mod0)
n_sv <- sv_obj$n.sv  # number of surrogate variables
# Add to design matrix for MAGeCK MLE
design_mat <- cbind(mod, sv_obj$sv)

Use case: When the screen has clear biological signal (e.g. essentiality recovery passes) but small effect sizes are hidden by noise; SVA-discovered latent factors as covariates can recover them.

Batch as Explicit Covariate (Preferred for MAGeCK MLE / Chronos)

Goal: Model batch and biology in the same regression instead of pre-correcting.

Approach: Add batch indicator columns to the MLE design matrix. The fitted beta for condition is the effect after accounting for batch; no pre-correction needed.

# Design matrix for a screen with 2 batches and 2 conditions
cat > design.txt <<EOF
Samples         baseline    batch2    treatment
Veh_b1_r1       1           0         0
Veh_b1_r2       1           0         0
Drug_b1_r1      1           0         1
Drug_b1_r2      1           0         1
Veh_b2_r1       1           1         0
Veh_b2_r2       1           1         0
Drug_b2_r1      1           1         1
Drug_b2_r2      1           1         1
EOF

mageck mle \
    --count-table counts.txt \
    --design-matrix design.txt \
    --output-prefix batch_aware_mle

Why this is preferred: ComBat shifts counts before testing; the MLE-with-covariates approach correctly propagates uncertainty from the batch term into the condition beta's standard error. ComBat-then-test pretends the corrected counts are noise-free, biasing FDR.

Control-Sgrna Anchored Normalization

Goal: Use non-targeting controls as the per-sample reference so batch shifts cancel.

Approach: Scale each sample so its NTC sgRNAs have a constant median. Subsequent fold changes are relative to NTCs in each sample, automatically batch-controlling.

def ntc_anchored_normalize(counts_df, ntc_sgrna_names, target_median=1000):
    '''Scale each sample so its NTC median is target_median. Subsequent LFC is NTC-anchored.'''
    is_ntc = counts_df.index.isin(ntc_sgrna_names)
    ntc_medians = counts_df.loc[is_ntc].median(axis=0)
    scale_factors = target_median / ntc_medians.replace(0, np.nan)
    return counts_df * scale_factors, scale_factors

Critical: Requires ≥500 NTCs in the library (see [[library-design]]). With fewer, the NTC median is unstable and amplifies noise rather than removing batch.

When NOT to Correct

Situation	Why correction hurts
Batch is fully confounded with condition	Correction destroys biology along with batch; redesign or accept
Batch effect is smaller than between-replicate noise	Correction adds noise without removing meaningful variance
Replicates already correlate >0.95 within and across batches	No batch effect to correct
Single-screen analysis	No "batch" to correct; only replicate noise
Per-batch sample size <3	Cannot estimate batch shift reliably; correction is harmful

Failure Modes

ComBat eliminates biological signal

Trigger: Batch is correlated with condition (e.g., all drug-arm samples were processed week 2; all vehicle-arm samples week 1). Mechanism: ComBat without a mod covariate treats condition variance as batch variance; corrects it away. Symptom: PR-AUC against CEGv2 drops after ComBat correction. Fix: Always supply mod covariate matrix indicating condition; verify by comparing PR-AUC before and after.

RUV adds noise instead of removing it

Trigger: k (number of unwanted factors) set too high. Mechanism: RUV's least-squares decomposition over-fits; "removed" variance includes biology. Symptom: Hits decrease and replicate Pearson drops after correction. Fix: Choose k via cross-validation; default k=1 or 2 for most screens.

Batch-aware MLE collinear design matrix

Trigger: Adding a batch indicator that is fully collinear with another design column (e.g., all of batch 2 is also Day 21). Mechanism: MLE design matrix is singular; betas not estimable. Symptom: MAGeCK MLE errors out or produces NaN betas. Fix: Drop the collinear column; re-design experiment with cross-batch balance.

ComBat after RUV double-corrects

Trigger: Applying multiple corrections sequentially. Mechanism: Both methods remove variance; sequential application removes biology twice. Symptom: All signal gone; counts look uniformly noisy. Fix: Pick one method based on diagnostic; never combine.

Per-batch sample size too small

Trigger: 2 replicates per batch with 3 batches; ComBat estimates batch shift from 2 samples. Mechanism: Insufficient data to estimate batch parameters; high-variance estimates. Symptom: Correction makes some batches worse than uncorrected. Fix: Need ≥3 (preferably 4-6) samples per batch; below this, use covariate modeling instead.

Quantitative Thresholds

Threshold	Value	Source / Rationale
PC1 batch F vs condition F	F_batch > 10x F_cond -> apply correction	Standard variance-decomposition diagnostic
ComBat min samples per batch	≥3, ideally 4-6	Empirical Bayes prior estimation
RUV `k` (unwanted factors)	k=1 default; k=2 if multiple known batch sources	Risso 2014; cross-validate
NTCs needed for NTC-anchored norm	≥500 in library	Stable median
Post-correction PCA check	Batches must overlap in PC1/PC2 plot	Visual sanity check
Post-correction PR-AUC	Should be same or higher than pre	If lower, correction destroyed biology

Common Errors

Error / symptom	Cause	Solution
PR-AUC drops after ComBat	Batch confounded with condition	Add `mod` covariate; or redesign
MAGeCK MLE NaN beta after adding batch column	Collinear design matrix	Drop collinear column
Replicates still cluster by batch after RUV	k too low	Increase k; cross-validate
Replicates lose internal cohesion after correction	Over-correction	Reduce k or revert
NTC-anchored norm worse than median	Too few NTCs	Use median; add NTCs to next library
Sequencing-run-level batch survives ComBat	Non-linear sequencing effect	Pre-normalize with `mageck count --norm-method control` first

References

Johnson WE et al. 2007. Biostatistics 8:118. Original ComBat algorithm.
Leek JT et al. 2012. Bioinformatics 28:882. SVA package.
Risso D et al. 2014. Nat Biotechnol 32:896. RUVSeq.
Pacini C et al. 2021. Cell Syst 12:1132. CRISPR-screen batch effects analysis.
Hayer KE et al. 2023. Genome Biol 24:1. Benchmark of normalization methods for CRISPR screens.

Related Skills

crispr-screens/mageck-analysis - MAGeCK MLE with explicit batch covariates
crispr-screens/screen-qc - Pre-correction PCA diagnostic
crispr-screens/copy-number-correction - Chronos handles batch + CN jointly
crispr-screens/library-design - NTC composition for NTC-anchored normalization
crispr-screens/jacks-analysis - Joint analysis across batches with shared efficacy
crispr-screens/hit-calling - Post-correction hit calling
crispr-screens/in-vivo-screens - In-vivo-specific batch sources (animal cohort, tissue prep)