تشغيل أي مهارة في Manus بنقرة واحدة

bio-phasing-imputation-imputation-qc

النجوم٩٤٣

التفرعات١٦٥

آخر تحديث١٦ يونيو ٢٠٢٦ في ١٤:٢٢

Assesses and filters phasing/imputation output - the quality metrics (Beagle DR2, Minimac R2 and EmpRsq, IMPUTE/GLIMPSE INFO), MAF-stratified filtering, true accuracy by masking, the differential-imputation confound, dosage-based downstream usage, and phasing switch-error QC. Covers why every routine quality score is an ESTIMATE of r2 from the posterior spread (not validation against truth), why it is confounded with MAF so a flat INFO>=0.3 cutoff is a hidden rare-variant filter, why concordance lies for rare variants while masked dosage-r2 by MAF is the gold standard, why separate case/control imputation manufactures false GWAS hits, and that the field name tells the tool (DR2=Beagle, R2=Minimac, INFO=GLIMPSE/IMPUTE). Use when filtering imputed variants before GWAS, validating accuracy, benchmarking phasing against trios, or diagnosing inflated association. Imputation is genotype-imputation; phasing is haplotype-phasing; panel ancestry is reference-panels; the test is population-genetics/association-testing.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

GPTomics

GPTomics/bioSkills

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

مطوّرو البرمجياتمهن الحاسوب والرياضيات·SOC 15-1252

مستكشف الملفات

3 ملفات

SKILL.md

readonly

المزيد من هذا المستودع

نفس المستودع

bio-ribo-seq-initiation-site-mapping

GPTomics/bioSkills

Map translation initiation sites, including non-AUG and alternative starts, from initiation-drug ribosome profiling (TI-seq). Use when locating start codons, detecting near-cognate or upstream initiation, or analyzing harringtonine, lactimidomycin (GTI-seq/QTI-seq), or retapamulin (Ribo-RET) data.

2026-06-20943

bio-ribo-seq-orf-detection

GPTomics/bioSkills

Detect and quantify translated ORFs from Ribo-seq using 3-nucleotide periodicity, including uORFs, internal ORFs, dORFs, and novel ORFs. Use when finding actively translated regions beyond annotated CDS, classifying ORFs by the 2022 community standard, quantifying ORF-level translation, or choosing between periodicity-based callers.

2026-06-20943

bio-ribo-seq-riboseq-preprocessing

GPTomics/bioSkills

Preprocess ribosome profiling reads with UMI handling, adapter trimming, contaminant/rRNA depletion, and footprint-aware alignment. Use when preparing Ribo-seq FASTQ for periodicity QC, ORF detection, translation efficiency, or stalling analysis, or when deciding how to deduplicate, which aligner to use, or how to size-select ribosome-protected fragments.

2026-06-20943

bio-ribo-seq-ribosome-periodicity

GPTomics/bioSkills

Validate Ribo-seq library quality by measuring 3-nucleotide periodicity and calibrating read-length-specific P-site offsets. Use when checking whether footprints capture genuine translation, determining P-site offsets for downstream ORF/TE/stalling analysis, or deciding which read lengths to keep.

2026-06-20943

bio-ribo-seq-ribosome-stalling

GPTomics/bioSkills

Detect ribosome pausing and stalling at codon resolution from Ribo-seq, using local-relative occupancy metrics and A-site assignment. Use when studying elongation dynamics, codon dwell times, pause motifs, or ribosome collisions, and when judging whether a pause is real biology or a cycloheximide artifact.

2026-06-20943

bio-ribo-seq-translation-efficiency

GPTomics/bioSkills

Quantify translation efficiency (TE) as ribosome occupancy relative to mRNA abundance and test for differential TE between conditions. Use when separating translational from transcriptional regulation, distinguishing genuine translational control from buffering, or choosing between riborex, Xtail, anota2seq, and DESeq2 interaction models.

2026-06-20943

name	bio-phasing-imputation-imputation-qc
description	Assesses and filters phasing/imputation output - the quality metrics (Beagle DR2, Minimac R2 and EmpRsq, IMPUTE/GLIMPSE INFO), MAF-stratified filtering, true accuracy by masking, the differential-imputation confound, dosage-based downstream usage, and phasing switch-error QC. Covers why every routine quality score is an ESTIMATE of r2 from the posterior spread (not validation against truth), why it is confounded with MAF so a flat INFO>=0.3 cutoff is a hidden rare-variant filter, why concordance lies for rare variants while masked dosage-r2 by MAF is the gold standard, why separate case/control imputation manufactures false GWAS hits, and that the field name tells the tool (DR2=Beagle, R2=Minimac, INFO=GLIMPSE/IMPUTE). Use when filtering imputed variants before GWAS, validating accuracy, benchmarking phasing against trios, or diagnosing inflated association. Imputation is genotype-imputation; phasing is haplotype-phasing; panel ancestry is reference-panels; the test is population-genetics/association-testing.
tool_type	mixed
primary_tool	bcftools

Version Compatibility

Reference examples tested with: bcftools 1.19+, cyvcf2 0.31+, pandas 2.2+, numpy 1.26+.

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show cyvcf2 pandas numpy then help(module.function) to check signatures
CLI: <tool> --version then <tool> --help to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

The imputation quality field is named differently by each engine: DR2 (Beagle 5.x; AR2 is legacy 4.x), R2 (Minimac4; ER2/EmpRsq for typed sites), INFO (IMPUTE5 and GLIMPSE). bcftools +fill-tags computes AF/MAF/HWE but CANNOT produce an imputation Rsq - the quality number comes from the imputer only. Record the engine, panel, and build, because a quality number is only comparable within the same engine and panel.

Imputation QC -- Filtering on a Self-Estimated Quality, Validating With Masked Truth

"Filter my imputed genotypes by quality and check the accuracy" -> Filter on the imputer's per-variant quality field with a MAF floor, and validate true accuracy by masking - because the routine quality score is the model grading its own posterior, it is confounded with allele frequency, and a flat cutoff silently deletes the rare variants of greatest interest.

CLI: bcftools view -e 'INFO/DR2<0.3 || INFO/AF<0.01 || INFO/AF>0.99' imputed.vcf.gz (DR2 for Beagle; R2 for Minimac; INFO for GLIMPSE/IMPUTE; Minimac also emits a MAF tag, Beagle only AF)

Scope: QC of already-produced phasing/imputation output - what the numbers mean, which to trust, how to filter, and how poor QC propagates into false GWAS hits. Running imputation -> genotype-imputation. Phasing -> haplotype-phasing. Panel ancestry (which the metric cannot detect) -> reference-panels. The GWAS test on the filtered dosages -> population-genetics/association-testing. VCF field-parsing mechanics -> variant-calling/vcf-statistics. Read-backed phasing switch QC -> long-read-sequencing/haplotype-phasing.

The Single Most Important Modern Insight -- The Quality Score Is the Model Grading Its Own Posterior, Confounded With MAF, So a Flat Cutoff Is a Hidden Rare-Variant Filter

Every routine imputation quality score (IMPUTE INFO, Minimac R2/Rsq, Beagle DR2) estimates the same quantity - the squared correlation r2 between the imputed dosage and the unobserved true genotype - from the posterior spread, without ever seeing truth (Marchini & Howie 2010 Nat Rev Genet 11:499). It is a self-report of confidence, not a measured accuracy. Three facts organize all of QC:

The metric is structurally confounded with MAF. The estimator is the fraction of the HWE-expected dosage variance 2p(1-p) the imputed dosages recover; the denominator collapses toward zero as the allele gets rarer, so the estimate is noisier and systematically lower for rare variants. A single flat INFO/R2 >= 0.3 cutoff therefore deletes rare variants far more aggressively than common ones - a hidden MAF filter applied without anyone deciding to. Filter MAF-stratified, or report accuracy per MAF bin.
The only true accuracy comes from masking. Hide genotypes that are actually known, re-impute them, and compute the real squared correlation against the held-out truth, binned by MAF (the dosage-r2-by-MAF curve). That is the gold standard; the per-variant R2 is a convenient proxy never validated against truth for that variant. Minimac's EmpRsq (masked, measured) vs Rsq (self-estimated) is the only place the two meet, and a gap between them is a QC alarm (panel/strand/ancestry mismatch). Concordance is disqualified for rare variants: a do-nothing imputer that always calls the major homozygote scores ~98% concordance at MAF 1% (0.99^2) while carrying zero information.
The differential-imputation confound manufactures false hits, and every per-group QC metric passes. Because imputation quality is a function of the data and panel, imputing cases and controls (or batches, or ancestries) separately makes the imputation error differ between groups, and a case-control test cannot distinguish that artifactual allele-frequency difference from real association. The fix is structural (impute together / harmonize), not a filter; the absence of a QC red flag never clears it.

The Metrics, Precisely

All three estimate r2 (imputed dosage vs true genotype) from the posterior, never from observed truth. The variance-ratio intuition: estimated r2 = Var(imputed dosage) / [2p(1-p)]. Under perfect information the dosages equal the true 0/1/2 genotypes and recover the full HWE variance (ratio 1); under no information every dosage collapses to the mean 2p and the variance goes to 0 (ratio 0). The fraction of HWE variance recovered IS the estimate, which is also why it is noisier and lower at low MAF. That Var(dosage)/2p(1-p) form is specifically the Minimac/Beagle estimator (the spread of the point dosages across samples); IMPUTE INFO targets the same r2 but computes it differently, averaging each sample's within-individual posterior variance - the mechanistic reason the three numbers are not interchangeable across engines.

Field	Engine	Meaning
DR2	Beagle 5.x	estimated squared correlation between estimated and true allele dose (AR2 is legacy 4.x)
R2 (Rsq)	Minimac4	estimated r2 for all sites, from the dosage variance ratio
EmpRsq / ER2 (EmpR)	Minimac4	empirical r2 from leave-one-out at TYPED sites (masked, measured); a negative EmpR flags a strand/allele flip
INFO	IMPUTE5, GLIMPSE	the IMPUTE information measure (posterior-variance ratio), same spirit

The provenance matters: the field name tells the tool, the numbers are NOT comparable across engines (a Beagle DR2 of 0.8 is not a Minimac R2 of 0.8), and bcftools +fill-tags cannot produce any of them.

The INFO x MAF Interaction and Thresholds

At low MAF there is little dosage variance to predict and few panel copies of the rare haplotype to anchor the estimate, so R2 is intrinsically noisier and downward-biased there - a property of the construction, not a bug. The standard cutoffs and their status:

Threshold	Source / status	Rationale
INFO/R2/DR2 >= 0.3	the common GWAS cutoff; convention, NOT a theorem	WTCCC/early-IMPUTE-era practice, acknowledged as somewhat arbitrary; acts as a hidden MAF filter
INFO/R2 >= 0.8	stricter, high-confidence analyses	the 0.3 and 0.8 pair are the two established values
MAF-stratified INFO	recommended remedy	a flat cutoff differentially deletes rare variants; filter or report per MAF bin
GP > 0.9 (or 0.8) hardcall threshold	when forced to hardcall	below this set missing; hardcalling at all loses power vs dosage regression
Meta-analysis N_eff = N * INFO	METAL/GWAMA/FinnGen convention	down-weight poorly-imputed variants; not a named theorem
EmpRsq vs Rsq large gap	QC alarm	self-estimate not matching measured accuracy = panel/strand/ancestry mismatch

True Accuracy by Masking

The accepted accuracy curve is aggregate r2 (squared Pearson correlation between imputed dosage and the masked-then-revealed true genotype) binned by MAF; it decreases monotonically as MAF falls. Workflow: mask typed genotypes (array sites or a held-out set), re-impute from the panel, compute squared correlation vs the withheld truth, bin by MAF. Leave-one-out (one variant at a time) is the per-variant version and is what produces Minimac's EmpRsq for typed sites. Never use concordance as the rare-variant accuracy metric (Ramnarine 2015 PLoS One 10:e0137601); it is dominated by the major-homozygote class and inflates rare-variant accuracy.

MAF-stratified quality summary

Goal: Reveal the hidden-MAF-filter effect by reporting imputation quality per MAF bin instead of as one global mean, so the rare-variant tail a flat cutoff would delete is visible.

Approach: Parse the imputed VCF for the engine's quality field and allele frequency with cyvcf2, derive MAF, bin it, and report mean quality and the fraction passing a candidate cutoff per bin.

import numpy as np
import pandas as pd
from cyvcf2 import VCF

def quality_by_maf(vcf_path, qual_key='DR2', cutoff=0.3):
    rows = []
    for v in VCF(vcf_path):
        q = v.INFO.get(qual_key)
        af = v.INFO.get('AF')
        if q is None or af is None:
            continue
        af = af[0] if isinstance(af, tuple) else af
        rows.append((min(af, 1 - af), float(q)))
    df = pd.DataFrame(rows, columns=['maf', 'qual'])
    bins = [0, 0.001, 0.01, 0.05, 0.5]   # rare-to-common; the rare bins are where a flat cutoff bites
    df['maf_bin'] = pd.cut(df['maf'], bins=bins)
    summary = df.groupby('maf_bin', observed=True).agg(n=('qual', 'size'), mean_qual=('qual', 'mean'), frac_pass=('qual', lambda q: (q >= cutoff).mean()))
    return summary

quality_by_maf('imputed.vcf.gz', qual_key='DR2', cutoff=0.3)

Phasing QC -- Switch Error Rate

Switch error rate (SER) = switch errors / opportunities, an opportunity being each consecutive heterozygous-site pair, scored against trio/duo/benchmark truth. A flip error is two switches one site apart (an isolated mis-assignment, not a long-range switch); Hamming distance counts overall haplotype differences and inflates on block swaps. Trio-based SER is biased upward by genotype error, which is why SHAPEIT5's switch reports SER and a genotyping-error rate jointly. Magnitudes are always MAC- and N-stratified (sub-0.5% common-variant SER for modern tools; single-digit-percent and rising as MAC approaches 1). Tools: whatshap compare, vcftools --diff-switch-error, SHAPEIT5 switch.

Dosage-Based Downstream Usage

Hardcall thresholding discards imputation uncertainty and sets low-confidence calls missing, losing power versus regression on the expected dosage, especially at low MAF (Huang 2014 PLoS One 9:e110679). Carry dosages: PLINK2 (--vcf file dosage=DS, dosage=HDS for Minimac4 phased, .pgen), SNPTEST (-method expected/score/em), REGENIE (BGEN v1.2/PGEN), and BOLT-LMM (BGEN v1.2) all accept dosages. In meta-analysis, INFO enters as a per-study filter and as an effective-N weight.

Per-Method Failure Modes

Flat INFO cutoff as a silent rare-variant filter

Trigger: applying one INFO/R2 >= 0.3 across all frequencies. Mechanism: the metric is confounded with MAF, so the cutoff removes a far higher fraction of rare than common variants. Symptom: the rare-variant tail vanishes with no record that a frequency filter was applied. Fix: filter MAF-stratified or report accuracy per MAF bin; pair any INFO cutoff with an explicit MAF floor stated in the methods.

Concordance reported as rare-variant accuracy

Trigger: quoting genotype concordance for rare variants. Mechanism: concordance is dominated by the major-homozygote class; a do-nothing imputer scores ~98% at MAF 1%. Symptom: uniformly high concordance hiding catastrophic rare-variant failure. Fix: use masked dosage-r2 binned by MAF; reserve concordance for sanity checks on common variants.

Cases and controls imputed separately

Trigger: imputing batches/groups independently. Mechanism: batch-differential imputation quality creates an artifactual allele-frequency difference indistinguishable from association. Symptom: genome-wide-significant hits that fail to replicate; every per-group QC passes. Fix: impute all samples together, or harmonize panel/version and verify quality does not differ by batch -> reference-panels.

Filtering on the wrong quality field

Trigger: bcftools view -i 'INFO/R2>0.3' on a Beagle VCF (which has DR2, not R2). Mechanism: the field name is engine-specific. Symptom: the filter silently passes everything or errors on a missing tag. Fix: use DR2 for Beagle, R2 for Minimac, INFO for GLIMPSE/IMPUTE; confirm with bcftools view -h.

Trusting Rsq where EmpRsq diverges

Trigger: reporting a high Rsq while the masked EmpRsq is much lower. Mechanism: the self-estimate assumes the model (panel, strand, ancestry) is right; a gap means it is not. Symptom: confident Rsq on systematically wrong imputation; a negative EmpR is an outright strand flip. Fix: treat the Rsq-EmpRsq gap as an alarm; check strand/build/ancestry -> reference-panels.

Quantitative Thresholds

Threshold	Source	Rationale
INFO/R2/DR2 >= 0.3 (common), >= 0.8 (strict)	convention (WTCCC/early-IMPUTE era)	the common GWAS cutoffs; not derived, and a hidden MAF filter
Always pair the quality cutoff with a MAF floor	Magi 2012 Genet Epidemiol 36:785	rare + low-R2 is the classic false-positive generator
Accuracy = masked dosage-r2 binned by MAF	Ramnarine 2015 PLoS One 10:e0137601	concordance inflates rare-variant accuracy; r2 by MAF bin is the gold standard
Hardcall GP > 0.9 only when forced	convention	hardcalling loses power vs dosage regression
Impute cases and controls together	best-practice consensus	separate imputation manufactures batch-driven false positives
Switch-error magnitudes are MAC/N-stratified	Hofmeister 2023 Nat Genet 55:1243	no single universal SER threshold; qualify by MAC bin and validation type

Common Errors

Error / symptom	Cause	Solution
Filter on INFO/R2 passes all Beagle variants	wrong field name (Beagle uses DR2)	use DR2; confirm with `bcftools view -h`
Rare-variant signal disappears after QC	flat INFO cutoff as a hidden MAF filter	filter MAF-stratified; state the MAF floor
Uniformly high "accuracy" for rare variants	concordance metric	use masked dosage-r2 by MAF bin
Genome-wide-significant hits do not replicate	cases/controls imputed separately, or hardcalled	impute together; regress on dosages
`bcftools +fill-tags` did not add an Rsq	fill-tags computes AF/MAF/HWE, not imputation quality	the quality field comes from the imputer
Negative Minimac EmpR at a site	strand/allele flip	re-align strand to the panel -> reference-panels

References

Marchini J, Howie B. 2010. Genotype imputation for genome-wide association studies. Nat Rev Genet 11:499-511.
Howie BN, Donnelly P, Marchini J. 2009. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5:e1000529.
Das S, Forer L, Schonherr S, et al. 2016. Next-generation genotype imputation service and methods. Nat Genet 48:1284-1287.
Browning BL, Zhou Y, Browning SR. 2018. A one-penny imputed genome from next-generation reference panels. Am J Hum Genet 103:338-348.
Ramnarine S, Zhang J, Chen LS, et al. 2015. When does choice of accuracy measure alter imputation accuracy assessments? PLoS One 10:e0137601.
Magi R, Asimit JL, Day-Williams AG, Zeggini E, Morris AP. 2012. Genome-wide association analysis of imputed rare variants: application to seven common complex diseases. Genet Epidemiol 36:785-796.
Hofmeister RJ, Ribeiro DM, Rubinacci S, Delaneau O. 2023. Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat Genet 55:1243-1249.
Huang KC, Sun W, Wu Y, et al. 2014. Association studies with imputed variants using expectation-maximization likelihood-ratio tests. PLoS One 9:e110679.
Willer CJ, Li Y, Abecasis GR. 2010. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26:2190-2191.

Related Skills

genotype-imputation - Produces the dosages and the quality field this skill filters
haplotype-phasing - Switch-error benchmarking of the phasing that precedes imputation
reference-panels - Panel ancestry mismatch, which the quality metric cannot detect
variant-calling/vcf-statistics - Generic VCF INFO/FORMAT field parsing
population-genetics/association-testing - Consumes the filtered dosages
long-read-sequencing/haplotype-phasing - Read-backed phasing switch QC
workflows/gwas-pipeline - End-to-end QC -> phase -> impute -> associate