원클릭으로 Manus에서 모든 스킬 실행

bio-phasing-imputation-reference-panels

스타943

포크165

업데이트2026년 6월 16일 14:22

Selects and prepares the reference panel that phasing/imputation copies haplotypes from (1000 Genomes, HRC, TOPMed, HGDP+1kGP/gnomAD, CAAPA), matching panel ancestry to the target, reconciling genome build and chromosome naming, and running the strand/allele harmonization gate. Covers why ancestry-match beats panel size (imputation can only copy haplotypes the panel contains), why palindromic A/T and C/G SNPs flip strand without erroring, why liftover is a strand-flip generator in between-build inverted regions, that HRC is SNP-only and TOPMed is never downloadable (governance can override accuracy), and panel formats (msav, bref3, imp5). Use when choosing a panel for a target ancestry, preparing or converting a panel, aligning study data, or deciding between downloadable and server-only panels. Phasing is haplotype-phasing; imputation is genotype-imputation; PCA for ancestry is population-genetics/population-structure; HLA panels are clinical-databases/hla-typing.

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

GPTomics

GPTomics/bioSkills

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

Version Compatibility

Reference examples tested with: bcftools 1.19+, PLINK 1.9+.

Before using code patterns, verify installed versions match. If versions differ:

CLI: <tool> --version then <tool> --help to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

A reference panel is DATA, not a tool: it has a dated release and a fixed genome build (GRCh37 or GRCh38). Record the exact panel name, version, and build with every result; "1000 Genomes" without a version is unreproducible because Phase 3 (GRCh37, low-coverage) and the high-coverage NYGC 3202 release (GRCh38, 30x) are different call sets. HRC and TOPMed are server-only (not downloadable); panel-build tools (Minimac4 --compress-reference, bref3.jar, imp5Converter) and the Will Rayner harmonization check are separate downloads.

Reference Panels -- Choosing and Preparing the Prior Imputation Copies From

"Which reference panel should I use, and how do I prepare it?" -> Match the panel's ancestry to the target population, reconcile build and strand, then convert to the engine's format - because imputation can only copy haplotypes the panel contains, so the panel IS the prior, and a mismatched ancestry or a flipped strand corrupts the result without any error.

CLI: bcftools norm -m -any -f ref.fa then the strand/allele harmonization check against the panel sites, then minimac4 --compress-reference / bref3.jar / imp5Converter to build the engine format

Scope: panel selection (ancestry-match, build, access), strand/allele harmonization, build/liftover, and format conversion. The phasing engine that consumes the panel -> haplotype-phasing. Imputation -> genotype-imputation. PCA to establish the target ancestry -> population-genetics/population-structure. Classical HLA-allele imputation needs a dedicated HLA panel -> clinical-databases/hla-typing. VCF normalization mechanics -> variant-calling/variant-normalization.

The Single Most Important Modern Insight -- Ancestry Match Beats Panel Size, Because Imputation Copies Haplotypes and a Mismatched Panel Has None Worth Copying

The reflex "TOPMed has 97k samples, HRC has 32k, so use TOPMed" is right for a European cohort and wrong for an ancestry-mismatched one, and the reason is mechanical, not statistical: imputation copies haplotype segments from panel samples that resemble the target, so if no panel sample carries the target population's haplotypes there is nothing to copy, and adding ten thousand more European haplotypes does nothing for an East African sample (Marchini & Howie 2010 Nat Rev Genet 11:499). The binding resource is not panel size but how many panel samples share the target's ancestry. Three facts follow:

Ancestry composition often dominates size. HRC (32k, European-heavy) imputes African-ancestry rare variants poorly while TOPMed lifts them dramatically - not because TOPMed is 3x bigger but because it contains African-American and Hispanic/Latino haplotypes. Admixed samples need a panel with both ancestral components and admixed individuals, because admixed haplotypes are mosaics a single-ancestry panel cannot reconstruct at the switch points. The honest answer is "large AND matched"; where both are not available, which one wins depends on whether the target is common or rare variants (size matters more for rare).
The self-graded INFO/R2 cannot see ancestry mismatch. When the panel lacks the target's haplotypes the model still finds some template and reports a confident-looking quality about a copy that is systematically wrong (full metric theory -> imputation-qc). High R2 in an under-represented ancestry is not reassurance; validate against masked truth stratified by frequency.
A panel encodes whose haplotypes are trusted to fill the gaps. HRC is European-heavy because its component cohorts were; TOPMed is diverse because it was designed to be. No panel represents everyone, and the imputation inherits exactly the panel population's representation, including its gaps. This is the central equity problem of imputation, and low-coverage WGS (unbiased ascertainment -> genotype-imputation) is the main route around it.

The Major Panels

Numbers are routinely misquoted; these are the verified figures. State the exact version and build in any method.

Panel	Samples	Build	Indels	Diversity	Access
1000G Phase 3 (Auton 2015)	2,504	GRCh37 (GRCh38 lift)	yes	26 pops, broad but shallow	public download
1000G high-cov NYGC (Byrska-Bishop 2022)	3,202 (incl. 602 trios)	GRCh38	yes	same 26 pops, 30x	public download
HRC r1.1 (McCarthy 2016)	32,470 (64,940 haps)	GRCh37 only	NO - SNP-only, MAF floor ~5e-4	European-heavy	server-only (Michigan)
TOPMed r2 (Taliun 2021)	97,256 (~308M sites)	GRCh38 only	yes	very diverse (large AA, Hispanic)	server-only, never downloadable
HGDP+1kGP / gnomAD (Koenig 2024)	4,094 (76-80 pops)	GRCh38	yes	maximally diverse per-sample	public download
CAAPA (Mathias 2016)	883 African-ancestry	GRCh37	(SNP)	African / African-American	server-supported

The "1000G" trap: Phase 3 (Auton 2015, GRCh37, low-coverage) and the high-coverage NYGC 3202 release (Byrska-Bishop 2022, GRCh38, 30x, with 602 trios) are different call sets; the NYGC release is strictly better for rare variants. State which.

Decision Tree by Scenario

Scenario	Recommended	Why
Study is GRCh37	HRC r1.1, 1000G Phase 3, or CAAPA (all GRCh37)	match the build; avoid liftover
Study is GRCh38	TOPMed, 1000G NYGC, or HGDP+1kGP (all GRCh38)	match the build; avoid liftover
Build mismatch unavoidable	lift over ONCE, strand-aware, then re-run the harmonization check	liftover flips strand in inverted regions (see failure modes)
European cohort, common-variant GWAS	HRC (server) or 1000G	large and European-rich
African / admixed / Hispanic / multi-ancestry	TOPMed (server)	diversity wins; contains the matching haplotypes
Need a downloadable, diverse, local panel	HGDP+1kGP (gnomAD)	global, jointly-called, and not server-gated
Data cannot leave the institution / country	downloadable panels only (1000G, HGDP+1kGP)	governance overrides accuracy (see failure modes)
Need indels imputed	1000G, TOPMed, or HGDP+1kGP	HRC is SNP-only
Classical HLA alleles	-> clinical-databases/hla-typing (dedicated HLA panel)	standard SNP panels cannot impute HLA alleles
Establish the target ancestry first	-> population-genetics/population-structure	PCA, not a panel operation

The governing principle: ancestry match beats panel size, and governance (can the data be uploaded to a US server?) often narrows the field before accuracy does.

The Strand / Allele Harmonization Gate

This is the step that silently corrupts results when skipped. The job: align every study variant's alleles to the panel REF/ALT, fix strand, and drop the variants that cannot be safely resolved. Normalize first (bcftools norm -m -any -f ref.fa), because the same indel represented two ways will not match.

The field-standard gate is Will Rayner's check (HRC-1000G-check-bim.pl and its bgen/VCF variants): it compares a QC'd PLINK .bim plus an allele-frequency file against the panel's sites list and EMITS a Run-plink.sh that updates positions, ref/alt, and strand, removes unresolvable SNPs, and splits by chromosome. The check diagnoses; the script fixes - both are required, and a surprising number of pipelines run the check and never execute the script.

The allele-frequency concordance plot (study AF vs panel AF) is the visual gate, not decoration: a tight diagonal is good; points on the y = 1 - x anti-diagonal are strand flips; a general smear is sample mislabeling or the wrong panel ancestry. Compare against the ancestry-MATCHED sub-panel's frequencies - an African cohort vs a European panel's AF smears even with perfect strand. Read this plot before uploading, every time.

Panel Formats and the Genetic-Map Pairing

Engine	Native format	Build command
Minimac4	`.msav` (current) or legacy `.m3vcf`	`minimac4 --compress-reference ref.vcf.gz > ref.msav` (legacy: `Minimac3 --processReference`)
Beagle 5.x	`.bref3` or plain VCF	`java -jar bref3.jar ref.vcf.gz > ref.bref3` (needs fully phased, non-missing, `
IMPUTE5	`.imp5` or VCF/BCF	`imp5Converter --h ref.vcf.gz --r chr20 --o ref.chr20.imp5`

A panel is half the input; phasing/imputation also needs a genetic (recombination) map matched to the panel build. A panel VCF comes site-only (the legend, used for the harmonization check) or full (the haplotypes, used to impute) - obtain both. The map is build-specific: an hg19 map with a GRCh38 panel silently mis-places recombination rates.

chrX and MHC

chrX: split into PAR1 / nonPAR / PAR2 using build-correct coordinates. PAR and female nonPAR are diploid; male nonPAR is HAPLOID and must be coded haploid (a het call there is an error). Mixed ploidy in one file crashes most tools; the Michigan/TOPMed servers split, impute, and re-merge automatically.
MHC (chr6 ~28-34 Mb): extreme LD and polymorphism. Standard panels impute SNPs there but unreliably and cannot impute classical HLA alleles at all - that needs a dedicated HLA panel and tool -> clinical-databases/hla-typing.

Per-Method Failure Modes

Palindromic (A/T, C/G) SNP strand flip

Trigger: keeping strand-ambiguous SNPs without a frequency-based strand check. Mechanism: A/T and C/G alleles are their own reverse complements, so opposite-strand study and panel still "match" on alleles; the variant passes every join and imputes cleanly while allele-swapped. Symptom: flipped effect direction at that locus and everything imputed in LD with it; no error. Fix: resolve strand by allele frequency; drop palindromic SNPs with MAF > 0.4 (cannot be disambiguated near 0.5); treat "I kept all palindromic SNPs" as proof strand was never checked.

Liftover across builds

Trigger: running liftOver to reach a panel in the other build and imputing without re-checking. Mechanism: ~2-5 Mb of the genome is inverted between GRCh37 and GRCh38 (BBIS regions); lifting a variant there changes its strand, and the allele-based check cannot see it on a palindrome (Sheng & Chiang 2023 HGG Adv 4:100159). Symptom: silent allele-swaps in inverted regions; the TOPMed server's own internal conversion had this bug. Fix: prefer a panel native to the study build; if forced, lift once with a strand-aware method, then re-run the harmonization check against the new build.

Chromosome-naming mismatch

Trigger: 1 vs chr1 between study and panel. Mechanism: GRCh38/TOPMed use chr prefixes, GRCh37 panels do not, so the join matches nothing. Symptom: "0 variants matched" and a wasted day; does not corrupt, just fails. Fix: bcftools annotate --rename-chrs before any check.

Expecting a SNP-only or rarity-floored panel to carry a variant

Trigger: imputing an indel against HRC, or a variant rarer than the panel floor. Mechanism: HRC is SNP-only; its MAC>=5 cutoff means nothing below MAF ~5e-4 is in the panel; un-present variants cannot be imputed at any quality. Symptom: the variant is absent or near-zero R2, misread as "imputed poorly." Fix: use a panel that contains the variant class (1000G/TOPMed/gnomAD for indels; a WGS panel for rarer variants).

Server-only panel blocked by governance

Trigger: planning to use TOPMed/HRC for data that cannot be uploaded. Mechanism: TOPMed is never downloadable and both are server-only; consent/data-residency/IRB rules may forbid uploading participant genotypes to a US server. Symptom: the best panel is legally unusable. Fix: use a downloadable panel (1000G, HGDP+1kGP) and impute locally.

Quantitative Thresholds

Threshold	Source	Rationale
Drop palindromic (A/T, C/G) SNPs with MAF > 0.4	Rayner check default	strand unresolvable from alleles and frequency too near 0.5 to disambiguate
Allele-frequency concordance flag > 0.2 (stringent 0.1)	Rayner check default	a large study-vs-panel AF gap signals a strand, build, or ancestry problem
HRC MAF floor ~5e-4 (MAC>=5 / 32,470)	McCarthy 2016 Nat Genet 48:1279	nothing rarer is in the panel and cannot be imputed
Male nonPAR chrX coded haploid	biological ploidy	a het call in male nonPAR is an error and mis-models every male
Genetic map must match the panel build	Li & Stephens 2003 Genetics 165:2213	an hg19 map on a GRCh38 panel mis-places recombination silently
Match AF comparison to the ancestry-matched sub-panel	Marchini & Howie 2010 Nat Rev Genet 11:499	true frequencies differ by ancestry; a mismatch smears the plot even with perfect strand

Common Errors

Error / symptom	Cause	Solution
"0 variants matched" the panel	chr naming (`1` vs `chr1`)	`bcftools annotate --rename-chrs`
Flipped effect direction at some loci	unresolved palindromic strand	run the harmonization check; drop A/T,C/G MAF>0.4; execute `Run-plink.sh`
Indels missing after imputation	HRC is SNP-only	use 1000G/TOPMed/gnomAD
AF concordance plot smears off-diagonal	wrong panel ancestry or sample mislabel	compare to the matched sub-panel; check sample labels
Cannot download HRC/TOPMed (403)	server-only / access-controlled	use the imputation server, or a downloadable panel
Engine errors building bref3	unphased or missing genotypes in the panel VCF	bref3 needs fully phased, non-missing, `
Imputation degraded in one region after liftover	BBIS inverted region strand flip	use a native-build panel; re-check after any liftover
Tempted to impute ancestry subgroups of one cohort separately	re-creates the differential-imputation confound (batch-differential quality)	impute all samples together against one large diverse panel (TOPMed or HGDP+1kGP), not per-stratum -> imputation-qc

References

Auton A, Brooks LD, Durbin RM, et al. (1000 Genomes Project Consortium). 2015. A global reference for human genetic variation. Nature 526:68-74.
Byrska-Bishop M, Evani US, Zhao X, et al. 2022. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185:3426-3440.
McCarthy S, Das S, Kretzschmar W, et al. 2016. A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet 48:1279-1283.
Taliun D, Harris DN, Kessler MD, et al. 2021. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590:290-299.
Koenig Z, Yohannes MT, Nkambule LL, et al. 2024. A harmonized public resource of deeply sequenced diverse human genomes. Genome Res 34:796-809.
Mathias RA, Taub MA, Gignoux CR, et al. 2016. A continuum of admixture in the Western Hemisphere revealed by the African Diaspora genome. Nat Commun 7:12522.
Marchini J, Howie B. 2010. Genotype imputation for genome-wide association studies. Nat Rev Genet 11:499-511.
Das S, Forer L, Schonherr S, et al. 2016. Next-generation genotype imputation service and methods. Nat Genet 48:1284-1287.
Sheng X, Xia L, Cahoon JL, et al. 2023. Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing. HGG Adv 4:100159.
Luo Y, Kanai M, Choi W, et al. 2021. A high-resolution HLA reference panel capturing global population diversity enables multi-ancestry fine-mapping in HIV host response. Nat Genet 53:1504-1516.

Related Skills

haplotype-phasing - The phasing engine that consumes the panel; the genetic-map pairing
genotype-imputation - Impute untyped variants once the panel is prepared
imputation-qc - INFO/R2 quality, which cannot detect ancestry mismatch
variant-calling/variant-normalization - Split multiallelics and left-align before harmonization
population-genetics/population-structure - PCA to establish target ancestry for panel choice
clinical-databases/hla-typing - Classical HLA-allele imputation with a dedicated panel
workflows/gwas-pipeline - End-to-end QC -> phase -> impute -> associate

name	bio-phasing-imputation-reference-panels
description	Selects and prepares the reference panel that phasing/imputation copies haplotypes from (1000 Genomes, HRC, TOPMed, HGDP+1kGP/gnomAD, CAAPA), matching panel ancestry to the target, reconciling genome build and chromosome naming, and running the strand/allele harmonization gate. Covers why ancestry-match beats panel size (imputation can only copy haplotypes the panel contains), why palindromic A/T and C/G SNPs flip strand without erroring, why liftover is a strand-flip generator in between-build inverted regions, that HRC is SNP-only and TOPMed is never downloadable (governance can override accuracy), and panel formats (msav, bref3, imp5). Use when choosing a panel for a target ancestry, preparing or converting a panel, aligning study data, or deciding between downloadable and server-only panels. Phasing is haplotype-phasing; imputation is genotype-imputation; PCA for ancestry is population-genetics/population-structure; HLA panels are clinical-databases/hla-typing.
tool_type	cli
primary_tool	bcftools

bio-phasing-imputation-reference-panels

이 저장소의 다른 Skills

이 저장소의 다른 Skills

Version Compatibility

Reference Panels -- Choosing and Preparing the Prior Imputation Copies From

The Single Most Important Modern Insight -- Ancestry Match Beats Panel Size, Because Imputation Copies Haplotypes and a Mismatched Panel Has None Worth Copying

The Major Panels

Decision Tree by Scenario

The Strand / Allele Harmonization Gate

Panel Formats and the Genetic-Map Pairing

chrX and MHC

Per-Method Failure Modes

Palindromic (A/T, C/G) SNP strand flip

Liftover across builds

Chromosome-naming mismatch

Expecting a SNP-only or rarity-floored panel to carry a variant

Server-only panel blocked by governance

Quantitative Thresholds

Common Errors

References

Related Skills

Version Compatibility

Reference Panels -- Choosing and Preparing the Prior Imputation Copies From

The Single Most Important Modern Insight -- Ancestry Match Beats Panel Size, Because Imputation Copies Haplotypes and a Mismatched Panel Has None Worth Copying

The Major Panels

Decision Tree by Scenario

The Strand / Allele Harmonization Gate

Panel Formats and the Genetic-Map Pairing

chrX and MHC

Per-Method Failure Modes

Palindromic (A/T, C/G) SNP strand flip

Liftover across builds

Chromosome-naming mismatch

Expecting a SNP-only or rarity-floored panel to carry a variant

Server-only panel blocked by governance

Quantitative Thresholds

Common Errors

References

Related Skills