| name | bio-population-genetics-linkage-disequilibrium |
| description | Calculate linkage disequilibrium statistics (r², D'), perform LD pruning for population structure analysis, identify haplotype blocks, and visualize LD patterns using PLINK, scikit-allel, and LDBlockShow. Use when calculating LD or pruning variants. |
| tool_type | mixed |
| primary_tool | plink2 |
Version Compatibility
Reference examples tested with: matplotlib 3.8+, numpy 1.26+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
pip show <package> then help(module.function) to check signatures
- CLI:
<tool> --version then <tool> --help to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
Linkage Disequilibrium
"Calculate LD between my variants" → Compute pairwise LD statistics (r², D'), prune correlated variants for independent sets, and identify haplotype blocks from genotype data.
- CLI:
plink2 --r2 for LD calculation, --indep-pairwise for pruning
- Python:
allel.rogers_huff_r() for windowed LD in scikit-allel
Calculate LD statistics, prune correlated variants, and identify haplotype blocks.
PLINK LD Calculations
Pairwise r²
plink2 --bfile data --r2 --ld-window-kb 1000 --ld-window-r2 0.2 --out ld_results
plink2 --bfile data --r2 inter-chr --ld-window-r2 0 --out all_pairs
plink2 --bfile data --r2-phased square --out ld_matrix
Output Format
# ld_results.ld contains:
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
PLINK 1.9 Options
plink --bfile data --r2 dprime --ld-window-kb 500 --out ld_with_dprime
plink --bfile data --r2 inter-chr --ld-snp-list target_snps.txt --out target_ld
LD Pruning
Standard Pruning
plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune
plink2 --bfile data --extract prune.prune.in --make-bed --out data_pruned
Pruning Parameters
| Parameter | Description | Common Values |
|---|
| Window (50) | Variants per window | 50-200 |
| Step (10) | Variants to shift | 5-50 |
| r² threshold (0.1) | Max LD allowed | 0.1-0.5 |
Use Cases
plink2 --bfile data --indep-pairwise 50 10 0.1 --out strict_prune
plink2 --bfile data --indep-pairwise 200 50 0.5 --out moderate_prune
plink2 --bfile data --indep-pairwise 50 10 0.2 --chr 6 --from-mb 25 --to-mb 35 --out mhc_prune
scikit-allel LD
Pairwise r²
import allel
import numpy as np
callset = allel.read_vcf('data.vcf.gz')
gt = allel.GenotypeArray(callset['calldata/GT'])
pos = callset['variants/POS']
gn = gt.to_n_alt()
r2 = allel.rogers_huff_r(gn[:100]) ** 2
LD Decay
Goal: Plot LD decay as a function of physical distance to characterize the extent of linkage in the population.
Approach: Compute pairwise r² values between nearby variants using scikit-allel, bin by physical distance, and plot mean r² per distance bin.
import allel
import numpy as np
import matplotlib.pyplot as plt
gn = gt.to_n_alt()
r2, dist = [], []
n_variants = min(1000, gn.shape[0])
for i in range(n_variants):
for j in range(i + 1, min(i + 100, n_variants)):
r = allel.rogers_huff_r(gn[[i, j]])[0, 1] ** 2
d = pos[j] - pos[i]
r2.append(r)
dist.append(d)
r2 = np.array(r2)
dist = np.array(dist)
bins = np.arange(0, 100001, 1000)
bin_means = []
for i in range(len(bins) - 1):
mask = (dist >= bins[i]) & (dist < bins[i + 1])
if mask.sum() > 0:
bin_means.append(np.mean(r2[mask]))
else:
bin_means.append(np.nan)
plt.figure(figsize=(10, 6))
plt.plot(bins[:-1] / 1000, bin_means)
plt.xlabel('Distance (kb)')
plt.ylabel('Mean r²')
plt.title('LD Decay')
plt.savefig('ld_decay.png')
Haplotype Blocks
PLINK
plink --bfile data --blocks no-pheno-req --out blocks
Block Statistics
import pandas as pd
blocks = pd.read_csv('blocks.blocks.det', sep='\s+')
print(f'Number of blocks: {len(blocks)}')
print(f'Mean block size: {blocks["KB"].mean():.1f} kb')
print(f'Mean SNPs per block: {blocks["NSNPS"].mean():.1f}')
LD Matrix Visualization
import allel
import numpy as np
import matplotlib.pyplot as plt
gn = gt.to_n_alt()[:200]
r = allel.rogers_huff_r(gn)
r2_matrix = r ** 2
plt.figure(figsize=(10, 10))
plt.imshow(r2_matrix, cmap='hot', vmin=0, vmax=1)
plt.colorbar(label='r²')
plt.xlabel('Variant index')
plt.ylabel('Variant index')
plt.title('LD Matrix')
plt.savefig('ld_matrix.png', dpi=150)
LD-based Clumping (GWAS)
plink --bfile data \
--clump gwas_results.txt \
--clump-p1 5e-8 \
--clump-p2 1e-5 \
--clump-r2 0.1 \
--clump-kb 250 \
--out clumped
Clump Parameters
| Parameter | Description |
|---|
| --clump-p1 | Index SNP p-value threshold |
| --clump-p2 | Clumped SNP p-value threshold |
| --clump-r2 | LD threshold for clumping |
| --clump-kb | Physical distance threshold |
vcftools LD
vcftools --vcf data.vcf --geno-r2 --ld-window-bp 100000 --out ld_results
vcftools --vcf data.vcf --hap-r2 --ld-window-bp 100000 --out hap_ld
Complete Workflow
plink2 --bfile data --r2 --ld-window-kb 500 --ld-window-r2 0.2 --out ld_genome
plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune
plink2 --bfile data --extract prune.prune.in --make-bed --out pruned
plink --bfile data --blocks no-pheno-req --out blocks
plink --bfile data --r2 dprime --chr 6 --from-mb 28 --to-mb 34 --out mhc_ld
Related Skills
- plink-basics - File format handling
- population-structure - Use pruned data for PCA
- association-testing - LD clumping for GWAS
- selection-statistics - LD affects EHH statistics