| name | bio-workflows-clip-pipeline |
| workflow | true |
| depends_on | ["clip-seq/clip-preprocessing","clip-seq/clip-alignment","clip-seq/clip-qc","clip-seq/clip-peak-calling","clip-seq/crosslink-site-detection","clip-seq/binding-site-annotation","clip-seq/clip-motif-analysis","clip-seq/differential-clip"] |
| qc_checkpoints | ["preprocessing_retention","alignment_rate","library_complexity","frip","idr"] |
| description | End-to-end CLIP-seq pipeline from FASTQ to ENCODE-compliant binding sites, single-nucleotide crosslink maps, annotation, motifs, and (optionally) differential binding. Use when running the full Yeo lab eCLIP / iCLIP / iCLIP2 / iCLIP3 / irCLIP / PAR-CLIP analysis with SMInput control, protocol-specific UMI extraction, ENCODE STAR parameters, CLIPper or Skipper peak calling with stringent log2 FC and -log10 p thresholds, IDR rescue and self-consistency QC, and downstream motif registration with mCross or PEKA. |
| tool_type | mixed |
| primary_tool | CLIPper |
Version Compatibility
Reference examples tested with: umi_tools 1.1.5+, cutadapt 4.6+, fastp 0.23+, STAR 2.7.11b+, samtools 1.19+, bedtools 2.31+, CLIPper 2.0+, Skipper (commit 2023.05+), PureCLIP 1.3.1+, HOMER 4.11+, ChIPseeker 1.40+, preseq 3.2+, picard 3.1+, idr 2.0.4+, MultiQC 1.21+.
Before using code patterns, verify installed versions match. If versions differ:
- CLI:
<tool> --version then <tool> --help to confirm flags
- Python:
pip show <package> then help(module.function) to check signatures
- R:
packageVersion('<pkg>') then ?function_name to verify parameters
If code throws unexpected errors, introspect the installed tool and adapt the example rather than retrying.
CLIP-seq End-to-End Pipeline
"Analyze my CLIP-seq data from raw FASTQ to ENCODE-compliant binding sites" -> Orchestrate protocol-specific UMI extraction, 3'-only adapter trimming (preserving the R2 5' truncation = crosslink site -1), ENCODE STAR alignment, UMI-based deduplication, library complexity QC, peak calling against SMInput with stringent thresholds (log2 FC >= 3 AND -log10 p >= 3), single-nucleotide crosslink-site detection, ChIPseeker annotation with CLIP-appropriate tssRegion, motif discovery with GC-matched background and CL-position registration, and optional differential binding between conditions.
Pipeline Overview
FASTQ + SMInput
-> [clip-preprocessing] UMI extract + 3' adapter trim (-q 6 -m 18) + two-pass for eCLIP
-> [clip-alignment] STAR ENCODE block (alignEndsType EndToEnd, mismatch 0.04 or 0.07 for PAR-CLIP) + UMI dedup
-> [clip-qc] preseq, FRiP, IDR rescue + self-consistency, read distribution
-> [clip-peak-calling] CLIPper + SMInput log2 norm (stringent: log2 FC >= 3, -log10 p >= 3) OR Skipper (210-320% more sites)
-> [crosslink-site-detection] PureCLIP or CTK CITS for single-nt CL positions
-> [binding-site-annotation] ChIPseeker (tssRegion=c(-100,100), level=transcript) + RBP-Maps for splicing factors
-> [clip-motif-analysis] HOMER + mCross (registered) + RBNS Kd cross-check
-> [differential-clip] DEWSeq window-level NB with type:condition interaction (optional)
CLIP Variant Selection
| Variant | When to use | UMI pattern | STAR mismatch ceiling | Detection signal |
|---|
| eCLIP (Van Nostrand 2016) | ENCODE comparability; SMInput available | 10 nt R1 | 0.04 | R2 5' truncation |
| iCLIP / iCLIP2 / iCLIP3 | Single-end; high motif specificity | NNNXXXXNN (3+4+2; demux first) | 0.04 | R1 5' truncation |
| irCLIP / FLASH | Non-radioactive; fast | Protocol-specific | 0.04 | Truncation |
| PAR-CLIP | Photoactivatable nucleoside (4SU); HEK293/K562 | 4 nt typical | 0.07 (raised for T->C) | T->C transitions |
| miCLIP / miCLIP2 | m6A modification | iCLIP-style | 0.04 | Truncation + C->T at m6A |
| STAMP / scSTAMP | Antibody-free; in vivo or single-cell | NA (no UV) | 0.04 (RNA-seq mode) | C->U editing (RBP-APOBEC1 fusion) |
| chimeric eCLIP / miR-eCLIP | Direct miRNA-target pairs | 10 nt R1 | 0.04 | Chimeric reads |
Step 1: Quality Control of Raw FASTQ
fastqc raw_R1.fq.gz raw_R2.fq.gz -o qc/raw/
zcat raw_R1.fq.gz | awk 'NR%4==2' | head -100 | cut -c1-12 | sort | uniq -c | sort -rn | head
Step 2: Preprocessing (Protocol-Specific)
Goal: Convert raw CLIP FASTQ into UMI-deduplicated, alignment-ready FASTQ while preserving the R2 5' end (= crosslink site -1) that drives single-nucleotide resolution downstream.
Approach: Use the protocol-matched UMI pattern (10 nt eCLIP, NNNXXXXNN iCLIP, 4 nt PAR-CLIP), run umi_tools extract to move random barcodes to read names, then apply cutadapt with 3'-only adapter trimming at -q 6 -m 18 (permissive 5' to protect the truncation base). eCLIP uses two-pass trimming to remove read-through inline adapters from R2 5' only; iCLIP and PAR-CLIP use single-pass.
umi_tools extract \
--bc-pattern=NNNNNNNNNN \
--stdin=raw_R1.fq.gz --read2-in=raw_R2.fq.gz \
--stdout=R1.umi.fq.gz --read2-out=R2.umi.fq.gz \
--log=qc/umi_extract.log
cutadapt \
-a AGATCGGAAGAGCACACGTCT \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
--quality-base 33 -q 6 -m 18 \
-j 8 \
-o R1.p1.fq.gz -p R2.p1.fq.gz \
R1.umi.fq.gz R2.umi.fq.gz \
> qc/cutadapt_pass1.log 2>&1
cutadapt \
-G GATCGTCGGACTGTAGAACTCTGAAC \
--quality-base 33 -q 6 -m 18 \
-j 8 \
-o R1.trim.fq.gz -p R2.trim.fq.gz \
R1.p1.fq.gz R2.p1.fq.gz \
>> qc/cutadapt_pass2.log 2>&1
For PAR-CLIP: same UMI extraction but downstream alignment raises --outFilterMismatchNoverReadLmax from 0.04 to 0.07 (the T->C signature would otherwise be filtered as sequencing error). See clip-seq/clip-preprocessing for full per-protocol guidance.
Step 3: Alignment (ENCODE STAR Block)
STAR --runMode alignReads \
--runThreadN 16 \
--genomeDir /path/to/STAR_hg38_index \
--genomeLoad NoSharedMemory \
--readFilesIn R1.trim.fq.gz R2.trim.fq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outFilterMultimapNmax 1 \
--alignEndsType EndToEnd \
--outFilterMismatchNoverReadLmax 0.04 \
--outFilterScoreMinOverLread 0.66 \
--outFilterMatchNminOverLread 0.66 \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes All \
--outFileNamePrefix sample_
samtools index sample_Aligned.sortedByCoord.out.bam
samtools view -b -q 10 sample_Aligned.sortedByCoord.out.bam > sample_q10.bam
samtools index sample_q10.bam
umi_tools dedup \
--stdin=sample_q10.bam \
--stdout=sample_dedup.bam \
--method=unique \
--paired \
--log=qc/dedup.log
samtools index sample_dedup.bam
For PAR-CLIP: change --outFilterMismatchNoverReadLmax 0.04 to 0.07. For repeat-binding RBPs (MATR3, ZFP36, FUS at LINE-1, HNRNPK at SINEs): change --outFilterMultimapNmax 1 to 100 and add --outSAMmultNmax -1, then run CLAM downstream for EM-based multi-mapper assignment. See clip-seq/clip-alignment for full guidance.
Step 4: QC (Five Gates)
grep -E "passing filters|Pairs written" qc/cutadapt_pass1.log
grep "Uniquely mapped reads %" sample_Log.final.out
preseq lc_extrap -B -P sample_q10.bam -o qc/preseq.txt
multiqc qc/ -o qc/multiqc/
CLIP libraries have 40-70% PCR duplication BY DESIGN (the IP enriches a small molecule pool). Low duplication usually means failed IP, not a good library. The unique-fragment count after UMI dedup is the actual quality metric. See clip-seq/clip-qc for full five-gate diagnostic.
Step 5: Peak Calling
clipper \
-b sample_dedup.bam \
-s hg38 \
-o peaks/sample.clipper.bed \
--FDR 0.05 \
--superlocal \
--save-pickle \
--processors 8
python overlap_peakfi_with_bam_PE.py \
peaks/sample.clipper.bed \
sample_dedup.bam sminput_dedup.bam \
sample_dedup.bam.readnum.txt sminput_dedup.bam.readnum.txt \
peaks/sample.normed.bed
python compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.py \
peaks/sample.normed.bed \
peaks/sample.compressed.bed
awk 'BEGIN{FS=OFS="\t"} $5 >= 3 && $6 >= 3' peaks/sample.compressed.bed > peaks/sample.stringent.bed
For maximum sensitivity (210-320% more sites than CLIPper for mRNA-binding RBPs), use Skipper Snakemake workflow with the same SMInput control. Mandatory for FASTKD2 / mt-RBPs which CLIPper misses on chrM. See clip-seq/clip-peak-calling for the full caller taxonomy.
Step 6: Single-Nucleotide Crosslink-Site Detection
pureclip \
-i sample_dedup.bam -bai sample_dedup.bam.bai \
-g genome.fa \
-ibam sminput_dedup.bam -ibai sminput_dedup.bam.bai \
-o crosslinks/sample.sites.bed \
-or crosslinks/sample.regions.bed \
-nt 8 -dm 8 \
-iv expressed_tx.bed
Single-nt CL sites feed mCross motif registration and allele-specific binding analyses. They are NOT a replacement for the broad peak list; complementary outputs. See clip-seq/crosslink-site-detection.
Step 7: IDR Across Replicates
sort -k5,5gr peaks/rep1.compressed.bed > peaks/rep1.sorted.bed
sort -k5,5gr peaks/rep2.compressed.bed > peaks/rep2.sorted.bed
idr --samples peaks/rep1.sorted.bed peaks/rep2.sorted.bed \
--input-file-type bed --rank 5 \
--output-file qc/idr.true.out \
--idr-threshold 0.05 \
--plot --log-output-file qc/idr.log
Step 8: Binding-Site Annotation
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
peaks <- readPeakFile('peaks/sample.stringent.bed')
anno <- annotatePeak(
peaks,
TxDb = txdb,
level = 'transcript',
tssRegion = c(-100, 100),
genomicAnnotationPriority = c('Promoter','5UTR','3UTR','Exon','Intron','Downstream','Intergenic')
)
plotAnnoPie(anno)
Default ChIPseeker tssRegion=c(-3000, 3000) over-extends for CLIP (would label 30-50% peaks as "Promoter"). Splicing factors additionally need RBP-Maps (Yeo lab) for the 1400 nt cassette-exon regulatory metagene. See clip-seq/binding-site-annotation.
Step 9: Motif Analysis (De Novo + CL-Registered)
bedtools getfasta -fi genome.fa -bed peaks/sample.stringent.bed -s -fo motifs/peaks.fa
bedtools shuffle -i peaks/sample.stringent.bed -g chrom.sizes \
-incl expressed_3utr.bed -seed 42 > motifs/background.bed
bedtools getfasta -fi genome.fa -bed motifs/background.bed -s -fo motifs/background.fa
findMotifs.pl motifs/peaks.fa fasta motifs/homer \
-rna -len 5,6,7,8 -p 8 -fasta motifs/background.fa
mCross -i crosslinks/sample.sites.bed -g genome.fa -k 7 -n 5 -o motifs/mcross
UV254 crosslinking has a strong U bias (~60-80% CL events at U); naive logos centered on CL positions are U-enriched even for non-U-binding RBPs. mCross corrects this by registering motif relative to the CL offset. See clip-seq/clip-motif-analysis.
Step 10: Differential Binding (Optional, Across Conditions)
library(DEWSeq)
counts <- read.table('counts/merged.tsv', sep='\t', header=TRUE, row.names=1)
colData <- data.frame(
type = c('ip','ip','ip','ip','sminput','sminput','sminput','sminput'),
condition = c('treat','treat','ctrl','ctrl','treat','treat','ctrl','ctrl')
)
dds <- DESeqDataSetFromSlidingWindows(
countData=counts, colData=colData,
annotObj='annotation_windows.bed',
design = ~ type + condition + type:condition
)
dds <- DESeq(dds)
res <- results(dds, name='typeip.conditiontreat')
See clip-seq/differential-clip for full DEWSeq workflow and the htseq-clip preprocessing required upstream.
Quality Checkpoints
| Step | Metric | ENCODE target |
|---|
| Preprocessing | Retention after adapter trim | >= 70% |
| Alignment | Unique mapping rate | >= 60% (eCLIP); >= 70% (iCLIP) |
| Complexity | preseq predicted unique at 100M reads | >= 10M (good); >= 1M (minimum acceptable) |
| Peak calling | FRiP (narrow-binding RBP) | >= 0.005 |
| Peak calling | Stringent peaks log2(IP/SMI) | >= 3 |
| Peak calling | Stringent peaks -log10 p | >= 3 |
| IDR | Rescue ratio | < 2 |
| IDR | Self-consistency ratio | < 2 |
| Annotation | Top RBP-class match expectation | Y (HuR -> 3' UTR; PTBP1 -> intron; FASTKD2 -> chrM) |
Per-Variant Adjustments
- PAR-CLIP: Raise STAR
--outFilterMismatchNoverReadLmax from 0.04 to 0.07; downstream use PARalyzer or CTK CIMS substitution T->C
- iCLIP / iCLIP2 multiplexed: Demultiplex by inline library barcode (NNNXXXXNN) BEFORE umi_tools extract
- HITS-CLIP: Use deletion-tolerant aligner (BWA-aln); downstream CTK CIMS deletion mode
- Repeat-binding RBPs: STAR
--outFilterMultimapNmax 100 --outSAMmultNmax -1 + CLAM EM rescue
- m6A profiling: Switch to clip-seq/m6a-clip (miCLIP2 + m6Aboost or GLORI)
- Antibody unavailable: Switch to clip-seq/stamp-antibody-free (STAMP or TRIBE)
- miRNA targets: Switch to clip-seq/ago-clip-mirna-targets (chimeric eCLIP / miR-eCLIP)
- Variant-effect prediction: Use clip-seq/clip-deep-learning (RBPNet or RNAProt)
Related Skills
- clip-seq/clip-preprocessing - UMI extraction and adapter trimming details
- clip-seq/clip-alignment - STAR ENCODE block + multi-mapper rescue
- clip-seq/clip-qc - Five-gate QC framework
- clip-seq/clip-peak-calling - CLIPper / Skipper / PureCLIP / CTK taxonomy
- clip-seq/crosslink-site-detection - Single-nt CL detection by chemistry
- clip-seq/binding-site-annotation - ChIPseeker + RBP-Maps
- clip-seq/clip-motif-analysis - HOMER + mCross + RBNS validation
- clip-seq/differential-clip - DEWSeq + Flipper for cross-condition
- clip-seq/m6a-clip - miCLIP2 / GLORI / DART for m6A modifications
- clip-seq/stamp-antibody-free - STAMP / TRIBE for antibody-free profiling
- clip-seq/ago-clip-mirna-targets - chimeric eCLIP for direct miRNA-target pairs
- clip-seq/clip-deep-learning - RBPNet / RNAProt for variant-effect prediction
- read-qc/quality-reports - FastQC / MultiQC upstream QC
- reporting/automated-qc-reports - MultiQC aggregates the per-tool QC into one report; gating stays in the five-gate framework, not MultiQC
- alternative-splicing/differential-splicing - Cassette exon tables for RBP-Maps