with one click
bio-workflows-clip-pipeline
// End-to-end CLIP-seq analysis from FASTQ to binding sites and motif enrichment. Use when analyzing protein-RNA interactions from CLIP-based methods.
// End-to-end CLIP-seq analysis from FASTQ to binding sites and motif enrichment. Use when analyzing protein-RNA interactions from CLIP-based methods.
Build and run production workloads on Amazon Bedrock with current model availability, Converse API, agents, guardrails, AgentCore, and IAM controls. Use when implementing Bedrock inference pipelines, managed agents, or provider-agnostic model routing on AWS.
Design evaluation, tracing, monitoring, scope-control, and rollback discipline for agent systems. Use when an agent workflow is becoming important enough that you need evidence, not vibes, to decide whether it is good.
Build and operate OpenAI-first coding and agent workflows using Codex app/cloud, the Responses API, current GPT and Codex models, Agents SDK, hosted tools, tool search, MCP/connectors, skills, and approval-aware tool execution. Use when you need long-horizon software agents or OpenAI-native multi-agent orchestration.
Design, evaluate, and operate agentic systems for biomedical and scientific discovery. Use when building or selecting agents for hypothesis generation, experiment planning, autonomous notebook analysis, lab-in-the-loop validation, pathology concept discovery, or multi-agent research workflows.
Implement and operate Model Context Protocol systems safely. Use when designing MCP clients or servers, selecting transports, configuring auth, onboarding remote servers, or enforcing approval and egress controls.
Batch effect correction for multi-cohort bulk RNA-seq data using ComBat, with PCA-based visualization before and after correction.
| name | bio-workflows-clip-pipeline |
| description | End-to-end CLIP-seq analysis from FASTQ to binding sites and motif enrichment. Use when analyzing protein-RNA interactions from CLIP-based methods. |
| tool_type | mixed |
| primary_tool | CLIPper |
FASTQ → QC → UMI extract → Trim adapters → Align → Filter → Dedup → Peak call → Annotate → Motifs
| Method | UMI | Crosslink Site | Adapter |
|---|---|---|---|
| HITS-CLIP | Optional | Deletions | 3' adapter |
| PAR-CLIP | Optional | T→C mutations | 3' adapter |
| iCLIP | Required | 5' of read | 3' adapter |
| eCLIP | Required | 5' of read | 3' adapter |
# Initial QC
fastqc reads.fastq.gz -o qc_pre/
# Check for adapter contamination and UMI structure
# For eCLIP: expect 10nt UMI at read start
zcat reads.fastq.gz | head -n 100 | cut -c1-15
# eCLIP (10nt UMI at 5' end)
umi_tools extract \
--stdin=reads.fastq.gz \
--bc-pattern=NNNNNNNNNN \
--stdout=extracted.fastq.gz \
--log=umi_extract.log
# iCLIP (5nt experimental barcode + 5nt UMI)
umi_tools extract \
--stdin=reads.fastq.gz \
--bc-pattern=NNNNNXXXXX \
--stdout=extracted.fastq.gz
# Trim 3' adapter (common eCLIP adapter)
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
--minimum-length 20 \
--quality-cutoff 20 \
-o trimmed.fastq.gz \
extracted.fastq.gz
# For paired UMI adapters
cutadapt -a AGATCGGAAGAGCACACGTCT \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
--minimum-length 20 \
-o trimmed_R1.fq.gz -p trimmed_R2.fq.gz \
extracted_R1.fq.gz extracted_R2.fq.gz
# Build STAR index (once)
STAR --runMode genomeGenerate \
--genomeDir star_index \
--genomeFastaFiles genome.fa \
--sjdbGTFfile genes.gtf \
--sjdbOverhang 100
# Align with STAR (optimized for short CLIP reads)
STAR --genomeDir star_index \
--readFilesIn trimmed.fastq.gz \
--readFilesCommand zcat \
--outFilterMismatchNmax 2 \
--outFilterMultimapNmax 1 \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes All \
--alignEndsType EndToEnd \
--outFileNamePrefix clip_
# Remove unmapped and low-quality reads
samtools view -b -F 4 -q 10 clip_Aligned.sortedByCoord.out.bam > filtered.bam
samtools index filtered.bam
# Optional: remove reads mapping to rRNA/tRNA
bedtools intersect -v -abam filtered.bam -b rrna_trna.bed > filtered_norRNA.bam
# UMI-aware deduplication
umi_tools dedup \
-I filtered.bam \
-S dedup.bam \
--output-stats=dedup_stats
samtools index dedup.bam
# Check deduplication rate
echo "Duplication rate:" $(grep "Input Reads" dedup_stats.log | awk '{print $3}')
# CLIPper (recommended)
clipper -b dedup.bam -s hg38 -o peaks.bed --FDR 0.05 --superlocal
# Alternative: Piranha
Piranha -s dedup.bam -o piranha_peaks.bed -p 0.01
# For PAR-CLIP with T→C mutations
PARalyzer settings.ini
# Strand-specific calling
samtools view -h -F 16 dedup.bam | samtools view -Sb - > plus.bam
samtools view -h -f 16 dedup.bam | samtools view -Sb - > minus.bam
clipper -b plus.bam -s hg38 -o peaks_plus.bed
clipper -b minus.bam -s hg38 -o peaks_minus.bed
cat peaks_plus.bed peaks_minus.bed | sort -k1,1 -k2,2n > peaks_stranded.bed
# Annotate with gene features
bedtools intersect -a peaks.bed -b genes.gtf -wo > peaks_annotated.txt
# Or use HOMER
annotatePeaks.pl peaks.bed hg38 > peaks_homer_annotated.txt
# Feature distribution
awk -F'\t' '{print $8}' peaks_homer_annotated.txt | sort | uniq -c | sort -rn
# Extract peak sequences
bedtools getfasta -fi genome.fa -bed peaks.bed -s -fo peaks.fa
# HOMER motif finding (RNA mode)
findMotifs.pl peaks.fa fasta motif_output -rna -len 5,6,7,8 -p 8
# MEME-ChIP
meme-chip -oc meme_output -dna peaks.fa -meme-mod zoops -meme-nmotifs 10
# For iCLIP/eCLIP: identify crosslink sites (read 5' ends)
bedtools genomecov -ibam dedup.bam -bg -5 -strand + > crosslinks_plus.bg
bedtools genomecov -ibam dedup.bam -bg -5 -strand - > crosslinks_minus.bg
# For PAR-CLIP: identify T→C conversion sites
# Requires specialized tools like PARpipe
| Step | Metric | Expected |
|---|---|---|
| Raw | Read count | >10M |
| Trimmed | Reads >20bp | >80% |
| Aligned | Mapping rate | >50% |
| Dedup | Unique rate | >20% |
| Peaks | Peak count | 1,000-50,000 |
| Peaks | Median width | 20-100 nt |
| FRiP | Reads in peaks | >10% |
# Calculate FRiP
reads_in_peaks=$(bedtools intersect -a dedup.bam -b peaks.bed -u | samtools view -c -)
total_reads=$(samtools view -c dedup.bam)
frip=$(echo "scale=4; $reads_in_peaks / $total_reads" | bc)
echo "FRiP: $frip"
#!/bin/bash
set -euo pipefail
SAMPLE=$1
READS=$2
GENOME_DIR=$3
GENOME_FA=$4
mkdir -p qc trimmed aligned peaks motifs
# QC
fastqc $READS -o qc/
# UMI extract
umi_tools extract --stdin=$READS --bc-pattern=NNNNNNNNNN \
--stdout=trimmed/${SAMPLE}_extracted.fq.gz
# Trim
cutadapt -a AGATCGGAAGAGCACACGTCT --minimum-length 20 \
-o trimmed/${SAMPLE}_trimmed.fq.gz trimmed/${SAMPLE}_extracted.fq.gz
# Align
STAR --genomeDir $GENOME_DIR --readFilesIn trimmed/${SAMPLE}_trimmed.fq.gz \
--readFilesCommand zcat --outFilterMismatchNmax 2 --outFilterMultimapNmax 1 \
--outSAMtype BAM SortedByCoordinate --outFileNamePrefix aligned/${SAMPLE}_
# Filter and dedup
samtools view -b -F 4 -q 10 aligned/${SAMPLE}_Aligned.sortedByCoord.out.bam | \
samtools sort -o aligned/${SAMPLE}_filtered.bam
samtools index aligned/${SAMPLE}_filtered.bam
umi_tools dedup -I aligned/${SAMPLE}_filtered.bam -S aligned/${SAMPLE}_dedup.bam
samtools index aligned/${SAMPLE}_dedup.bam
# Peaks
clipper -b aligned/${SAMPLE}_dedup.bam -s hg38 -o peaks/${SAMPLE}_peaks.bed
# Motifs
bedtools getfasta -fi $GENOME_FA -bed peaks/${SAMPLE}_peaks.bed -s -fo peaks/${SAMPLE}.fa
findMotifs.pl peaks/${SAMPLE}.fa fasta motifs/${SAMPLE} -rna -len 5,6,7 -p 4
echo "Pipeline complete for $SAMPLE"