Run any Skill in Manus with one click

bio-workflows-genome-assembly-pipeline

Stars943

Forks165

UpdatedMay 29, 2026 at 23:41

Orchestrates an end-to-end de novo genome assembly project, routing each step to the right genome-assembly skill rather than restating it. Profiles the genome first (k-mer spectrum -> size, heterozygosity, ploidy), QCs reads, chooses an assembly path by data type (SPAdes for Illumina, Flye for noisy long reads, hifiasm for HiFi, metaFlye for communities), polishes only when needed, decontaminates, scaffolds with Hi-C, and finishes with three-axis QC (contiguity + completeness + correctness). Use when assembling a genome from raw reads and deciding which assembler, whether to polish, and how to prove the result is good.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

GPTomics

GPTomics/bioSkills

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

File Explorer

3 files

SKILL.md

readonly

More from this repository

same repository

bio-ribo-seq-initiation-site-mapping

GPTomics/bioSkills

Map translation initiation sites, including non-AUG and alternative starts, from initiation-drug ribosome profiling (TI-seq). Use when locating start codons, detecting near-cognate or upstream initiation, or analyzing harringtonine, lactimidomycin (GTI-seq/QTI-seq), or retapamulin (Ribo-RET) data.

2026-06-20943

bio-ribo-seq-orf-detection

GPTomics/bioSkills

Detect and quantify translated ORFs from Ribo-seq using 3-nucleotide periodicity, including uORFs, internal ORFs, dORFs, and novel ORFs. Use when finding actively translated regions beyond annotated CDS, classifying ORFs by the 2022 community standard, quantifying ORF-level translation, or choosing between periodicity-based callers.

2026-06-20943

bio-ribo-seq-riboseq-preprocessing

GPTomics/bioSkills

Preprocess ribosome profiling reads with UMI handling, adapter trimming, contaminant/rRNA depletion, and footprint-aware alignment. Use when preparing Ribo-seq FASTQ for periodicity QC, ORF detection, translation efficiency, or stalling analysis, or when deciding how to deduplicate, which aligner to use, or how to size-select ribosome-protected fragments.

2026-06-20943

bio-ribo-seq-ribosome-periodicity

GPTomics/bioSkills

Validate Ribo-seq library quality by measuring 3-nucleotide periodicity and calibrating read-length-specific P-site offsets. Use when checking whether footprints capture genuine translation, determining P-site offsets for downstream ORF/TE/stalling analysis, or deciding which read lengths to keep.

2026-06-20943

bio-ribo-seq-ribosome-stalling

GPTomics/bioSkills

Detect ribosome pausing and stalling at codon resolution from Ribo-seq, using local-relative occupancy metrics and A-site assignment. Use when studying elongation dynamics, codon dwell times, pause motifs, or ribosome collisions, and when judging whether a pause is real biology or a cycloheximide artifact.

2026-06-20943

bio-ribo-seq-translation-efficiency

GPTomics/bioSkills

Quantify translation efficiency (TE) as ribosome occupancy relative to mRNA abundance and test for differential TE between conditions. Use when separating translational from transcriptional regulation, distinguishing genuine translational control from buffering, or choosing between riborex, Xtail, anota2seq, and DESeq2 interaction models.

2026-06-20943

name	bio-workflows-genome-assembly-pipeline
description	Orchestrates an end-to-end de novo genome assembly project, routing each step to the right genome-assembly skill rather than restating it. Profiles the genome first (k-mer spectrum -> size, heterozygosity, ploidy), QCs reads, chooses an assembly path by data type (SPAdes for Illumina, Flye for noisy long reads, hifiasm for HiFi, metaFlye for communities), polishes only when needed, decontaminates, scaffolds with Hi-C, and finishes with three-axis QC (contiguity + completeness + correctness). Use when assembling a genome from raw reads and deciding which assembler, whether to polish, and how to prove the result is good.
tool_type	cli
primary_tool	Flye
workflow	true
depends_on	["genome-assembly/genome-profiling","read-qc/fastp-workflow","long-read-sequencing/long-read-qc","genome-assembly/short-read-assembly","genome-assembly/long-read-assembly","genome-assembly/hifi-assembly","genome-assembly/metagenome-assembly","genome-assembly/assembly-polishing","genome-assembly/contamination-detection","genome-assembly/scaffolding","genome-assembly/assembly-qc"]
qc_checkpoints	[{"after_profiling":"Genome-size and heterozygosity estimate obtained; sets NG50 denominator, purge level, assembler choice"},{"after_assembly":"Total length within ~10-20% of profiled size; contig count plausible for read type"},{"after_polishing":"Merqury QV improved or plateaued (do not over-polish HiFi); k-mers from accurate reads, not the polishing reads"},{"after_decontamination":"Single-organism: FCS-GX/BlobToolKit clean; MAG: CheckM2 >90% complete, <5% contam, GUNC pass"},{"after_scaffolding":"Contact map shows clean diagonal; off-diagonal blocks inspected/broken before calling chromosome-scale"},{"final_three_axis_qc":"Contiguity (auN/NG50 vs profiled size) + completeness (BUSCO/compleasm) + correctness (Merqury QV) all reported; never N50 alone"}]

Version Compatibility

Reference examples tested with: GenomeScope2 2.0+, meryl 1.4+, Merqury 1.3+, fastp 0.23+, SPAdes 4.0+, Flye 2.9+, hifiasm 0.25+, metaFlye 2.9+, Racon 1.5+, medaka 2.0+, minimap2 2.26+, FCS-GX 0.5+, CheckM2 1.0+, GUNC 1.0+, YaHS 1.2+, QUAST 5.2+, BUSCO 5.5+, samtools 1.19+. Each owning genome-assembly skill is the source of truth for its tool's pinned version.

Before using code patterns, verify installed versions match. If versions differ:

CLI: <tool> --version then <tool> --help to confirm flags

Tool outputs are driven by more than the binary version: medaka consensus quality depends on the basecaller MODEL string (must match the basecaller, e.g. -m r1041_e82_400bps_sup_v5.0.0); BUSCO/compleasm results depend on the lineage dataset and OrthoDB generation (record them); CheckM2/GTDB-Tk results track the reference DATABASE release; hifiasm output filenames and default purge behaviour change across versions (verify against the installed build). If a command errors, introspect the installed tool and adapt rather than retrying.

Genome Assembly Pipeline

"Assemble a genome from my sequencing reads and prove it is good" -> Profile the genome, QC reads, route to the right assembler by data type, polish only if needed, decontaminate, scaffold if Hi-C exists, and finish with three-axis QC. This skill ORCHESTRATES the genome-assembly category; it routes each step to the owning skill and encodes the cross-cutting decisions, not each tool's full option set.

The Single Most Important Modern Insight -- Assembly Is Three Orthogonal Questions, and Each Step Answers One

A genome project fails when one number stands in for the whole. Profiling sets expectations (how big, how heterozygous, how many haplotypes) BEFORE assembling, the assembler answers contiguity, polishing answers per-base accuracy, decontamination answers provenance, scaffolding answers arrangement, and QC must independently address all three of contiguity, completeness, and correctness. The orchestration job is to keep these separate and route each to its skill: a high N50 says nothing about whether the bases are right (Merqury QV) or whether the sequence is the organism's (contamination), and skipping profiling means the assembler guesses the parameters that profiling would have set.

Decision Flow (Step 0 -> 6)

Raw reads (+ optional Hi-C, trio, short reads)
    |
    v
[0. Profile the genome] --> genome-assembly/genome-profiling
    |   k-mer spectrum (GenomeScope2) -> genome size, heterozygosity, ploidy.
    |   Sets NG50 denominator, expected haplotype count, hifiasm purge level,
    |   and which assembly path is even sensible. Do this BEFORE assembling.
    v
[1. QC reads] -----------> short: read-qc/fastp-workflow
    |                       long:  long-read-sequencing/long-read-qc
    |   Garbage-in caps assembly quality; record platform + basecaller era
    |   (it is an assembly PARAMETER, see step 2), trim internal adapters.
    v
[2. Choose path BY DATA TYPE]
    |  Illumina-only small/isolate -> genome-assembly/short-read-assembly (SPAdes)
    |  noisy ONT/CLR              -> genome-assembly/long-read-assembly (Flye --nano-hq for R10)
    |  PacBio HiFi                -> genome-assembly/hifi-assembly (hifiasm, phased)
    |  community sample           -> genome-assembly/metagenome-assembly (metaFlye/metaSPAdes + binning)
    |  large/heterozygous euk     -> long-read or HiFi, NOT short reads
    v
[3. Polish IF needed] ---> genome-assembly/assembly-polishing
    |   noisy long-read assemblies: Racon -> medaka (model MUST match basecaller).
    |   Do NOT polish HiFi reflexively (often net-harmful). Measure with Merqury QV,
    |   not the reads polished with. Skip entirely for SPAdes/HiFi when QV is already high.
    v
[4. Decontaminate] ------> genome-assembly/contamination-detection
    |   single organism: FCS-GX (GenBank-mandatory) + BlobToolKit blob plot.
    |   MAG:              CheckM2 + GUNC (chimerism). Two disjoint problems (see below).
    v
[5. Scaffold IF Hi-C] ---> genome-assembly/scaffolding
    |   automated YaHS produces a DRAFT; manual contact-map curation is the standard.
    |   Scaffold N50 != contig N50 (gaps are Ns). Skip if no Hi-C.
    v
[6. Three-axis QC] ------> genome-assembly/assembly-qc
        contiguity (auN/NG50 vs profiled size) + completeness (BUSCO/compleasm)
        + correctness (Merqury QV). Report the triad; NEVER N50 alone.

Routing Table by Scenario

Scenario	Path	Routes to
Bacterial isolate, ONT R10 only	profile -> QC -> Flye `--nano-hq` -> medaka -> FCS-GX -> QC	long-read-assembly, assembly-polishing, contamination-detection
Bacterial isolate, Illumina only	profile -> fastp -> SPAdes `--isolate` -> FCS-GX -> QC	short-read-assembly
Small genome, ONT, max quality	profile -> QC -> multi-assembler consensus (Trycycler/Autocycler) -> medaka -> QC	long-read-assembly
Diploid eukaryote, HiFi (+Hi-C/trio)	profile -> QC -> hifiasm (hap1/hap2) -> purge check -> decontam -> scaffold -> QC	hifi-assembly, scaffolding, contamination-detection
Large heterozygous eukaryote, ONT	profile -> QC -> Flye -> purge_dups -> medaka -> decontam -> scaffold -> QC	long-read-assembly, scaffolding
Community / microbiome sample	QC -> metaFlye/metaSPAdes -> binning -> CheckM2 + GUNC	metagenome-assembly, contamination-detection
Hi-C reads available	after contigs+polish: scaffold, curate contact map	scaffolding
Reads not yet QC'd	start at step 1	read-qc/fastp-workflow, long-read-sequencing/long-read-qc

Cross-Cutting Gotchas (surface these at every project)

Basecaller era must match the assembler flag. --nano-raw on R10/Dorado-SUP reads silently collapses real repeats while RAISING N50; --nano-hq is the R10 default. The platform + basecaller model is an assembly parameter, not metadata.
A primary assembly is not a haplotype. The hifiasm primary is a maternal/paternal mosaic that exists in no cell; for any allele-aware downstream use hap1/hap2 phased with trio or Hi-C, and treat HiFi-only hap1/hap2 as only partially phased.
N50 is gamed. It rises when an assembly gets WORSE (misjoins, collapsed repeats, retained haplotigs). Report the triad (auN/NG50 + BUSCO + Merqury QV), never N50 alone.
A MAG is a population consensus, not a genome. The unit of success is a binned, MIMAG-gated MAG, and "% contamination" conflates foreign-organism mixing, strain mixing, and assembly artifacts.
"Contamination" is two disjoint problems. Single-organism cross-kingdom foreign sequence (FCS-GX, blob plot) is a different question from intra-domain MAG contamination/chimerism (CheckM2 + GUNC); do not apply one tool's question to the other's input.
Scaffold N50 >> contig N50 because gaps are Ns. Scaffold contiguity is glue, not sequence; every join is a hypothesis a contact map must confirm. Report contig N50 alongside scaffold N50.

Step 0: Profile the Genome (do this first)

Route the full treatment to genome-assembly/genome-profiling. The minimal orchestration step:

# k-mer count from ACCURATE reads (Illumina/HiFi, NEVER noisy ONT), then GenomeScope2 for size / heterozygosity / ploidy
meryl count k=21 output reads.meryl accurate_reads.fq.gz
meryl histogram reads.meryl > reads.hist
genomescope2 -i reads.hist -o gscope_out -k 21
# read off: estimated haploid genome size, heterozygosity %, and (with -p) ploidy.
# These set the NG50 denominator, the expected number of haplotypes, and the purge decision.

Step 1: QC Reads

Short reads route to read-qc/fastp-workflow; long reads to long-read-sequencing/long-read-qc.

fastp -i R1.fq.gz -I R2.fq.gz -o t_R1.fq.gz -O t_R2.fq.gz \
    --detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 --html qc.html

Step 2: Assemble (route by data type)

Give the assembler the exact preset for the chemistry; the wrong preset is silent. Detailed options live in the owning skills.

# Illumina-only small/isolate genome -> short-read-assembly
spades.py --isolate -1 t_R1.fq.gz -2 t_R2.fq.gz -o spades_out -t 16
# NOTE: --careful is small-genome-only; do NOT use it on large eukaryote genomes.

# Noisy ONT (R10/Dorado-SUP) -> long-read-assembly. --nano-hq is the modern default.
flye --nano-hq ont.fq.gz --out-dir flye_out --threads 16     # --genome-size optional in recent Flye

# PacBio HiFi -> hifi-assembly (phased by default; verify output filenames per version)
hifiasm -o asm -t 16 hifi.fq.gz                              # add --h1/--h2 (Hi-C) or -1/-2 (trio) to phase

# Community sample -> metagenome-assembly
flye --meta ont.fq.gz --out-dir metaflye_out --threads 16    # then bin + CheckM2/GUNC

Step 3: Polish IF Needed

Polishing is read-type-matched and conditional. Route to genome-assembly/assembly-polishing.

# Noisy long-read assembly: Racon (overlaps from minimap2) then medaka with the MATCHING model.
minimap2 -ax map-ont flye_out/assembly.fasta ont.fq.gz | samtools sort -o aln.bam
medaka_consensus -i ont.fq.gz -d flye_out/assembly.fasta -o medaka_out -t 16 \
    -m r1041_e82_400bps_sup_v5.0.0   # MUST match the basecaller model used to call the reads

Do NOT reflexively polish a HiFi assembly (already ~Q30+; over-polishing lowers QV). SPAdes output needs no separate long-read polish. The stop signal is a Merqury QV plateau, not a fixed iteration count, and the QV must be measured against reads independent of those used to polish.

Step 4: Decontaminate (route by sample type)

# Single-organism assembly (GenBank-mandatory foreign screen + blob plot)
python3 ./fcs.py screen genome --fasta assembly.fa --out-dir gx_out/ --gx-db "$GXDB/gxdb" --tax-id <taxid>
# acts on EXCLUDE/TRIM/FIX cross-kingdom contigs; keep host-integrated foreign sequence (see contamination-detection)

# MAG (intra-domain contamination + chimerism)
checkm2 predict --input bins/ --output-directory checkm2_out --threads 16
gunc run --input_dir bins/ --out_dir gunc_out                            # chimerism, orthogonal to CheckM2

Step 5: Scaffold IF Hi-C Is Available

YaHS produces a draft; the contact map is the QC, not decoration. Route to genome-assembly/scaffolding.

# Map Hi-C to contigs, then YaHS; inspect the contact map (PretextMap/Juicer) and break misjoins.
yahs assembly.fasta hic_to_contigs.bam -o yahs_out          # output scaffolds + AGP; curate before publishing

Step 6: Three-Axis QC (contiguity + completeness + correctness)

Route the full treatment to genome-assembly/assembly-qc. Report all three axes; lead with the QV.

# Contiguity vs the PROFILED genome size (NG50/auN, not bare N50)
quast.py final.fasta -o quast_out -t 16 --est-ref-size <profiled_size>

# Completeness on the DEEPEST applicable clade (compleasm on good genomes; BUSCO otherwise)
busco -i final.fasta -l <clade>_odb10 -o busco_out -m genome -c 16

# Correctness: Merqury QV from ACCURATE reads (k from best_k.sh, not hardcoded)
K=$(sh $MERQURY/best_k.sh <genome_size_bp> | tail -n1 | awk '{print int($1+0.5)}')   # round float->int
meryl count k=$K output reads.meryl accurate_reads.fq.gz
merqury.sh reads.meryl final.fasta merqury_out          # QV + k-mer completeness + spectra-cn

Troubleshooting

Issue	Likely cause	Solution
Assembly ~1.5-2x profiled size, high BUSCO-Duplicated	uncollapsed haplotigs (false duplication)	purge_dups; check half-coverage depth peak; do not over-purge real segmental duplications
Contiguous but gene models frameshift	noisy long-read assembly not polished	Racon -> medaka (matched model); measure QV
QV drops after polishing	over-polishing an already-accurate (HiFi) assembly	stop polishing; HiFi rarely needs short-read polish
medaka consensus worse than input	wrong basecaller model string	set `-m` to the model the reads were basecalled with
Fewer contigs than expected but repeats collapsed	`--nano-raw` used on R10 reads	re-run Flye with `--nano-hq`
CheckM2 says clean but bin looks mixed	chimera with disjoint markers	run GUNC; CheckM2 marker redundancy cannot see chimerism
Scaffold N50 huge, contig N50 small	scaffolding glue, not sequence	inspect contact map, break off-diagonal misjoins

Related Skills

genome-assembly/genome-profiling - Step 0: k-mer spectrum for size, heterozygosity, ploidy; sets expectations before assembling
genome-assembly/short-read-assembly - SPAdes path for Illumina-only small/isolate genomes
genome-assembly/long-read-assembly - Flye/Canu path for noisy ONT/CLR reads
genome-assembly/hifi-assembly - hifiasm phased path for PacBio HiFi
genome-assembly/metagenome-assembly - metaFlye/metaSPAdes + binning for community samples
genome-assembly/assembly-polishing - Racon/medaka/Pilon, applied only when needed
genome-assembly/contamination-detection - FCS-GX/BlobToolKit (single organism) vs CheckM2/GUNC (MAG)
genome-assembly/scaffolding - YaHS Hi-C scaffolding and contact-map curation
genome-assembly/assembly-qc - Three-axis QC: auN/NG50 + BUSCO + Merqury QV
read-qc/fastp-workflow - Short-read QC before assembly
long-read-sequencing/long-read-qc - Long-read length/quality QC and basecaller-era awareness