| name | bio-workflows-genome-assembly-pipeline |
| description | Orchestrates an end-to-end de novo genome assembly project, routing each step to the right genome-assembly skill rather than restating it. Profiles the genome first (k-mer spectrum -> size, heterozygosity, ploidy), QCs reads, chooses an assembly path by data type (SPAdes for Illumina, Flye for noisy long reads, hifiasm for HiFi, metaFlye for communities), polishes only when needed, decontaminates, scaffolds with Hi-C, and finishes with three-axis QC (contiguity + completeness + correctness). Use when assembling a genome from raw reads and deciding which assembler, whether to polish, and how to prove the result is good. |
| tool_type | cli |
| primary_tool | Flye |
| workflow | true |
| depends_on | ["genome-assembly/genome-profiling","read-qc/fastp-workflow","long-read-sequencing/long-read-qc","genome-assembly/short-read-assembly","genome-assembly/long-read-assembly","genome-assembly/hifi-assembly","genome-assembly/metagenome-assembly","genome-assembly/assembly-polishing","genome-assembly/contamination-detection","genome-assembly/scaffolding","genome-assembly/assembly-qc"] |
| qc_checkpoints | [{"after_profiling":"Genome-size and heterozygosity estimate obtained; sets NG50 denominator, purge level, assembler choice"},{"after_assembly":"Total length within ~10-20% of profiled size; contig count plausible for read type"},{"after_polishing":"Merqury QV improved or plateaued (do not over-polish HiFi); k-mers from accurate reads, not the polishing reads"},{"after_decontamination":"Single-organism: FCS-GX/BlobToolKit clean; MAG: CheckM2 >90% complete, <5% contam, GUNC pass"},{"after_scaffolding":"Contact map shows clean diagonal; off-diagonal blocks inspected/broken before calling chromosome-scale"},{"final_three_axis_qc":"Contiguity (auN/NG50 vs profiled size) + completeness (BUSCO/compleasm) + correctness (Merqury QV) all reported; never N50 alone"}] |
Version Compatibility
Reference examples tested with: GenomeScope2 2.0+, meryl 1.4+, Merqury 1.3+, fastp 0.23+, SPAdes 4.0+, Flye 2.9+, hifiasm 0.25+, metaFlye 2.9+, Racon 1.5+, medaka 2.0+, minimap2 2.26+, FCS-GX 0.5+, CheckM2 1.0+, GUNC 1.0+, YaHS 1.2+, QUAST 5.2+, BUSCO 5.5+, samtools 1.19+. Each owning genome-assembly skill is the source of truth for its tool's pinned version.
Before using code patterns, verify installed versions match. If versions differ:
- CLI:
<tool> --version then <tool> --help to confirm flags
Tool outputs are driven by more than the binary version: medaka consensus quality depends on the basecaller MODEL string (must match the basecaller, e.g. -m r1041_e82_400bps_sup_v5.0.0); BUSCO/compleasm results depend on the lineage dataset and OrthoDB generation (record them); CheckM2/GTDB-Tk results track the reference DATABASE release; hifiasm output filenames and default purge behaviour change across versions (verify against the installed build). If a command errors, introspect the installed tool and adapt rather than retrying.
Genome Assembly Pipeline
"Assemble a genome from my sequencing reads and prove it is good" -> Profile the genome, QC reads, route to the right assembler by data type, polish only if needed, decontaminate, scaffold if Hi-C exists, and finish with three-axis QC. This skill ORCHESTRATES the genome-assembly category; it routes each step to the owning skill and encodes the cross-cutting decisions, not each tool's full option set.
The Single Most Important Modern Insight -- Assembly Is Three Orthogonal Questions, and Each Step Answers One
A genome project fails when one number stands in for the whole. Profiling sets expectations (how big, how heterozygous, how many haplotypes) BEFORE assembling, the assembler answers contiguity, polishing answers per-base accuracy, decontamination answers provenance, scaffolding answers arrangement, and QC must independently address all three of contiguity, completeness, and correctness. The orchestration job is to keep these separate and route each to its skill: a high N50 says nothing about whether the bases are right (Merqury QV) or whether the sequence is the organism's (contamination), and skipping profiling means the assembler guesses the parameters that profiling would have set.
Decision Flow (Step 0 -> 6)
Raw reads (+ optional Hi-C, trio, short reads)
|
v
[0. Profile the genome] --> genome-assembly/genome-profiling
| k-mer spectrum (GenomeScope2) -> genome size, heterozygosity, ploidy.
| Sets NG50 denominator, expected haplotype count, hifiasm purge level,
| and which assembly path is even sensible. Do this BEFORE assembling.
v
[1. QC reads] -----------> short: read-qc/fastp-workflow
| long: long-read-sequencing/long-read-qc
| Garbage-in caps assembly quality; record platform + basecaller era
| (it is an assembly PARAMETER, see step 2), trim internal adapters.
v
[2. Choose path BY DATA TYPE]
| Illumina-only small/isolate -> genome-assembly/short-read-assembly (SPAdes)
| noisy ONT/CLR -> genome-assembly/long-read-assembly (Flye --nano-hq for R10)
| PacBio HiFi -> genome-assembly/hifi-assembly (hifiasm, phased)
| community sample -> genome-assembly/metagenome-assembly (metaFlye/metaSPAdes + binning)
| large/heterozygous euk -> long-read or HiFi, NOT short reads
v
[3. Polish IF needed] ---> genome-assembly/assembly-polishing
| noisy long-read assemblies: Racon -> medaka (model MUST match basecaller).
| Do NOT polish HiFi reflexively (often net-harmful). Measure with Merqury QV,
| not the reads polished with. Skip entirely for SPAdes/HiFi when QV is already high.
v
[4. Decontaminate] ------> genome-assembly/contamination-detection
| single organism: FCS-GX (GenBank-mandatory) + BlobToolKit blob plot.
| MAG: CheckM2 + GUNC (chimerism). Two disjoint problems (see below).
v
[5. Scaffold IF Hi-C] ---> genome-assembly/scaffolding
| automated YaHS produces a DRAFT; manual contact-map curation is the standard.
| Scaffold N50 != contig N50 (gaps are Ns). Skip if no Hi-C.
v
[6. Three-axis QC] ------> genome-assembly/assembly-qc
contiguity (auN/NG50 vs profiled size) + completeness (BUSCO/compleasm)
+ correctness (Merqury QV). Report the triad; NEVER N50 alone.
Routing Table by Scenario
| Scenario | Path | Routes to |
|---|
| Bacterial isolate, ONT R10 only | profile -> QC -> Flye --nano-hq -> medaka -> FCS-GX -> QC | long-read-assembly, assembly-polishing, contamination-detection |
| Bacterial isolate, Illumina only | profile -> fastp -> SPAdes --isolate -> FCS-GX -> QC | short-read-assembly |
| Small genome, ONT, max quality | profile -> QC -> multi-assembler consensus (Trycycler/Autocycler) -> medaka -> QC | long-read-assembly |
| Diploid eukaryote, HiFi (+Hi-C/trio) | profile -> QC -> hifiasm (hap1/hap2) -> purge check -> decontam -> scaffold -> QC | hifi-assembly, scaffolding, contamination-detection |
| Large heterozygous eukaryote, ONT | profile -> QC -> Flye -> purge_dups -> medaka -> decontam -> scaffold -> QC | long-read-assembly, scaffolding |
| Community / microbiome sample | QC -> metaFlye/metaSPAdes -> binning -> CheckM2 + GUNC | metagenome-assembly, contamination-detection |
| Hi-C reads available | after contigs+polish: scaffold, curate contact map | scaffolding |
| Reads not yet QC'd | start at step 1 | read-qc/fastp-workflow, long-read-sequencing/long-read-qc |
Cross-Cutting Gotchas (surface these at every project)
- Basecaller era must match the assembler flag.
--nano-raw on R10/Dorado-SUP reads silently collapses real repeats while RAISING N50; --nano-hq is the R10 default. The platform + basecaller model is an assembly parameter, not metadata.
- A primary assembly is not a haplotype. The hifiasm primary is a maternal/paternal mosaic that exists in no cell; for any allele-aware downstream use hap1/hap2 phased with trio or Hi-C, and treat HiFi-only hap1/hap2 as only partially phased.
- N50 is gamed. It rises when an assembly gets WORSE (misjoins, collapsed repeats, retained haplotigs). Report the triad (auN/NG50 + BUSCO + Merqury QV), never N50 alone.
- A MAG is a population consensus, not a genome. The unit of success is a binned, MIMAG-gated MAG, and "% contamination" conflates foreign-organism mixing, strain mixing, and assembly artifacts.
- "Contamination" is two disjoint problems. Single-organism cross-kingdom foreign sequence (FCS-GX, blob plot) is a different question from intra-domain MAG contamination/chimerism (CheckM2 + GUNC); do not apply one tool's question to the other's input.
- Scaffold N50 >> contig N50 because gaps are Ns. Scaffold contiguity is glue, not sequence; every join is a hypothesis a contact map must confirm. Report contig N50 alongside scaffold N50.
Step 0: Profile the Genome (do this first)
Route the full treatment to genome-assembly/genome-profiling. The minimal orchestration step:
meryl count k=21 output reads.meryl accurate_reads.fq.gz
meryl histogram reads.meryl > reads.hist
genomescope2 -i reads.hist -o gscope_out -k 21
Step 1: QC Reads
Short reads route to read-qc/fastp-workflow; long reads to long-read-sequencing/long-read-qc.
fastp -i R1.fq.gz -I R2.fq.gz -o t_R1.fq.gz -O t_R2.fq.gz \
--detect_adapter_for_pe --qualified_quality_phred 20 --length_required 50 --html qc.html
Step 2: Assemble (route by data type)
Give the assembler the exact preset for the chemistry; the wrong preset is silent. Detailed options live in the owning skills.
spades.py --isolate -1 t_R1.fq.gz -2 t_R2.fq.gz -o spades_out -t 16
flye --nano-hq ont.fq.gz --out-dir flye_out --threads 16
hifiasm -o asm -t 16 hifi.fq.gz
flye --meta ont.fq.gz --out-dir metaflye_out --threads 16
Step 3: Polish IF Needed
Polishing is read-type-matched and conditional. Route to genome-assembly/assembly-polishing.
minimap2 -ax map-ont flye_out/assembly.fasta ont.fq.gz | samtools sort -o aln.bam
medaka_consensus -i ont.fq.gz -d flye_out/assembly.fasta -o medaka_out -t 16 \
-m r1041_e82_400bps_sup_v5.0.0
Do NOT reflexively polish a HiFi assembly (already ~Q30+; over-polishing lowers QV). SPAdes output needs no separate long-read polish. The stop signal is a Merqury QV plateau, not a fixed iteration count, and the QV must be measured against reads independent of those used to polish.
Step 4: Decontaminate (route by sample type)
python3 ./fcs.py screen genome --fasta assembly.fa --out-dir gx_out/ --gx-db "$GXDB/gxdb" --tax-id <taxid>
checkm2 predict --input bins/ --output-directory checkm2_out --threads 16
gunc run --input_dir bins/ --out_dir gunc_out
Step 5: Scaffold IF Hi-C Is Available
YaHS produces a draft; the contact map is the QC, not decoration. Route to genome-assembly/scaffolding.
yahs assembly.fasta hic_to_contigs.bam -o yahs_out
Step 6: Three-Axis QC (contiguity + completeness + correctness)
Route the full treatment to genome-assembly/assembly-qc. Report all three axes; lead with the QV.
quast.py final.fasta -o quast_out -t 16 --est-ref-size <profiled_size>
busco -i final.fasta -l <clade>_odb10 -o busco_out -m genome -c 16
K=$(sh $MERQURY/best_k.sh <genome_size_bp> | tail -n1 | awk '{print int($1+0.5)}')
meryl count k=$K output reads.meryl accurate_reads.fq.gz
merqury.sh reads.meryl final.fasta merqury_out
Troubleshooting
| Issue | Likely cause | Solution |
|---|
| Assembly ~1.5-2x profiled size, high BUSCO-Duplicated | uncollapsed haplotigs (false duplication) | purge_dups; check half-coverage depth peak; do not over-purge real segmental duplications |
| Contiguous but gene models frameshift | noisy long-read assembly not polished | Racon -> medaka (matched model); measure QV |
| QV drops after polishing | over-polishing an already-accurate (HiFi) assembly | stop polishing; HiFi rarely needs short-read polish |
| medaka consensus worse than input | wrong basecaller model string | set -m to the model the reads were basecalled with |
| Fewer contigs than expected but repeats collapsed | --nano-raw used on R10 reads | re-run Flye with --nano-hq |
| CheckM2 says clean but bin looks mixed | chimera with disjoint markers | run GUNC; CheckM2 marker redundancy cannot see chimerism |
| Scaffold N50 huge, contig N50 small | scaffolding glue, not sequence | inspect contact map, break off-diagonal misjoins |
Related Skills
- genome-assembly/genome-profiling - Step 0: k-mer spectrum for size, heterozygosity, ploidy; sets expectations before assembling
- genome-assembly/short-read-assembly - SPAdes path for Illumina-only small/isolate genomes
- genome-assembly/long-read-assembly - Flye/Canu path for noisy ONT/CLR reads
- genome-assembly/hifi-assembly - hifiasm phased path for PacBio HiFi
- genome-assembly/metagenome-assembly - metaFlye/metaSPAdes + binning for community samples
- genome-assembly/assembly-polishing - Racon/medaka/Pilon, applied only when needed
- genome-assembly/contamination-detection - FCS-GX/BlobToolKit (single organism) vs CheckM2/GUNC (MAG)
- genome-assembly/scaffolding - YaHS Hi-C scaffolding and contact-map curation
- genome-assembly/assembly-qc - Three-axis QC: auN/NG50 + BUSCO + Merqury QV
- read-qc/fastp-workflow - Short-read QC before assembly
- long-read-sequencing/long-read-qc - Long-read length/quality QC and basecaller-era awareness