Run any Skill in Manus with one click

bio-flow-cytometry-clustering-phenotyping

Stars943

Forks165

UpdatedMay 30, 2026 at 12:23

Unsupervised clustering and cell-type identification for high-dimensional flow, spectral, and mass cytometry - FlowSOM, PhenoGraph, FlowSOM-via-CATALYST, with UMAP/tSNE for visualization. Covers the type-vs-state marker distinction (cluster on lineage, test state within clusters), over-provision-then-metacluster, the Weber-Robinson benchmark, seed dependence and metacluster stability, why embeddings are for looking not measuring, and median-heatmap annotation/merging. Use when discovering populations without predefined gates, choosing a clustering algorithm, selecting the number of metaclusters, or annotating clusters into cell types.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

GPTomics

GPTomics/bioSkills

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Biological Scientists, All OtherLife, Physical, and Social Science Occupations·SOC 19-1029

File Explorer

3 files

SKILL.md

readonly

More from this repository

same repository

bio-ribo-seq-initiation-site-mapping

GPTomics/bioSkills

Map translation initiation sites, including non-AUG and alternative starts, from initiation-drug ribosome profiling (TI-seq). Use when locating start codons, detecting near-cognate or upstream initiation, or analyzing harringtonine, lactimidomycin (GTI-seq/QTI-seq), or retapamulin (Ribo-RET) data.

2026-06-20943

bio-ribo-seq-orf-detection

GPTomics/bioSkills

Detect and quantify translated ORFs from Ribo-seq using 3-nucleotide periodicity, including uORFs, internal ORFs, dORFs, and novel ORFs. Use when finding actively translated regions beyond annotated CDS, classifying ORFs by the 2022 community standard, quantifying ORF-level translation, or choosing between periodicity-based callers.

2026-06-20943

bio-ribo-seq-riboseq-preprocessing

GPTomics/bioSkills

Preprocess ribosome profiling reads with UMI handling, adapter trimming, contaminant/rRNA depletion, and footprint-aware alignment. Use when preparing Ribo-seq FASTQ for periodicity QC, ORF detection, translation efficiency, or stalling analysis, or when deciding how to deduplicate, which aligner to use, or how to size-select ribosome-protected fragments.

2026-06-20943

bio-ribo-seq-ribosome-periodicity

GPTomics/bioSkills

Validate Ribo-seq library quality by measuring 3-nucleotide periodicity and calibrating read-length-specific P-site offsets. Use when checking whether footprints capture genuine translation, determining P-site offsets for downstream ORF/TE/stalling analysis, or deciding which read lengths to keep.

2026-06-20943

bio-ribo-seq-ribosome-stalling

GPTomics/bioSkills

Detect ribosome pausing and stalling at codon resolution from Ribo-seq, using local-relative occupancy metrics and A-site assignment. Use when studying elongation dynamics, codon dwell times, pause motifs, or ribosome collisions, and when judging whether a pause is real biology or a cycloheximide artifact.

2026-06-20943

bio-ribo-seq-translation-efficiency

GPTomics/bioSkills

Quantify translation efficiency (TE) as ribosome occupancy relative to mRNA abundance and test for differential TE between conditions. Use when separating translational from transcriptional regulation, distinguishing genuine translational control from buffering, or choosing between riborex, Xtail, anota2seq, and DESeq2 interaction models.

2026-06-20943

name	bio-flow-cytometry-clustering-phenotyping
description	Unsupervised clustering and cell-type identification for high-dimensional flow, spectral, and mass cytometry - FlowSOM, PhenoGraph, FlowSOM-via-CATALYST, with UMAP/tSNE for visualization. Covers the type-vs-state marker distinction (cluster on lineage, test state within clusters), over-provision-then-metacluster, the Weber-Robinson benchmark, seed dependence and metacluster stability, why embeddings are for looking not measuring, and median-heatmap annotation/merging. Use when discovering populations without predefined gates, choosing a clustering algorithm, selecting the number of metaclusters, or annotating clusters into cell types.
tool_type	r
primary_tool	CATALYST

Version Compatibility

Reference examples tested with: CATALYST 1.26+, FlowSOM 2.10+, flowCore 2.14+; Rphenograph (GitHub: JinmiaoChenLab/Rphenograph).

Before using code patterns, verify installed versions match. If versions differ:

R: packageVersion('<pkg>') then ?function_name to verify parameters

Rphenograph is GitHub-only (remotes::install_github('JinmiaoChenLab/Rphenograph')) and returns a list - membership is igraph::membership(out[[2]]), not a vector. Adapt rather than retrying.

Clustering and Phenotyping

"Cluster my cytometry data to find cell types" -> Discover populations in high-dimensional data without gates, then annotate them by marker expression.

R: CATALYST::cluster() (wraps FlowSOM + ConsensusClusterPlus) - the field default
R: FlowSOM::FlowSOM() directly, or Rphenograph() for graph-based clustering

The Single Most Important Modern Insight -- Cluster on Type Markers; the Embedding Is for Looking, Not Measuring

Two rules carry most of the correctness here. First, the type-vs-state distinction: LINEAGE/type markers (CD3, CD4, CD8, CD19) DEFINE clusters; functional/STATE markers (phospho-epitopes, cytokines, Ki-67, activation markers) must be WITHHELD from clustering and tested within clusters instead (the DA/DS framework, Nowicka 2017 F1000Res 6:748). Clustering on state markers splits "activated CD4" from "resting CD4" and confounds abundance with activation - a classic, silent design error. Second, t-SNE/UMAP embeddings do NOT preserve inter-cluster distances, cluster sizes, or densities (the apparent "UMAP preserves global structure" edge over tSNE is largely an initialization artifact - Kobak & Linderman 2021 Nat Biotechnol 39:156). Define populations by clustering in the HIGH-DIMENSIONAL space and COLOR the embedding by cluster; never gate on the embedding or read biology off blob distances.

Algorithm Taxonomy

Algorithm	Citation	Mechanism	Speed	Rare-pop	Determinism
FlowSOM	Van Gassen 2015 Cytometry A 87:636	SOM grid -> MST (viz) -> consensus metaclustering	fastest	good if grid over-provisioned	stochastic; seed-controllable
PhenoGraph	Levine 2015 Cell 162:184	kNN graph (Jaccard) + Louvain	moderate	strong (no preset k)	seed-fragile (>40% reassignment reported)
X-shift	Samusik 2016 Nat Methods 13:493	weighted kNN density + auto cluster #	slow	excellent	more deterministic
flowMeans	Aghaeepour 2011 Cytometry A 79:6	k-means multi-cluster + change-point k	fast	moderate	stochastic

Benchmark: Weber & Robinson 2016 Cytometry A 89:1084 tested 18 methods - FlowSOM (with metaclustering) was a top performer AND by far fastest, hence the field default; but its accuracy depends on supplying the right number of metaclusters.

Why Over-Provision the Grid, Then Metacluster

Set the SOM grid (e.g. 10x10 = 100 nodes) MUCH larger than the number of populations expected, then metacluster down. The asymmetry: metaclustering can MERGE over-fine nodes into a real population, but can NEVER SPLIT a node that erroneously fused two cell types. Too coarse commits the unrecoverable error; too fine commits only the recoverable one. So over-cluster, then merge by hand off the median heatmap.

CATALYST Clustering Pipeline

Goal: Cluster on type markers and prepare for annotation.

Approach: prepData builds the SCE (panel marker_class flags type vs state); cluster() wraps FlowSOM+ConsensusClusterPlus. Defaults xdim=ydim=10, maxK=20 (the metacluster cap people forget); set seed on the function.

library(CATALYST)

sce <- prepData(fs, panel, md, transform = TRUE, cofactor = 5)   # cofactor 5 = CyTOF; ~150 for fluorescence
sce <- cluster(sce, features = 'type',                            # type markers only
               xdim = 10, ydim = 10, maxK = 20, seed = 42)        # maxK caps metaclusters at 20 by default
plotExprHeatmap(sce, features = 'type', by = 'cluster_id', k = 'meta20', scale = 'last')

PhenoGraph (graph-based alternative)

Goal: Cluster with a kNN graph when a data-driven cluster count is wanted.

Approach: Rphenograph on the type-marker matrix (cells x markers); extract membership from the list.

library(Rphenograph)
type_expr <- t(assay(sce, 'exprs')[rowData(sce)$marker_class == 'type', ])
out <- Rphenograph(type_expr, k = 30)                  # only knob: k (neighbors)
sce$phenograph <- factor(igraph::membership(out[[2]]))  # list -> membership, not a vector

Dimensionality Reduction (visualization only) and Annotation

Goal: Visualize structure and assign cell-type labels.

Approach: runDR subsamples per sample (cells=); color by cluster, never gate on it. Annotate from the median heatmap, then mergeClusters with a curated table.

sce <- runDR(sce, dr = 'UMAP', features = 'type', cells = 2000)   # subsampled embedding
plotDR(sce, 'UMAP', color_by = 'meta20')

merging <- data.frame(old_cluster = 1:20,
                      new_cluster = c('CD4 T','CD4 T','CD8 T', '...'))   # curated from the heatmap
sce <- mergeClusters(sce, k = 'meta20', table = merging, id = 'annotated')

Per-Method Failure Modes

Clustering on state markers

Trigger: activation/phospho markers in the clustering feature set. Mechanism: state contaminates lineage identity. Symptom: "activated" and "resting" versions of a type split as separate clusters. Fix: cluster on type only; test state markers within clusters (differential-analysis).

Seed-dependent "novel populations"

Trigger: a population that appears at one seed and vanishes at another. Mechanism: FlowSOM init / Louvain are stochastic. Symptom: non-reproducible clusters. Fix: set + report the seed; check multi-seed stability; treat unstable clusters as hypotheses.

Reading biology off the embedding

Trigger: "cluster A is closer to B than C." Mechanism: UMAP/tSNE distances are non-metric. Symptom: false developmental/relatedness claims. Fix: quantify in marker space; embedding for display only.

Clustering uncompensated/untransformed data

Trigger: raw linear input to FlowSOM. Mechanism: spillover + scale dominate Euclidean distance. Symptom: clusters track intensity, not biology. Fix: compensate + transform first.

Quantitative Thresholds

Threshold	Source	Rationale
over-provision grid (10x10) >> expected pops	Van Gassen 2015	metacluster can merge, never split
maxK = 20 default	CATALYST	metacluster cap; raise if expecting more
FlowSOM needs correct K	Weber & Robinson 2016	accuracy depends on metacluster number
use median (not mean) per cluster	Bendall 2011 Science 332:687	robust to doublet/spillover contamination

Common Errors

Error / symptom	Cause	Solution
clustering uses scatter/Time/state	`features` not restricted	`features='type'` / `colsToUse=` lineage markers
Rphenograph result unusable	it returns a list	`igraph::membership(out[[2]])`
`set.seed` doesn't make FlowSOM reproducible	internal reseeding	pass `seed=` to `cluster()`
only 20 clusters no matter what	`maxK` default	raise `maxK`

References

Van Gassen 2015 Cytometry A 87(7):636-645 — FlowSOM.
Levine 2015 Cell 162(1):184-197 — PhenoGraph.
Samusik 2016 Nat Methods 13(6):493-496 — X-shift.
Weber & Robinson 2016 Cytometry A 89(12):1084-1096 — clustering benchmark (FlowSOM top + fastest).
Nowicka 2017 F1000Research 6:748 — CyTOF workflow; type-vs-state markers.
Kobak & Linderman 2021 Nat Biotechnol 39:156-157 — embedding initialization artifact.
Bendall 2011 Science 332(6030):687-696 — arcsinh-median analysis of CyTOF data.

Related Skills

compensation-transformation - Compensate/transform before clustering
gating-analysis - Supervised alternative; needed for rare populations
differential-analysis - Test abundance/state of clusters between conditions
cytometry-qc - Cluster only QC-passed events
single-cell/clustering - Leiden/Louvain on scRNA-seq (shared graph-clustering ideas)
imaging-mass-cytometry/phenotyping - Same CATALYST/FlowSOM conventions for imaging