with one click
single-cell-preprocessing-with-omicverse
// Walk through omicverse's single-cell preprocessing tutorials to QC PBMC3k data, normalise counts, detect HVGs, and run PCA/embedding pipelines on CPU, CPU–GPU mixed, or GPU stacks.
// Walk through omicverse's single-cell preprocessing tutorials to QC PBMC3k data, normalise counts, detect HVGs, and run PCA/embedding pipelines on CPU, CPU–GPU mixed, or GPU stacks.
Convert raw Nanopore signal data (FAST5/POD5) to nucleotide sequences using Dorado basecaller. Covers model selection, GPU acceleration, modified base detection, and quality filtering. Use when processing raw Nanopore data before alignment. Guppy is deprecated; use Dorado for all new analyses.
Production-ready PDF processing with forms, tables, OCR, validation, and batch operations. Use when working with complex PDF workflows in production environments, processing large volumes of PDFs, or requiring robust error handling and validation.
Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.
Analyze protein-protein interaction networks using STRING, BioGRID, and SASBDB databases. Maps protein identifiers, retrieves interaction networks with confidence scores, performs functional enrichment analysis (GO/KEGG/Reactome), and optionally includes structural data. No API key required for core functionality (STRING). Use when analyzing protein networks, discovering interaction partners, identifying functional modules, or studying protein complexes.
Prepare for US medical licensing exams with progress tracking, weak area analysis, question bank management, and residency match planning.
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
| name | single-cell-preprocessing-with-omicverse |
| title | Single-cell preprocessing with omicverse |
| description | Walk through omicverse's single-cell preprocessing tutorials to QC PBMC3k data, normalise counts, detect HVGs, and run PCA/embedding pipelines on CPU, CPU–GPU mixed, or GPU stacks. |
Follow this skill when a user needs to reproduce the preprocessing workflow from the omicverse notebooks t_preprocess.ipynb, t_preprocess_cpu.ipynb, and t_preprocess_gpu.ipynb. The tutorials operate on the 10x PBMC3k dataset and cover QC filtering, normalisation, highly variable gene (HVG) detection, dimensionality reduction, and downstream embeddings.
omicverse as ov and scanpy as sc, then call ov.plot_set(font_path='Arial') (or ov.ov_plot_set() in legacy notebooks) to standardise figure styling.%load_ext autoreload and %autoreload 2 when iterating inside notebooks so code edits propagate without restarting the kernel.pbmc3k_filtered_gene_bc_matrices.tar.gz) and extract it under data/filtered_gene_bc_matrices/hg19/.sc.read_10x_mtx(..., var_names='gene_symbols', cache=True) and keep a writable folder like write/ for exports.ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, doublets_method='scrublet') for the CPU/CPU–GPU pipelines; omit doublets_method on pure GPU where Scrublet is not yet supported.ov.utils.store_layers(adata, layers='counts') immediately after QC so the original counts remain accessible for later recovery and comparison.ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000, target_sum=5e5) to apply shift-log normalisation followed by Pearson residual HVG detection (set target_sum=None on GPU, which keeps defaults).ov.pp.recover_counts(...) to invert normalisation and store reconstructed counts in adata.layers['recover_counts']..raw and layer recovery
.raw with adata.raw = adata (or adata.raw = adata.copy()), and show ov.utils.retrieve_layers(adata_counts, layers='counts') to compare normalised vs. raw intensities.ov.pp.scale(adata) (layers hold scaled matrices) followed by ov.pp.pca(adata, layer='scaled', n_pcs=50).sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca') for the baseline notebook.ov.pp.neighbors(..., use_rep='scaled|original|X_pca') on CPU–GPU to leverage accelerated routines.ov.pp.neighbors(..., method='cagra') on GPU to call RAPIDS graph primitives.ov.utils.mde(...), ov.pp.umap(adata), ov.pp.mde(...), ov.pp.tsne(...), or ov.pp.sude(...) depending on the notebook variant.ov.pp.leiden(adata, resolution=1) or ov.single.leiden(adata, resolution=1.0) after neighbour graph construction; CPU–GPU pipelines also showcase ov.pp.score_genes_cell_cycle before clustering.color='leiden'), always check if the clustering has been performed first:
# Check if leiden clustering exists, if not, run it
if 'leiden' not in adata.obs:
if 'neighbors' not in adata.uns:
ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
ov.single.leiden(adata, resolution=1.0)
ov.pl.embedding(...) or ov.utils.embedding(...), colouring by leiden clusters and marker genes. Always verify that the column specified in color= parameter exists in adata.obs before plotting.adata.write('write/pbmc3k_preprocessed.h5ad')) and figure exports using Matplotlib’s plt.savefig(...) to preserve QC summaries and embeddings.t_preprocess.ipynb): Focuses on CPU execution with Scanpy neighbours; emphasise storing counts before and after retrieve_layers demonstrations.t_preprocess_cpu.ipynb): Highlights Omicverse ≥1.7.0 mixed acceleration. Include timing magics (%%time) to showcase speedups and call out doublets_method='scrublet' support.t_preprocess_gpu.ipynb): Requires a CUDA-capable GPU, RAPIDS 24.04 stack, and rapids-singlecell. Mention the ov.pp.anndata_to_GPU/ov.pp.anndata_to_CPU transfers and method='cagra' neighbours. Note the current warning that pure-GPU pipelines depend on RAPIDS updates.sc.read_10x_mtx fails, verify the extracted folder structure and ensure gene symbols are available via var_names='gene_symbols'.nvidia-smi).ov.pp.preprocess dimension mismatches, ensure QC filtered out empty barcodes so HVG selection does not encounter zero-variance features.scaled|original|X_pca missing), re-run ov.pp.scale and ov.pp.pca to rebuild the cached layers.adata.obs before trying to color plots by themIMPORTANT: Always validate and prepare the batch column before any batch-aware operations (batch correction, integration, etc.). Missing or NaN values will cause errors.
CORRECT usage:
# Step 1: Check if batch column exists, create default if not
if 'batch' not in adata.obs.columns:
adata.obs['batch'] = 'batch_1' # Default single batch
# Step 2: Handle NaN/missing values - CRITICAL!
adata.obs['batch'] = adata.obs['batch'].fillna('unknown')
# Step 3: Convert to categorical for efficient memory usage
adata.obs['batch'] = adata.obs['batch'].astype('category')
# Now safe to use in batch-aware operations
ov.pp.combat(adata, batch='batch') # or other batch correction methods
WRONG - DO NOT USE:
# WRONG! Using batch column without validation can cause NaN errors
# ov.pp.combat(adata, batch='batch') # May fail if batch has NaN values!
# WRONG! Assuming batch column exists
# adata.obs['batch'].unique() # KeyError if column doesn't exist!
fillna() before batch operations# Complete defensive batch preparation pattern:
def prepare_batch_column(adata, batch_key='batch', default_batch='batch_1'):
"""Prepare batch column for batch-aware operations."""
if batch_key not in adata.obs.columns:
adata.obs[batch_key] = default_batch
adata.obs[batch_key] = adata.obs[batch_key].fillna('unknown')
adata.obs[batch_key] = adata.obs[batch_key].astype(str).astype('category')
return adata
IMPORTANT: The seurat_v3 HVG flavor uses LOESS regression which fails on small datasets or small per-batch subsets (<500 cells per batch). This manifests as:
ValueError: Extrapolation not allowed with blending
CORRECT - Use try/except fallback pattern:
# Robust HVG selection for any dataset size
try:
sc.pp.highly_variable_genes(
adata,
flavor='seurat_v3',
n_top_genes=2000,
batch_key='batch' # if batch correction is needed
)
except ValueError as e:
if 'Extrapolation' in str(e) or 'LOESS' in str(e):
# Fallback to simpler method for small datasets
sc.pp.highly_variable_genes(
adata,
flavor='seurat', # Works with any size
n_top_genes=2000
)
else:
raise
Alternative - Use cell_ranger flavor for batch-aware HVG:
# cell_ranger flavor is more robust for batched data
sc.pp.highly_variable_genes(
adata,
flavor='cell_ranger', # No LOESS, works with batches
n_top_genes=2000,
batch_key='batch'
)
seurat or cell_ranger when batch sizes vary significantlyseurat_v3 only when all batches have >500 cells# Safe batch-aware HVG pattern
def safe_highly_variable_genes(adata, batch_key='batch', n_top_genes=2000):
"""Select HVGs with automatic fallback for small batches."""
try:
sc.pp.highly_variable_genes(
adata, flavor='seurat_v3', n_top_genes=n_top_genes, batch_key=batch_key
)
except ValueError:
# Fallback for small batches
sc.pp.highly_variable_genes(
adata, flavor='seurat', n_top_genes=n_top_genes
)
shiftlog|pearson, and compute MDE + UMAP embeddings on CPU."method='cagra' neighbours, and return embeddings to CPU for plotting."t_preprocess.ipynb, t_preprocess_cpu.ipynb, t_preprocess_gpu.ipynbreference.md