with one click
molfeat
Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.
Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.
Deep generative models for single-cell omics. Use when you need probabilistic batch correction (scVI), transfer learning, differential expression with uncertainty, or multi-modal integration (TOTALVI, MultiVI). Best for advanced modeling, batch effects, multimodal data. For standard analysis pipelines use scanpy.
Biological data toolkit. Sequence analysis, alignments, phylogenetic trees, diversity metrics (alpha/beta, UniFrac), ordination (PCoA), PERMANOVA, FASTA/Newick I/O, for microbiome analysis.
Modal is a serverless cloud platform for running Python on demand, including on-demand GPUs. Use when deploying or serving AI/ML models, running GPU-accelerated workloads (training, fine-tuning, inference), serving web endpoints, scheduling batch jobs, or scaling Python code to cloud containers with the Modal SDK.
End-to-end bulk RNA-seq orchestrator — takes raw FASTQ reads through QC and trimming (FastQC, fastp/Trim Galore), alignment and quantification (STAR, Salmon, featureCounts), assembles a gene-level counts matrix, then hands off to differential expression (pydeseq2), pathway/GSEA enrichment (pathway-enrichment), and publication figures (scientific-visualization). Use whenever the user has bulk RNA-seq reads or quant output and wants a complete, reproducible differential-expression workflow — e.g. "analyze my RNA-seq", "FASTQ to DESeq2", "run nf-core/rnaseq", "STAR/Salmon quantification", "build a counts matrix for DESeq2", or "go from reads to differentially expressed genes and enriched pathways". Routes between an nf-core/rnaseq (Nextflow) path and a standalone STAR/Salmon path, and covers experimental design, strandedness, and QC gates. For single-cell RNA-seq use the scanpy skill instead.
Run pathway and gene-set enrichment analysis on gene lists or ranked gene data, then interpret the results. Use whenever the user has a set of genes (differentially expressed genes from PyDESeq2/Scanpy, CRISPR-screen hits, cluster marker genes, proteomics hits) and wants to know which biological pathways, GO terms, or gene sets are over-represented or enriched. Covers over-representation analysis (ORA / Enrichr / Fisher / hypergeometric), ranked Gene Set Enrichment Analysis (GSEA / preranked), single-sample scoring (ssGSEA/GSVA), and functional profiling via gseapy, g:Profiler, Enrichr libraries, MSigDB, GO, KEGG, Reactome, and WikiPathways — plus gene-ID mapping, choosing the right background universe, multiple-testing correction, redundancy reduction, dotplots/enrichment maps, and publication-ready tables. Use this for "pathway analysis", "enrichment analysis", "GO enrichment", "KEGG/Reactome pathways", "GSEA", "over-representation", "functional annotation", or "what pathways are my genes in".
Build, run, and debug Nextflow data pipelines and nf-core workflows end to end. Use whenever the user mentions Nextflow, nf-core, .nf files, nextflow.config, DSL2, processes/channels/operators, samplesheets, or wants to run a community pipeline (e.g. nf-core/rnaseq, nf-core/sarek), write or test a module/subworkflow with nf-test, configure executors/containers (Docker, Singularity/Apptainer, Conda, Wave), scale a workflow to HPC/SLURM or cloud (AWS Batch, Google Batch, Azure, Kubernetes), or debug a failed/-resume run. Make sure to use this skill for any reproducible scientific/bioinformatics workflow work even if the user does not say the word "Nextflow", and for authoring nf-core-compliant pipelines, modules, configs, and linting.
| name | molfeat |
| description | Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML. |
| license | Apache-2.0 license |
| allowed-tools | Read Write Edit Bash |
| compatibility | Requires Python 3.9–3.10 (molfeat 0.11.0 does not support 3.11+). Requires datamol, PyTorch, and optional extras for GNN/transformer models. |
| metadata | {"version":"1.0","skill-author":"K-Dense Inc."} |
Molfeat is a comprehensive Python library for molecular featurization that unifies 100+ pre-trained embeddings and hand-crafted featurizers. Convert chemical structures (SMILES strings or RDKit molecules) into numerical representations for machine learning tasks including QSAR modeling, virtual screening, similarity searching, and deep learning applications. Features fast parallel processing, scikit-learn compatible transformers, and built-in caching.
Version note: Examples target molfeat 0.11.0 (PyPI stable, May 2025). Requires Python 3.9–3.10 (requires-python caps below 3.11). Depends on datamol ≥0.8.0 and PyTorch ≥1.13. Since 0.8.7, prefer datamol Mol objects over raw rdkit.Chem.Mol. Since 0.10.1, fingerprint calculators use RDKit's rdFingerprintGenerator API internally. Since 0.11.0, pretrained models load in memory and base models are set to PyTorch evaluation mode automatically.
This skill should be used when working with:
Use a Python 3.9 or 3.10 environment (molfeat does not install on 3.11+ as of 0.11.0):
uv pip install "molfeat==0.11.0"
# With all pip-installable optional dependencies
uv pip install "molfeat[all]==0.11.0"
Optional dependency extras (PyPI):
molfeat[dgl] — GNN models (GIN variants); upstream recommends dgl<=2.0 (graphbolt issues in newer DGL)molfeat[graphormer] — Graphormer modelsmolfeat[transformer] — ChemBERTa, ChemGPT, MolT5molfeat[fcd] — FCD descriptorsmolfeat[pyg] — PyTorch Geometric featurizersmolfeat[viz] — NGLView visualization widgetsExternal featurizers: MAP4 is not bundled in molfeat extras — install from reymond-group/map4 separately. Some heavy deps (DGL, dgllife, graphormer-pretrained) are easier via conda-forge; see optional dependencies.
Molfeat organizes featurization into three hierarchical classes:
molfeat.calc)Callable objects that convert individual molecules into feature vectors. Accept RDKit Chem.Mol objects or SMILES strings.
Use calculators for:
Example:
from molfeat.calc import FPCalculator
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
features = calc("CCO") # Returns numpy array (2048,)
molfeat.trans)Scikit-learn compatible transformers that wrap calculators for batch processing with parallelization.
Use transformers for:
Example:
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
features = transformer(smiles_list) # Parallel processing
molfeat.trans.pretrained)Specialized transformers for deep learning models with batched inference and caching.
Use pretrained transformers for:
Example:
from molfeat.trans.pretrained import PretrainedMolTransformer
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
embeddings = transformer(smiles_list) # Deep learning embeddings
import datamol as dm
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
# Load molecular data
smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"]
# Create calculator and transformer
calc = FPCalculator("ecfp", radius=3)
transformer = MoleculeTransformer(calc, n_jobs=-1)
# Featurize molecules
features = transformer(smiles)
print(f"Shape: {features.shape}") # (4, 2048)
# Save featurizer configuration for reproducibility
transformer.to_state_yaml_file("featurizer_config.yml")
# Reload exact configuration
loaded = MoleculeTransformer.from_state_yaml_file("featurizer_config.yml")
# Process dataset with potentially invalid SMILES
transformer = MoleculeTransformer(
calc,
n_jobs=-1,
ignore_errors=True, # Continue on failures
verbose=True # Log error details
)
features = transformer(smiles_with_errors)
# Returns None for failed molecules
Start with fingerprints:
# ECFP - Most popular, general-purpose
FPCalculator("ecfp", radius=3, fpSize=2048)
# MACCS - Fast, good for scaffold hopping
FPCalculator("maccs")
# MAP4 - Efficient for large-scale screening
FPCalculator("map4")
For interpretable models:
# RDKit 2D descriptors (200+ named properties)
from molfeat.calc import RDKitDescriptors2D
RDKitDescriptors2D()
# Mordred (1800+ comprehensive descriptors)
from molfeat.calc import MordredDescriptors
MordredDescriptors()
Combine multiple featurizers:
from molfeat.trans import FeatConcat
concat = FeatConcat([
FPCalculator("maccs"), # 167 dimensions
FPCalculator("ecfp") # 2048 dimensions
]) # Result: 2215-dimensional combined features
Transformer-based embeddings:
# ChemBERTa - Pre-trained on 77M PubChem compounds
PretrainedMolTransformer("ChemBERTa-77M-MLM")
# ChemGPT - Autoregressive language model
PretrainedMolTransformer("ChemGPT-1.2B")
Graph neural networks:
# GIN models with different pre-training objectives
PretrainedMolTransformer("gin-supervised-masking")
PretrainedMolTransformer("gin-supervised-infomax")
# Graphormer for quantum chemistry
PretrainedMolTransformer("Graphormer-pcqm4mv2")
# ECFP - General purpose, most widely used
FPCalculator("ecfp")
# MACCS - Fast, scaffold-based similarity
FPCalculator("maccs")
# MAP4 - Efficient for large databases
FPCalculator("map4")
# USR/USRCAT - 3D shape similarity
from molfeat.calc import USRDescriptors
USRDescriptors()
# FCFP - Functional group based
FPCalculator("fcfp")
# CATS - Pharmacophore pair distributions
from molfeat.calc import CATSCalculator
CATSCalculator(mode="2D")
# Gobbi - Explicit pharmacophore features
FPCalculator("gobbi2D")
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
# Featurize molecules
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X = transformer(smiles_train)
# Train model
model = RandomForestRegressor(n_estimators=100)
scores = cross_val_score(model, X, y_train, cv=5)
print(f"R² = {scores.mean():.3f}")
# Save configuration for deployment
transformer.to_state_yaml_file("production_featurizer.yml")
from sklearn.ensemble import RandomForestClassifier
# Train on known actives/inactives
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X_train = transformer(train_smiles)
clf = RandomForestClassifier(n_estimators=500)
clf.fit(X_train, train_labels)
# Screen large library
X_screen = transformer(screening_library) # e.g., 1M compounds
predictions = clf.predict_proba(X_screen)[:, 1]
# Rank and select top hits
top_indices = predictions.argsort()[::-1][:1000]
top_hits = [screening_library[i] for i in top_indices]
from sklearn.metrics.pairwise import cosine_similarity
# Query molecule
calc = FPCalculator("ecfp")
query_fp = calc(query_smiles).reshape(1, -1)
# Database fingerprints
transformer = MoleculeTransformer(calc, n_jobs=-1)
database_fps = transformer(database_smiles)
# Compute similarity
similarities = cosine_similarity(query_fp, database_fps)[0]
top_similar = similarities.argsort()[-10:][::-1]
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
# Create end-to-end pipeline
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
('classifier', RandomForestClassifier(n_estimators=100))
])
# Train and predict directly on SMILES
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
featurizers = {
'ECFP': FPCalculator("ecfp"),
'MACCS': FPCalculator("maccs"),
'Descriptors': RDKitDescriptors2D(),
'ChemBERTa': PretrainedMolTransformer("ChemBERTa-77M-MLM")
}
results = {}
for name, feat in featurizers.items():
transformer = MoleculeTransformer(feat, n_jobs=-1)
X = transformer(smiles)
# Evaluate with your ML model
score = score_model(X, y)
results[name] = score
Use the ModelStore to explore all available featurizers:
from molfeat.store.modelstore import ModelStore
store = ModelStore()
# List all available models
all_models = store.available_models
print(f"Total featurizers: {len(all_models)}")
# Search for specific models
chemberta_models = store.search(name="ChemBERTa")
for model in chemberta_models:
print(f"- {model.name}: {model.description}")
# Get usage information
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
model_card.usage() # Display usage examples
# Load model
transformer = store.load("ChemBERTa-77M-MLM")
class CustomTransformer(MoleculeTransformer):
def preprocess(self, mol):
"""Custom preprocessing pipeline"""
if isinstance(mol, str):
mol = dm.to_mol(mol)
mol = dm.standardize_mol(mol)
mol = dm.remove_salts(mol)
return mol
transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)
import numpy as np
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
"""Process large datasets in chunks to manage memory"""
all_features = []
for i in range(0, len(smiles_list), chunk_size):
chunk = smiles_list[i:i+chunk_size]
features = transformer(chunk)
all_features.append(features)
return np.vstack(all_features)
Prefer molfeat's built-in pretrained-model cache when possible. For custom embedding caches, use NumPy arrays instead of pickle (pickle can execute arbitrary code when loading untrusted files):
import numpy as np
from pathlib import Path
cache_file = Path("embeddings_cache.npz") # fixed path under your project
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
if cache_file.exists():
embeddings = np.load(cache_file)["embeddings"]
else:
embeddings = transformer(smiles_list)
np.savez(cache_file, embeddings=embeddings)
n_jobs=-1 to utilize all CPU coresdtype=np.float32 when precision allowsignore_errors=True for large datasetsQuick reference for frequently used featurizers:
| Featurizer | Type | Dimensions | Speed | Use Case |
|---|---|---|---|---|
ecfp | Fingerprint | 2048 | Fast | General purpose |
maccs | Fingerprint | 167 | Very fast | Scaffold similarity |
desc2D | Descriptors | 200+ | Fast | Interpretable models |
mordred | Descriptors | 1800+ | Medium | Comprehensive features |
map4 | Fingerprint | 1024 | Fast | Large-scale screening |
ChemBERTa-77M-MLM | Deep learning | 768 | Slow* | Transfer learning |
gin-supervised-masking | GNN | Variable | Slow* | Graph-based models |
*First run is slow; subsequent runs benefit from caching
This skill includes comprehensive reference documentation:
Complete API documentation covering:
molfeat.calc - All calculator classes and parametersmolfeat.trans - Transformer classes and methodsmolfeat.store - ModelStore usageWhen to load: Reference when implementing specific calculators, understanding transformer parameters, or integrating with scikit-learn/PyTorch.
Comprehensive catalog of all 100+ featurizers organized by category:
When to load: Reference when selecting the optimal featurizer for a specific task, exploring available options, or understanding featurizer characteristics.
Search tip: Use grep to find specific featurizer types:
grep -i "chembert" references/available_featurizers.md
grep -i "pharmacophore" references/available_featurizers.md
Practical code examples for common scenarios:
When to load: Reference when implementing specific workflows, troubleshooting issues, or learning molfeat patterns.
Enable error handling to skip invalid SMILES:
transformer = MoleculeTransformer(
calc,
ignore_errors=True,
verbose=True
)
Process in chunks or use streaming approaches for datasets > 100K molecules.
Some models require additional packages. Install specific extras (pin version for reproducibility):
uv pip install "molfeat[transformer]==0.11.0" # For ChemBERTa/ChemGPT
uv pip install "molfeat[dgl]==0.11.0" # For GIN models
uv pip install "molfeat[graphormer]==0.11.0" # For Graphormer
Save exact configurations and document versions:
transformer.to_state_yaml_file("config.yml")
import molfeat
print(f"molfeat version: {molfeat.__version__}")