Run any Skill in Manus with one click

$pwd:

molfeat-molecular-featurization

Name: Molfeat Molecular Featurization
Author: jaechang-hits

// Molecular featurization hub (100+ featurizers) for ML. SMILES to fingerprints (ECFP, MACCS, MAP4), descriptors (RDKit 2D, Mordred), pretrained embeddings (ChemBERTa, GIN, Graphormer), pharmacophores. Scikit-learn compatible with parallelization/caching. For QSAR, virtual screening, similarity, and molecular DL.

Run Skill in Manus

$ git log --oneline --stat

stars:186

forks:19

updated:May 28, 2026 at 08:36

File Explorer

3 files

SKILL.md

readonly

related-skills.json

same repository

bioservices-multi-database.md

from "jaechang-hits/SciAgent-Skills"

Unified Python interface to 40+ bioinformatics web services: UniProt proteins, KEGG pathways, ChEMBL/ChEBI/PubChem, BLAST, cross-database ID mapping, GO annotations, PPI. For deep single-DB queries use dedicated tools (gget for Ensembl, pubchempy for PubChem); bioservices excels at cross-database workflows.

2026-05-28186

cbioportal-database.md

from "jaechang-hits/SciAgent-Skills"

Cancer genomics (TCGA et al.) via cBioPortal REST API. Retrieve somatic mutations, CNAs, expression, clinical data (survival/stage/treatment) across thousands of studies. Use for TMB, oncoprints, survival analysis. For population frequencies use gnomad-database; for drug-gene interactions use opentargets-database.

2026-05-28186

cellxgene-census.md

from "jaechang-hits/SciAgent-Skills"

Query CELLxGENE Census (61M+ cells). Search by cell type/tissue/disease/organism; get AnnData, stream out-of-core, train PyTorch models. For your own data use scanpy; for annotated data use anndata.

2026-05-28186

esm-protein-language-model.md

from "jaechang-hits/SciAgent-Skills"

Protein language models (ESM3, ESM C) for sequence generation, structure prediction, inverse folding, and embeddings. Design novel proteins, extract ML features, or fold sequences. Local GPU or EvolutionaryScale Forge API. Use AlphaFold for traditional folding; RDKit for small molecules.

2026-05-28186

uniprot-protein-database.md

from "jaechang-hits/SciAgent-Skills"

Query UniProt REST API: search by gene/protein name, fetch FASTA, map IDs (Ensembl, PDB, RefSeq), access Swiss-Prot annotations. Use bioservices for multi-DB access; alphafold-database-access for structures.

2026-05-28186

zarr-python.md

from "jaechang-hits/SciAgent-Skills"

Chunked N-D arrays with compression and cloud storage. NumPy-style indexing. Backends: local, S3, GCS, ZIP, memory. Dask/Xarray integration for parallel and labeled computation. For lineage use lamindb; for labeled arrays use xarray.

2026-05-28186

package.json

"author": "jaechang-hits"

"repository": "jaechang-hits/SciAgent-Skills"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

name	molfeat-molecular-featurization
description	Molecular featurization hub (100+ featurizers) for ML. SMILES to fingerprints (ECFP, MACCS, MAP4), descriptors (RDKit 2D, Mordred), pretrained embeddings (ChemBERTa, GIN, Graphormer), pharmacophores. Scikit-learn compatible with parallelization/caching. For QSAR, virtual screening, similarity, and molecular DL.
license	Apache-2.0

Molfeat — Molecular Featurization Hub

Overview

Molfeat is a comprehensive Python library for molecular featurization that unifies 100+ pre-trained embeddings and hand-crafted featurizers under a scikit-learn compatible API. Convert SMILES strings into numerical representations (fingerprints, descriptors, deep learning embeddings) for QSAR modeling, virtual screening, similarity searching, and chemical space analysis.

When to Use

Building QSAR/QSPR models requiring molecular features as input
Virtual screening — ranking compound libraries by predicted activity
Similarity searching against molecular databases
Chemical space analysis — clustering, visualization, dimensionality reduction
Deep learning on molecules using pretrained embeddings (ChemBERTa, GIN)
Featurization pipelines integrating with scikit-learn or PyTorch
Comparing multiple molecular representations for benchmarking
For molecular manipulation and filtering use datamol instead; for substructure-based molecular operations use rdkit-cheminformatics

Prerequisites

uv pip install molfeat

# Optional extras for specific featurizer types
uv pip install "molfeat[transformer]"   # ChemBERTa, ChemGPT, MolT5
uv pip install "molfeat[dgl]"           # GIN graph neural networks
uv pip install "molfeat[graphormer]"    # Graphormer models
uv pip install "molfeat[fcd]"           # FCD descriptors
uv pip install "molfeat[map4]"          # MAP4 fingerprints
uv pip install "molfeat[all]"           # All dependencies

Quick Start

from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer

smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"]

# Create fingerprint calculator + transformer
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
transformer = MoleculeTransformer(calc, n_jobs=-1)

# Featurize batch in parallel
features = transformer(smiles)
print(f"Shape: {features.shape}")  # (4, 2048)

# Save configuration for reproducibility
transformer.to_state_yaml_file("featurizer_config.yml")

Key Concepts

Architecture: Calculator → Transformer → Store

Molfeat organizes featurization into three layers:

Layer	Class	Purpose	Use When
Calculator	`molfeat.calc.*`	Single molecule → feature vector	Custom loops, single molecules
Transformer	`molfeat.trans.MoleculeTransformer`	Batch processing with parallelization	Datasets, scikit-learn pipelines
Store	`molfeat.store.ModelStore`	Discovery and loading of pretrained models	Finding available featurizers

Calculators are callable: calc("CCO") returns a numpy array. Transformers wrap calculators for batch processing: transformer(smiles_list) returns a 2D array. Pretrained transformers (PretrainedMolTransformer) add batched GPU inference and caching.

Featurizer Selection Guide

Task	Recommended	Dimensions	Speed
General QSAR	`ecfp` (radius=3)	2048	Fast
Scaffold similarity	`maccs`	167	Very fast
Large-scale screening	`map4`	1024	Fast
Interpretable models	`desc2D` (RDKitDescriptors2D)	200+	Fast
Comprehensive descriptors	`mordred`	1800+	Medium
Transfer learning	`ChemBERTa-77M-MLM`	768	Slow*
Graph-based DL	`gin-supervised-masking`	Variable	Slow*
Pharmacophore	`fcfp` or `cats2D`	2048 / 21	Fast
3D shape	`usr` / `usrcat`	12 / 60	Fast

*First run slow; subsequent runs cached.

State Persistence

Save and reload exact featurizer configuration for reproducibility:

# Save
transformer.to_state_yaml_file("config.yml")
transformer.to_state_json_file("config.json")

# Reload
loaded = MoleculeTransformer.from_state_yaml_file("config.yml")

Core API

1. Fingerprint Calculators

from molfeat.calc import FPCalculator

# ECFP — most popular, general-purpose
ecfp = FPCalculator("ecfp", radius=3, fpSize=2048)
fp = ecfp("CCO")
print(f"ECFP shape: {fp.shape}")  # (2048,)

# MACCS keys — 167-bit structural keys, fast scaffold similarity
maccs = FPCalculator("maccs")
fp = maccs("c1ccccc1")
print(f"MACCS shape: {fp.shape}")  # (167,)

# Count-based fingerprints (non-binary)
ecfp_count = FPCalculator("ecfp-count", radius=3, fpSize=2048)

# MAP4 — MinHashed atom-pair, efficient for large databases
map4 = FPCalculator("map4")
print(f"MAP4 shape: {map4('CCO').shape}")  # (1024,)

Available fingerprint types: ecfp, fcfp, maccs, rdkit, avalon, pattern, layered, atompair, topological, map4, secfp, erg, estate (and count variants with -count suffix).

2. Descriptor Calculators

from molfeat.calc import RDKitDescriptors2D, MordredDescriptors

# RDKit 2D — 200+ named properties (MW, logP, TPSA, etc.)
desc2d = RDKitDescriptors2D()
descriptors = desc2d("CCO")
print(f"2D descriptors: {len(descriptors)}")  # 200+
print(f"Feature names: {desc2d.columns[:5]}")

# Mordred — 1800+ comprehensive descriptors
mordred = MordredDescriptors()
descriptors = mordred("c1ccccc1O")
print(f"Mordred descriptors: {len(descriptors)}")  # 1800+

3. Pharmacophore & Shape Calculators

from molfeat.calc import CATSCalculator, USRDescriptors

# CATS — pharmacophore point pair distributions
cats = CATSCalculator(mode="2D", scale="raw")
descriptors = cats("CC(C)Cc1ccc(C)cc1C")
print(f"CATS shape: {descriptors.shape}")  # (21,)

# USR — ultrafast shape recognition
usr = USRDescriptors()
shape = usr("CC(=O)Oc1ccccc1C(=O)O")
print(f"USR shape: {shape.shape}")  # (12,)

4. Batch Processing with Transformers

from molfeat.trans import MoleculeTransformer, FeatConcat
from molfeat.calc import FPCalculator

smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O", "CCCC"]

# Parallel batch processing
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
features = transformer(smiles)
print(f"Batch shape: {features.shape}")  # (5, 2048)

# Concatenate multiple featurizers
concat = FeatConcat([
    FPCalculator("maccs"),      # 167 dims
    FPCalculator("ecfp")        # 2048 dims
])
combo_transformer = MoleculeTransformer(concat, n_jobs=-1)
combo_features = combo_transformer(smiles)
print(f"Combined shape: {combo_features.shape}")  # (5, 2215)

# Error-tolerant processing
safe_transformer = MoleculeTransformer(
    FPCalculator("ecfp"), n_jobs=-1,
    ignore_errors=True, verbose=True
)
features = safe_transformer(["CCO", "invalid", "c1ccccc1"])
# Returns None for failed molecules

5. Pretrained Model Embeddings

from molfeat.trans.pretrained import PretrainedMolTransformer

# ChemBERTa — RoBERTa trained on 77M PubChem compounds
chemberta = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
embeddings = chemberta(["CCO", "CC(=O)O", "c1ccccc1"])
print(f"ChemBERTa shape: {embeddings.shape}")  # (3, 768)

# GIN — graph neural network pretrained on ChEMBL
gin = PretrainedMolTransformer("gin-supervised-masking", n_jobs=-1)
graph_emb = gin(["CCO", "CC(=O)O"])
print(f"GIN shape: {graph_emb.shape}")

6. ModelStore — Discovering Featurizers

from molfeat.store.modelstore import ModelStore

store = ModelStore()
print(f"Total available: {len(store.available_models)}")

# Search for specific model
results = store.search(name="ChemBERTa")
for model in results:
    print(f"  {model.name}: {model.description}")

# View usage and load
card = store.search(name="ChemBERTa-77M-MLM")[0]
card.usage()
transformer = store.load("ChemBERTa-77M-MLM")

Common Workflows

Workflow 1: QSAR Model Building

from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# Featurize molecules
transformer = MoleculeTransformer(FPCalculator("ecfp", radius=3), n_jobs=-1)
X = transformer(smiles_train)
print(f"Features shape: {X.shape}")

# Train and evaluate
model = RandomForestRegressor(n_estimators=100)
scores = cross_val_score(model, X, y_train, cv=5, scoring='r2')
print(f"R² = {scores.mean():.3f} ± {scores.std():.3f}")

# Save for deployment
transformer.to_state_yaml_file("production_featurizer.yml")

Workflow 2: Virtual Screening Pipeline

from sklearn.ensemble import RandomForestClassifier

# Step 1: Featurize known actives/inactives
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X_train = transformer(train_smiles)

# Step 2: Train classifier
clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
clf.fit(X_train, train_labels)

# Step 3: Screen library (e.g., 1M compounds)
X_screen = transformer(screening_smiles)
predictions = clf.predict_proba(X_screen)[:, 1]

# Step 4: Rank and select top hits
top_indices = predictions.argsort()[::-1][:1000]
top_hits = [screening_smiles[i] for i in top_indices]
print(f"Top 1000 hits selected from {len(screening_smiles)} compounds")

Workflow 3: Featurizer Benchmarking

from molfeat.calc import FPCalculator, RDKitDescriptors2D
from sklearn.metrics import roc_auc_score

featurizers = {
    'ECFP': FPCalculator("ecfp"),
    'MACCS': FPCalculator("maccs"),
    'Descriptors': RDKitDescriptors2D(),
}

for name, calc in featurizers.items():
    transformer = MoleculeTransformer(calc, n_jobs=-1)
    X_train = transformer(smiles_train)
    X_test = transformer(smiles_test)
    clf = RandomForestClassifier(n_estimators=100)
    clf.fit(X_train, y_train)
    auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
    print(f"{name}: AUC = {auc:.3f}")

Common Recipes

Recipe: Scikit-learn Pipeline Integration

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
    ('classifier', RandomForestClassifier(n_estimators=100))
])
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)

Recipe: Similarity Search

from sklearn.metrics.pairwise import cosine_similarity

calc = FPCalculator("ecfp")
query_fp = calc("CC(=O)Oc1ccccc1C(=O)O").reshape(1, -1)  # Aspirin

transformer = MoleculeTransformer(calc, n_jobs=-1)
db_fps = transformer(database_smiles)

similarities = cosine_similarity(query_fp, db_fps)[0]
top_k = similarities.argsort()[-10:][::-1]
for i in top_k:
    print(f"  {database_smiles[i]}: {similarities[i]:.3f}")

Recipe: Chunk Processing for Large Datasets

import numpy as np

def featurize_chunks(smiles_list, transformer, chunk_size=10000):
    all_features = []
    for i in range(0, len(smiles_list), chunk_size):
        chunk = smiles_list[i:i+chunk_size]
        features = transformer(chunk)
        all_features.append(features)
        print(f"Processed {min(i+chunk_size, len(smiles_list))}/{len(smiles_list)}")
    return np.vstack(all_features)

Key Parameters

Parameter	Module	Default	Description
`method`	FPCalculator	—	Fingerprint type: `ecfp`, `maccs`, `map4`, etc.
`radius`	FPCalculator	3	Circular fingerprint radius
`fpSize`	FPCalculator	2048	Fingerprint bit length
`counting`	FPCalculator	False	Count vector instead of binary
`n_jobs`	MoleculeTransformer	1	Parallel workers (-1 = all cores)
`ignore_errors`	MoleculeTransformer	False	Skip invalid molecules (returns None)
`verbose`	MoleculeTransformer	False	Log processing details
`dtype`	MoleculeTransformer	float64	Output type (float32 for memory)
`mode`	CATSCalculator	"2D"	Distance calculation mode
`scale`	CATSCalculator	"raw"	Scaling: `raw`, `num`, `count`

Best Practices

Use n_jobs=-1 for parallel processing on all CPU cores — significant speedup for batch featurization
Start with ECFP for initial baselines — best general-purpose fingerprint before trying deep learning
Use ignore_errors=True for large datasets — invalid SMILES won't crash the pipeline
Save configurations with to_state_yaml_file() for reproducibility — recreate exact featurizer later
Use float32 when memory matters: MoleculeTransformer(calc, dtype=np.float32)
Cache pretrained embeddings — first ChemBERTa/GIN inference is slow, subsequent runs use cache
Process in chunks for datasets >100K — prevents memory exhaustion (see Recipes)
Combine fingerprints with FeatConcat to capture complementary molecular information

Troubleshooting

Problem	Cause	Solution
`ValueError: unsupported featurizer`	Unknown method name	Check `FPCalculator` supported types or use `ModelStore.search()`
`ImportError` for pretrained model	Missing optional dependency	Install extras: `pip install "molfeat[transformer]"` or `"molfeat[dgl]"`
`None` in output array	Invalid SMILES with `ignore_errors=True`	Filter results: `[f for f in features if f is not None]`
Memory error on large dataset	Too many molecules at once	Process in chunks of 10K-50K (see Recipes)
Slow pretrained model inference	First run downloads model weights	Normal — subsequent runs use cache
Shape mismatch in pipeline	Mixed valid/invalid molecules	Ensure `ignore_errors=True` and filter None before ML model
Reproducibility issues	Different molfeat versions	Pin version and save config: `transformer.to_state_yaml_file()`

Related Skills

datamol-cheminformatics — High-level molecular manipulation (standardization, I/O, conformers)
rdkit-cheminformatics — Low-level cheminformatics (substructure, reactions, 3D)
scikit-learn — ML models consuming molfeat features

References

Official documentation: https://molfeat-docs.datamol.io/
GitHub repository: https://github.com/datamol-io/molfeat
PyPI package: https://pypi.org/project/molfeat/
Tutorial: https://portal.valencelabs.com/datamol/post/types-of-featurizers-b1e8HHrbFMkbun6

Bundled Resources

Main SKILL.md + 2 reference files. Original total: 1,273 lines (SKILL.md 510 + api_reference.md 429 + available_featurizers.md 334). Scripts: none. Examples: 724 lines (examples.md).

references/available_featurizers.md: Complete catalog of all 100+ featurizers organized by category — transformer models, GNNs, descriptors, fingerprints, pharmacophore, shape, scaffold, graph featurizers. Includes dimensions, dependencies, and selection guidance per category. Purely lookup-oriented content preserved as reference.

references/api_reference.md: Detailed API reference for molfeat.calc, molfeat.trans, and molfeat.store modules. Covers SerializableCalculator base class, all calculator subclasses with parameters, MoleculeTransformer methods, PretrainedMolTransformer, FeatConcat, ModelStore/ModelCard API, data type control, and PyTorch integration patterns.

Original file disposition:

SKILL.md (510 lines) → Core API modules 1-6, Key Concepts (architecture, selection guide), Quick Start, Workflows 1-3. "Choosing the Right Featurizer" → Key Concepts selection guide table. "Advanced Features" (custom preprocessing, batch processing, caching) → Recipes + Best Practices. "Common Featurizers Reference" table → Key Concepts selection guide. "Performance Tips" → Best Practices. Per-use-case disposition: QSAR Modeling → Workflow 1, Virtual Screening → Workflow 2, Similarity Search → Recipe, Chemical Space → When to Use bullet, scikit-learn Pipeline → Recipe, Featurizer Comparison → Workflow 3
references/api_reference.md (429 lines) → Migrated to new references/api_reference.md. Core patterns (FPCalculator, MoleculeTransformer, basic ModelStore) relocated to SKILL.md Core API modules 1-6. Detailed class methods, SerializableCalculator base class, PrecomputedMolTransformer, and PyTorch integration retained in reference
references/available_featurizers.md (334 lines) → Migrated to new references/available_featurizers.md. Top-level summary → Key Concepts selection guide table. Full categorized catalog retained in reference
references/examples.md (724 lines) → Fully consolidated inline: installation → Prerequisites; calculator examples → Core API 1-3; transformer examples → Core API 4; pretrained examples → Core API 5; ML integration → Workflows 1-3 + Recipes; advanced patterns (custom preprocessing, caching, chunk processing) → Recipes + Best Practices; troubleshooting → Troubleshooting table. No separate reference file needed — all content absorbed into SKILL.md sections

Retention: ~490 lines (SKILL.md) + ~170 lines (available_featurizers) + ~190 lines (api_reference) = ~850 / 1,273 original (excl. examples.md treated as consolidated) = ~67%. Including examples.md in denominator: ~850 / 1,997 = ~43%.

molfeat-molecular-featurization

More from this repository

More from this repository

Molfeat — Molecular Featurization Hub

Overview

When to Use

Prerequisites

Quick Start

Key Concepts

Architecture: Calculator → Transformer → Store

Featurizer Selection Guide

State Persistence

Core API

1. Fingerprint Calculators

2. Descriptor Calculators

3. Pharmacophore & Shape Calculators

4. Batch Processing with Transformers

5. Pretrained Model Embeddings

6. ModelStore — Discovering Featurizers

Common Workflows

Workflow 1: QSAR Model Building

Workflow 2: Virtual Screening Pipeline

Workflow 3: Featurizer Benchmarking

Common Recipes

Recipe: Scikit-learn Pipeline Integration

Recipe: Similarity Search

Recipe: Chunk Processing for Large Datasets

Key Parameters

Best Practices

Troubleshooting

Related Skills

References

Bundled Resources

Molfeat — Molecular Featurization Hub

Overview

When to Use

Prerequisites

Quick Start

Key Concepts

Architecture: Calculator → Transformer → Store

Featurizer Selection Guide

State Persistence

Core API

1. Fingerprint Calculators

2. Descriptor Calculators

3. Pharmacophore & Shape Calculators

4. Batch Processing with Transformers

5. Pretrained Model Embeddings

6. ModelStore — Discovering Featurizers

Common Workflows

Workflow 1: QSAR Model Building

Workflow 2: Virtual Screening Pipeline

Workflow 3: Featurizer Benchmarking

Common Recipes

Recipe: Scikit-learn Pipeline Integration

Recipe: Similarity Search

Recipe: Chunk Processing for Large Datasets

Key Parameters

Best Practices

Troubleshooting

Related Skills

References

Bundled Resources