| name | datamol |
| description | Pythonic wrapper around RDKit with simplified interface and sensible defaults. Preferred for standard drug discovery including SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, parallel processing. Returns native rdkit.Chem.Mol objects. For advanced control or custom parameters, use rdkit directly. |
| license | Apache-2.0 license |
| metadata | {"skill-author":"K-Dense Inc."} |
Datamol Cheminformatics Skill
Overview
Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. Simplify complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native rdkit.Chem.Mol instances, ensuring full compatibility with the RDKit ecosystem.
Key capabilities:
- Molecular format conversion (SMILES, SELFIES, InChI)
- Structure standardization and sanitization
- Molecular descriptors and fingerprints
- 3D conformer generation and analysis
- Clustering and diversity selection
- Scaffold and fragment analysis
- Chemical reaction application
- Visualization and alignment
- Batch processing with parallelization
- Cloud storage support via fsspec
Installation and Setup
Guide users to install datamol:
uv pip install datamol
Import convention:
import datamol as dm
Core Workflows
1. Basic Molecule Handling
Creating molecules from SMILES:
import datamol as dm
mol = dm.to_mol("CCO")
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
mols = [dm.to_mol(smi) for smi in smiles_list]
mol = dm.to_mol("invalid_smiles")
if mol is None:
print("Failed to parse SMILES")
Converting molecules to SMILES:
smiles = dm.to_smiles(mol)
smiles = dm.to_smiles(mol, isomeric=True)
inchi = dm.to_inchi(mol)
inchikey = dm.to_inchikey(mol)
selfies = dm.to_selfies(mol)
Standardization and sanitization (always recommend for user-provided molecules):
mol = dm.sanitize_mol(mol)
mol = dm.standardize_mol(
mol,
disconnect_metals=True,
normalize=True,
reionize=True
)
clean_smiles = dm.standardize_smiles(smiles)
2. Reading and Writing Molecular Files
Refer to references/io_module.md for comprehensive I/O documentation.
Reading files:
df = dm.read_sdf("compounds.sdf", mol_column='mol')
df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
df = dm.open_df("file.sdf")
Writing files:
dm.to_sdf(mols, "output.sdf")
dm.to_sdf(df, "output.sdf", mol_column="mol")
dm.to_smi(mols, "output.smi")
dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])
Remote file support (S3, GCS, HTTP):
df = dm.read_sdf("s3://bucket/compounds.sdf")
df = dm.read_csv("https://example.com/data.csv")
dm.to_sdf(mols, "s3://bucket/output.sdf")
3. Molecular Descriptors and Properties
Refer to references/descriptors_viz.md for detailed descriptor documentation.
Computing descriptors for a single molecule:
descriptors = dm.descriptors.compute_many_descriptors(mol)
Batch descriptor computation (recommended for datasets):
desc_df = dm.descriptors.batch_compute_many_descriptors(
mols,
n_jobs=-1,
progress=True
)
Specific descriptors:
n_aromatic = dm.descriptors.n_aromatic_atoms(mol)
aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
n_stereo = dm.descriptors.n_stereo_centers(mol)
n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol)
n_rigid = dm.descriptors.n_rigid_bonds(mol)
Drug-likeness filtering (Lipinski's Rule of Five):
def is_druglike(mol):
desc = dm.descriptors.compute_many_descriptors(mol)
return (
desc['mw'] <= 500 and
desc['logp'] <= 5 and
desc['hbd'] <= 5 and
desc['hba'] <= 10
)
druglike_mols = [mol for mol in mols if is_druglike(mol)]
4. Molecular Fingerprints and Similarity
Generating fingerprints:
fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)
fp_maccs = dm.to_fp(mol, fp_type='maccs')
fp_topological = dm.to_fp(mol, fp_type='topological')
fp_atompair = dm.to_fp(mol, fp_type='atompair')
Similarity calculations:
distance_matrix = dm.pdist(mols, n_jobs=-1)
distances = dm.cdist(query_mols, library_mols, n_jobs=-1)
from scipy.spatial.distance import squareform
dist_matrix = squareform(dm.pdist(mols))
5. Clustering and Diversity Selection
Refer to references/core_api.md for clustering details.
Butina clustering:
clusters = dm.cluster_mols(
mols,
cutoff=0.2,
n_jobs=-1
)
for i, cluster in enumerate(clusters):
print(f"Cluster {i}: {len(cluster)} molecules")
cluster_mols = [mols[idx] for idx in cluster]
Important: Butina clustering builds a full distance matrix - suitable for ~1000 molecules, not for 10,000+.
Diversity selection:
diverse_mols = dm.pick_diverse(
mols,
npick=100
)
centroids = dm.pick_centroids(
mols,
npick=50
)
6. Scaffold Analysis
Refer to references/fragments_scaffolds.md for complete scaffold documentation.
Extracting Murcko scaffolds:
scaffold = dm.to_scaffold_murcko(mol)
scaffold_smiles = dm.to_smiles(scaffold)
Scaffold-based analysis:
from collections import Counter
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
scaffold_counts = Counter(scaffold_smiles)
most_common = scaffold_counts.most_common(10)
scaffold_groups = {}
for mol, scaf_smi in zip(mols, scaffold_smiles):
if scaf_smi not in scaffold_groups:
scaffold_groups[scaf_smi] = []
scaffold_groups[scaf_smi].append(mol)
Scaffold-based train/test splitting (for ML):
scaffold_to_mols = {}
for mol, scaf in zip(mols, scaffold_smiles):
if scaf not in scaffold_to_mols:
scaffold_to_mols[scaf] = []
scaffold_to_mols[scaf].append(mol)
import random
scaffolds = list(scaffold_to_mols.keys())
random.shuffle(scaffolds)
split_idx = int(0.8 * len(scaffolds))
train_scaffolds = scaffolds[:split_idx]
test_scaffolds = scaffolds[split_idx:]
train_mols = [mol for scaf in train_scaffolds for mol in scaffold_to_mols[scaf]]
test_mols = [mol for scaf in test_scaffolds for mol in scaffold_to_mols[scaf]]
7. Molecular Fragmentation
Refer to references/fragments_scaffolds.md for fragmentation details.
BRICS fragmentation (16 bond types):
fragments = dm.fragment.brics(mol)
RECAP fragmentation (11 bond types):
fragments = dm.fragment.recap(mol)
Fragment analysis:
from collections import Counter
all_fragments = []
for mol in mols:
frags = dm.fragment.brics(mol)
all_fragments.extend(frags)
fragment_counts = Counter(all_fragments)
common_frags = fragment_counts.most_common(20)
def fragment_score(mol, reference_fragments):
mol_frags = dm.fragment.brics(mol)
overlap = mol_frags.intersection(reference_fragments)
return len(overlap) / len(mol_frags) if mol_frags else 0
8. 3D Conformer Generation
Refer to references/conformers_module.md for detailed conformer documentation.
Generating conformers:
mol_3d = dm.conformers.generate(
mol,
n_confs=50,
rms_cutoff=0.5,
minimize_energy=True,
method='ETKDGv3'
)
n_conformers = mol_3d.GetNumConformers()
conf = mol_3d.GetConformer(0)
positions = conf.GetPositions()
Conformer clustering:
clusters = dm.conformers.cluster(
mol_3d,
rms_cutoff=1.0,
centroids=False
)
centroids = dm.conformers.return_centroids(mol_3d, clusters)
SASA calculation:
sasa_values = dm.conformers.sasa(mol_3d, n_jobs=-1)
conf = mol_3d.GetConformer(0)
sasa = conf.GetDoubleProp('rdkit_free_sasa')
9. Visualization
Refer to references/descriptors_viz.md for visualization documentation.
Basic molecule grid:
dm.viz.to_image(
mols[:20],
legends=[dm.to_smiles(m) for m in mols[:20]],
n_cols=5,
mol_size=(300, 300)
)
dm.viz.to_image(mols, outfile="molecules.png")
dm.viz.to_image(mols, outfile="molecules.svg", use_svg=True)
Aligned visualization (for SAR analysis):
dm.viz.to_image(
similar_mols,
align=True,
legends=activity_labels,
n_cols=4
)
Highlighting substructures:
dm.viz.to_image(
mol,
highlight_atom=[0, 1, 2, 3],
highlight_bond=[0, 1, 2]
)
Conformer visualization:
dm.viz.conformers(
mol_3d,
n_confs=10,
align_conf=True,
n_cols=3
)
10. Chemical Reactions
Refer to references/reactions_data.md for reactions documentation.
Applying reactions:
from rdkit.Chem import rdChemReactions
rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]'
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
reactant = dm.to_mol("CC(=O)O")
product = dm.reactions.apply_reaction(
rxn,
(reactant,),
sanitize=True
)
product_smiles = dm.to_smiles(product)
Batch reaction application:
products = []
for mol in reactant_mols:
try:
prod = dm.reactions.apply_reaction(rxn, (mol,))
if prod is not None:
products.append(prod)
except Exception as e:
print(f"Reaction failed: {e}")
Parallelization
Datamol includes built-in parallelization for many operations. Use n_jobs parameter:
n_jobs=1: Sequential (no parallelization)
n_jobs=-1: Use all available CPU cores
n_jobs=4: Use 4 cores
Functions supporting parallelization:
dm.read_sdf(..., n_jobs=-1)
dm.descriptors.batch_compute_many_descriptors(..., n_jobs=-1)
dm.cluster_mols(..., n_jobs=-1)
dm.pdist(..., n_jobs=-1)
dm.conformers.sasa(..., n_jobs=-1)
Progress bars: Many batch operations support progress=True parameter.
Common Workflows and Patterns
Complete Pipeline: Data Loading → Filtering → Analysis
import datamol as dm
import pandas as pd
df = dm.read_sdf("compounds.sdf")
df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None)
df = df[df['mol'].notna()]
desc_df = dm.descriptors.batch_compute_many_descriptors(
df['mol'].tolist(),
n_jobs=-1,
progress=True
)
druglike = (
(desc_df['mw'] <= 500) &
(desc_df['logp'] <= 5) &
(desc_df['hbd'] <= 5) &
(desc_df['hba'] <= 10)
)
filtered_df = df[druglike]
diverse_mols = dm.pick_diverse(
filtered_df['mol'].tolist(),
npick=100
)
dm.viz.to_image(
diverse_mols,
legends=[dm.to_smiles(m) for m in diverse_mols],
outfile="diverse_compounds.png",
n_cols=10
)
Structure-Activity Relationship (SAR) Analysis
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
sar_df = pd.DataFrame({
'mol': mols,
'scaffold': scaffold_smiles,
'activity': activities
})
for scaffold, group in sar_df.groupby('scaffold'):
if len(group) >= 3:
print(f"\nScaffold: {scaffold}")
print(f"Count: {len(group)}")
print(f"Activity range: {group['activity'].min():.2f} - {group['activity'].max():.2f}")
dm.viz.to_image(
group['mol'].tolist(),
legends=[f"Activity: {act:.2f}" for act in group['activity']],
align=True
)
Virtual Screening Pipeline
query_fps = [dm.to_fp(mol) for mol in query_actives]
library_fps = [dm.to_fp(mol) for mol in library_mols]
from scipy.spatial.distance import cdist
import numpy as np
distances = dm.cdist(query_actives, library_mols, n_jobs=-1)
min_distances = distances.min(axis=0)
similarities = 1 - min_distances
top_indices = np.argsort(similarities)[::-1][:100]
top_hits = [library_mols[i] for i in top_indices]
top_scores = [similarities[i] for i in top_indices]
dm.viz.to_image(
top_hits[:20],
legends=[f"Sim: {score:.3f}" for score in top_scores[:20]],
outfile="screening_hits.png"
)
Reference Documentation
For detailed API documentation, consult these reference files:
references/core_api.md: Core namespace functions (conversions, standardization, fingerprints, clustering)
references/io_module.md: File I/O operations (read/write SDF, CSV, Excel, remote files)
references/conformers_module.md: 3D conformer generation, clustering, SASA calculations
references/descriptors_viz.md: Molecular descriptors and visualization functions
references/fragments_scaffolds.md: Scaffold extraction, BRICS/RECAP fragmentation
references/reactions_data.md: Chemical reactions and toy datasets
Best Practices
-
Always standardize molecules from external sources:
mol = dm.standardize_mol(mol, disconnect_metals=True, normalize=True, reionize=True)
-
Check for None values after molecule parsing:
mol = dm.to_mol(smiles)
if mol is None:
-
Use parallel processing for large datasets:
result = dm.operation(..., n_jobs=-1, progress=True)
-
Leverage fsspec for cloud storage:
df = dm.read_sdf("s3://bucket/compounds.sdf")
-
Use appropriate fingerprints for similarity:
- ECFP (Morgan): General purpose, structural similarity
- MACCS: Fast, smaller feature space
- Atom pairs: Considers atom pairs and distances
-
Consider scale limitations:
- Butina clustering: ~1,000 molecules (full distance matrix)
- For larger datasets: Use diversity selection or hierarchical methods
-
Scaffold splitting for ML: Ensure proper train/test separation by scaffold
-
Align molecules when visualizing SAR series
Error Handling
def safe_to_mol(smiles):
try:
mol = dm.to_mol(smiles)
if mol is not None:
mol = dm.standardize_mol(mol)
return mol
except Exception as e:
print(f"Failed to process {smiles}: {e}")
return None
valid_mols = []
for smiles in smiles_list:
mol = safe_to_mol(smiles)
if mol is not None:
valid_mols.append(mol)
Integration with Machine Learning
X = np.array([dm.to_fp(mol) for mol in mols])
desc_df = dm.descriptors.batch_compute_many_descriptors(mols, n_jobs=-1)
X = desc_df.values
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X, y_target)
predictions = model.predict(X_test)
Troubleshooting
Issue: Molecule parsing fails
- Solution: Use
dm.standardize_smiles() first or try dm.fix_mol()
Issue: Memory errors with clustering
- Solution: Use
dm.pick_diverse() instead of full clustering for large sets
Issue: Slow conformer generation
- Solution: Reduce
n_confs or increase rms_cutoff to generate fewer conformers
Issue: Remote file access fails
- Solution: Ensure fsspec and appropriate cloud provider libraries are installed (s3fs, gcsfs, etc.)
Additional Resources