| name | datamol |
| description | Pythonic wrapper around RDKit with simplified interface and sensible defaults. Preferred for standard drug discovery including SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, parallel processing. Returns native rdkit.Chem.Mol objects. For advanced control or custom parameters, use rdkit directly. |
| license | Apache-2.0 license |
| allowed-tools | Read Write Edit Bash |
| compatibility | Requires Python 3.8+ and datamol (uv pip install). RDKit is installed automatically as a datamol dependency (since 0.12.2). Optional s3fs/gcsfs for cloud I/O via fsspec. |
| metadata | {"version":"1.0","skill-author":"K-Dense Inc."} |
Datamol Cheminformatics Skill
Overview
Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. Simplify complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native rdkit.Chem.Mol instances, ensuring full compatibility with the RDKit ecosystem.
Version note: Examples target datamol 0.12.x (PyPI stable: 0.12.5, June 2024). Since 0.10.0, modules are lazy-loaded by default (set DATAMOL_DISABLE_LAZY_LOADING=1 to disable). Since 0.12.2, RDKit is a direct PyPI dependency of datamol. Fingerprints use RDKit's rdFingerprintGenerator API (0.12.5+).
Key capabilities:
- Molecular format conversion (SMILES, SELFIES, InChI)
- Structure standardization and sanitization
- Molecular descriptors and fingerprints
- 3D conformer generation and analysis
- Clustering and diversity selection
- Scaffold and fragment analysis
- Chemical reaction application
- Visualization and alignment
- Batch processing with parallelization
- Cloud storage support via fsspec
Installation and Setup
Guide users to install datamol:
uv pip install datamol
RDKit is installed automatically with datamol. For remote file paths (S3, GCS, HTTP), install the matching fsspec backend:
uv pip install s3fs
uv pip install gcsfs
Import convention:
import datamol as dm
Core Workflows
1. Basic Molecule Handling
Creating molecules from SMILES:
import datamol as dm
mol = dm.to_mol("CCO")
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
mols = [dm.to_mol(smi) for smi in smiles_list]
mol = dm.to_mol("invalid_smiles")
if mol is None:
print("Failed to parse SMILES")
Converting molecules to SMILES:
smiles = dm.to_smiles(mol)
smiles = dm.to_smiles(mol, isomeric=True)
inchi = dm.to_inchi(mol)
inchikey = dm.to_inchikey(mol)
selfies = dm.to_selfies(mol)
Standardization and sanitization (always recommend for user-provided molecules):
mol = dm.sanitize_mol(mol)
mol = dm.standardize_mol(
mol,
disconnect_metals=True,
normalize=True,
reionize=True
)
clean_smiles = dm.standardize_smiles(smiles)
2. Reading and Writing Molecular Files
Refer to references/io_module.md for comprehensive I/O documentation.
Reading files:
df = dm.read_sdf("compounds.sdf", mol_column='mol')
df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
df = dm.open_df("file.sdf")
dm.save_df(df, "output.parquet")
Writing files:
dm.to_sdf(mols, "output.sdf")
dm.to_sdf(df, "output.sdf", mol_column="mol")
dm.to_smi(mols, "output.smi")
dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])
Remote file support (S3, GCS, HTTP via fsspec):
Only use cloud paths when the user explicitly requests them. Confirm the destination before writing.
df = dm.read_sdf("s3://bucket/compounds.sdf")
df = dm.read_csv("https://example.com/data.csv")
dm.to_sdf(mols, "s3://bucket/output.sdf")
Cloud backends read credentials from the standard provider environment (for example AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION, or GOOGLE_APPLICATION_CREDENTIALS). Datamol passes these to fsspec locally; it does not collect or transmit environment variables to third-party endpoints. Scope credential access to the named provider variables only.
3. Molecular Descriptors and Properties
Refer to references/descriptors_viz.md for detailed descriptor documentation.
Computing descriptors for a single molecule:
descriptors = dm.descriptors.compute_many_descriptors(mol)
Batch descriptor computation (recommended for datasets):
desc_df = dm.descriptors.batch_compute_many_descriptors(
mols,
n_jobs=-1,
progress=True
)
Specific descriptors:
n_aromatic = dm.descriptors.n_aromatic_atoms(mol)
aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
n_stereo = dm.descriptors.n_stereo_centers(mol)
n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol)
n_rigid = dm.descriptors.n_rigid_bonds(mol)
Drug-likeness filtering (Lipinski's Rule of Five):
def is_druglike(mol):
desc = dm.descriptors.compute_many_descriptors(mol)
return (
desc['mw'] <= 500 and
desc['logp'] <= 5 and
desc['hbd'] <= 5 and
desc['hba'] <= 10
)
druglike_mols = [mol for mol in mols if is_druglike(mol)]
4. Molecular Fingerprints and Similarity
Generating fingerprints:
Datamol defaults to ECFP6 (radius=3, n_bits=2048). Pass radius=2 explicitly for ECFP4.
fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)
fp_maccs = dm.to_fp(mol, fp_type='maccs')
fp_topological = dm.to_fp(mol, fp_type='topological')
fp_atompair = dm.to_fp(mol, fp_type='atompair')
fp_rdkit = dm.to_fp(mol, fp_type='rdkit')
Similarity calculations:
distance_matrix = dm.pdist(mols, n_jobs=-1)
distances = dm.cdist(query_mols, library_mols, n_jobs=-1)
from scipy.spatial.distance import squareform
dist_matrix = squareform(dm.pdist(mols))
5. Clustering and Diversity Selection
Refer to references/core_api.md for clustering details.
Butina clustering:
clusters = dm.cluster_mols(
mols,
cutoff=0.2,
n_jobs=-1
)
for i, cluster in enumerate(clusters):
print(f"Cluster {i}: {len(cluster)} molecules")
cluster_mols = [mols[idx] for idx in cluster]
Important: Butina clustering builds a full distance matrix - suitable for ~1000 molecules, not for 10,000+.
Diversity selection:
diverse_mols = dm.pick_diverse(
mols,
npick=100
)
centroids = dm.pick_centroids(
mols,
npick=50
)
6. Scaffold Analysis
Refer to references/fragments_scaffolds.md for complete scaffold documentation.
Extracting Murcko scaffolds:
scaffold = dm.to_scaffold_murcko(mol)
scaffold_smiles = dm.to_smiles(scaffold)
Scaffold-based analysis:
from collections import Counter
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
scaffold_counts = Counter(scaffold_smiles)
most_common = scaffold_counts.most_common(10)
scaffold_groups = {}
for mol, scaf_smi in zip(mols, scaffold_smiles):
if scaf_smi not in scaffold_groups:
scaffold_groups[scaf_smi] = []
scaffold_groups[scaf_smi].append(mol)
Scaffold-based train/test splitting (for ML):
scaffold_to_mols = {}
for mol, scaf in zip(mols, scaffold_smiles):
if scaf not in scaffold_to_mols:
scaffold_to_mols[scaf] = []
scaffold_to_mols[scaf].append(mol)
import random
scaffolds = list(scaffold_to_mols.keys())
random.shuffle(scaffolds)
split_idx = int(0.8 * len(scaffolds))
train_scaffolds = scaffolds[:split_idx]
test_scaffolds = scaffolds[split_idx:]
train_mols = [mol for scaf in train_scaffolds for mol in scaffold_to_mols[scaf]]
test_mols = [mol for scaf in test_scaffolds for mol in scaffold_to_mols[scaf]]
7. Molecular Fragmentation
Refer to references/fragments_scaffolds.md for fragmentation details.
BRICS fragmentation (16 bond types):
fragments = dm.fragment.brics(mol)
RECAP fragmentation (11 bond types):
fragments = dm.fragment.recap(mol)
Fragment analysis:
from collections import Counter
all_fragments = []
for mol in mols:
frags = dm.fragment.brics(mol)
all_fragments.extend(frags)
fragment_counts = Counter(all_fragments)
common_frags = fragment_counts.most_common(20)
def fragment_score(mol, reference_fragments):
mol_frags = dm.fragment.brics(mol)
overlap = mol_frags.intersection(reference_fragments)
return len(overlap) / len(mol_frags) if mol_frags else 0
8. 3D Conformer Generation
Refer to references/conformers_module.md for detailed conformer documentation.
Generating conformers:
mol_3d = dm.conformers.generate(
mol,
n_confs=50,
rms_cutoff=0.5,
minimize_energy=True,
method='ETKDGv3'
)
n_conformers = mol_3d.GetNumConformers()
conf = mol_3d.GetConformer(0)
positions = conf.GetPositions()
Conformer clustering:
clusters = dm.conformers.cluster(
mol_3d,
rms_cutoff=1.0,
centroids=False
)
centroids = dm.conformers.return_centroids(mol_3d, clusters)
SASA calculation:
sasa_values = dm.conformers.sasa(mol_3d, n_jobs=-1)
conf = mol_3d.GetConformer(0)
sasa = conf.GetDoubleProp('rdkit_free_sasa')
9. Visualization
Refer to references/descriptors_viz.md for visualization documentation.
Basic molecule grid:
dm.viz.to_image(
mols[:20],
legends=[dm.to_smiles(m) for m in mols[:20]],
n_cols=5,
mol_size=(300, 300)
)
dm.viz.to_image(mols, outfile="molecules.png")
dm.viz.to_image(mols, outfile="molecules.svg", use_svg=True)
Aligned visualization (for SAR analysis):
dm.viz.to_image(
similar_mols,
align=True,
legends=activity_labels,
n_cols=4
)
Highlighting substructures:
dm.viz.to_image(
mol,
highlight_atom=[0, 1, 2, 3],
highlight_bond=[0, 1, 2]
)
Conformer visualization:
dm.viz.conformers(
mol_3d,
n_confs=10,
align_conf=True,
n_cols=3
)
10. Chemical Reactions
Refer to references/reactions_data.md for reactions documentation.
Applying reactions:
from rdkit.Chem import rdChemReactions
rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]'
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
reactant = dm.to_mol("CC(=O)O")
product = dm.reactions.apply_reaction(
rxn,
(reactant,),
sanitize=True
)
product_smiles = dm.to_smiles(product)
Batch reaction application:
products = []
for mol in reactant_mols:
try:
prod = dm.reactions.apply_reaction(rxn, (mol,))
if prod is not None:
products.append(prod)
except Exception as e:
print(f"Reaction failed: {e}")
Parallelization
Datamol includes built-in parallelization for many operations. Use n_jobs parameter:
n_jobs=1: Sequential (no parallelization)
n_jobs=-1: Use all available CPU cores
n_jobs=4: Use 4 cores
Functions supporting parallelization:
dm.read_sdf(..., n_jobs=-1)
dm.descriptors.batch_compute_many_descriptors(..., n_jobs=-1)
dm.cluster_mols(..., n_jobs=-1)
dm.pdist(..., n_jobs=-1)
dm.conformers.sasa(..., n_jobs=-1)
Progress bars: Many batch operations support progress=True parameter.
Common Workflows and Patterns
Complete Pipeline: Data Loading → Filtering → Analysis
import datamol as dm
import pandas as pd
df = dm.read_sdf("compounds.sdf")
df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None)
df = df[df['mol'].notna()]
desc_df = dm.descriptors.batch_compute_many_descriptors(
df['mol'].tolist(),
n_jobs=-1,
progress=True
)
druglike = (
(desc_df['mw'] <= 500) &
(desc_df['logp'] <= 5) &
(desc_df['hbd'] <= 5) &
(desc_df['hba'] <= 10)
)
filtered_df = df[druglike]
diverse_mols = dm.pick_diverse(
filtered_df['mol'].tolist(),
npick=100
)
dm.viz.to_image(
diverse_mols,
legends=[dm.to_smiles(m) for m in diverse_mols],
outfile="diverse_compounds.png",
n_cols=10
)
Structure-Activity Relationship (SAR) Analysis
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
sar_df = pd.DataFrame({
'mol': mols,
'scaffold': scaffold_smiles,
'activity': activities
})
for scaffold, group in sar_df.groupby('scaffold'):
if len(group) >= 3:
print(f"\nScaffold: {scaffold}")
print(f"Count: {len(group)}")
print(f"Activity range: {group['activity'].min():.2f} - {group['activity'].max():.2f}")
dm.viz.to_image(
group['mol'].tolist(),
legends=[f"Activity: {act:.2f}" for act in group['activity']],
align=True
)
Virtual Screening Pipeline
import numpy as np
distances = dm.cdist(query_actives, library_mols, n_jobs=-1)
min_distances = distances.min(axis=0)
similarities = 1 - min_distances
top_indices = np.argsort(similarities)[::-1][:100]
top_hits = [library_mols[i] for i in top_indices]
top_scores = [similarities[i] for i in top_indices]
dm.viz.to_image(
top_hits[:20],
legends=[f"Sim: {score:.3f}" for score in top_scores[:20]],
outfile="screening_hits.png"
)
Reference Documentation
For detailed API documentation, consult these reference files:
references/core_api.md: Core namespace functions (conversions, standardization, fingerprints, clustering)
references/io_module.md: File I/O operations (read/write SDF, CSV, Excel, remote files)
references/conformers_module.md: 3D conformer generation, clustering, SASA calculations
references/descriptors_viz.md: Molecular descriptors and visualization functions
references/fragments_scaffolds.md: Scaffold extraction, BRICS/RECAP fragmentation
references/reactions_data.md: Chemical reactions and toy datasets
Best Practices
-
Always standardize molecules from external sources:
mol = dm.standardize_mol(mol, disconnect_metals=True, normalize=True, reionize=True)
-
Check for None values after molecule parsing:
mol = dm.to_mol(smiles)
if mol is None:
-
Use parallel processing for large datasets:
result = dm.operation(..., n_jobs=-1, progress=True)
-
Use cloud I/O only when requested — confirm remote write paths; install s3fs/gcsfs as needed:
df = dm.read_sdf("s3://bucket/compounds.sdf")
-
Use appropriate fingerprints for similarity:
- ECFP (Morgan): General purpose, structural similarity
- MACCS: Fast, smaller feature space
- Atom pairs: Considers atom pairs and distances
-
Consider scale limitations:
- Butina clustering: ~1,000 molecules (full distance matrix)
- For larger datasets: Use diversity selection or hierarchical methods
-
Scaffold splitting for ML: Ensure proper train/test separation by scaffold
-
Align molecules when visualizing SAR series
Error Handling
def safe_to_mol(smiles):
try:
mol = dm.to_mol(smiles)
if mol is not None:
mol = dm.standardize_mol(mol)
return mol
except Exception as e:
print(f"Failed to process {smiles}: {e}")
return None
valid_mols = []
for smiles in smiles_list:
mol = safe_to_mol(smiles)
if mol is not None:
valid_mols.append(mol)
Integration with Machine Learning
Datamol ships with scipy and scikit-learn as dependencies. Import them as normal PyPI packages — they are not scripts bundled in this skill.
import numpy as np
X = np.array([dm.to_fp(mol) for mol in mols])
desc_df = dm.descriptors.batch_compute_many_descriptors(mols, n_jobs=-1)
X = desc_df.values
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X, y_target)
predictions = model.predict(X_test)
Troubleshooting
Issue: Molecule parsing fails
- Solution: Use
dm.standardize_smiles() first or try dm.fix_mol()
Issue: Memory errors with clustering
- Solution: Use
dm.pick_diverse() instead of full clustering for large sets
Issue: Slow conformer generation
- Solution: Reduce
n_confs or increase rms_cutoff to generate fewer conformers
Issue: Remote file access fails
- Solution: Install the matching fsspec backend (
uv pip install s3fs or gcsfs) and verify only the provider credentials needed for that backend are set (see Remote file support above)
Additional Resources