| name | rdkit |
| description | Cheminformatics toolkit for fine-grained molecular control. SMILES/SDF parsing, descriptors (MW, LogP, TPSA), fingerprints, substructure search, 2D/3D generation, similarity, reactions. For standard workflows with simpler interface, use datamol (wrapper around RDKit). Use rdkit for advanced control, custom sanitization, specialized algorithms. |
| license | BSD-3-Clause license |
| metadata | {"skill-author":"K-Dense Inc."} |
RDKit Cheminformatics Toolkit
Overview
RDKit is a comprehensive cheminformatics library providing Python APIs for molecular analysis and manipulation. This skill provides guidance for reading/writing molecular structures, calculating descriptors, fingerprinting, substructure searching, chemical reactions, 2D/3D coordinate generation, and molecular visualization. Use this skill for drug discovery, computational chemistry, and cheminformatics research tasks.
Core Capabilities
1. Molecular I/O and Creation
Reading Molecules:
Read molecular structures from various formats:
from rdkit import Chem
mol = Chem.MolFromSmiles('Cc1ccccc1')
mol = Chem.MolFromMolFile('path/to/file.mol')
mol = Chem.MolFromMolBlock(mol_block_string)
mol = Chem.MolFromInchi('InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H')
Writing Molecules:
Convert molecules to text representations:
smiles = Chem.MolToSmiles(mol)
mol_block = Chem.MolToMolBlock(mol)
inchi = Chem.MolToInchi(mol)
Batch Processing:
For processing multiple molecules, use Supplier/Writer objects:
suppl = Chem.SDMolSupplier('molecules.sdf')
for mol in suppl:
if mol is not None:
pass
suppl = Chem.SmilesMolSupplier('molecules.smi', titleLine=False)
with gzip.open('molecules.sdf.gz') as f:
suppl = Chem.ForwardSDMolSupplier(f)
for mol in suppl:
pass
suppl = Chem.MultithreadedSDMolSupplier('molecules.sdf')
writer = Chem.SDWriter('output.sdf')
for mol in molecules:
writer.write(mol)
writer.close()
Important Notes:
- All
MolFrom* functions return None on failure with error messages
- Always check for
None before processing molecules
- Molecules are automatically sanitized on import (validates valence, perceives aromaticity)
2. Molecular Sanitization and Validation
RDKit automatically sanitizes molecules during parsing, executing 13 steps including valence checking, aromaticity perception, and chirality assignment.
Sanitization Control:
mol = Chem.MolFromSmiles('C1=CC=CC=C1', sanitize=False)
Chem.SanitizeMol(mol)
problems = Chem.DetectChemistryProblems(mol)
for problem in problems:
print(problem.GetType(), problem.Message())
from rdkit.Chem import rdMolStandardize
Chem.SanitizeMol(mol, sanitizeOps=Chem.SANITIZE_ALL ^ Chem.SANITIZE_PROPERTIES)
Common Sanitization Issues:
- Atoms with explicit valence exceeding maximum allowed will raise exceptions
- Invalid aromatic rings will cause kekulization errors
- Radical electrons may not be properly assigned without explicit specification
3. Molecular Analysis and Properties
Accessing Molecular Structure:
for atom in mol.GetAtoms():
print(atom.GetSymbol(), atom.GetIdx(), atom.GetDegree())
for bond in mol.GetBonds():
print(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType())
ring_info = mol.GetRingInfo()
ring_info.NumRings()
ring_info.AtomRings()
atom = mol.GetAtomWithIdx(0)
atom.IsInRing()
atom.IsInRingSize(6)
from rdkit.Chem import GetSymmSSSR
rings = GetSymmSSSR(mol)
Stereochemistry:
from rdkit.Chem import FindMolChiralCenters
chiral_centers = FindMolChiralCenters(mol, includeUnassigned=True)
from rdkit.Chem import AssignStereochemistryFrom3D
AssignStereochemistryFrom3D(mol)
bond = mol.GetBondWithIdx(0)
stereo = bond.GetStereo()
Fragment Analysis:
frags = Chem.GetMolFrags(mol, asMols=True)
from rdkit.Chem import FragmentOnBonds
frag_mol = FragmentOnBonds(mol, [bond_idx1, bond_idx2])
from rdkit.Chem.Scaffolds import MurckoScaffold
scaffold = MurckoScaffold.GetScaffoldForMol(mol)
4. Molecular Descriptors and Properties
Basic Descriptors:
from rdkit.Chem import Descriptors
mw = Descriptors.MolWt(mol)
exact_mw = Descriptors.ExactMolWt(mol)
logp = Descriptors.MolLogP(mol)
tpsa = Descriptors.TPSA(mol)
hbd = Descriptors.NumHDonors(mol)
hba = Descriptors.NumHAcceptors(mol)
rot_bonds = Descriptors.NumRotatableBonds(mol)
aromatic_rings = Descriptors.NumAromaticRings(mol)
Batch Descriptor Calculation:
all_descriptors = Descriptors.CalcMolDescriptors(mol)
descriptor_names = [desc[0] for desc in Descriptors._descList]
Lipinski's Rule of Five:
mw = Descriptors.MolWt(mol) <= 500
logp = Descriptors.MolLogP(mol) <= 5
hbd = Descriptors.NumHDonors(mol) <= 5
hba = Descriptors.NumHAcceptors(mol) <= 10
is_drug_like = mw and logp and hbd and hba
5. Fingerprints and Molecular Similarity
Fingerprint Types:
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem import MACCSkeys
rdk_gen = rdFingerprintGenerator.GetRDKitFPGenerator(minPath=1, maxPath=7, fpSize=2048)
fp = rdk_gen.GetFingerprint(mol)
morgan_gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
fp = morgan_gen.GetFingerprint(mol)
fp_count = morgan_gen.GetCountFingerprint(mol)
fp = MACCSkeys.GenMACCSKeys(mol)
ap_gen = rdFingerprintGenerator.GetAtomPairGenerator()
fp = ap_gen.GetFingerprint(mol)
tt_gen = rdFingerprintGenerator.GetTopologicalTorsionGenerator()
fp = tt_gen.GetFingerprint(mol)
from rdkit.Avalon import pyAvalonTools
fp = pyAvalonTools.GetAvalonFP(mol)
Similarity Calculation:
from rdkit import DataStructs
from rdkit.Chem import rdFingerprintGenerator
mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
fp1 = mfpgen.GetFingerprint(mol1)
fp2 = mfpgen.GetFingerprint(mol2)
similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
fps = [mfpgen.GetFingerprint(m) for m in [mol2, mol3, mol4]]
similarities = DataStructs.BulkTanimotoSimilarity(fp1, fps)
dice = DataStructs.DiceSimilarity(fp1, fp2)
cosine = DataStructs.CosineSimilarity(fp1, fp2)
Clustering and Diversity:
from rdkit.ML.Cluster import Butina
dists = []
mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
fps = [mfpgen.GetFingerprint(mol) for mol in mols]
for i in range(len(fps)):
sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i])
dists.extend([1-sim for sim in sims])
clusters = Butina.ClusterData(dists, len(fps), distThresh=0.3, isDistData=True)
6. Substructure Searching and SMARTS
Basic Substructure Matching:
query = Chem.MolFromSmarts('[#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1')
has_match = mol.HasSubstructMatch(query)
matches = mol.GetSubstructMatches(query)
match = mol.GetSubstructMatch(query)
Common SMARTS Patterns:
primary_alcohol = Chem.MolFromSmarts('[CH2][OH1]')
carboxylic_acid = Chem.MolFromSmarts('C(=O)[OH]')
amide = Chem.MolFromSmarts('C(=O)N')
aromatic_n = Chem.MolFromSmarts('[nR]')
macrocycle = Chem.MolFromSmarts('[r{12-}]')
Matching Rules:
- Unspecified properties in query match any value in target
- Hydrogens are ignored unless explicitly specified
- Charged query atom won't match uncharged target atom
- Aromatic query atom won't match aliphatic target atom (unless query is generic)
7. Chemical Reactions
Reaction SMARTS:
from rdkit.Chem import AllChem
rxn = AllChem.ReactionFromSmarts('[C:1]=[O:2]>>[C:1][O:2]')
reactants = (mol1,)
products = rxn.RunReactants(reactants)
for product_set in products:
for product in product_set:
Chem.SanitizeMol(product)
Reaction Features:
- Atom mapping preserves specific atoms between reactants and products
- Dummy atoms in products are replaced by corresponding reactant atoms
- "Any" bonds inherit bond order from reactants
- Chirality preserved unless explicitly changed
Reaction Similarity:
fp = AllChem.CreateDifferenceFingerprintForReaction(rxn)
similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
8. 2D and 3D Coordinate Generation
2D Coordinate Generation:
from rdkit.Chem import AllChem
AllChem.Compute2DCoords(mol)
template = Chem.MolFromSmiles('c1ccccc1')
AllChem.Compute2DCoords(template)
AllChem.GenerateDepictionMatching2DStructure(mol, template)
3D Coordinate Generation and Conformers:
AllChem.EmbedMolecule(mol, randomSeed=42)
conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=10, randomSeed=42)
AllChem.UFFOptimizeMolecule(mol)
AllChem.MMFFOptimizeMolecule(mol)
for conf_id in conf_ids:
AllChem.MMFFOptimizeMolecule(mol, confId=conf_id)
from rdkit.Chem import AllChem
rms = AllChem.GetConformerRMS(mol, conf_id1, conf_id2)
AllChem.AlignMol(probe_mol, ref_mol)
Constrained Embedding:
AllChem.ConstrainedEmbed(mol, core_mol)
9. Molecular Visualization
Basic Drawing:
from rdkit.Chem import Draw
img = Draw.MolToImage(mol, size=(300, 300))
img.save('molecule.png')
Draw.MolToFile(mol, 'molecule.png')
mols = [mol1, mol2, mol3, mol4]
img = Draw.MolsToGridImage(mols, molsPerRow=2, subImgSize=(200, 200))
Highlighting Substructures:
query = Chem.MolFromSmarts('c1ccccc1')
match = mol.GetSubstructMatch(query)
img = Draw.MolToImage(mol, highlightAtoms=match)
highlight_colors = {atom_idx: (1, 0, 0) for atom_idx in match}
img = Draw.MolToImage(mol, highlightAtoms=match,
highlightAtomColors=highlight_colors)
Customizing Visualization:
from rdkit.Chem.Draw import rdMolDraw2D
drawer = rdMolDraw2D.MolDraw2DCairo(300, 300)
opts = drawer.drawOptions()
opts.addAtomIndices = True
opts.addStereoAnnotation = True
opts.bondLineWidth = 2
drawer.DrawMolecule(mol)
drawer.FinishDrawing()
with open('molecule.png', 'wb') as f:
f.write(drawer.GetDrawingText())
Jupyter Notebook Integration:
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG = True
IPythonConsole.molSize = (300, 300)
mol
Visualizing Fingerprint Bits:
from rdkit.Chem import Draw
bit_info = {}
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, bitInfo=bit_info)
img = Draw.DrawMorganBit(mol, bit_id, bit_info)
10. Molecular Modification
Adding/Removing Hydrogens:
mol_h = Chem.AddHs(mol)
mol = Chem.RemoveHs(mol_h)
Kekulization and Aromaticity:
Chem.Kekulize(mol)
Chem.SetAromaticity(mol)
Replacing Substructures:
query = Chem.MolFromSmarts('c1ccccc1')
replacement = Chem.MolFromSmiles('C1CCCCC1')
new_mol = Chem.ReplaceSubstructs(mol, query, replacement)[0]
Neutralizing Charges:
from rdkit.Chem.MolStandardize import rdMolStandardize
uncharger = rdMolStandardize.Uncharger()
mol_neutral = uncharger.uncharge(mol)
11. Working with Molecular Hashes and Standardization
Molecular Hashing:
from rdkit.Chem import rdMolHash
scaffold_hash = rdMolHash.MolHash(mol, rdMolHash.HashFunction.MurckoScaffold)
canonical_hash = rdMolHash.MolHash(mol, rdMolHash.HashFunction.CanonicalSmiles)
regio_hash = rdMolHash.MolHash(mol, rdMolHash.HashFunction.Regioisomer)
Randomized SMILES:
from rdkit.Chem import MolToRandomSmilesVect
random_smiles = MolToRandomSmilesVect(mol, numSmiles=10, randomSeed=42)
12. Pharmacophore and 3D Features
Pharmacophore Features:
from rdkit.Chem import ChemicalFeatures
from rdkit import RDConfig
import os
fdef_path = os.path.join(RDConfig.RDDataDir, 'BaseFeatures.fdef')
factory = ChemicalFeatures.BuildFeatureFactory(fdef_path)
features = factory.GetFeaturesForMol(mol)
for feat in features:
print(feat.GetFamily(), feat.GetType(), feat.GetAtomIds())
Common Workflows
Drug-likeness Analysis
from rdkit import Chem
from rdkit.Chem import Descriptors
def analyze_druglikeness(smiles):
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return None
results = {
'MW': Descriptors.MolWt(mol),
'LogP': Descriptors.MolLogP(mol),
'HBD': Descriptors.NumHDonors(mol),
'HBA': Descriptors.NumHAcceptors(mol),
'TPSA': Descriptors.TPSA(mol),
'RotBonds': Descriptors.NumRotatableBonds(mol)
}
results['Lipinski'] = (
results['MW'] <= 500 and
results['LogP'] <= 5 and
results['HBD'] <= 5 and
results['HBA'] <= 10
)
return results
Similarity Screening
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs
def similarity_screen(query_smiles, database_smiles, threshold=0.7):
query_mol = Chem.MolFromSmiles(query_smiles)
query_fp = AllChem.GetMorganFingerprintAsBitVect(query_mol, 2)
hits = []
for idx, smiles in enumerate(database_smiles):
mol = Chem.MolFromSmiles(smiles)
if mol:
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
sim = DataStructs.TanimotoSimilarity(query_fp, fp)
if sim >= threshold:
hits.append((idx, smiles, sim))
return sorted(hits, key=lambda x: x[2], reverse=True)
Substructure Filtering
from rdkit import Chem
def filter_by_substructure(smiles_list, pattern_smarts):
query = Chem.MolFromSmarts(pattern_smarts)
hits = []
for smiles in smiles_list:
mol = Chem.MolFromSmiles(smiles)
if mol and mol.HasSubstructMatch(query):
hits.append(smiles)
return hits
Best Practices
Error Handling
Always check for None when parsing molecules:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
print(f"Failed to parse: {smiles}")
continue
Performance Optimization
Use binary formats for storage:
import pickle
with open('molecules.pkl', 'wb') as f:
pickle.dump(mols, f)
with open('molecules.pkl', 'rb') as f:
mols = pickle.load(f)
Use bulk operations:
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols]
similarities = DataStructs.BulkTanimotoSimilarity(fps[0], fps[1:])
Thread Safety
RDKit operations are generally thread-safe for:
- Molecule I/O (SMILES, mol blocks)
- Coordinate generation
- Fingerprinting and descriptors
- Substructure searching
- Reactions
- Drawing
Not thread-safe: MolSuppliers when accessed concurrently.
Memory Management
For large datasets:
with open('large.sdf') as f:
suppl = Chem.ForwardSDMolSupplier(f)
for mol in suppl:
pass
suppl = Chem.MultithreadedSDMolSupplier('large.sdf', numWriterThreads=4)
Common Pitfalls
- Forgetting to check for None: Always validate molecules after parsing
- Sanitization failures: Use
DetectChemistryProblems() to debug
- Missing hydrogens: Use
AddHs() when calculating properties that depend on hydrogen
- 2D vs 3D: Generate appropriate coordinates before visualization or 3D analysis
- SMARTS matching rules: Remember that unspecified properties match anything
- Thread safety with MolSuppliers: Don't share supplier objects across threads
Resources
references/
This skill includes detailed API reference documentation:
api_reference.md - Comprehensive listing of RDKit modules, functions, and classes organized by functionality
descriptors_reference.md - Complete list of available molecular descriptors with descriptions
smarts_patterns.md - Common SMARTS patterns for functional groups and structural features
Load these references when needing specific API details, parameter information, or pattern examples.
scripts/
Example scripts for common RDKit workflows:
molecular_properties.py - Calculate comprehensive molecular properties and descriptors
similarity_search.py - Perform fingerprint-based similarity screening
substructure_filter.py - Filter molecules by substructure patterns
These scripts can be executed directly or used as templates for custom workflows.