| name | text-based-molecule-editing |
| description | Modify molecules based on natural language descriptions using MolT5/BioT5 models.
Use this skill when:
(1) User wants to modify a molecule to improve specific properties (solubility, potency, etc.),
(2) User provides a molecule and asks to "make it more X" or "improve Y",
(3) User wants to generate molecule variants guided by text descriptions.
Triggers on phrases like "modify this molecule", "edit the molecule", "make it more soluble",
"improve drug-likeness", "change the molecule to", "optimize this compound".
|
| license | MIT |
| category | drug-discovery |
| tags | ["molecule-editing","text-guided","molecular-optimization","de-novo-design"] |
Text-Based Molecule Editing
Modify molecular structures guided by natural language property descriptions.
When to Use
- User wants to optimize a molecule for specific properties (solubility, binding, drug-likeness)
- User provides a molecule and requests property-based modifications
- User wants to explore structural variants guided by text descriptions
Workflow
Step 1: Prepare Input Molecule
from open_biomed.data import Molecule
from open_biomed.tools.tool_registry import TOOLS
tool = TOOLS["molecule_name_request"]
result, _ = tool.run(accession="aspirin")
molecule = result[0]
molecule = Molecule.from_smiles("CC(=O)Oc1ccccc1C(=O)O")
Step 2: Calculate Baseline Properties (Optional)
qed_tool = TOOLS["molecule_qed"]
logp_tool = TOOLS["molecule_logp"]
sa_tool = TOOLS["molecule_sa"]
qed, _ = qed_tool.run(molecule=molecule)
logp, _ = logp_tool.run(molecule=molecule)
sa, _ = sa_tool.run(molecule=molecule)
Step 3: Run Text-Based Editing
from open_biomed.core.pipeline import InferencePipeline
from open_biomed.data import Text
pipeline = InferencePipeline(
task="text_based_molecule_editing",
model="molt5",
model_ckpt="./checkpoints/server/text_based_molecule_editing_biot5.ckpt",
device="cuda:0"
)
outputs = pipeline.run(
molecule=molecule,
text=Text.from_str("This molecule should be more soluble in water"),
)
edited_molecule = outputs[0][0]
Step 4: Compare Properties
qed_new, _ = qed_tool.run(molecule=edited_molecule)
logp_new, _ = logp_tool.run(molecule=edited_molecule)
print(f"Original SMILES: {molecule.smiles}")
print(f"Edited SMILES: {edited_molecule.smiles}")
print(f"LogP change: {logp[0]:.2f} → {logp_new[0]:.2f}")
Expected Outputs
| Step | Output | Description |
|---|
| Step 1 | Molecule object | Input molecule with SMILES |
| Step 2 | float values | QED (0-1), LogP, SA scores |
| Step 3 | Molecule object | Edited molecule with new structure |
| Step 4 | Comparison | Before/after property summary |
Interpretation Guide
LogP (Lipophilicity)
| Value | Solubility | Interpretation |
|---|
| < 0 | High water solubility | Very hydrophilic |
| 0-2 | Moderate | Good balance for oral drugs |
| 2-5 | Low water solubility | May need formulation help |
| > 5 | Very lipophilic | Poor absorption likely |
QED (Quantitative Estimate of Drug-likeness)
| Value | Quality | Interpretation |
|---|
| > 0.7 | Excellent | Highly drug-like |
| 0.5-0.7 | Good | Acceptable drug-likeness |
| 0.3-0.5 | Moderate | May need optimization |
| < 0.3 | Poor | Significant liabilities |
SA (Synthetic Accessibility)
| Value | Difficulty | Interpretation |
|---|
| 1-3 | Easy | Straightforward synthesis |
| 3-5 | Moderate | Some challenges |
| 5-7 | Difficult | Complex synthesis needed |
| > 7 | Very difficult | Likely impractical |
Error Handling
Model Checkpoint Not Found
Symptom: FileNotFoundError for checkpoint file
Solution: Ensure checkpoint exists at ./checkpoints/server/text_based_molecule_editing_biot5.ckpt
import os
ckpt_path = "./checkpoints/server/text_based_molecule_editing_biot5.ckpt"
if not os.path.exists(ckpt_path):
raise FileNotFoundError(f"Download checkpoint to: {ckpt_path}")
Invalid SMILES Output
Symptom: Model generates invalid SMILES string
Solution: The model returns None for invalid molecules. Try:
- Rephrasing the edit prompt
- Using beam search with more beams
- Running multiple times for different outputs
CUDA Out of Memory
Symptom: RuntimeError: CUDA out of memory
Solution: Use CPU or smaller batch:
pipeline = InferencePipeline(
task="text_based_molecule_editing",
model="molt5",
model_ckpt="./checkpoints/server/text_based_molecule_editing_biot5.ckpt",
device="cpu"
)
Example
Input: aspirin
Prompt: "This molecule should be more soluble in water"
Original SMILES: CC(=O)Oc1ccccc1C(=O)O
Edited SMILES: CC(=O)Oc1ccc(C(=O)O)cc1C(=O)O
Property Changes:
LogP: 1.31 → 1.01 (-0.30, more soluble)
QED: 0.55 → 0.59 (+0.04, better drug-likeness)
SA: 1.58 → 1.81 (+0.23, slightly harder to synthesize)
See Also
examples/basic_example.py - Full runnable example script
examples/solubility_optimization.py - Solubility-focused workflow
references/troubleshooting.md - Detailed error handling
references/advanced.md - Advanced prompt engineering tips