// "Diffusion-based molecular docking. Predict protein-ligand binding poses from PDB/SMILES, confidence scores, virtual screening, for structure-based drug design. Not for affinity prediction."
| name | diffdock |
| description | Diffusion-based molecular docking. Predict protein-ligand binding poses from PDB/SMILES, confidence scores, virtual screening, for structure-based drug design. Not for affinity prediction. |
DiffDock is a diffusion-based deep learning tool for molecular docking that predicts 3D binding poses of small molecule ligands to protein targets. It represents the state-of-the-art in computational docking, crucial for structure-based drug discovery and chemical biology.
Core Capabilities:
Key Distinction: DiffDock predicts binding poses (3D structure) and confidence (prediction certainty), NOT binding affinity (ฮG, Kd). Always combine with scoring functions (GNINA, MM/GBSA) for affinity assessment.
This skill should be used when:
Before proceeding with DiffDock tasks, verify the environment setup:
# Use the provided setup checker
python scripts/setup_check.py
This script validates Python version, PyTorch with CUDA, PyTorch Geometric, RDKit, ESM, and other dependencies.
Option 1: Conda (Recommended)
git clone https://github.com/gcorso/DiffDock.git
cd DiffDock
conda env create --file environment.yml
conda activate diffdock
Option 2: Docker
docker pull rbgcsail/diffdock
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
micromamba activate diffdock
Important Notes:
Use Case: Dock one ligand to one protein target
Input Requirements:
Command:
python -m inference \
--config default_inference_args.yaml \
--protein_path protein.pdb \
--ligand "CC(=O)Oc1ccccc1C(=O)O" \
--out_dir results/single_docking/
Alternative (protein sequence):
python -m inference \
--config default_inference_args.yaml \
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKF..." \
--ligand ligand.sdf \
--out_dir results/sequence_docking/
Output Structure:
results/single_docking/
โโโ rank_1.sdf # Top-ranked pose
โโโ rank_2.sdf # Second-ranked pose
โโโ ...
โโโ rank_10.sdf # 10th pose (default: 10 samples)
โโโ confidence_scores.txt
Use Case: Dock multiple ligands to proteins, virtual screening campaigns
Step 1: Prepare Batch CSV
Use the provided script to create or validate batch input:
# Create template
python scripts/prepare_batch_csv.py --create --output batch_input.csv
# Validate existing CSV
python scripts/prepare_batch_csv.py my_input.csv --validate
CSV Format:
complex_name,protein_path,ligand_description,protein_sequence
complex1,protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
complex2,,COc1ccc(C#N)cc1,MSKGEELFT...
complex3,protein3.pdb,ligand3.sdf,
Required Columns:
complex_name: Unique identifierprotein_path: PDB file path (leave empty if using sequence)ligand_description: SMILES string or ligand file pathprotein_sequence: Amino acid sequence (leave empty if using PDB)Step 2: Run Batch Docking
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv batch_input.csv \
--out_dir results/batch/ \
--batch_size 10
For Large Virtual Screening (>100 compounds):
Pre-compute protein embeddings for faster processing:
# Pre-compute embeddings
python datasets/esm_embedding_preparation.py \
--protein_ligand_csv screening_input.csv \
--out_file protein_embeddings.pt
# Run with pre-computed embeddings
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv screening_input.csv \
--esm_embeddings_path protein_embeddings.pt \
--out_dir results/screening/
After docking completes, analyze confidence scores and rank predictions:
# Analyze all results
python scripts/analyze_results.py results/batch/
# Show top 5 per complex
python scripts/analyze_results.py results/batch/ --top 5
# Filter by confidence threshold
python scripts/analyze_results.py results/batch/ --threshold 0.0
# Export to CSV
python scripts/analyze_results.py results/batch/ --export summary.csv
# Show top 20 predictions across all complexes
python scripts/analyze_results.py results/batch/ --best 20
The analysis script:
Understanding Scores:
| Score Range | Confidence Level | Interpretation |
|---|---|---|
| > 0 | High | Strong prediction, likely accurate |
| -1.5 to 0 | Moderate | Reasonable prediction, validate carefully |
| < -1.5 | Low | Uncertain prediction, requires validation |
Critical Notes:
For detailed guidance: Read references/confidence_and_limitations.md using the Read tool
Create custom configuration for specific use cases:
# Copy template
cp assets/custom_inference_config.yaml my_config.yaml
# Edit parameters (see template for presets)
# Then run with custom config
python -m inference \
--config my_config.yaml \
--protein_ligand_csv input.csv \
--out_dir results/
Sampling Density:
samples_per_complex: 10 โ Increase to 20-40 for difficult casesInference Steps:
inference_steps: 20 โ Increase to 25-30 for higher accuracyTemperature Parameters (control diversity):
temp_sampling_tor: 7.04 โ Increase for flexible ligands (8-10)temp_sampling_tor: 7.04 โ Decrease for rigid ligands (5-6)Presets Available in Template:
For complete parameter reference: Read references/parameters_reference.md using the Read tool
For proteins with known flexibility, dock to multiple conformations:
# Create ensemble CSV
import pandas as pd
conformations = ["conf1.pdb", "conf2.pdb", "conf3.pdb"]
ligand = "CC(=O)Oc1ccccc1C(=O)O"
data = {
"complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
"protein_path": conformations,
"ligand_description": [ligand] * len(conformations),
"protein_sequence": [""] * len(conformations)
}
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
Run docking with increased sampling:
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv ensemble_input.csv \
--samples_per_complex 20 \
--out_dir results/ensemble/
DiffDock generates poses; combine with other tools for affinity:
GNINA (Fast neural network scoring):
for pose in results/*.sdf; do
gnina -r protein.pdb -l "$pose" --score_only
done
MM/GBSA (More accurate, slower): Use AmberTools MMPBSA.py or gmx_MMPBSA after energy minimization
Free Energy Calculations (Most accurate): Use OpenMM + OpenFE or GROMACS for FEP/TI calculations
Recommended Workflow:
DiffDock IS Designed For:
DiffDock IS NOT Designed For:
For complete limitations: Read references/confidence_and_limitations.md using the Read tool
Issue: Low confidence scores across all predictions
samples_per_complex (20-40), try ensemble docking, validate protein structureIssue: Out of memory errors
--batch_size 2 or process fewer complexes at onceIssue: Slow performance
python -c "import torch; print(torch.cuda.is_available())", use GPUIssue: Unrealistic binding poses
Issue: "Module not found" errors
python scripts/setup_check.py to diagnoseFor Best Results:
For interactive use, launch the web interface:
python app/main.py
# Navigate to http://localhost:7860
Or use the online demo without installation:
scripts/)prepare_batch_csv.py: Create and validate batch input CSV files
analyze_results.py: Analyze confidence scores and rank predictions
setup_check.py: Verify DiffDock environment setup
references/)parameters_reference.md: Complete parameter documentation
Read this file when users need:
confidence_and_limitations.md: Confidence score interpretation and tool limitations
Read this file when users need:
workflows_examples.md: Comprehensive workflow examples
Read this file when users need:
assets/)batch_template.csv: Template for batch processing
custom_inference_config.yaml: Configuration template
setup_check.py before starting large jobsprepare_batch_csv.py to catch errors earlyWhen using DiffDock, cite the appropriate papers:
DiffDock-L (current default model):
Stรคrk et al. (2024) "DiffDock-L: Improving Molecular Docking with Diffusion Models"
arXiv:2402.18396
Original DiffDock:
Corso et al. (2023) "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
ICLR 2023, arXiv:2210.01776