with one click
pytdc
Therapeutics Data Commons. AI-ready drug discovery datasets (ADME, toxicity, DTI), benchmarks, scaffold splits, molecular oracles, for therapeutic ML and pharmacological prediction.
Therapeutics Data Commons. AI-ready drug discovery datasets (ADME, toxicity, DTI), benchmarks, scaffold splits, molecular oracles, for therapeutic ML and pharmacological prediction.
| name | pytdc |
| description | Therapeutics Data Commons. AI-ready drug discovery datasets (ADME, toxicity, DTI), benchmarks, scaffold splits, molecular oracles, for therapeutic ML and pharmacological prediction. |
| license | MIT license |
| metadata | {"version":"1.0","skill-author":"K-Dense Inc."} |
PyTDC is an open-science platform providing AI-ready datasets and benchmarks for drug discovery and development. Access curated datasets spanning the entire therapeutics pipeline with standardized evaluation metrics and meaningful data splits, organized into three categories: single-instance prediction (molecular/protein properties), multi-instance prediction (drug-target interactions, DDI), and generation (molecule generation, retrosynthesis).
This skill should be used when:
Install PyTDC using pip:
uv pip install PyTDC
To upgrade to the latest version:
uv pip install PyTDC --upgrade
Core dependencies (automatically installed):
Additional packages are installed automatically as needed for specific features.
The basic pattern for accessing any TDC dataset follows this structure:
from tdc.<problem> import <Task>
data = <Task>(name='<Dataset>')
split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])
df = data.get_data(format='df')
Where:
<problem>: One of single_pred, multi_pred, or generation<Task>: Specific task category (e.g., ADME, DTI, MolGen)<Dataset>: Dataset name within that taskExample - Loading ADME data:
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold')
# Returns dict with 'train', 'valid', 'test' DataFrames
Single-instance prediction involves forecasting properties of individual biomedical entities (molecules, proteins, etc.).
Predict pharmacokinetic properties of drug molecules.
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang') # Intestinal permeability
# Other datasets: HIA_Hou, Bioavailability_Ma, Lipophilicity_AstraZeneca, etc.
Common ADME datasets:
Predict toxicity and adverse effects of compounds.
from tdc.single_pred import Tox
data = Tox(name='hERG') # Cardiotoxicity
# Other datasets: AMES, DILI, Carcinogens_Lagunin, etc.
Common toxicity datasets:
Bioactivity predictions from screening data.
from tdc.single_pred import HTS
data = HTS(name='SARSCoV2_Vitro_Touret')
Quantum mechanical properties of molecules.
from tdc.single_pred import QM
data = QM(name='QM7')
Single prediction datasets typically return DataFrames with columns:
Drug_ID or Compound_ID: Unique identifierDrug or X: SMILES string or molecular representationY: Target label (continuous or binary)Multi-instance prediction involves forecasting properties of interactions between multiple biomedical entities.
Predict binding affinity between drugs and protein targets.
from tdc.multi_pred import DTI
data = DTI(name='BindingDB_Kd')
split = data.get_split()
Available datasets:
Data format: Drug_ID, Target_ID, Drug (SMILES), Target (sequence), Y (binding affinity)
Predict interactions between drug pairs.
from tdc.multi_pred import DDI
data = DDI(name='DrugBank')
split = data.get_split()
Multi-class classification task predicting interaction types. Dataset contains 191,808 DDI pairs with 1,706 drugs.
Predict protein-protein interactions.
from tdc.multi_pred import PPI
data = PPI(name='HuRI')
Generation tasks involve creating novel biomedical entities with desired properties.
Generate diverse, novel molecules with desirable chemical properties.
from tdc.generation import MolGen
data = MolGen(name='ChEMBL_V29')
split = data.get_split()
Use with oracles to optimize for specific properties:
from tdc import Oracle
oracle = Oracle(name='GSK3B')
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O') # Evaluate SMILES
See references/oracles.md for all available oracle functions.
Predict reactants needed to synthesize a target molecule.
from tdc.generation import RetroSyn
data = RetroSyn(name='USPTO')
split = data.get_split()
Dataset contains 1,939,253 reactions from USPTO database.
Generate molecule pairs (e.g., prodrug-drug pairs).
from tdc.generation import PairMolGen
data = PairMolGen(name='Prodrug')
For detailed oracle documentation and molecular generation workflows, refer to references/oracles.md and scripts/molecular_generation.py.
Benchmark groups provide curated collections of related datasets for systematic model evaluation.
from tdc.benchmark_group import admet_group
group = admet_group(path='data/')
# Get benchmark datasets
benchmark = group.get('Caco2_Wang')
predictions = {}
for seed in [1, 2, 3, 4, 5]:
train, valid = benchmark['train'], benchmark['valid']
# Train model here
predictions[seed] = model.predict(benchmark['test'])
# Evaluate with required 5 seeds
results = group.evaluate(predictions)
ADMET Group includes 22 datasets covering absorption, distribution, metabolism, excretion, and toxicity.
Available benchmark groups include collections for:
For benchmark evaluation workflows, see scripts/benchmark_evaluation.py.
TDC provides comprehensive data processing utilities organized into four categories.
Retrieve train/validation/test partitions with various strategies:
# Scaffold split (default for most tasks)
split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])
# Random split
split = data.get_split(method='random', seed=42, frac=[0.8, 0.1, 0.1])
# Cold split (for DTI/DDI tasks)
split = data.get_split(method='cold_drug', seed=1) # Unseen drugs in test
split = data.get_split(method='cold_target', seed=1) # Unseen targets in test
Available split strategies:
random: Random shufflingscaffold: Scaffold-based (for chemical diversity)cold_drug, cold_target, cold_drug_target: For DTI taskstemporal: Time-based splits for temporal datasetsUse standardized metrics for evaluation:
from tdc import Evaluator
# For binary classification
evaluator = Evaluator(name='ROC-AUC')
score = evaluator(y_true, y_pred)
# For regression
evaluator = Evaluator(name='RMSE')
score = evaluator(y_true, y_pred)
Available metrics: ROC-AUC, PR-AUC, F1, Accuracy, RMSE, MAE, R2, Spearman, Pearson, and more.
TDC provides 11 key processing utilities:
from tdc.chem_utils import MolConvert
# Molecule format conversion
converter = MolConvert(src='SMILES', dst='PyG')
pyg_graph = converter('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
Processing utilities include:
For comprehensive utilities documentation, see references/utilities.md.
TDC provides 17+ oracle functions for molecular optimization:
from tdc import Oracle
# Single oracle
oracle = Oracle(name='DRD2')
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
# Multiple oracles
oracle = Oracle(name='JNK3')
scores = oracle(['SMILES1', 'SMILES2', 'SMILES3'])
For complete oracle documentation, see references/oracles.md.
from tdc.utils import retrieve_dataset_names
# Get all ADME datasets
adme_datasets = retrieve_dataset_names('ADME')
# Get all DTI datasets
dti_datasets = retrieve_dataset_names('DTI')
# Get label mapping
label_map = data.get_label_map(name='DrugBank')
# Convert labels
from tdc.chem_utils import label_transform
transformed = label_transform(y, from_unit='nM', to_unit='p')
from tdc.utils import cid2smiles, uniprot2seq
# Convert PubChem CID to SMILES
smiles = cid2smiles(2244)
# Convert UniProt ID to amino acid sequence
sequence = uniprot2seq('P12345')
See scripts/load_and_split_data.py for a complete example:
from tdc.single_pred import ADME
from tdc import Evaluator
# Load data
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold', seed=42)
train, valid, test = split['train'], split['valid'], split['test']
# Train model (user implements)
# model.fit(train['Drug'], train['Y'])
# Evaluate
evaluator = Evaluator(name='MAE')
# score = evaluator(test['Y'], predictions)
See scripts/benchmark_evaluation.py for a complete example with multiple seeds and proper evaluation protocol.
See scripts/molecular_generation.py for an example of goal-directed generation using oracle functions.
This skill includes bundled resources for common TDC workflows:
load_and_split_data.py: Template for loading and splitting TDC datasets with various strategiesbenchmark_evaluation.py: Template for running benchmark group evaluations with proper 5-seed protocolmolecular_generation.py: Template for molecular generation using oracle functionsdatasets.md: Comprehensive catalog of all available datasets organized by task typeoracles.md: Complete documentation of all 17+ molecule generation oraclesutilities.md: Detailed guide to data processing, splitting, and evaluation utilitiesDeep generative models for single-cell omics. Use when you need probabilistic batch correction (scVI), transfer learning, differential expression with uncertainty, or multi-modal integration (TOTALVI, MultiVI). Best for advanced modeling, batch effects, multimodal data. For standard analysis pipelines use scanpy.
Biological data toolkit. Sequence analysis, alignments, phylogenetic trees, diversity metrics (alpha/beta, UniFrac), ordination (PCoA), PERMANOVA, FASTA/Newick I/O, for microbiome analysis.
Modal is a serverless cloud platform for running Python on demand, including on-demand GPUs. Use when deploying or serving AI/ML models, running GPU-accelerated workloads (training, fine-tuning, inference), serving web endpoints, scheduling batch jobs, or scaling Python code to cloud containers with the Modal SDK.
End-to-end bulk RNA-seq orchestrator — takes raw FASTQ reads through QC and trimming (FastQC, fastp/Trim Galore), alignment and quantification (STAR, Salmon, featureCounts), assembles a gene-level counts matrix, then hands off to differential expression (pydeseq2), pathway/GSEA enrichment (pathway-enrichment), and publication figures (scientific-visualization). Use whenever the user has bulk RNA-seq reads or quant output and wants a complete, reproducible differential-expression workflow — e.g. "analyze my RNA-seq", "FASTQ to DESeq2", "run nf-core/rnaseq", "STAR/Salmon quantification", "build a counts matrix for DESeq2", or "go from reads to differentially expressed genes and enriched pathways". Routes between an nf-core/rnaseq (Nextflow) path and a standalone STAR/Salmon path, and covers experimental design, strandedness, and QC gates. For single-cell RNA-seq use the scanpy skill instead.
Run pathway and gene-set enrichment analysis on gene lists or ranked gene data, then interpret the results. Use whenever the user has a set of genes (differentially expressed genes from PyDESeq2/Scanpy, CRISPR-screen hits, cluster marker genes, proteomics hits) and wants to know which biological pathways, GO terms, or gene sets are over-represented or enriched. Covers over-representation analysis (ORA / Enrichr / Fisher / hypergeometric), ranked Gene Set Enrichment Analysis (GSEA / preranked), single-sample scoring (ssGSEA/GSVA), and functional profiling via gseapy, g:Profiler, Enrichr libraries, MSigDB, GO, KEGG, Reactome, and WikiPathways — plus gene-ID mapping, choosing the right background universe, multiple-testing correction, redundancy reduction, dotplots/enrichment maps, and publication-ready tables. Use this for "pathway analysis", "enrichment analysis", "GO enrichment", "KEGG/Reactome pathways", "GSEA", "over-representation", "functional annotation", or "what pathways are my genes in".
Build, run, and debug Nextflow data pipelines and nf-core workflows end to end. Use whenever the user mentions Nextflow, nf-core, .nf files, nextflow.config, DSL2, processes/channels/operators, samplesheets, or wants to run a community pipeline (e.g. nf-core/rnaseq, nf-core/sarek), write or test a module/subworkflow with nf-test, configure executors/containers (Docker, Singularity/Apptainer, Conda, Wave), scale a workflow to HPC/SLURM or cloud (AWS Batch, Google Batch, Azure, Kubernetes), or debug a failed/-resume run. Make sure to use this skill for any reproducible scientific/bioinformatics workflow work even if the user does not say the word "Nextflow", and for authoring nf-core-compliant pipelines, modules, configs, and linting.