| name | cbioportal-database |
| description | Cancer genomics (TCGA et al.) via cBioPortal REST API. Retrieve somatic mutations, CNAs, expression, clinical data (survival/stage/treatment) across thousands of studies. Use for TMB, oncoprints, survival analysis. For population frequencies use gnomad-database; for drug-gene interactions use opentargets-database. |
| license | AGPL-3.0 |
cBioPortal Database
Overview
cBioPortal for Cancer Genomics is a public repository of cancer genomics data including TCGA, ICGC, and hundreds of curated studies spanning 100+ cancer types. It provides somatic mutation profiles, copy number alterations (CNA), gene expression, clinical data (survival, stage, treatment history), and methylation data for tens of thousands of patient samples. Data is accessible via a REST API at https://www.cbioportal.org/api/ with no authentication required.
When to Use
- Retrieving somatic mutation profiles (variant type, amino acid change) for a gene across TCGA studies
- Querying copy number alteration data (amplification, deep deletion) for candidate cancer driver genes
- Accessing clinical data — overall survival, disease-free survival, tumor stage — for survival curve analysis
- Identifying which cancer studies have molecular profiling data for a specific cancer type (e.g., breast, lung)
- Downloading gene expression (RNA-seq FPKM/RSEM) data from specific TCGA cohorts for differential expression analysis
- Correlating genomic alterations with clinical outcomes in a specific study
- Use
gnomad-database instead when you need population-level variant allele frequencies in healthy individuals
- For drug-gene interaction lookups use
opentargets-database; cBioPortal provides the genomic alteration data, not drug interaction annotations
Prerequisites
- Python packages:
requests, pandas, matplotlib
- Data requirements: Entrez gene symbols (e.g.,
TP53), cBioPortal study IDs (e.g., tcga_brca), molecular profile IDs
- Environment: internet connection; no API key required
- Rate limits: no strict rate limits; use
time.sleep(0.2) between batch requests for polite access
pip install requests pandas matplotlib
Quick Start
import requests
import pandas as pd
BASE_URL = "https://www.cbioportal.org/api"
def cbio_get(endpoint, params=None):
"""GET request to cBioPortal REST API, returns parsed JSON."""
r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
headers={"Accept": "application/json"}, timeout=30)
r.raise_for_status()
return r.json()
cancer_types = cbio_get("cancer-types")
print(f"Total cancer types: {len(cancer_types)}")
studies = cbio_get("studies", params={"keyword": "breast"})
brca = [s for s in studies if "tcga_brca" in s["studyId"]]
if brca:
s = brca[0]
print(f"Study: {s['name']}")
print(f" studyId: {s['studyId']}")
print(f" Samples: {s['allSampleCount']}")
Core API
Query 1: Cancer Types and Studies
List available cancer types and find studies by cancer type or keyword.
import requests
import pandas as pd
BASE_URL = "https://www.cbioportal.org/api"
def cbio_get(endpoint, params=None):
r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
headers={"Accept": "application/json"}, timeout=30)
r.raise_for_status()
return r.json()
cancer_types = cbio_get("cancer-types")
ct_df = pd.DataFrame(cancer_types)[["cancerTypeId", "name", "dedicatedColor"]]
print(f"Cancer types: {len(ct_df)}")
print(ct_df.head(5).to_string(index=False))
lung_studies = cbio_get("studies", params={"keyword": "lung adenocarcinoma"})
print(f"\nLung adenocarcinoma studies: {len(lung_studies)}")
for s in lung_studies[:3]:
print(f" {s['studyId']:40s} n={s['allSampleCount']}")
study_id = "brca_tcga_pan_can_atlas_2018"
study = cbio_get(f"studies/{study_id}")
print(f"Study: {study['name']}")
print(f" Reference genome: {study.get('referenceGenome', 'n/a')}")
print(f" All sample count: {study['allSampleCount']}")
profiles = cbio_get("molecular-profiles", params={"studyId": study_id})
print(f"\nMolecular profiles ({len(profiles)} total):")
for p in profiles:
print(f" {p['molecularProfileId']:55s} [{p['molecularAlterationType']}]")
Query 2: Somatic Mutations
Retrieve mutation data for a gene or set of genes in a study's mutation profile.
import requests, json
import pandas as pd
BASE_URL = "https://www.cbioportal.org/api"
def cbio_post(endpoint, body):
"""POST request to cBioPortal REST API."""
r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
headers={"Accept": "application/json",
"Content-Type": "application/json"},
timeout=60)
r.raise_for_status()
return r.json()
def cbio_get(endpoint, params=None):
r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
headers={"Accept": "application/json"}, timeout=30)
r.raise_for_status()
return r.json()
study_id = "brca_tcga_pan_can_atlas_2018"
samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples]
print(f"Total samples: {len(sample_ids)}")
profile_id = f"{study_id}_mutations"
body = {
"sampleIds": sample_ids[:200],
"entrezGeneIds": [7157]
}
mutations = cbio_post(f"molecular-profiles/{profile_id}/mutations/fetch", body)
print(f"TP53 mutations in first 200 samples: {len(mutations)}")
mut_df = pd.DataFrame(mutations)
print("\nMutation type distribution:")
print(mut_df["mutationType"].value_counts().head(8).to_string())
Query 3: Copy Number Alterations
Fetch discrete CNA data (amplification = 2, gain = 1, diploid = 0, loss = -1, deep deletion = -2).
import requests
import pandas as pd
BASE_URL = "https://www.cbioportal.org/api"
def cbio_post(endpoint, body):
r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
headers={"Accept": "application/json",
"Content-Type": "application/json"},
timeout=60)
r.raise_for_status()
return r.json()
def cbio_get(endpoint, params=None):
r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
headers={"Accept": "application/json"}, timeout=30)
r.raise_for_status()
return r.json()
study_id = "brca_tcga_pan_can_atlas_2018"
cna_profile_id = f"{study_id}_gistic"
samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples][:300]
body = {
"sampleIds": sample_ids,
"entrezGeneIds": [2064, 4609]
}
cna_data = cbio_post(
f"molecular-profiles/{cna_profile_id}/molecular-data/fetch", body
)
print(f"CNA records retrieved: {len(cna_data)}")
cna_df = pd.DataFrame(cna_data)
cna_label = {2: "AMP", 1: "GAIN", 0: "DIPLOID", -1: "LOSS", -2: "HOMDEL"}
print("\nERBB2 CNA distribution:")
erbb2 = cna_df[cna_df["entrezGeneId"] == 2064]
erbb2_counts = erbb2["value"].map(lambda x: cna_label.get(int(x), str(x))).value_counts()
print(erbb2_counts.to_string())
Query 4: Clinical Data
Retrieve per-sample or per-patient clinical attributes including survival, tumor stage, and treatment.
import requests
import pandas as pd
BASE_URL = "https://www.cbioportal.org/api"
def cbio_get(endpoint, params=None):
r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
headers={"Accept": "application/json"}, timeout=30)
r.raise_for_status()
return r.json()
study_id = "brca_tcga_pan_can_atlas_2018"
attrs = cbio_get(f"studies/{study_id}/clinical-attributes")
attr_df = pd.DataFrame(attrs)[["clinicalAttributeId", "displayName", "datatype", "patientAttribute"]]
print(f"Clinical attributes: {len(attr_df)}")
survival_attrs = attr_df[attr_df["clinicalAttributeId"].str.contains("SURVIVAL|MONTHS|STATUS", na=False)]
print(survival_attrs[["clinicalAttributeId", "displayName"]].to_string(index=False))
clinical = cbio_get(f"studies/{study_id}/clinical-data",
params={"clinicalDataType": "PATIENT",
"projection": "DETAILED"})
clin_df = pd.DataFrame(clinical)
clin_pivot = clin_df.pivot_table(
index="patientId", columns="clinicalAttributeId",
values="value", aggfunc="first"
)
print(f"\nPatients: {len(clin_pivot)}")
if "OS_STATUS" in clin_pivot.columns:
print("OS status counts:")
print(clin_pivot["OS_STATUS"].value_counts().to_string())
Query 5: Gene Expression Data
Retrieve mRNA expression values (RSEM or FPKM) from RNA-seq profiles.
import requests
import pandas as pd
BASE_URL = "https://www.cbioportal.org/api"
def cbio_post(endpoint, body):
r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
headers={"Accept": "application/json",
"Content-Type": "application/json"},
timeout=60)
r.raise_for_status()
return r.json()
def cbio_get(endpoint, params=None):
r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
headers={"Accept": "application/json"}, timeout=30)
r.raise_for_status()
return r.json()
study_id = "brca_tcga_pan_can_atlas_2018"
rna_profile_id = f"{study_id}_rna_seq_v2_mrna_median_normed_log2"
samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples][:100]
body = {
"sampleIds": sample_ids,
"entrezGeneIds": [2099, 2064, 5241]
}
expr_data = cbio_post(
f"molecular-profiles/{rna_profile_id}/molecular-data/fetch", body
)
expr_df = pd.DataFrame(expr_data)
print(f"Expression records: {len(expr_df)}")
expr_pivot = expr_df.pivot_table(
index="sampleId", columns="entrezGeneId", values="value"
)
expr_pivot.columns = ["ERBB2", "ESR1", "PGR"]
print(f"\nExpression matrix: {expr_pivot.shape}")
print(expr_pivot.describe().round(2))
Query 6: Gene Details and Batch Lookup
Look up gene metadata (symbol, Entrez ID, type) required to construct mutation and CNA queries.
import requests
import pandas as pd
BASE_URL = "https://www.cbioportal.org/api"
def cbio_get(endpoint, params=None):
r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
headers={"Accept": "application/json"}, timeout=30)
r.raise_for_status()
return r.json()
def cbio_post(endpoint, body):
r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
headers={"Accept": "application/json",
"Content-Type": "application/json"},
timeout=30)
r.raise_for_status()
return r.json()
gene = cbio_get("genes/TP53")
print(f"TP53: entrezGeneId={gene['entrezGeneId']}, type={gene['type']}")
gene_symbols = ["BRCA1", "BRCA2", "TP53", "PIK3CA", "PTEN", "KRAS", "EGFR"]
body = {"geneIds": gene_symbols, "geneIdType": "HUGO_GENE_SYMBOL"}
gene_list = cbio_post("genes/fetch", body)
gene_map = {g["hugoGeneSymbol"]: g["entrezGeneId"] for g in gene_list}
gene_df = pd.DataFrame(gene_list)[["hugoGeneSymbol", "entrezGeneId", "type"]]
print(f"\nResolved {len(gene_df)} genes:")
print(gene_df.to_string(index=False))
Query 7: Visualization — Mutation Frequency Barplot
Plot mutation frequency across TCGA studies for a cancer driver gene.
import requests, time
import pandas as pd
import matplotlib.pyplot as plt
BASE_URL = "https://www.cbioportal.org/api"
def cbio_get(endpoint, params=None):
r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
headers={"Accept": "application/json"}, timeout=30)
r.raise_for_status()
return r.json()
def cbio_post(endpoint, body):
r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
headers={"Accept": "application/json",
"Content-Type": "application/json"},
timeout=60)
r.raise_for_status()
return r.json()
STUDIES = {
"brca_tcga_pan_can_atlas_2018": "BRCA",
"luad_tcga_pan_can_atlas_2018": "LUAD",
"coad_tcga_pan_can_atlas_2018": "COAD",
"prad_tcga_pan_can_atlas_2018": "PRAD",
"gbm_tcga_pan_can_atlas_2018": "GBM",
}
GENE_ENTREZ = 7157
GENE_SYMBOL = "TP53"
rows = []
for study_id, label in STUDIES.items():
try:
samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples]
n_total = len(sample_ids)
profile_id = f"{study_id}_mutations"
body = {"sampleIds": sample_ids, "entrezGeneIds": [GENE_ENTREZ]}
muts = cbio_post(f"molecular-profiles/{profile_id}/mutations/fetch", body)
mutated_samples = len({m["sampleId"] for m in muts})
rows.append({"study": label, "n_mutated": mutated_samples,
"n_total": n_total,
"freq": mutated_samples / n_total * 100})
time.sleep(0.2)
except Exception as e:
print(f" Skipping {study_id}: {e}")
df = pd.DataFrame(rows).sort_values("freq", ascending=True)
fig, ax = plt.subplots(figsize=(7, 4))
bars = ax.barh(df["study"], df["freq"], color="#C0392B", edgecolor="white")
ax.bar_label(bars, labels=[f"{v:.0f}% (n={n})" for v, n in zip(df["freq"], df["n_mutated"])],
padding=4, fontsize=9)
ax.set_xlabel(f"{GENE_SYMBOL} Mutation Frequency (%)")
ax.set_title(f"{GENE_SYMBOL} Somatic Mutation Frequency\nacross TCGA PanCancer Atlas Studies")
ax.set_xlim(0, df["freq"].max() * 1.3)
plt.tight_layout()
plt.savefig(f"{GENE_SYMBOL}_mutation_frequency.png", dpi=150, bbox_inches="tight")
print(f"Saved {GENE_SYMBOL}_mutation_frequency.png")
print(df[["study", "n_mutated", "n_total", "freq"]].to_string(index=False))
Key Concepts
cBioPortal Data Model
cBioPortal organizes data in a three-tier hierarchy: Cancer Studies → Molecular Profiles → Sample-level data. A single study (e.g., brca_tcga_pan_can_atlas_2018) contains multiple molecular profiles, each covering one data type. Before querying mutation or expression data, always retrieve the molecular profile list with GET /molecular-profiles?studyId={studyId} to confirm the correct profile ID.
Molecular Profile ID Conventions
| Data Type | Typical Profile ID Suffix | Alteration Type |
|---|
| Somatic mutations | _mutations | MUTATION_EXTENDED |
| Discrete CNA (GISTIC) | _gistic | COPY_NUMBER_ALTERATION |
| Continuous CNA (log2) | _log2CNA | COPY_NUMBER_ALTERATION |
| RNA-seq (log2 RSEM) | _rna_seq_v2_mrna_median_normed_log2 | MRNA_EXPRESSION |
| Methylation | _methylation_hm27 or _hm450 | METHYLATION |
Not all studies have all profile types. Always verify with GET /molecular-profiles?studyId={studyId}.
Entrez Gene IDs
The REST API mutation and molecular data endpoints require Entrez Gene IDs (integers), not Hugo symbols. Use GET /genes/{hugoSymbol} or POST /genes/fetch to resolve symbols to IDs before batch queries.
Common Workflows
Workflow 1: Somatic Mutation Landscape for a Gene Panel
Goal: Retrieve mutations for multiple cancer driver genes across an entire TCGA study and export to CSV.
import requests, time
import pandas as pd
BASE_URL = "https://www.cbioportal.org/api"
def cbio_get(endpoint, params=None):
r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
headers={"Accept": "application/json"}, timeout=30)
r.raise_for_status()
return r.json()
def cbio_post(endpoint, body):
r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
headers={"Accept": "application/json",
"Content-Type": "application/json"},
timeout=120)
r.raise_for_status()
return r.json()
study_id = "luad_tcga_pan_can_atlas_2018"
profile_id = f"{study_id}_mutations"
gene_symbols = ["KRAS", "EGFR", "TP53", "BRAF", "STK11", "KEAP1", "RB1"]
gene_list = cbio_post("genes/fetch",
{"geneIds": gene_symbols, "geneIdType": "HUGO_GENE_SYMBOL"})
gene_map = {g["entrezGeneId"]: g["hugoGeneSymbol"] for g in gene_list}
entrez_ids = list(gene_map.keys())
samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples]
print(f"Study: {study_id} — {len(sample_ids)} samples")
chunk_size = 500
all_muts = []
for i in range(0, len(sample_ids), chunk_size):
chunk = sample_ids[i:i + chunk_size]
body = {"sampleIds": chunk, "entrezGeneIds": entrez_ids}
muts = cbio_post(f"molecular-profiles/{profile_id}/mutations/fetch", body)
all_muts.extend(muts)
time.sleep(0.1)
mut_df = pd.DataFrame(all_muts)
mut_df["hugoSymbol"] = mut_df["entrezGeneId"].map(gene_map)
print(f"Total mutations: {len(mut_df)}")
print("\nMutation counts per gene:")
print(mut_df.groupby("hugoSymbol")["sampleId"].nunique()
.sort_values(ascending=False).to_string())
mut_df.to_csv(f"{study_id}_driver_mutations.csv", index=False)
print(f"\nSaved: {study_id}_driver_mutations.csv")
Workflow 2: Survival Analysis — CNA Status vs. Overall Survival
Goal: Compare overall survival between patients with ERBB2 amplification vs. diploid/loss in TCGA BRCA.
import requests
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
BASE_URL = "https://www.cbioportal.org/api"
def cbio_get(endpoint, params=None):
r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
headers={"Accept": "application/json"}, timeout=30)
r.raise_for_status()
return r.json()
def cbio_post(endpoint, body):
r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
headers={"Accept": "application/json",
"Content-Type": "application/json"},
timeout=60)
r.raise_for_status()
return r.json()
study_id = "brca_tcga_pan_can_atlas_2018"
cna_profile_id = f"{study_id}_gistic"
samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples]
cna_data = cbio_post(
f"molecular-profiles/{cna_profile_id}/molecular-data/fetch",
{"sampleIds": sample_ids, "entrezGeneIds": [2064]}
)
cna_df = pd.DataFrame(cna_data)[["sampleId", "value"]].rename(columns={"value": "erbb2_cna"})
cna_df["erbb2_cna"] = cna_df["erbb2_cna"].astype(int)
cna_df["erbb2_status"] = cna_df["erbb2_cna"].map(
{2: "Amplified", 1: "Gain", 0: "Diploid", -1: "Loss", -2: "Deep Deletion"})
clinical = cbio_get(f"studies/{study_id}/clinical-data",
params={"clinicalDataType": "PATIENT", "projection": "DETAILED"})
clin_df = pd.DataFrame(clinical)
clin_pivot = clin_df.pivot_table(
index="patientId", columns="clinicalAttributeId", values="value", aggfunc="first"
).reset_index()
sample_patient = cbio_get(f"studies/{study_id}/samples", params={"projection": "DETAILED"})
sp_df = pd.DataFrame(sample_patient)[["sampleId", "patientId"]]
merged = (cna_df
.merge(sp_df, on="sampleId")
.merge(clin_pivot[["patientId", "OS_STATUS", "OS_MONTHS"]],
on="patientId", how="inner"))
merged = merged.dropna(subset=["OS_STATUS", "OS_MONTHS"])
merged["OS_MONTHS"] = pd.to_numeric(merged["OS_MONTHS"], errors="coerce")
merged["event"] = (merged["OS_STATUS"] == "1:DECEASED").astype(int)
def km_curve(df, time_col="OS_MONTHS"):
times = sorted(df[time_col].dropna().values)
surv = []
s = 1.0
n = len(times)
for i, t in enumerate(times):
s *= (1 - 1 / (n - i))
surv.append((t, s))
return surv
fig, ax = plt.subplots(figsize=(8, 5))
colors = {"Amplified": "#C0392B", "Diploid": "#2980B9"}
for status, color in colors.items():
grp = merged[merged["erbb2_status"] == status]
if len(grp) < 10:
continue
km = km_curve(grp)
times = [0] + [x[0] for x in km]
surv = [1.0] + [x[1] for x in km]
ax.step(times, surv, where="post", color=color,
label=f"ERBB2 {status} (n={len(grp)})", lw=2)
ax.set_xlabel("Overall Survival (months)")
ax.set_ylabel("Survival Probability")
ax.set_title("ERBB2 CNA Status vs. Overall Survival\nTCGA BRCA (PanCancer Atlas)")
ax.legend()
ax.set_ylim(0, 1.05)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("erbb2_survival.png", dpi=150, bbox_inches="tight")
print(f"Saved erbb2_survival.png")
print(f"ERBB2 Amplified: {(merged['erbb2_status']=='Amplified').sum()} samples")
print(f"ERBB2 Diploid: {(merged['erbb2_status']=='Diploid').sum()} samples")
Workflow 3: Multi-Study Alteration Frequency Heatmap
Goal: Build a gene × cancer-type alteration frequency matrix across TCGA studies.
import requests, time
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
BASE_URL = "https://www.cbioportal.org/api"
def cbio_get(endpoint, params=None):
r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
headers={"Accept": "application/json"}, timeout=30)
r.raise_for_status()
return r.json()
def cbio_post(endpoint, body):
r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
headers={"Accept": "application/json",
"Content-Type": "application/json"},
timeout=90)
r.raise_for_status()
return r.json()
STUDIES = {
"brca_tcga_pan_can_atlas_2018": "BRCA",
"luad_tcga_pan_can_atlas_2018": "LUAD",
"coad_tcga_pan_can_atlas_2018": "COAD",
"gbm_tcga_pan_can_atlas_2018": "GBM",
}
GENE_SYMBOLS = ["TP53", "KRAS", "PIK3CA", "EGFR", "PTEN"]
gene_list = cbio_post("genes/fetch",
{"geneIds": GENE_SYMBOLS, "geneIdType": "HUGO_GENE_SYMBOL"})
gene_map = {g["entrezGeneId"]: g["hugoGeneSymbol"] for g in gene_list}
entrez_ids = list(gene_map.keys())
freq_matrix = pd.DataFrame(index=GENE_SYMBOLS, columns=list(STUDIES.values()), dtype=float)
for study_id, label in STUDIES.items():
try:
samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples]
n_total = len(sample_ids)
profile_id = f"{study_id}_mutations"
body = {"sampleIds": sample_ids, "entrezGeneIds": entrez_ids}
muts = cbio_post(f"molecular-profiles/{profile_id}/mutations/fetch", body)
mut_df = pd.DataFrame(muts) if muts else pd.DataFrame()
for eid, symbol in gene_map.items():
if mut_df.empty:
freq_matrix.loc[symbol, label] = 0.0
else:
n_mut = mut_df[mut_df["entrezGeneId"] == eid]["sampleId"].nunique()
freq_matrix.loc[symbol, label] = n_mut / n_total * 100
time.sleep(0.2)
except Exception as e:
print(f" {label}: {e}")
freq_matrix = freq_matrix.fillna(0).astype(float)
fig, ax = plt.subplots(figsize=(7, 4))
im = ax.imshow(freq_matrix.values, cmap="YlOrRd", aspect="auto", vmin=0, vmax=80)
ax.set_xticks(range(len(freq_matrix.columns)))
ax.set_xticklabels(freq_matrix.columns, rotation=30, ha="right")
ax.set_yticks(range(len(freq_matrix.index)))
ax.set_yticklabels(freq_matrix.index)
for i in range(len(freq_matrix.index)):
for j in range(len(freq_matrix.columns)):
val = freq_matrix.iloc[i, j]
ax.text(j, i, f"{val:.0f}%", ha="center", va="center", fontsize=9,
color="white" if val > 40 else "black")
plt.colorbar(im, ax=ax, label="Mutation Frequency (%)")
ax.set_title("Somatic Mutation Frequency — TCGA PanCancer Atlas")
plt.tight_layout()
plt.savefig("mutation_frequency_heatmap.png", dpi=150, bbox_inches="tight")
print("Saved mutation_frequency_heatmap.png")
print(freq_matrix.to_string())
Key Parameters
| Parameter | Function/Endpoint | Default | Range / Options | Effect |
|---|
studyId | All study endpoints | — | any valid cBioPortal study ID | Selects the cancer study |
molecularProfileId | mutations/fetch, molecular-data/fetch | — | {studyId}_mutations, {studyId}_gistic, etc. | Selects the data type profile |
entrezGeneIds | mutations/fetch, molecular-data/fetch | — | list of integer Entrez IDs | Genes to query; use POST /genes/fetch to resolve symbols |
sampleIds | mutations/fetch, molecular-data/fetch | — | list of sample ID strings | Samples to retrieve; use GET /studies/{id}/samples for all |
clinicalDataType | clinical-data | "SAMPLE" | "SAMPLE", "PATIENT" | Whether to return sample-level or patient-level clinical attributes |
projection | samples, clinical-data | "SUMMARY" | "ID", "SUMMARY", "DETAILED", "META" | Response verbosity; "ID" fastest for ID-only fetches |
keyword | studies | "" | free text | Filter studies by name/cancer type keyword |
Best Practices
-
Fetch sample IDs before data queries: All mutation and molecular data endpoints require explicit sampleIds. Retrieve them with GET /studies/{studyId}/samples?projection=ID before each query.
-
Verify profile IDs from the API: Profile IDs are not guaranteed to follow the _mutations / _gistic pattern in every study. Always confirm with GET /molecular-profiles?studyId={studyId} rather than guessing.
-
Chunk large sample sets: The API can time out on requests with thousands of sample IDs. Batch requests in chunks of 500 samples with time.sleep(0.1) between chunks.
-
Use Entrez IDs, not Hugo symbols, in data fetch endpoints: The mutation and molecular data endpoints accept entrezGeneIds (integers). Resolve symbols first with POST /genes/fetch.
-
Don't hard-code Entrez IDs: Gene IDs can be looked up dynamically via the API. Hard-coded IDs become incorrect if the gene model changes. Use POST /genes/fetch to resolve gene symbols at runtime.
Common Recipes
Recipe: List All Molecular Profiles for a Study
When to use: Before running any data query — verify which profile IDs are available.
import requests
BASE_URL = "https://www.cbioportal.org/api"
def cbio_get(endpoint, params=None):
r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
headers={"Accept": "application/json"}, timeout=30)
r.raise_for_status()
return r.json()
study_id = "brca_tcga_pan_can_atlas_2018"
profiles = cbio_get("molecular-profiles", params={"studyId": study_id})
for p in profiles:
print(f"{p['molecularProfileId']:55s} {p['molecularAlterationType']}")
Recipe: Download Full Mutation MAF for a Study
When to use: Export all somatic mutations from a study into MAF-compatible format for downstream analysis.
import requests, time
import pandas as pd
BASE_URL = "https://www.cbioportal.org/api"
def cbio_get(endpoint, params=None):
r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
headers={"Accept": "application/json"}, timeout=60)
r.raise_for_status()
return r.json()
def cbio_post(endpoint, body):
r = requests.post(f"{BASE_URL}/{endpoint}", json=body,
headers={"Accept": "application/json",
"Content-Type": "application/json"},
timeout=120)
r.raise_for_status()
return r.json()
study_id = "coad_tcga_pan_can_atlas_2018"
profile_id = f"{study_id}_mutations"
samples = cbio_get(f"studies/{study_id}/samples", params={"projection": "ID"})
sample_ids = [s["sampleId"] for s in samples]
all_mutations = []
for i in range(0, len(sample_ids), 300):
chunk = sample_ids[i:i + 300]
muts = cbio_post(f"molecular-profiles/{profile_id}/mutations/fetch",
{"sampleIds": chunk, "entrezGeneIds": []})
all_mutations.extend(muts)
time.sleep(0.1)
mut_df = pd.DataFrame(all_mutations)
cols = ["hugoGeneSymbol", "sampleId", "chr", "startPosition", "endPosition",
"referenceAllele", "variantAllele", "mutationType",
"proteinChange", "variantType"]
available = [c for c in cols if c in mut_df.columns]
mut_df[available].to_csv(f"{study_id}_mutations.csv", index=False)
print(f"Saved {len(mut_df)} mutations → {study_id}_mutations.csv")
Recipe: Query Patient-Level Clinical Attribute
When to use: Extract a specific clinical variable (e.g., tumor stage, age at diagnosis) for all patients.
import requests
import pandas as pd
BASE_URL = "https://www.cbioportal.org/api"
def cbio_get(endpoint, params=None):
r = requests.get(f"{BASE_URL}/{endpoint}", params=params,
headers={"Accept": "application/json"}, timeout=30)
r.raise_for_status()
return r.json()
study_id = "brca_tcga_pan_can_atlas_2018"
attr_id = "TUMOR_STAGE"
clinical = cbio_get(f"studies/{study_id}/clinical-data",
params={"clinicalDataType": "PATIENT",
"projection": "DETAILED"})
clin_df = pd.DataFrame(clinical)
if "clinicalAttributeId" in clin_df.columns:
stage_df = clin_df[clin_df["clinicalAttributeId"] == attr_id][["patientId", "value"]]
print(f"Patients with {attr_id} annotation: {len(stage_df)}")
print(stage_df["value"].value_counts().head(10).to_string())
Troubleshooting
| Problem | Cause | Solution |
|---|
404 Not Found on profile endpoint | Molecular profile does not exist for study | List profiles with GET /molecular-profiles?studyId={id}; confirm the profile ID |
| Empty mutations list | Gene has no mutations in the selected samples/profile | Verify study has a mutation profile; check sample IDs belong to the same study |
requests.exceptions.Timeout | Large sample set (>1000) in a single request | Chunk requests to 300–500 samples; increase timeout to 120s |
entrezGeneIds key error in response | Hugo symbol passed instead of Entrez ID | Use POST /genes/fetch to resolve symbols to integer Entrez IDs first |
| CNA values returned as strings | value field is string in JSON | Cast with pd.to_numeric() or int(value) before comparison |
| Expression profile not found | Study uses non-standard profile naming | Check profile list; look for MRNA_EXPRESSION alteration type in GET /molecular-profiles |
| Survival analysis has many NA values | Clinical attribute absent for some patients | Use dropna() on OS columns; check attribute availability with GET /studies/{id}/clinical-attributes |
Related Skills
gnomad-database — population variant allele frequencies for healthy cohorts (complement to cBioPortal somatic data)
cnvkit-copy-number — CNVkit pipeline for generating SEG/CNA files that can be loaded into cBioPortal
pydeseq2-differential-expression — differential expression analysis that can be applied to cBioPortal RNA-seq exports
References