| name | uniprot-protein-database |
| description | Query UniProt protein database via REST API. Search by gene/protein name, retrieve FASTA sequences, map IDs across databases (Ensembl, PDB, RefSeq), access Swiss-Prot annotations. For unified multi-database access use bioservices; for protein structure use alphafold-database. |
| license | CC-BY-4.0 |
UniProt — Protein Database
Overview
UniProt is the most comprehensive protein sequence and functional annotation database, containing 250M+ entries. This skill covers programmatic access via the UniProt REST API for protein search, sequence retrieval, ID mapping, and annotation queries. Swiss-Prot entries are manually curated; TrEMBL entries are computationally predicted.
When to Use
- Searching for proteins by gene name, accession, organism, or function keywords
- Retrieving protein sequences in FASTA format for downstream analysis
- Mapping identifiers between databases (UniProt ↔ Ensembl, PDB, RefSeq, KEGG)
- Accessing protein annotations: GO terms, domains, post-translational modifications
- Batch retrieving multiple protein entries for comparative analysis
- Downloading reviewed (Swiss-Prot) protein datasets for a specific organism
- For unified access to 40+ databases, use bioservices instead
- For protein 3D structures, use alphafold-database or pdb-database
Prerequisites
pip install requests pandas
API Rate Limits: UniProt REST API has no strict rate limit but recommends adding time.sleep(0.5) between batch requests. For large queries (>10k results), use the streaming endpoint instead of paginated search. Maximum 100,000 IDs per ID mapping job.
Quick Start
import requests
url = "https://rest.uniprot.org/uniprotkb/search"
params = {"query": "insulin AND organism_id:9606 AND reviewed:true", "format": "tsv",
"fields": "accession,gene_names,protein_name,length"}
response = requests.get(url, params=params)
print(response.text[:500])
Core API
1. Protein Search
Search UniProt with structured queries combining Boolean operators and field-specific filters.
import requests
import time
BASE = "https://rest.uniprot.org/uniprotkb/search"
def search_uniprot(query, fields=None, format="json", size=25):
"""Search UniProt with query syntax."""
params = {"query": query, "format": format, "size": size}
if fields:
params["fields"] = ",".join(fields)
resp = requests.get(BASE, params=params)
resp.raise_for_status()
return resp.json() if format == "json" else resp.text
results = search_uniprot("gene:BRCA1 AND reviewed:true",
fields=["accession", "gene_names", "organism_name", "length"])
for entry in results["results"][:3]:
print(f"{entry['primaryAccession']} | {entry.get('genes', [{}])[0].get('geneName', {}).get('value', 'N/A')} | {entry.get('organism', {}).get('scientificName', 'N/A')}")
Query syntax reference:
# Boolean operators
kinase AND organism_id:9606 # Human kinases
(diabetes OR insulin) AND reviewed:true
cancer NOT lung
# Field-specific
gene:BRCA1
accession:P12345
taxonomy_name:"Homo sapiens"
go:0005515 # GO term: protein binding
# Range queries
length:[100 TO 500]
mass:[50000 TO 100000]
# Wildcards
gene:BRCA*
2. Protein Entry Retrieval
Retrieve individual protein entries by accession number.
import requests
def get_protein(accession, format="json"):
"""Retrieve a single protein entry."""
url = f"https://rest.uniprot.org/uniprotkb/{accession}"
resp = requests.get(url, headers={"Accept": f"application/{format}"})
resp.raise_for_status()
return resp.json() if format == "json" else resp.text
entry = get_protein("P01308")
print(f"Protein: {entry['proteinDescription']['recommendedName']['fullName']['value']}")
print(f"Gene: {entry['genes'][0]['geneName']['value']}")
print(f"Length: {entry['sequence']['length']} aa")
print(f"Sequence: {entry['sequence']['value'][:50]}...")
fasta = requests.get("https://rest.uniprot.org/uniprotkb/P01308.fasta").text
print(fasta[:200])
3. ID Mapping
Map identifiers between UniProt and other databases.
import requests
import time
def map_ids(ids, from_db, to_db):
"""Map identifiers between databases (async job)."""
resp = requests.post("https://rest.uniprot.org/idmapping/run",
data={"from": from_db, "to": to_db, "ids": ",".join(ids)})
resp.raise_for_status()
job_id = resp.json()["jobId"]
while True:
status = requests.get(f"https://rest.uniprot.org/idmapping/status/{job_id}").json()
if "results" in status or "failedIds" in status:
break
time.sleep(1)
results = requests.get(f"https://rest.uniprot.org/idmapping/results/{job_id}").json()
return results
results = map_ids(["P01308", "P12345"], from_db="UniProtKB_AC-ID", to_db="PDB")
for r in results.get("results", []):
print(f"{r['from']} → PDB: {r['to']}")
results = map_ids(["P01308"], from_db="UniProtKB_AC-ID", to_db="Ensembl")
for r in results.get("results", []):
print(f"{r['from']} → Ensembl: {r['to']}")
Common database codes: UniProtKB_AC-ID, Ensembl, RefSeq_Protein, PDB, Gene_Name, GeneID, KEGG
4. Batch Retrieval and Streaming
Retrieve large datasets efficiently.
import requests
import time
def batch_retrieve(accessions, fields=None, format="tsv"):
"""Retrieve multiple proteins by accession."""
query = " OR ".join(f"accession:{acc}" for acc in accessions)
params = {"query": query, "format": format}
if fields:
params["fields"] = ",".join(fields)
resp = requests.get("https://rest.uniprot.org/uniprotkb/search", params=params)
resp.raise_for_status()
return resp.text
accessions = ["P01308", "P12345", "Q9Y6K9"]
tsv = batch_retrieve(accessions, fields=["accession", "gene_names", "protein_name", "length"])
print(tsv)
def stream_query(query, format="fasta"):
"""Stream large result sets."""
url = f"https://rest.uniprot.org/uniprotkb/stream?query={query}&format={format}"
resp = requests.get(url, stream=True)
resp.raise_for_status()
for chunk in resp.iter_content(chunk_size=8192, decode_unicode=True):
yield chunk
5. Pagination and Cursor-Based Iteration
Handle large result sets with pagination using the Link header cursor.
import requests
def paginate_search(query, fields=None, page_size=500):
"""Iterate all pages of a UniProt search using cursor pagination."""
params = {"query": query, "format": "tsv", "size": page_size}
if fields:
params["fields"] = ",".join(fields)
url = "https://rest.uniprot.org/uniprotkb/search"
rows = []
header = None
while url:
resp = requests.get(url, params=params)
resp.raise_for_status()
params = {}
lines = resp.text.strip().split("\n")
if header is None:
header = lines[0]
rows.extend(lines[1:])
link = resp.headers.get("Link", "")
url = link.split("<")[1].split(">")[0] if "<" in link else None
return header, rows
header, rows = paginate_search(
"kinase AND organism_id:9606 AND reviewed:true",
fields=["accession", "gene_names", "length"]
)
print(f"Retrieved {len(rows)} proteins")
print(header)
print("\n".join(rows[:3]))
6. Field Selection and Annotations
Customize which data fields to retrieve.
import requests
import pandas as pd
from io import StringIO
params = {
"query": "gene:TP53 AND organism_id:9606 AND reviewed:true",
"format": "tsv",
"fields": "accession,gene_names,protein_name,go_p,go_f,go_c,cc_function,ft_domain",
}
resp = requests.get("https://rest.uniprot.org/uniprotkb/search", params=params)
df = pd.read_csv(StringIO(resp.text), sep="\t")
print(df.columns.tolist())
print(df.iloc[0])
Common field groups:
- Sequence:
accession, sequence, length, mass
- Names:
gene_names, protein_name, organism_name
- GO:
go_p (process), go_f (function), go_c (component)
- Features:
ft_domain, ft_binding, ft_act_site, ft_mod_res
- Comments:
cc_function, cc_interaction, cc_subcellular_location
Key Parameters
| Parameter | Function/Endpoint | Default | Range / Options | Effect |
|---|
query | /search, /stream | — | UniProt query syntax | Filter proteins by criteria |
format | All endpoints | json | json, tsv, fasta, xml, gff | Output format |
fields | /search | all | Comma-separated field names | Reduces response size |
size | /search | 25 | 1–500 | Results per page |
from / to | /idmapping/run | — | Database codes | ID mapping direction |
reviewed:true | Query filter | — | true/false | Swiss-Prot (curated) only |
organism_id | Query filter | — | NCBI taxonomy ID | Filter by species |
Best Practices
-
Filter reviewed:true for curated data: Swiss-Prot entries are manually reviewed; TrEMBL entries are computationally predicted. Use Swiss-Prot for high-confidence annotations.
-
Use TSV format with fields for tabular analysis: Requesting only needed fields as TSV is faster and easier to parse than full JSON entries.
-
Use streaming for large downloads: The /stream endpoint returns all results without pagination, avoiding the need for multi-page iteration.
-
Add time.sleep(0.5) between batch requests: Respect API resources, especially when making many sequential requests.
-
Cache frequently accessed entries locally: UniProt updates monthly; cache results and re-fetch only when needed.
-
Anti-pattern — querying without organism_id: Broad queries like gene:INS return thousands of entries across all species. Always filter by organism for targeted results.
Common Recipes
Recipe: Download All Human Kinases as DataFrame
import requests
import pandas as pd
from io import StringIO
url = "https://rest.uniprot.org/uniprotkb/stream"
params = {
"query": "ec:2.7.* AND organism_id:9606 AND reviewed:true",
"format": "tsv",
"fields": "accession,gene_names,protein_name,length,go_f",
}
resp = requests.get(url, params=params)
df = pd.read_csv(StringIO(resp.text), sep="\t")
print(f"Human kinases (Swiss-Prot): {len(df)}")
print(df.head())
Recipe: Extract GO Annotations for a Gene Set
import requests
import pandas as pd
from io import StringIO
gene_list = ["BRCA1", "BRCA2", "TP53", "ATM", "CHEK2"]
query = " OR ".join(f"gene:{g}" for g in gene_list)
query += " AND organism_id:9606 AND reviewed:true"
params = {
"query": query,
"format": "tsv",
"fields": "accession,gene_names,go_p,go_f,go_c",
}
resp = requests.get("https://rest.uniprot.org/uniprotkb/search", params=params)
df = pd.read_csv(StringIO(resp.text), sep="\t")
print(df[["Accession", "Gene Names", "Gene Ontology (biological process)"]].head())
Recipe: Cross-Reference UniProt to PDB Structures
import requests
import time
accessions = ["P53_HUMAN", "P01308", "P00533"]
resp = requests.post("https://rest.uniprot.org/idmapping/run",
data={"from": "UniProtKB_AC-ID", "to": "PDB", "ids": ",".join(accessions)})
job_id = resp.json()["jobId"]
time.sleep(2)
results = requests.get(f"https://rest.uniprot.org/idmapping/results/{job_id}").json()
for r in results.get("results", []):
print(f"{r['from']} → PDB: {r['to']}")
Troubleshooting
| Problem | Cause | Solution |
|---|
400 Bad Request | Invalid query syntax | Check Boolean operators, field names, bracket matching; use UniProt query syntax docs |
| Too many results (slow) | No organism or review filter | Add AND organism_id:9606 AND reviewed:true to narrow results |
| ID mapping returns empty | Wrong database code | Verify from/to codes: use UniProtKB_AC-ID (not UniProtKB alone) |
| Pagination missing entries | Large result set | Use /stream endpoint instead of paginated /search |
429 Too Many Requests | Excessive API calls | Add time.sleep(0.5) between requests; batch accessions in single queries |
| FASTA has no gene name | TrEMBL entry with minimal annotation | Filter reviewed:true for Swiss-Prot entries with full annotations |
Related Skills
- biopython-molecular-biology — parse FASTA sequences returned by UniProt; run BLAST with retrieved sequences
- alphafold-database — retrieve predicted 3D structures using UniProt accessions
- esm-protein-language-model — generate embeddings from UniProt protein sequences
- gget-genomic-databases — alternative interface for quick gene/protein lookups across databases
References