| name | arxiv-search |
| description | Search arXiv for preprints in physics, math, CS, quantitative biology, quantitative finance, statistics, electrical engineering, economics. Use when: (1) finding preprints by topic, (2) searching by author, (3) browsing arXiv categories, (4) getting paper metadata/abstracts. NOT for: published journal articles (use crossref-search), biomedical (use pubmed-search). |
| metadata | {"openclaw":{"emoji":"📄","requires":{"bins":["curl"]}}} |
arXiv Search
Search arXiv preprints via public API. Covers physics, math, CS, q-bio, q-fin,
statistics, electrical engineering, and economics.
API Endpoint
curl -s "http://export.arxiv.org/api/query?search_query=all:transformer+attention&start=0&max_results=5"
Parameters: search_query= (required), id_list= (direct lookup by arXiv ID),
start= (pagination offset), max_results= (default 10, max 30000),
sortBy=relevance|lastUpdatedDate|submittedDate, sortOrder=ascending|descending.
Query Syntax
Field prefixes: ti: title, au: author, abs: abstract, co: comment,
jr: journal ref, cat: category, all: all fields.
Boolean: AND, OR, ANDNOT. Example:
curl -s "http://export.arxiv.org/api/query?search_query=au:bengio+AND+cat:cs.LG+AND+ti:attention&max_results=10"
Category Codes
Physics: astro-ph (.CO/.EP/.GA/.HE/.IM/.SR), cond-mat (.dis-nn/.mes-hall/.mtrl-sci/.soft/.stat-mech/.str-el/.supr-con), hep-ex, hep-lat, hep-ph, hep-th, quant-ph, gr-qc, nucl-ex, nucl-th
CS: cs.AI, cs.CL (NLP), cs.CV, cs.LG (ML), cs.CR, cs.DB, cs.DS, cs.SE, cs.RO
Math: math.AG, math.AP, math.CO, math.PR, math.ST
Other: q-bio (.BM/.CB/.GN/.MN/.NC/.PE/.QM/.SC/.TO), q-fin (.CP/.EC/.GN/.MF/.PM/.PR/.RM/.ST/.TR), stat (.AP/.CO/.ME/.ML/.OT/.TH), eess (.AS/.IV/.SP/.SY), econ (.EM/.GN/.TH)
Response Parsing
The API returns Atom XML. Parse with Python:
curl -s "http://export.arxiv.org/api/query?search_query=ti:large+language+model&max_results=5&sortBy=submittedDate&sortOrder=descending" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom'}
root = ET.parse(sys.stdin).getroot()
for entry in root.findall('a:entry', ns):
title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
aid = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
pub = entry.find('a:published', ns).text[:10]
authors = ', '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
print(f'[{aid}] {pub} | {title}')
print(f' Authors: {authors}\n')
"
Direct Lookup and Pagination
curl -s "http://export.arxiv.org/api/query?id_list=2301.07041,2302.13971"
curl -s "http://export.arxiv.org/api/query?search_query=cat:cs.AI&start=0&max_results=25&sortBy=submittedDate&sortOrder=descending"
curl -s "http://export.arxiv.org/api/query?search_query=cat:cs.AI&start=25&max_results=25&sortBy=submittedDate&sortOrder=descending"
Rate Limiting
No official limit, but keep to 1 request per 3 seconds for bulk queries.
For large-scale harvesting, use the OAI-PMH bulk access endpoint instead.
Best Practices
- Use
sortBy=submittedDate&sortOrder=descending for latest papers.
- Combine
cat: with keyword searches for targeted results.
- Check
opensearch:totalResults in the response for total match count.
- For PDF access, replace
/abs/ with /pdf/ in the paper URL.
- Use
id_list for direct lookups (faster and more reliable).
- URL-encode spaces as
+ in query terms.
Zero-Hallucination Rule
NEVER fabricate results from training data. Every paper title, author, DOI, PMID, citation count, and metadata detail presented to the user MUST come from an actual API response in this conversation. If the API returns no results or partial data, report exactly what was returned. Do not "fill in" missing details from memory.