with one click
develop-web-translator
// Develop a web translator that scrapes bibliographic data from a website. This is the most common translator type.
// Develop a web translator that scrapes bibliographic data from a website. This is the most common translator type.
Analyze a website's API by capturing network traffic (HAR) and generating an OpenAPI spec via mitmproxy2swagger.
Create or update test cases for a Zotero translator by running it against live URLs and capturing the output.
Develop an export translator that converts Zotero items into a file format (JSON, XML, CSV, etc.).
Develop an import translator that parses a file format (JSON, XML, RIS, BibTeX, CSV, etc.) into Zotero items.
Develop a search translator that looks up items by identifier (DOI, ISBN, PMID, arXiv ID, etc.) via an external API. NOT for websites with search pages — use develop-web-translator for those.
Inspect a live web page using headless Chrome. Gets screenshots, meta tags, accessibility tree, and runs CSS selectors or JS expressions against the rendered DOM.
| name | develop-web-translator |
| description | Develop a web translator that scrapes bibliographic data from a website. This is the most common translator type. |
Fetch and read the Zotero translator documentation:
Also read index.d.ts in the repo root for type definitions. Give more weight to recently created translators when looking for examples.
Collect from the user:
From the URLs, derive the target regex.
DO NOT fetch site pages with WebFetch, curl, or any HTTP tool. Use the tools instead:
node .bin/capture-har.mjs "<example url>"
Read the generated YAML file. It contains full API schemas. This is your source of truth.
node .bin/inspect-page.mjs "<example url>"
This gives you meta tags, accessibility tree, and screenshot.
Check the inspect-page meta tags first:
Embedded Metadata (EM) — if the page has Highwire Press tags (citation_title, citation_author, citation_doi, etc.), Dublin Core (DC.title, etc.), or good JSON-LD with bibliographic data, use EM. This is the most common approach (~180 translators use it):
async function scrape(doc, url = doc.location.href) {
let translator = Zotero.loadTranslator('web');
translator.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48'); // EM
translator.setDocument(doc);
translator.setHandler('itemDone', (_obj, item) => {
// fix up fields EM gets wrong
item.complete();
});
await translator.translate();
}
Call await translator.getTranslatorObject() only if you need to customize EM before translation (e.g. setting itemType).
DOI search — if the page doesn't have rich metadata but you can extract a DOI, use a search translator to look it up via DOI Content Negotiation:
async function scrape(doc, url = doc.location.href) {
let doi = doc.querySelector('a[href*="/doi/"]')?.href.match(/10\.\d{4,}\/[^\s]+/)?.[0];
if (!doi) return;
let translate = Zotero.loadTranslator('search');
translate.setSearch({ DOI: doi });
translate.setHandler('error', () => {});
translate.setHandler('itemDone', (_obj, item) => {
item.complete();
});
await translate.translate();
}
API-based — the site has a clean JSON API visible in the YAML. Call it with requestJSON().
HTML scraping — no useful APIs or metadata. Parse the DOM directly. Last resort.
Hybrid — combine any of the above.
node .bin/init-translator.mjs --label "<Label>" --creator "<Creator>" --target "<regex>" --type web
Implement detectWeb(doc, url), getSearchResults(doc, checkOnly), doWeb(doc, url), and scrape(doc, url).
node .bin/create-test.mjs "<Label>.js" --url "<example url>"
Include at least one single-item test and one multiple-item test (if supported).
Update lastUpdated every time you modify translator code. Zotero uses it to determine when to push updates to users.
node .bin/update-metadata.mjs "<Label>.js"
npm run lint -- "<Label>.js"
node .bin/run-tests.mjs "<Label>.js"
All tests must pass. Then create a branch and PR.