| name | dewey |
| version | 1 |
| description | Use when "query Dewey Data", "deweydata.io", "SafeGraph places/patterns/spend", "Advan foot traffic", "POI / points of interest", "mobility data", "dataplor", "Veraset", "PassBy", "crypto/Bitcoin ATM locations", or any pull from the Dewey Data academic marketplace (UVA/NYU Platform Subscription) via the deweypy/deweydatapy client, DuckDB, or the Dewey MCP server. |
| user-invocable | false |
Contents
What Dewey Is
Dewey Data is an academic data marketplace — one institutional Platform Subscription unlocks a catalog of ~300 datasets from ~40 providers (foot traffic, POI, mobility, consumer transactions, real estate, labor). UVA Library and NYU both hold the institutional subscription; SafeGraph and most providers are free under it.
Dewey is not a SQL warehouse like WRDS. Data is delivered as partitioned Parquet/CSV.gz files downloaded via an API key. You discover datasets, read metadata, sample, filter (by date partition + columns), then download. Think "S3 of presigned Parquet links," not "PostgreSQL."
| WRDS | Dewey |
|---|
| Data | Finance/accounting | POI, foot traffic, mobility, consumer, real estate |
| Access | PostgreSQL / SAS on the grid | File download (Parquet/CSV.gz) via API key |
| Query engine | server-side SQL | DuckDB over the files (local or remote presigned URLs) |
| Licensing | per-vendor, negotiated | one platform subscription unlocks the catalog |
| AI access | none | MCP server (api.deweydata.io/mcp) |
Credential Enforcement
IRON LAW: NEVER GUESS, INVENT, OR HARDCODE THE API KEY
The Dewey API key belongs to the **user's** account (`app.deweydata.io` → Connections → Add Connection → API Key). It is shown **once**. You do not have it and cannot derive it.
- ALWAYS ask the user for the key before any real data pull. No exceptions.
- NEVER write a placeholder like
apikey = "your_api_key" and run it — it will 401 and waste a round trip. Read from DEWEY_API_KEY env var or a gitignored file (~/.config/dewey/apikey).
- NEVER commit the key, echo it back, or paste it into a script that gets committed.
Guessing or hardcoding the key is NOT HELPFUL — every call 401s, and a committed key is a security incident the user must rotate.
Each product (dataset) has its own product path / project ID (prj_…), obtained from the dataset page: Get Data → (Skip filtering) → Connect to API / Bulk API → API URL. One API key, many product paths. If you don't have the product path, discover it via the MCP server (search_datasets) rather than guessing.
Download Enforcement
IRON LAW: NO BULK DOWNLOAD WITHOUT METADATA + SAMPLE + FILTER FIRST
Before downloading ANY Dewey dataset, you MUST:
- IDENTIFY the product path and what partitions/columns you actually need
- META — call
get_meta (deweydatapy) / get_download_info (MCP) to learn partition columns, date range, file count, total size
- SAMPLE — pull 100 rows (
read_sample / MCP sample_dataset) and INSPECT the schema before committing to a full pull
- FILTER — restrict by date partition (
partition_key_after/before) AND columns; for selective pulls use DuckDB COPY TO over the presigned URLs, never download the whole catalog
- DOWNLOAD the filtered subset, then verify row counts / NULLs / date range on disk
This is not negotiable. Skipping the sample-and-filter step is NOT HELPFUL — Dewey datasets are routinely hundreds of GB to multiple TB; an unfiltered pull burns hours of bandwidth and disk for data you'll immediately throw away.
Dewey Facts
- SafeGraph Patterns is multi-TB; "download everything and filter in pandas" fills the disk before the filter ever runs — counterproductive on its own terms. Use DuckDB
COPY TO with a WHERE clause on the remote parquet to pull only the rows/columns you need.
- Column names differ by provider and release (
naics_code vs NAICS_CODE; opened_on may not exist at all). A full pull against guessed columns is the exact incompetence the sample step exists to prevent — read_sample(nrows=100) BEFORE the full pull.
- Most datasets are date-partitioned weekly; "all of it" means every weekly file ever shipped. Set
partition_key_after/before to the study window.
- Presigned links expire in 24h (
download_files0). For large multi-day pulls use download_files1 (page-by-page, refreshes links) — a long job on download_files0 dies mid-pull.
- A wrong
prj_ product path 404s or returns someone else's data. Get the path from Connect to API or MCP search_datasets; hardcoding a guessed path is an unverified claim presented as fact.
- Use
deweypy.get_dataset_files, not deweydatapy.get_meta/get_file_list — the latter's external-api/v3 endpoint is dead (returns non-JSON / 500 → JSONDecodeError), confirmed 2026-06-10. See references/deweypy-client.md.
- The download service throws transient HTTP 500s on individual presigned URLs, and one bad file aborts a whole-batch DuckDB
COPY read_csv([...]). For filtered pulls: chunk (~20 files), retry per chunk re-minting fresh URLs, fall back to per-file skip; restartable via per-chunk parquet. Set SET http_timeout=120000; SET http_retries=3;. Worked example in references/deweypy-client.md.
- Some providers gate access behind extra terms (e.g. ConsumerEdge): the web "Get Data" flow shows an "I acknowledge…additional terms" modal you must accept once before the dataset is usable / its
prj_ path mints. Don't auto-accept a provider license without the user's OK.
- MCP tools load only at session start. After
claude mcp add … dewey-prod, the search_datasets/sample_dataset/etc. tools are NOT available in the current session — start a new session to use them.
Red Flags — STOP Immediately If You're About To:
- Call
download_files* without first calling get_meta + read_sample → STOP. Meta + sample first.
- Download a dataset with no
start_date/end_date / partition filter → STOP. Scope the date range.
- Load a whole remote dataset into a DataFrame → STOP. Use DuckDB
COPY TO … (FORMAT PARQUET, PARTITION_BY …) to persist a filtered subset to disk.
- Run a pull with
apikey="your_api_key" or any guessed key → STOP. Ask the user; read from env/file.
- Write the API key into a script you'll commit → STOP. Env var or gitignored file only.
Access Method Decision Table
| Need | Method | Reference |
|---|
| Discover/search datasets, check schema, sample — from inside Claude | MCP server (api.deweydata.io/mcp) | references/mcp.md |
| Scripted Python bulk download | deweypy (recommended) or deweydatapy (legacy, product_path API) | references/deweypy-client.md |
| Selective pull — specific columns/rows from huge datasets | DuckDB over presigned URLs (read_parquet($urls) + COPY TO) | references/duckdb.md |
| R workflow | deweyr (download_dewey()) | references/deweypy-client.md |
| One-off, dataset < 2.0 GB | UI CSV download (platform → project) | references/access-options.md |
| Analyze data already on disk | DuckDB / pandas / polars over *.parquet or *.csv.gz | references/access-options.md |
Authentication
Get the key once from app.deweydata.io → Connections → Add Connection → API Key. Store it out of source control:
mkdir -p ~/.config/dewey && echo 'YOUR_KEY' > ~/.config/dewey/apikey && chmod 600 ~/.config/dewey/apikey
import os, pathlib
apikey = os.environ.get("DEWEY_API_KEY") or pathlib.Path("~/.config/dewey/apikey").expanduser().read_text().strip()
Institutional login (to browse the catalog / create the key) is via UVA NetBadge (use your UVA email) or NYU SSO. The Platform Subscription is what makes SafeGraph etc. free — see references/datasets.md.
Quick Reference: Featured Datasets
| Provider | Dataset(s) | What it is |
|---|
| SafeGraph | Global Places (POI), Geometry, Spend, Patterns | POI master, building footprints, card spend, foot-traffic visit patterns |
| Advan Research | Monthly/Weekly Patterns, Home Panel | Foot traffic aggregated to place & census-block |
| dataplor | POI | Global POI, strong emerging-markets coverage |
| Veraset | Movement | Device-level mobility (institutional license only) |
| PassBy | Foot Traffic | Per-POI foot-traffic analytics |
| Consumer Edge / PDI | Spend / transactions | Card & product-level purchasing |
| LinkUp | Job postings | Labor-market activity |
| ATTOM / Dwellsy / RentHub | Real estate | Property records, rentals |
Full catalog (all ~250 datasets): references/catalog.md — every dataset grouped by category with time coverage, row count, size, and download access (machine-readable: references/catalog.csv). Featured-dataset detail + discovery workflow: references/datasets.md.
SafeGraph Global Places Quick Reference
Core POI schema — columns are UPPERCASE, NAICS_CODE is a string, BRANDS is a JSON-array string (extract with json_extract_string(BRANDS,'$[0].safegraph_brand_name')). Always sample before filtering.
| Column | Meaning |
|---|
PLACEKEY | Stable unique POI id (join key across SafeGraph products) |
LOCATION_NAME | POI name |
BRANDS | JSON array: [{"safegraph_brand_name":"…"}] — not plain text |
STREET_ADDRESS,CITY,REGION,POSTAL_CODE,ISO_COUNTRY_CODE | Address (REGION=US state) |
LATITUDE,LONGITUDE | Coordinates |
NAICS_CODE,NAICS_CODE_2022 | 6-digit NAICS (string) |
TOP_CATEGORY,SUB_CATEGORY | Category labels |
OPENED_ON,CLOSED_ON,TRACKING_CLOSED_SINCE | Open/close dates (exist but sparsely populated — NULL for BTMs) |
Resolved empirically: crypto/Bitcoin ATMs do exist as standalone POIs under NAICS_CODE='522320'; all major operators are present. But OPENED_ON/CLOSED_ON are NULL for BTMs in the current release → it's a cross-section, not a time series. Full details, the 7 BTM operators, and the worked example: references/safegraph-places.md and examples/btm_safegraph_pull.py.
Additional Resources
Reference Files
references/access-options.md — all download methods (UI, deweypy, deweydatapy, DuckDB, MCP, R), 24h link expiry, partitioning, reading data on disk
references/deweypy-client.md — deweypy (modern CLI + auth/download) and deweydatapy (get_meta, get_file_list, read_sample, download_files0/1) function reference; deweyr for R
references/duckdb.md — selective remote-Parquet pulls, COPY TO … PARTITION_BY pattern, querying downloaded files
references/mcp.md — Dewey MCP server URL, JSON config, the 9 tools, discovery → schema → sample workflow
references/datasets.md — featured-dataset catalog, UVA NetBadge / NYU institutional access, discovery workflow
references/catalog.md + catalog.csv — full enumerated catalog (~250 datasets / 39 partners) by category, with coverage / rows / column count / size / access
references/schemas.json — full column schemas for all ~250 datasets (keyed by slug → columns[] with name/type/description; 11,264 columns). Look up a dataset's columns here before pulling, instead of a live get_dataset_schema call
references/linkage.md — cross-dataset join-key map (placekey, ticker, cusip/cik, domain, person id, lat/long, fips, zip…) — which datasets combine and on what spine
references/safegraph-places.md — Global Places schema, NAICS 522320, BTM operator brands, opened_on/closed_on, the Bitcoin-ATM worked example
Example Files
examples/btm_safegraph_pull.py — acceptance test: filter SafeGraph Global Places to the 7 BTM operator brands + NAICS 522320, verify standalone-POI / open-close coverage, export the US subset to ~/projects/batm/