| name | bigdata-skill |
| description | Pull Bigdata.com (RavenPack) financial and news data through the official `bigdata-client` SDK and its public `/v1/*` REST endpoints when the Bigdata MCP server returns only pre-synthesized tearsheets but you need the machine-readable substrate underneath. MCP search returns prose chunks (text + relevance only — no per-chunk sentiment, no entity spans); its tearsheets give only aggregate values, not computable time series or per-field JSON. This skill bundles a verified, cost-guarded toolkit over the official REST API: annotated chunk search, entity/ISIN resolution, analyst estimates, calendar/surprise/ ratings/targets, financial statements, TTM metrics & ratios, prices, dividends, revenue segments, a daily entity-sentiment series, co-mention graph, screener, and batch search. Use it whenever the user mentions Bigdata.com, RavenPack, a `bd_v2_` key, the bigdata MCP, rp_entity_id, chunk/query_unit cost, or wants structured financials, fundamentals, prices, sentiment, or annotated news. |
Bigdata.com SDK + REST Toolkit
Get the structured substrate the Bigdata.com MCP server doesn't hand over. The
MCP returns clean prose and pre-synthesized tearsheets, but its search tool
gives chunks with no per-chunk sentiment or entity spans, and its tearsheets
give aggregate values — not the fiscal-period time series, universe screener, or
per-field JSON you'd build a pipeline on. The official bigdata-client SDK plus
a thin REST passthrough over the same backend, same JWT reach the official
/v1/* endpoints that hold it. This skill bundles a toolkit that does exactly
that — already debugged, already cost-guarded — so you don't re-pay the
discovery cost.
The core problem this solves (read this first)
The Bigdata MCP server answers "what's the sentiment around NVIDIA?" with a
readable paragraph or a pre-synthesized tearsheet — genuinely useful for a chat
turn. But the moment you need the machine-readable substrate to build a
pipeline on, the MCP doesn't hand it over:
- its search tool returns chunks with text + relevance only — no per-chunk
sentiment number, no entity character spans;
- its tearsheets give aggregate values (a single sentiment score, a summary
of estimates) — not a fiscal-period time series you can compute on, a
universe screener, or per-field JSON.
The fix is a general pattern, not a Bigdata trick:
When an MCP data source returns only synthesized output but you need the
structured fields underneath, drop to the vendor SDK or REST. MCP optimizes
for a chat turn, not a pipeline.
Crucially, for Bigdata these structured fields are official, publicly
documented REST endpoints (docs.bigdata.com/api-reference/...), not a hidden
backend — and Bigdata is sunsetting the SDK (EOL 2026-12-31) in favour of this
REST API, so the REST layer here is the forward-compatible path, not a hack.
The SDK (bigdata_client.Bigdata) covers search + knowledge-graph; bd._api.http
reaches every /v1/* endpoint the SDK never wrapped. The bundled
bigdata_toolkit packages both behind one BigdataClient.
When to use this skill
Trigger on any of these, in any language:
- The user is using Bigdata.com / RavenPack and the MCP result feels thin —
"where's the sentiment score?", "I need entity-level data", "the calendar".
- They want forward / structured financials for a ticker: analyst
estimates, earnings or event calendar, earnings surprise, analyst ratings,
price targets, a company screener / universe.
- They want annotated news chunks with numeric sentiment + entity spans, or
a sentiment time series / co-mention graph.
- They mention a
bd_v2_ API key, rp_entity_id, query_unit / chunk
cost, bigdata-client, or "the bigdata MCP isn't enough".
- They're building an investment-research dataset and need a reusable,
cost-aware data-pull layer rather than one-off MCP calls.
Setup (one time)
1 — API key (never hardcode it). The client fail-fasts if it's missing:
export BIGDATA_API_KEY=bd_v2_xxxxxxxx
2 — An isolated Python env with the official SDK. The bundled toolkit
imports bigdata_client; install it once:
uv venv .venv --python 3.12
uv pip install --python .venv/bin/python bigdata-client
3 — Outbound proxy (only if your network needs one to reach
api.bigdata.com). Two equivalent options — the official SDK accepts both: an
env var, or BigdataClient(proxy=...) in code. The env var is simplest:
export HTTPS_PROXY=http://<host>:<port>
If a proxy does TLS interception (self-signed CA) and you hit SSL handshake
errors, the official fix is BigdataClient(verify_ssl="<proxy-CA>.pem") — not
blind retries.
4 — Make the bundled package importable by putting this skill's scripts/
on PYTHONPATH (or sys.path.insert(0, "<this-skill>/scripts")).
Smoke-test the whole path (entity resolve + quota are free; --with-search
adds one ~1 query_unit chunk search):
BIGDATA_API_KEY=bd_v2_xxx PYTHONPATH=scripts .venv/bin/python scripts/probe_example.py
Quickstart
import sys
sys.path.insert(0, "<this-skill>/scripts")
from bigdata_toolkit import (
BigdataClient, EntityResolver, AnnotatedSearcher,
StructuredDataREST, CostTracker, CostModel, rc,
)
c = BigdataClient()
er = EntityResolver(c)
nvda = rc(lambda: er.resolve_id("NVIDIA", country="US"))
rest = StructuredDataREST(c)
est = rc(lambda: rest.analyst_estimates(nvda, period="quarter", limit=5))
surp = rc(lambda: rest.latest_surprise(nvda))
cal = rc(lambda: rest.events_calendar(nvda, categories=["earnings-call"],
start_date="2026-06-01", end_date="2026-12-31"))
s = AnnotatedSearcher(c)
docs = rc(lambda: s.search_entity(nvda, keyword="data center", chunk_limit=10))
ct = CostTracker(c); ct.snapshot()
print(ct.delta())
Wrap every network call in rc(lambda: ...) — a first-handshake SSL: UNEXPECTED_EOF is common and the SDK's internal retry doesn't cover it.
Routing — which capability answers the question
| The user wants… | Use | Module |
|---|
Company name / ISIN / CUSIP / SEDOL → rp_entity_id | EntityResolver.resolve_id / .resolve_by_isin | kg.py (SDK) |
| Forward analyst consensus (revenue/EPS by fiscal period) | StructuredDataREST.analyst_estimates | rest_ext.py |
| Latest earnings surprise (actual vs estimate) | .latest_surprise | rest_ext.py |
| Upcoming earnings / event calendar (one name or whole market) | .events_calendar | rest_ext.py |
| Analyst ratings / price-target consensus | .analyst_ratings / .price_target | rest_ext.py |
| Full financial statements (income / balance / cash-flow, multi-year) | .income_statement / .balance_sheet / .cash_flow_statement | rest_ext.py |
| TTM valuation metrics & ratios (EV/EBITDA, ROE, P/E, margins) | .key_metrics_ttm / .company_ratios_ttm | rest_ext.py |
| Company profile (CEO, sector, employees, IPO date) | .company_profile | rest_ext.py |
| Daily OHLC prices / dividend history | .daily_prices / .dividends | rest_ext.py |
| Revenue by geography / product segment | .revenue_geographic_segments / .revenue_product_segments | rest_ext.py |
| Daily entity-sentiment time series (don't self-aggregate from chunks!) | .entity_sentiment | rest_ext.py |
| Co-mention graph (supply-chain / competitor / customer — ⚠️ chunk-billed) | .connected_entities | rest_ext.py |
| Build a universe by market-cap / sector / country | .company_screener | rest_ext.py |
| News/filing/transcript chunks with sentiment + entity spans | AnnotatedSearcher.search_entity | search.py (SDK) |
| Bulk-pull many searches 50% cheaper (portfolio backfill) | BatchSearch (create→upload→poll→download) | rest_ext.py |
| Track / forecast quota spend before a backfill | CostTracker / CostModel | cost.py |
| Hit an endpoint the toolkit hasn't wrapped yet | client.http.post("v1/<resource>/query", body) | client.py |
income/balance/cash-flow/daily-prices/dividends/revenue-segments return
{fields, values} — wrap them in fields_values_to_records() to get
[{field: value}]. The *_ttm / company_profile endpoints are already flat.
All structured endpoints above are free (0 chunks) except
connected_entities and AnnotatedSearcher (chunk-billed).
The two data faces (do NOT say "Bigdata fails for Chinese / A-shares")
This split is the most important non-obvious conclusion — state it precisely:
| Face | Path | A-share / Chinese verdict |
|---|
| Structured financial (estimates, calendar, surprise, ratings, target, screener, financials, prices, dividends, revenue segments, daily entity-sentiment) | REST (rest_ext.py) | Works — via rp_entity_id resolved from the English name or ISIN (not the Chinese name). Data is fresh. Minor holes (some A-share price-targets return the entity with no numeric target). The daily entity_sentiment series lives here and works for any resolvable entity — it is not the dead end below. |
| Unstructured Chinese NLP (Chinese-news entity detection, per-chunk Chinese sentiment) | SDK search (search.py) | Dead end — a data-source-level gap, not an SDK bug: Chinese entity detection ≈ 0, per-chunk CJK sentiment is a doc-level inherited value, and language mislabels Chinese filings as English. Pair Bigdata with a China-domestic source for Chinese-language chunk content; use Bigdata for the structured face (incl. aggregate entity_sentiment) + ISIN/KG crosswalk + English-language chunk sentiment. |
Cost discipline
1 query_unit = 10 chunks (official). Only chunk-search is billed — the
structured /v1/* endpoints (estimates, financials, prices, calendar, surprise,
ratings, the sentiment time series, screener…) are free (0 chunks,
contract-tested). connected_entities (co-mentions) and AnnotatedSearcher
are chunk-billed.
Three levers when you do pay for chunks:
ChunkLimit, never a bare int. Search.run(int) is a document limit
billed by the full chunk page; ChunkLimit(n) bills per chunk.
AnnotatedSearcher.search forces ChunkLimit for you. (We observed roughly a
52x gap once — a single measured data point, not stated in the official
docs; treat the exact multiple as indicative. The rule "use ChunkLimit"
holds regardless, because max_chunks is the official billing unit.)
- Rerank bills only the returned chunks (official) — pass a
rerank_threshold to recall broadly but pay only for the high-relevance hits.
- Batch search is 50% cheaper (
$0.0075 vs $0.015 / qu) — use
BatchSearch for a large multi-query backfill.
Use CostModel to veto an over-budget job before running it, and
CostTracker.snapshot() / delta() to measure real spend. Full accounting →
references/cost_accounting.md.
Known pitfalls (already solved — don't re-debug these)
Each cost real debugging time and is fixed or guarded in the toolkit. Full
reproductions and fixes in references/known_pitfalls.md:
- First-handshake
SSL: UNEXPECTED_EOF → wrap calls in rc(); the SDK's
urllib3 retry only covers HTTP status, not the SSL EOF.
All(entity, Keyword(kw)) raises TypeError → combine with the &
operator (entity & Keyword(kw)); All takes a single iterable. (Fixed in
AnnotatedSearcher.entity_query.)
- The 52x doc-limit billing trap → always
ChunkLimit, never a bare int.
- Closure capture in loops → bind loop vars:
rc(lambda q=q, dr=dr: ...).
analyst_estimates(period="quarter") 400s above limit≈20.
company_screener filters must nest under "filters" — flat top-level
keys don't 400, they're silently dropped → unfiltered universe.
Document.reporting_period is always None (the SDK model drops a field
present on the REST wire) → fetch_reporting_period_raw.
What this skill will not do
- Never hardcode an API key.
BigdataClient reads BIGDATA_API_KEY and
fail-fasts if absent — no plaintext fallback (that is exactly the pattern
secret scanners catch).
- Only ever reads — never writes or uploads. Every method is a read-only
query (
uploads is NotImplementedError in API-key mode anyway), so the
toolkit can't mutate your account or push data anywhere.
- Never invent an endpoint or a schema. Every signature here is runtime
L4-verified or marked L3 (doc-confirmed, not yet run); see
references/verified_api_signatures.md. For a new endpoint, confirm the path
via docs.bigdata.com/llms.txt rather than guessing.
File layout
bigdata-skill/
├── SKILL.md # this file — routing + setup + quickstart
├── scripts/
│ ├── bigdata_toolkit/ # the verified, cost-guarded package
│ │ ├── client.py # BigdataClient: SDK (.bd) + REST escape hatch (.http/.conn)
│ │ ├── kg.py # EntityResolver: name/ISIN/CUSIP/SEDOL → rp_entity_id
│ │ ├── search.py # AnnotatedSearcher: chunks + sentiment + entity spans (SDK)
│ │ ├── rest_ext.py # StructuredDataREST (estimates/financials/prices/dividends/sentiment/co-mentions/screener) + BatchSearch + fields_values_to_records — official REST
│ │ ├── cost.py # CostTracker + CostModel: chunk billing + budget veto
│ │ └── retry.py # rc(): SSL/transient-error retry passthrough
│ └── probe_example.py # runnable end-to-end smoke test
└── references/
├── escape_hatch_architecture.md # WHY the MCP is lossy; bd._api.http mechanism; adding endpoints
├── verified_api_signatures.md # L4/L3-verified signatures + the two data faces, with evidence
├── cost_accounting.md # chunk billing, the 52x trap, CostModel/CostTracker, budgeting
└── known_pitfalls.md # every pitfall above, with reproduction + fix
References
| Read when you need to… | File |
|---|
Understand why the MCP is insufficient and how the REST escape hatch works (and how to wrap a new /v1/* endpoint) | references/escape_hatch_architecture.md |
| Look up an exact verified method signature + its verification level | references/verified_api_signatures.md |
| Budget a backfill or debug a surprise quota burn | references/cost_accounting.md |
| Diagnose an error you hit while pulling data | references/known_pitfalls.md |