| name | genai-conformance |
| description | Run, interpret, and iterate on the OpenInference GenAI conformance MVP at python/openinference-instrumentation/scripts/conformance/. Use when the user mentions GenAI conformance, OTel GenAI semantic conventions, Weaver registry live-check, the dual-write conversion (`_genai_conversion.py`, `enable_genai_semconv`), `gen_ai.*` attribute coverage, or asks to add new providers / scenarios to the conformance harness. |
GenAI Conformance
The repo ships a self-contained conformance harness at python/openinference-instrumentation/scripts/conformance/ that exercises OpenInference instrumentors against deterministic mock provider APIs, exports OTLP traces to weaver registry live-check, and prints a console summary of registry attributes seen / missing / advice-level counts. It validates the dual-write logic in _genai_conversion.py that translates OpenInference's native attributes (llm.*, input.*, output.*, openinference.*) into the OTel GenAI semantic conventions (gen_ai.*).
When to Use
- User asks to run the conformance harness, "test conformance", or "run weaver".
- User asks to maximize / improve
gen_ai.* registry coverage.
- User wants to extend the dual-write conversion in
_genai_conversion.py.
- User wants to add a new provider, a new test scenario, or a new mock endpoint.
- User mentions specific
gen_ai.* attributes (response.id, system_instructions, tool.call., retrieval., etc.) and whether they're being emitted.
Layout
scripts/conformance/
├── run.py # orchestrator (PEP 723, stdlib only)
├── mock_server.py # Flask mock with all providers' endpoints
├── anthropic_conformance.py # PEP 723 + editable [tool.uv.sources]
├── openai_conformance.py # PEP 723 + editable [tool.uv.sources]
├── google_genai_conformance.py # PEP 723 + editable [tool.uv.sources]
├── README.md
└── results/ # gitignored Weaver output
Each provider script declares its deps as PEP 723 inline metadata and pins the local OpenInference packages via [tool.uv.sources.<pkg>] blocks (multi-section dotted-key form — single-line inline tables exceed ruff's 100-char limit). run.py invokes everything via uv run. Filenames avoid the bare provider name (openai.py, anthropic.py) because that would shadow the SDK package on sys.path[0].
run.py lives in PROVIDER_SCRIPTS — a tuple iterated for both prewarm and execution. To add a provider, append to PROVIDER_SCRIPTS and add the corresponding <provider>_conformance.py and any new mock endpoints.
Running
uv run python/openinference-instrumentation/scripts/conformance/run.py
First run downloads pinned weaver v0.22.1 and semantic-conventions v1.40.0 to ~/.cache/oi-conformance/; subsequent runs are fast. uv caches each provider script's env by PEP 723 metadata hash.
Interpreting the summary
- Registry attributes seen —
gen_ai.* (and a few service.* / telemetry.sdk.*) attrs the run emitted, with sample counts.
- Non-registry attributes seen — OpenInference's native vocabulary. These show up as Weaver
missing_attribute violations by design — they aren't (and shouldn't be) in the OTel registry.
- Missing registry attributes (
gen_ai.*) — registry attrs the run did not emit. Categorize each one:
- Real dual-write gap — provider API has the data, instrumentor captures it as an OI attr, but
_genai_conversion.py doesn't map it. Fixable in conversion.
- Test scenario gap — conversion handles it, but the test doesn't exercise the relevant scenario (e.g.
gen_ai.tool.call.* need a TOOL span; gen_ai.embeddings.* need an EMBEDDING span). Fixable in <provider>_conformance.py.
- Mock data gap — instrumentor would capture it if the response included it (e.g.
gen_ai.usage.cache_read.input_tokens requires cache_read_input_tokens in the mock's usage block). Fixable in mock_server.py.
- Provider doesn't support it — e.g. Anthropic has no
frequency_penalty. Document and skip.
- Application-level / not auto-emittable —
gen_ai.agent.*, gen_ai.evaluation.*, gen_ai.prompt.name, gen_ai.data_source.id. Require explicit user attribution; out of scope for SDK instrumentation.
- Metric-only —
gen_ai.token.type lives on gen_ai.client.token.usage metric, not spans.
- Advice levels —
violation counts are predominantly missing_attribute for the OI native vocab (expected); improvement counts are not_stable warnings for development-stage gen_ai.* attrs (also expected). The dual-write itself is well-formed — Weaver does not flag type/shape/value errors on the emitted gen_ai.* attrs.
Iterating to maximize coverage
For category 1 (dual-write gap):
- Inspect
results/live_check.json to see exactly what OI attributes the instrumentor emitted (look for the relevant span's attributes array).
- Decide where to extend
_genai_conversion.py (get_genai_request_attributes, get_genai_response_attributes, etc.).
- Always add a unit test in test_genai.py for the new path. The existing tests cover the major span kinds; mirror that style.
- Re-run the conformance harness. Verify the missing list shrinks and no existing
gen_ai.* attribute regressed.
For category 2 (test scenario gap):
- Use
OITracer(trace.get_tracer(__name__), TraceConfig(enable_genai_semconv=True)) to manually emit non-LLM spans (TOOL, RETRIEVER, EMBEDDING, AGENT) inside a provider script. The Anthropic script already does this for TOOL / RETRIEVER / EMBEDDING — copy the pattern.
For category 3 (mock data gap):
- Mock responses are simple dicts at the top of
mock_server.py. The Anthropic mock already returns cache_creation_input_tokens / cache_read_input_tokens; the OpenAI mock returns prompt_tokens_details.cached_tokens. Add fields the SDK will surface and the OI instrumentor will turn into LLM_TOKEN_COUNT_PROMPT_DETAILS_CACHE_*.
Bumping the semconv version
The harness pins SEMCONV_VERSION (currently v1.41.1) and WEAVER_VERSION (currently v0.23.0) in run.py. When OTel cuts a new semconv release, walk this checklist:
- Check for a newer Weaver release too — always run
gh release list --repo open-telemetry/weaver --limit 5 alongside the semconv check. Weaver and the registry version independently; the harness depends on both. Bump WEAVER_VERSION whenever a newer release exists, and skim its notes for live-check-relevant fixes.
- Bump the constants in run.py:
SEMCONV_VERSION and WEAVER_VERSION to the latest releases.
- Run the harness once (
uv run python/openinference-instrumentation/scripts/conformance/run.py) so it downloads the new registry into ~/.cache/oi-conformance/semconv/<new-version>/.
- Refresh the vendored JSON schemas at tests/fixtures/genai_schemas/ from
~/.cache/oi-conformance/semconv/<new-version>/docs/gen-ai/gen-ai-{input,output}-messages.json.
- Run unit tests (
pytest tests/test_genai.py). The _load_json_attribute validator runs the new schemas against every emitted message payload — any breaking shape change surfaces here.
- Skim the semconv changelog for these specific risks (each one usually requires a code change in
_genai_conversion.py):
- New required fields on
ChatMessage / OutputMessage parts (TextPart, ToolCallRequestPart, etc.) → builder functions need to populate them.
- New
Role enum values → _normalize_message_role may need a mapping.
- New
FinishReason enum values → _normalize_finish_reason may need a mapping.
- Added
gen_ai.* registry attrs → opportunity for new dual-write mappings; re-run the harness and look at the "Missing registry attributes" summary.
- Removed / renamed
gen_ai.* attrs → drop from _genai_attributes.py and stop emitting in _genai_conversion.py.
- Refresh inline version refs: the semconv-version mentions in test_genai.py (schema-source comment), README.md (caveats section, includes Weaver version too), and _genai_conversion.py (the encoding comment inside
get_genai_message_attributes).
- Re-run the conformance harness end-to-end; verify no
gen_ai.* attribute regressed and no genuine shape errors appear in results/live_check.json (advice with id != "missing_attribute").
Gotchas
- PEP 723 inline-table line length:
[tool.uv.sources.<pkg>] { path = "...", editable = true } on one line easily exceeds 100 chars and trips ruff E501. Use the multi-section form ([tool.uv.sources.<pkg>]\npath = "..."\neditable = true).
- The conformance dir is excluded from package-level checks:
pyproject.toml excludes scripts/* from mypy and scripts from pytest's norecursedirs. ruff still lints it. Don't add Python imports that mypy/pytest can't resolve in the lint env (e.g. provider SDKs) outside this directory.
- Weaver inactivity timeout is 90s. First-run uv installs of OpenTelemetry/OpenAI/Google SDKs can take a while —
run.py prewarms each provider env via --prewarm early-exit before starting Weaver. If you add a new provider script, give it a --prewarm early-exit too.
- Single mock server for all providers: one Flask app handles
/v1/messages (Anthropic), /v1/chat/completions + /v1/embeddings + /v1/responses (OpenAI), and /v1beta/models/<path:model> (Google). Each endpoint discriminates between text and tool variants by checking body.get("tools").
- Editable installs are mandatory: PyPI versions of
openinference-instrumentation and the per-provider packages won't have in-progress dual-write changes. The [tool.uv.sources] blocks in each <provider>_conformance.py pin the local repo paths.
- System messages stay in
gen_ai.input.messages. The dual-write does not emit gen_ai.system_instructions; system instructions are assumed to flow through as a system-role entry in LLM_INPUT_MESSAGES. The gen_ai.input.messages JSON schema explicitly admits "system" as a valid role.
tool_call (singular) finish_reason — the conversion normalizes both tool_calls (OpenAI plural) and function_call (legacy) to "tool_call". The OTel registry doesn't constrain values for gen_ai.response.finish_reasons, but OTel's conventional value is plural tool_calls. If you change the normalization target, update the asserting tests in test_genai.py too.
- Don't conflate violations with shape errors: the headline "violation: N" counts
missing_attribute advice on OI native attrs (expected) plus any genuine shape mismatches on gen_ai.* attrs (real bugs). To find genuine shape errors, parse results/live_check.json and look at advice with id != "missing_attribute".