| name | python-code-reviewer |
| description | Review Python OpenInference instrumentation code for correctness and completeness. Use this skill when reviewing a Python instrumentor package — whether it's a new instrumentor, a PR that modifies one, or when the user asks to audit/review/check an existing instrumentor's code quality. Trigger on phrases like "review the instrumentor", "check the code", "audit the package", "is this instrumentor correct", or any request to validate an OpenInference Python instrumentation package against project standards.
|
| invocable | true |
Python Code Reviewer for OpenInference Instrumentors
Review a Python OpenInference instrumentation package against the project's established
patterns and conventions. This is a checklist-driven review — go through each section,
report findings with file paths and line numbers, and surface issues organized by severity.
Workflow
Step 1: Identify the package to review
- Ask the user which instrumentor to review if not already clear from context
- The package lives under
python/instrumentation/openinference-instrumentation-<name>/
- Read the key files:
__init__.py, _wrappers.py (or equivalent), pyproject.toml,
and the full tests/ directory
Step 2: Pull the instrumented library source and use it as ground truth
OpenInference instrumentors work by monkey-patching functions in the library they
instrument. All correctness judgments — whether wrappers target the right methods, handle
the right signatures, process the right data structures, and cover the right edge cases —
must be verified against the actual library source code. Do NOT make assumptions about
how the instrumented library works.
Note: The tox env name <pkg> and the library's Python import path <library>
often differ. For example, google_genai is the tox env name but the library installs
as google/genai/ in site-packages. Check test-requirements.txt or pyproject.toml
to find the actual library package name.
-
Set up the tox environment to install the pinned library version. Look up the
tox envlist in python/tox.ini to find the correct env name (use the highest Python
version available, e.g., py314, py313):
cd python && uvx --with tox-uv tox run -e <pyVER>-ci-<pkg> -- --co -q
(-- --co -q tells pytest to collect without running, which triggers the install.)
If the .tox env already exists, skip this step.
If tox setup fails (missing Python version, dependency conflicts), fall back to
pip install <library> in a temporary venv to unblock the review.
-
Locate the installed library source at:
python/.tox/<pyVER>-ci-<pkg>/lib/python<X.Y>/site-packages/<library>/
-
Reference the library source throughout the review. Before flagging any finding,
verify it against the actual code:
- Are the monkey-patched methods/classes correct? Check they exist and have the
expected signatures.
- Are parameter types handled correctly? Read the real type annotations and defaults.
- Are edge cases real? Check whether a supposed edge case can actually occur given
the library's actual types, validation, and control flow.
- Are attribute extractions correct? Verify field names, nesting, and optional vs.
required fields against the library's actual data classes.
-
Calibrate severity based on what the library actually does:
- A bug affecting types/paths the library actually uses → High or Critical
- An edge case for a type that can't actually appear at runtime → Low
- A missing handler for a type in the library's Union that is common → higher severity
- A missing handler for a rare/internal type → lower severity
Step 3: Run all review sections below
Step 4: Present findings organized by severity:
- Critical: Will cause incorrect behavior or CI failure
- High: Missing required convention or test coverage gap
- Medium: Deviates from established patterns but functional
- Low: Style or minor improvement suggestions
Section 1: Test Setup and CI Config
The tox.ini install pattern matters because a broken pattern silently installs the wrong
version of the library, making the "pinned version" test target useless.
1.1 tox.ini install pattern
Read python/tox.ini and find the commands_pre entries for this package.
Correct pattern (google_adk style — 4 steps, substitute <pkg> with the actual
package name):
<pkg>: uv pip uninstall -r test-requirements.txt
<pkg>: uv pip install --reinstall-package openinference-instrumentation-<pkg> .
<pkg>: python -c 'import openinference.instrumentation.<pkg>'
<pkg>: uv pip install -r test-requirements.txt
Broken pattern (causes under-resolution — the pinned version test may silently test
the wrong version):
<pkg>: uv pip install --reinstall {toxinidir}/instrumentation/openinference-instrumentation-<pkg>[test]
Flag the broken pattern as Critical — it defeats the purpose of version-pinned testing.
1.2 test-requirements.txt
Check that test-requirements.txt exists in the package root and contains:
- A pinned version of the library being instrumented (e.g.,
openai==2.8.0)
opentelemetry-sdk
pytest-recording (if using VCR cassettes)
- Any other test utilities needed (pytest-asyncio, respx, responses, etc.)
If test-requirements.txt is missing entirely, flag as Critical (the correct tox
pattern depends on it).
1.3 Latest test target
Verify that the tox envlist has both pinned and -latest variants:
py3{10,14}-ci-{pkg,pkg-latest}
And that the -latest variant upgrades the library:
pkg-latest: uv pip install -U <library-name>
Section 2: Testing Patterns
2.1 conftest.py fixtures
Read tests/conftest.py and verify these fixtures exist:
Required fixtures:
in_memory_span_exporter — returns InMemorySpanExporter()
tracer_provider — creates TracerProvider with SimpleSpanProcessor wired to the exporter
instrument (autouse) — calls Instrumentor().instrument(tracer_provider=...),
clears exporter, yields, then calls .uninstrument() and clears again
Scope considerations:
- Session-scoped instrumentor fixtures are fine when the instrumentor is stateless
- Function-scoped exporter+provider is safer for test isolation but session-scoped works
if tests clear the exporter properly
VCR config fixture (if using cassettes):
@pytest.fixture(scope="session")
def vcr_config() -> dict[str, Any]:
return {
"before_record_request": _strip_request_headers,
"before_record_response": _strip_response_headers,
"decode_compressed_response": True,
"record_mode": "once",
}
With helper functions that strip sensitive headers from recorded cassettes.
2.2 pytest-recording / VCR cassettes
If the instrumentor calls external APIs (LLM providers, embedding services, etc.):
- Tests should use
@pytest.mark.vcr decorator
- Cassettes should live in
tests/cassettes/ (pytest-recording default)
- Cassette YAML files should have headers stripped (no API keys recorded)
pytest-recording should be in test-requirements.txt
If tests use mocking instead of VCR, that's acceptable but note it as a pattern difference.
2.3 Exhaustive attribute assertions (pop-style)
This is the most important testing pattern. Tests should verify ALL span attributes, not
just spot-check a few. The pattern prevents regressions where unexpected attributes appear
or expected ones disappear silently.
Correct pattern:
attributes = dict(span.attributes or {})
assert attributes.pop(OPENINFERENCE_SPAN_KIND) == OpenInferenceSpanKindValues.CHAIN.value
assert attributes.pop(INPUT_VALUE)
assert attributes.pop(INPUT_MIME_TYPE) == JSON
assert attributes.pop(OUTPUT_VALUE)
assert attributes.pop(OUTPUT_MIME_TYPE) == JSON
assert not attributes
What to flag:
- Tests that only check a few attributes without the final
assert not attributes — High
- Tests that use
span.attributes[KEY] or span.attributes.get(KEY) instead of pop — Medium
(functional but doesn't catch unexpected extras)
- Missing
assert not attributes at the end — High
2.4 Context attribute propagation tests
There should be at least one test that uses using_attributes() context manager and
verifies that context attributes appear on spans:
with using_attributes(
session_id="test-session",
user_id="test-user",
metadata={"key": "value"},
tags=["tag-1", "tag-2"],
prompt_template="template {var}",
prompt_template_version="v1.0",
prompt_template_variables={"var": "value"},
):
Then verify these attributes appear on the spans via pop assertions.
Section 3: OpenInference Semantic Conventions
Check which conventions apply based on the type of library being instrumented. Not every
instrumentor needs every attribute — match the conventions to what the library actually does.
3.1 Always required
Every span must have:
OPENINFERENCE_SPAN_KIND — set to the appropriate kind enum value
INPUT_VALUE + INPUT_MIME_TYPE — what went into the operation
OUTPUT_VALUE + OUTPUT_MIME_TYPE — what came out
MIME types should be application/json for structured data (dicts, Pydantic models) and
text/plain for strings. Flag if MIME type is missing when value is set — High.
When setting input/output attributes, the instrumentor should use:
from openinference.instrumentation import get_input_attributes, get_output_attributes
span.set_attributes(dict(get_input_attributes(val, mime_type=OpenInferenceMimeTypeValues.JSON)))
3.2 LLM libraries (OpenAI, Anthropic, Mistral, LiteLLM, etc.)
These should set:
LLM_MODEL_NAME — the model identifier
LLM_PROVIDER — the provider name (e.g., "openai", "anthropic")
LLM_INVOCATION_PARAMETERS — JSON of parameters like temperature, max_tokens
LLM_INPUT_MESSAGES — array of message objects with role and content
LLM_OUTPUT_MESSAGES — array of response message objects
LLM_TOKEN_COUNT_PROMPT / LLM_TOKEN_COUNT_COMPLETION / LLM_TOKEN_COUNT_TOTAL — token usage
Span kind should be LLM.
3.3 Embedding libraries
EMBEDDING_MODEL_NAME
EMBEDDING_EMBEDDINGS — the embedding vectors (unless masked by TraceConfig)
EMBEDDING_TEXT — input text
Span kind should be EMBEDDING.
3.4 Tool/function calling
When the library supports tool use or function calling:
TOOL_NAME — name of the tool/function
TOOL_DESCRIPTION — description (if available)
TOOL_PARAMETERS — JSON schema of parameters
Span kind should be TOOL.
3.5 Agent/orchestration frameworks (CrewAI, LangChain, DSPy, etc.)
These typically produce multiple span kinds in a hierarchy:
CHAIN for orchestration/workflow spans
AGENT for agent execution spans
TOOL for tool invocations
LLM for underlying model calls
3.6 Retrieval libraries
RETRIEVAL_DOCUMENTS — array of retrieved documents
- Document attributes:
DOCUMENT_ID, DOCUMENT_CONTENT, DOCUMENT_SCORE, DOCUMENT_METADATA
Span kind should be RETRIEVER.
Section 4: Span Hierarchy
4.1 Parent-child relationships
For instrumentors that create multiple spans, verify:
- Spans nest correctly (child span's
parent.span_id matches parent span's context.span_id)
- All spans from a single operation share the same
trace_id
- No orphaned root spans that should be children
Tests should verify hierarchy explicitly:
trace_ids = {span.context.trace_id for span in spans}
assert len(trace_ids) == 1
assert child_span.parent.span_id == parent_span.context.span_id
Flag missing hierarchy tests as High for multi-span instrumentors.
4.2 Correct span kinds in hierarchy
Common correct hierarchies:
CHAIN -> LLM (simple chain with LLM call)
CHAIN -> AGENT -> TOOL (agent framework)
CHAIN -> AGENT -> LLM (agent making LLM calls)
CHAIN -> RETRIEVER -> EMBEDDING (RAG pipeline)
CHAIN -> CHAIN -> LLM (nested chains)
4.3 Thread/async context propagation
If the instrumented library uses threads or async:
- Verify that OTel context is properly propagated across thread boundaries
(using
contextvars.copy_context() if needed)
- For async code, ensure spans created in async functions are properly parented
- Flag if the library is known to use
ThreadPoolExecutor or similar and the
instrumentor doesn't handle context propagation — Critical
4.4 Suppress tracing support
Every wrapper should check suppression at the start. Either pattern is acceptable:
if context_api.get_value(context_api._SUPPRESS_INSTRUMENTATION_KEY):
return wrapped(*args, **kwargs)
from opentelemetry.context import suppress_instrumentation
if suppress_instrumentation():
return wrapped(*args, **kwargs)
Missing suppression check — High.
4.5 TraceConfig masking support
The instrumentor should accept and respect TraceConfig:
- Pass it to
OITracer or use it to mask attributes before setting them
- At minimum,
hide_inputs and hide_outputs should work
Missing TraceConfig support — Medium (functional but incomplete).
Presenting Results
Organize findings into a table:
| Severity | Section | Finding | Location |
|---|
| Critical | 1.1 | Uses broken tox install pattern | python/tox.ini:142 |
| High | 2.3 | Tests don't use exhaustive pop assertions | tests/test_instrumentor.py:85 |
| ... | ... | ... | ... |
Then list what's working well — positive findings help the user understand what doesn't
need to change.
Finally, ask the user what they'd like to do:
- "Fix the issues" — generate patches
- "Run the tests" — execute
tox run -e test-<pkg>
- "Just reviewing" — done