| name | pandera |
| description | This skill should be used when the user asks to "validate a DataFrame with pandera", "write a pandera schema", "use pandera DataFrameModel", "add data validation to a pipeline", or needs guidance on pandera best practices for data quality. |
Pandera: DataFrame Validation
Pandera is an open-source framework for validating DataFrame-like objects at runtime. Define schemas once and reuse them across pandas, polars, Dask, Modin, PySpark, and Ibis backends.
Import Convention
Since pandera v0.24.0, use the backend-specific module. Using the top-level pandera module produces a FutureWarning and will be deprecated in v0.29.0.
import pandera.pandas as pa
import pandera.polars as pa
from pandera.typing.pandas import DataFrame, Series, Index
Two Schema Styles
Object-based API (DataFrameSchema)
Suitable for dynamic schema construction or when schemas need to be built programmatically.
import pandas as pd
import pandera.pandas as pa
schema = pa.DataFrameSchema({
"user_id": pa.Column(int, pa.Check.gt(0)),
"email": pa.Column(str, pa.Check.str_matches(r"^[^@]+@[^@]+\.[^@]+$")),
"score": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(1.0)]),
"status": pa.Column(str, pa.Check.isin(["active", "inactive", "banned"])),
})
validated = schema.validate(df)
Class-based API (DataFrameModel) — preferred
Pydantic-style syntax with type annotations. Produces cleaner, reusable schemas that integrate with @pa.check_types.
import pandera.pandas as pa
from pandera.typing.pandas import DataFrame, Series
class UserSchema(pa.DataFrameModel):
user_id: int = pa.Field(gt=0)
email: str = pa.Field(str_matches=r"^[^@]+@[^@]+\.[^@]+$")
score: float = pa.Field(ge=0.0, le=1.0)
status: str = pa.Field(isin=["active", "inactive", "banned"])
class Config:
strict = True
coerce = False
UserSchema.validate(df)
@pa.check_types
def process(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
return df
Checks
Built-in Checks (prefer these over lambdas)
pa.Check.gt(0)
pa.Check.ge(0)
pa.Check.lt(100)
pa.Check.le(100)
pa.Check.eq("value")
pa.Check.ne("value")
pa.Check.isin(["a", "b"])
pa.Check.notin(["x"])
pa.Check.str_matches(r"^\d+$")
pa.Check.in_range(0, 100)
pa.Check.str_startswith("prefix")
pa.Check.str_endswith("suffix")
pa.Check.str_length(1, 255)
Custom Checks
pa.Check(lambda s: s.str.len() <= 255)
pa.Check(lambda x: x > 0, element_wise=True)
pa.Check(lambda s: s > 0, error="values must be positive")
DataFrame-level Checks
schema = pa.DataFrameSchema(
columns={...},
checks=pa.Check(lambda df: df["end_date"] >= df["start_date"]),
)
In DataFrameModel, use @pa.dataframe_check:
class Schema(pa.DataFrameModel):
start_date: int
end_date: int
@pa.dataframe_check
@classmethod
def end_after_start(cls, df: pd.DataFrame) -> pd.Series:
return df["end_date"] >= df["start_date"]
Nullable and Optional Columns
pa.Column(float, nullable=True)
from typing import Optional
class Schema(pa.DataFrameModel):
required_col: Series[int]
optional_col: Optional[Series[float]]
Coercion
Enable coercion to cast data to the declared type before validation. Use deliberately — coercion can hide upstream data issues.
pa.Column(int, coerce=True)
class Schema(pa.DataFrameModel):
year: int = pa.Field(gt=2000, coerce=True)
class Config:
coerce = True
Lazy Validation — Collect All Errors
By default pandera raises on the first error. Use lazy=True to collect all failures before raising, useful for batch reporting.
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc.failure_cases)
Decorator Integration
Integrate validation transparently into pipelines using decorators.
@pa.check_types
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
return df.assign(revenue=df["units"] * df["price"])
@pa.check_input(input_schema)
@pa.check_output(output_schema)
def pipeline_step(df):
return df
@pa.check_io(raw=input_schema, out=output_schema)
def pipeline_step(raw):
return raw
Decorators work on sync/async functions, methods, class methods, and static methods.
Schema Inheritance
Build specialized schemas from a base to avoid repetition.
class BaseEvent(pa.DataFrameModel):
event_id: str
timestamp: int = pa.Field(gt=0)
class ClickEvent(BaseEvent):
url: str
user_agent: str
class Config:
strict = True
Schema Persistence (YAML / Script)
Serialize and reload schemas to keep validation reproducible.
import pandera.io
pandera.io.to_yaml(schema, "./schema.yaml")
schema = pandera.io.from_yaml("./schema.yaml")
pandera.io.to_script(schema, "./schema_definition.py")
Schema Inference (Prototyping Only)
Infer a schema from existing data to bootstrap development. Always review and tighten the generated schema before using in production.
import pandera.pandas as pa
inferred = pa.infer_schema(df)
print(inferred.to_script())
Dropping Invalid Rows
Use drop_invalid_rows=True on DataFrameSchema to filter out failing rows instead of raising an error. Supported on pandas and polars.
schema = pa.DataFrameSchema(
{"score": pa.Column(float, pa.Check.ge(0))},
drop_invalid_rows=True,
)
cleaned = schema.validate(df_with_bad_rows)
Error Handling
from pandera.errors import SchemaError, SchemaErrors
try:
schema.validate(df)
except SchemaError as exc:
print(exc.failure_cases)
try:
schema.validate(df, lazy=True)
except SchemaErrors as exc:
print(exc.error_counts)
print(exc.failure_cases)
Key Configuration Options (Config)
| Option | Type | Effect |
|---|
strict | bool | Raise if extra columns present |
coerce | bool | Cast columns to declared dtypes |
ordered | bool | Require columns in declared order |
name | str | Schema name shown in error messages |
add_missing_columns | bool | Insert columns with default values |
Best Practices
- Use
DataFrameModel over DataFrameSchema for new code — cleaner syntax, inheritance, and type-annotation integration.
- Prefer
strict=True to catch unexpected extra columns early.
- Use built-in checks (
Check.gt, Check.isin, etc.) over custom lambdas where possible — they produce better error messages.
- Write vectorized checks (
element_wise=False, the default) for performance; only use element_wise=True when the logic is truly scalar.
- Always add
error= messages to custom Check objects to improve debuggability.
- Use lazy validation in pipelines that process large batches so all failures surface in one pass.
- Never rely on inferred schemas in production — always explicitly define constraints.
- Use
coerce=True deliberately — set at the column level to limit scope; avoid schema-wide coercion unless certain.
- Prefer
raise_warning=True only for non-critical informational checks (e.g., normality tests), not for data integrity constraints.
Additional Resources
references/checks-and-validation.md — Built-in check catalog, groupby checks, wide checks, hypothesis testing
references/dataframe-models.md — Field spec, schema inheritance, MultiIndex, aliases, parsers, Polars usage