Validation Issues with Nullable 'Int64' Columns and pd.Int64Dtype Coercion in Pandera #2022

nabagkmit · 2025-06-03T08:57:27Z

nabagkmit
Jun 3, 2025

Validation Issues with Nullable 'Int64' Columns and Coercion in Pandera
Description
I am experiencing unexpected behavior when validating a DataFrame with Pandera using a schema that includes a nullable 'Int64' column with coercion enabled.
Here is the code that reproduces the issue:

import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema
import numpy as np

# Define schema using pandas nullable dtypes
schema_nullable_int = pa.DataFrameSchema(
    {
        "id": pa.Column("Int64", checks=[pa.Check.ge(1)], coerce=True, nullable=True),
        "name": pa.Column("str", coerce=True, nullable=True),
        "date": pa.Column("datetime64", coerce=True, nullable=True),
        "decimal": pa.Column(
            pa.dtypes.Decimal, checks=[pa.Check.ge(1)], coerce=True, nullable=True
        ),
    }
)

# Test with mixed data including string numbers
df_mixed = pd.DataFrame(
    {
        "id": [1, 2, 4, 4, 7.5, "8", "9", "x", pd.NA],
        "name": [
            "Alice",
            "Bob",
            "Charlie",
            "David",
            "Eve",
            "Frank",
            "Grace",
            "Henry",
            "Ivy",
        ],
        "date": [
            "2023-01-01",
            "2023-02-01",
            "2023-03-01",
            "2023-04-01",
            "2023-05-01",
            "2023-06-01",
            "2023-07-01",
            "2023-08-01",
            "2023-09-01",
        ],
        "decimal": [2.01144, "1", "3.3", "4.4", "5.5", "6.6", "7.7", "8.8", np.nan],
    }
)

try:
    validated_df = schema_nullable_int.validate(df_mixed, lazy=True)
    print("Successfully validated with nullable dtype")
    print(validated_df)
except pa.errors.SchemaErrors as exc:
    print("Schema errors and failure cases:")
    print(exc.failure_cases)
    print("\nDataFrame object that failed validation:")
    print(exc.data)

When I run this code, I get the following output:
Schema errors and failure cases:
schema_context column check check_number failure_case index
0 Column id coerce_dtype('Int64') NaN x 7
1 Column id coerce_dtype('Int64') NaN 8
2 Column id dtype('Int64') NaN object None
3 Column id greater_than_or_equal_to(1) 0.0 TypeError("'>=' not supported between instance... None

DataFrame object that failed validation:
id name date decimal
0 1 Alice 2023-01-01 2
1 2 Bob 2023-02-01 1
2 4 Charlie 2023-03-01 3
3 4 David 2023-04-01 4
4 7.5 Eve 2023-05-01 6
5 8 Frank 2023-06-01 7
6 9 Grace 2023-07-01 8
7 x Henry 2023-08-01 9
8 Ivy 2023-09-01

I am confused about why pd.NA is included in the failure cases for the 'id' column, even though nullable=True is set. I would expect pd.NA to be allowed since the column is nullable.
I have checked the existing issues on the Pandera GitHub repository and found some related discussions, such as #796 and #664, but I'm still unsure about the specific behavior in this case.
Additionally, for the 'decimal' column, I have set the type to pa.dtypes.Decimal with checks=[pa.Check.ge(1)], coerce=True, and nullable=True. However, in the DataFrame, there are values like "1" and 1 (as integers), which are not decimals. I want to ensure that only proper decimal values (with fractional parts) pass the validation, but I'm not sure if Pandera's Decimal type handles this correctly.
Questions

Why is pd.NA appearing in the failure cases for the 'id' column when nullable=True is set?
How can I configure the schema to ensure that for the 'decimal' column, only values with fractional parts are accepted, excluding integers like 1 or "1"?
Are there similar issues or considerations for other data types such as string and datetime when using nullable columns with coercion?

I would appreciate any guidance or clarification on these points.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Validation Issues with Nullable 'Int64' Columns and pd.Int64Dtype Coercion in Pandera #2022

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Uh oh!

Validation Issues with Nullable 'Int64' Columns and pd.Int64Dtype Coercion in Pandera #2022

Uh oh!

nabagkmit Jun 3, 2025

Replies: 0 comments

nabagkmit
Jun 3, 2025