You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Validation Issues with Nullable 'Int64' Columns and Coercion in Pandera
Description
I am experiencing unexpected behavior when validating a DataFrame with Pandera using a schema that includes a nullable 'Int64' column with coercion enabled.
Here is the code that reproduces the issue:
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema
import numpy as np
# Define schema using pandas nullable dtypes
schema_nullable_int = pa.DataFrameSchema(
{
"id": pa.Column("Int64", checks=[pa.Check.ge(1)], coerce=True, nullable=True),
"name": pa.Column("str", coerce=True, nullable=True),
"date": pa.Column("datetime64", coerce=True, nullable=True),
"decimal": pa.Column(
pa.dtypes.Decimal, checks=[pa.Check.ge(1)], coerce=True, nullable=True
),
}
)
# Test with mixed data including string numbers
df_mixed = pd.DataFrame(
{
"id": [1, 2, 4, 4, 7.5, "8", "9", "x", pd.NA],
"name": [
"Alice",
"Bob",
"Charlie",
"David",
"Eve",
"Frank",
"Grace",
"Henry",
"Ivy",
],
"date": [
"2023-01-01",
"2023-02-01",
"2023-03-01",
"2023-04-01",
"2023-05-01",
"2023-06-01",
"2023-07-01",
"2023-08-01",
"2023-09-01",
],
"decimal": [2.01144, "1", "3.3", "4.4", "5.5", "6.6", "7.7", "8.8", np.nan],
}
)
try:
validated_df = schema_nullable_int.validate(df_mixed, lazy=True)
print("Successfully validated with nullable dtype")
print(validated_df)
except pa.errors.SchemaErrors as exc:
print("Schema errors and failure cases:")
print(exc.failure_cases)
print("\nDataFrame object that failed validation:")
print(exc.data)
When I run this code, I get the following output:
Schema errors and failure cases:
schema_context column check check_number failure_case index
0 Column id coerce_dtype('Int64') NaN x 7
1 Column id coerce_dtype('Int64') NaN 8
2 Column id dtype('Int64') NaN object None
3 Column id greater_than_or_equal_to(1) 0.0 TypeError("'>=' not supported between instance... None
DataFrame object that failed validation:
id name date decimal
0 1 Alice 2023-01-01 2
1 2 Bob 2023-02-01 1
2 4 Charlie 2023-03-01 3
3 4 David 2023-04-01 4
4 7.5 Eve 2023-05-01 6
5 8 Frank 2023-06-01 7
6 9 Grace 2023-07-01 8
7 x Henry 2023-08-01 9
8 Ivy 2023-09-01
I am confused about why pd.NA is included in the failure cases for the 'id' column, even though nullable=True is set. I would expect pd.NA to be allowed since the column is nullable.
I have checked the existing issues on the Pandera GitHub repository and found some related discussions, such as #796 and #664, but I'm still unsure about the specific behavior in this case.
Additionally, for the 'decimal' column, I have set the type to pa.dtypes.Decimal with checks=[pa.Check.ge(1)], coerce=True, and nullable=True. However, in the DataFrame, there are values like "1" and 1 (as integers), which are not decimals. I want to ensure that only proper decimal values (with fractional parts) pass the validation, but I'm not sure if Pandera's Decimal type handles this correctly.
Questions
Why is pd.NA appearing in the failure cases for the 'id' column when nullable=True is set?
How can I configure the schema to ensure that for the 'decimal' column, only values with fractional parts are accepted, excluding integers like 1 or "1"?
Are there similar issues or considerations for other data types such as string and datetime when using nullable columns with coercion?
I would appreciate any guidance or clarification on these points.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Validation Issues with Nullable 'Int64' Columns and Coercion in Pandera
Description
I am experiencing unexpected behavior when validating a DataFrame with Pandera using a schema that includes a nullable 'Int64' column with coercion enabled.
Here is the code that reproduces the issue:
When I run this code, I get the following output:
Schema errors and failure cases:
schema_context column check check_number failure_case index
0 Column id coerce_dtype('Int64') NaN x 7
1 Column id coerce_dtype('Int64') NaN 8
2 Column id dtype('Int64') NaN object None
3 Column id greater_than_or_equal_to(1) 0.0 TypeError("'>=' not supported between instance... None
DataFrame object that failed validation:
id name date decimal
0 1 Alice 2023-01-01 2
1 2 Bob 2023-02-01 1
2 4 Charlie 2023-03-01 3
3 4 David 2023-04-01 4
4 7.5 Eve 2023-05-01 6
5 8 Frank 2023-06-01 7
6 9 Grace 2023-07-01 8
7 x Henry 2023-08-01 9
8 Ivy 2023-09-01
I am confused about why pd.NA is included in the failure cases for the 'id' column, even though nullable=True is set. I would expect pd.NA to be allowed since the column is nullable.
I have checked the existing issues on the Pandera GitHub repository and found some related discussions, such as #796 and #664, but I'm still unsure about the specific behavior in this case.
Additionally, for the 'decimal' column, I have set the type to pa.dtypes.Decimal with checks=[pa.Check.ge(1)], coerce=True, and nullable=True. However, in the DataFrame, there are values like "1" and 1 (as integers), which are not decimals. I want to ensure that only proper decimal values (with fractional parts) pass the validation, but I'm not sure if Pandera's Decimal type handles this correctly.
Questions
Why is pd.NA appearing in the failure cases for the 'id' column when nullable=True is set?
How can I configure the schema to ensure that for the 'decimal' column, only values with fractional parts are accepted, excluding integers like 1 or "1"?
Are there similar issues or considerations for other data types such as string and datetime when using nullable columns with coercion?
I would appreciate any guidance or clarification on these points.
Beta Was this translation helpful? Give feedback.
All reactions