-
Notifications
You must be signed in to change notification settings - Fork 15.4k
Resolve serialization for numpy bool 1.x and 2.x compatibility #53690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Resolve serialization for numpy bool 1.x and 2.x compatibility #53690
Conversation
if cls not in allowed_deserialize_classes: | ||
raise TypeError(f"unsupported {qualname(cls)} found for numpy deserialization") | ||
name = qualname(cls) | ||
if name not in deserializers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove this check if this is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why redundant? It's good to verify it here. No harm done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is redundant. The deserializer will only be invoked when serde
finds it in deserializers. This was added in the PR of PyDantic support and should not have been there. Let's remove it.
if cls not in allowed_deserialize_classes: | ||
raise TypeError(f"unsupported {qualname(cls)} found for numpy deserialization") | ||
name = qualname(cls) | ||
if name not in deserializers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is redundant. The deserializer will only be invoked when serde
finds it in deserializers. This was added in the PR of PyDantic support and should not have been there. Let's remove it.
See the failing test case. |
@@ -250,7 +250,6 @@ def test_numpy_serializers(self): | |||
("klass", "ver", "value", "msg"), | |||
[ | |||
(np.int32, 999, 123, r"serialized version is newer"), | |||
(np.float32, 1, 123, r"unsupported numpy\.float32"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case is stale after removing the redundant type check in the deserializers. Also, I notice numpy.float32
is not in the serializers list, but there is such a check. Not sure if it is an issue.
if isinstance(o, (np.float16, np.float32, np.float64, np.complex64, np.complex128)):
return float(o), name, __version__, True
d5231d4
to
5735f6b
Compare
Just found out the rest of the code will not be executed by using pytest.xfail in this way. Will think about another way to do it. |
… individually to numpy bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case test_serializers_importable_and_str
is refactored to be parameterized on serializers. The logic that loads serializers from modules are moved into generate_serializers_importable_and_str_test_cases
. This function generate all the test cases (the input and the expected output) to test if a serializer is importable.
- It creates a single test case for each serializer defined in the module.
- It allows us to mark the
numpy.bool
as expected failure (i.e.,xfail
) in pytest outcomes when the NumPy version in breeze environment is 1.x. The failure reason is documented in thexfail
message. - If the NumPy version in breeze environemnt is 2.x, the test case for
numpy.bool
will pass.
Please let me know if this approach is reasonable. Also, I would appreciate if you could provide some feedback on the test implementation (e.g., the invocation of a function in pytest.mark.parameterize to generate test case). Thanks!
@@ -61,6 +63,49 @@ def recalculate_patterns(): | |||
_match_regexp.cache_clear() | |||
|
|||
|
|||
def generate_serializers_importable_and_str_test_cases(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not personally a fan in dynamically generating cases like this (if we get the logic wrong, there’s no easy way to detect missing cases) but it’s a personal opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I would like to elaborate more about the thoughts behind this implementation. Another approach to achieve it may be hard-coding all the values in the parameterization list. In the case of new serializers are added, we can also update the test case manually.
From another perspective, I think the intention may be to simulate the following serializer loading (from what I understand, but may be wrong).
In the original implementation, it is a bit challenging to mark the loading of np.bool
as an expected failure when the installed numpy version is 1.x, since executing pytest.xfail
will interrupt the execution of the test.
airflow/airflow-core/src/airflow/serialization/serde.py
Lines 361 to 376 in 0cd0ff0
def _register(): | |
"""Register builtin serializers and deserializers for types that don't have any themselves.""" | |
_serializers.clear() | |
_deserializers.clear() | |
_stringifiers.clear() | |
with Stats.timer("serde.load_serializers") as timer: | |
for _, name, _ in iter_namespace(airflow.serialization.serializers): | |
name = import_module(name) | |
for s in getattr(name, "serializers", ()): | |
if not isinstance(s, str): | |
s = qualname(s) | |
if s in _serializers and _serializers[s] != name: | |
raise AttributeError(f"duplicate {s} for serialization in {name} and {_serializers[s]}") | |
log.debug("registering %s for serialization", s) | |
_serializers[s] = name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since executing
pytest.xfail
will interrupt the execution of the test
Can you elaborate on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that no other code is executed after the pytest.xfail() call, differently from the marker. That’s because it is implemented internally by raising a known exception.
https://docs.pytest.org/en/stable/how-to/skipping.html#xfail-mark-test-functions-as-expected-to-fail
In the above commit, I mark a test as XFAIL
from within the test. Suppose the modules are loading as follow:
np.bool_
np.bool
kubernetes.client.models.v1_pod.V1Pod
datetime.date
...
If the test is marked as XFAIL
from within the test while loading np.bool
. The test will not continue to check for the import for kubernetes.client.models.v1_pod.V1Pod
and datetime.date
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just mark the bool case instead, not the entire test? (Actually skip is probably better than xfail for this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A skip means that you expect your test to pass only if some conditions are met, otherwise pytest should skip running the test altogether.
I revisited the document and seems like skip is a better option. In this case, I would expect the np.bool
case to pass if the installed NumPy version is above 2.0.
Why not just mark the bool case instead, not the entire test?
I wanted to understand it better in terms of the implementation. Do you suggest to replace the pytest.xfail
in the commit 5735f6b with pytest.skip
? Or I should restructure it to individual tests (i.e., np.bool
has its own test case) and use pytest.mark.skipif
for that test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting the case out to a separate test is a good idea. Splitting out the deltalake cases is probably also a good idea. This would make the dynamically param generation code as simple as possible and reduce the possibility of mistakes to the minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will see if I can figure out something to balance the trade offs here for that test case.
Are there any similar cases we need to account for with float or int between numpy 1.x and 2.x? |
I have a test DAG #52753 (comment) which check for the integer types and the floating point types listed in the serializers. I think it is good with those types. However, as mentioned here #53690 (comment). The |
Create this PR to help review the solution for #52753
Closes: #52753
Unit Test
the testing ran in the breeze environment with numpy
1.26.4
installed. The following test case failed because it attempted to importnumpy.bool
.np.bool
was a deprecated alias for the built-in Pythonbool
type, introduced in NumPy version1.20.0
.np.bool_
is the Numpy scalar. If we keep both"numpy.bool"
and"numpy.bool_"
in the list of serializers and the list of deserializers, we might need to update this test case to run based on the installed numpy version. Or, update the code to check which numpy version we have and import and check the right bool (#52753 (comment)).then, I created a init file
files/airflow-breeze-config/init.sh
to runuv pip install --upgrade pandas
. This upgrade both pandas and numpy to the following versions. We may not only update one package, because the pandas depends on the numpy package and can result in the error mentioned #52753 (comment). The test case that failed when numpy1.26.4
is installed now pass becausenumpy.bool
is the NumPy scalar in2.3.1
.If we only upgrade numpy to 2.x and didn't update pandas, the following error is raised. (
uv pip install --upgrade numpy
)DAG Test (Run-time Behavior)
Numpy version 1.26.4

Numpy version 2.3.1

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in airflow-core/newsfragments.