Skip to content

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Oct 16, 2025

What changes were proposed in this pull request?

Fix scheduled job for numpy 2.1.3

Why are the changes needed?

to fix https://github.com/apache/spark/actions/runs/18538043179/job/52838303733

it was caused by a bug in 19.0.0, see apache/arrow#45283

Does this PR introduce any user-facing change?

no, infra-only

How was this patch tested?

PR builder with

default: '{"PYSPARK_IMAGE_TO_TEST": "numpy-213", "PYTHON_TO_TEST": "python3.11"}'

see https://github.com/zhengruifeng/spark/actions/runs/18527303212/job/52801019275

Was this patch authored or co-authored using generative AI tooling?

no

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @zhengruifeng . I was also worrying about that failed CI, but didn't get a chance.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 16, 2025

For this one, do you think we need to document somewhere this incompatibility because our minimum numpy is still 1.22 and pyarrow is 15.0.0.

it was caused by a bug in 19.0.0, see apache/arrow#45283

@pan3793
Copy link
Member

pan3793 commented Oct 16, 2025

@zhengruifeng, I have a silly question about Python deps management - I see that many Python deps are declared without a version, or with a range version(half-bounded, e.g. foo>=1.0 or bar<2.0). Silently upgrading 3rd libs may introduce breaking changes (especially for major version bumping)/bugs.

This means that if we do not specify the dependency version, or only specify the lower bound of the dependency version, PySpark may not work once a new major version of the dependency is released. This becomes a problem if users want to create a venv for older PySpark versions (in practice, EOLed versions of Spark are used widely and upgrading is not timely).

I wonder if PySpark can pin all Python deps in a fixed version(or at least a bounded range version, e.g. foo>=1.0,<=2.3), this clearly shows the versions of Spark that have been fully tested.

@zhengruifeng
Copy link
Contributor Author

@zhengruifeng, I have a silly question about Python deps management - I see that many Python deps are declared without a version, or with a range version(half-bounded, e.g. foo>=1.0 or bar<2.0). Silently upgrading 3rd libs may introduce breaking changes (especially for major version bumping)/bugs.

This means that if we do not specify the dependency version, or only specify the lower bound of the dependency version, PySpark may not work once a new major version of the dependency is released. This becomes a problem if users want to use older Spark versions (in practice, EOLed versions of Spark are used widely and upgrading is not timely).

I wonder if Spark can pin all Python deps in a fixed version(or at least a bounded range version, e.g. foo>=1.0,<=2.3), this clearly shows the versions of Spark that have been fully tested.

@pan3793 the reason to use lower bounds foo>=1.0 in most places is to eagerly testing spark against latest packages (need to trigger the refresh of cached images), when spark get broken with new versions we set the upper bound (e.g. 22d2eb3) and restore it once the issue get resolved (e.g. 47574ba)

currently, most workflows are testing against latests version; and we have two workflow against the minimum versions in which the versions of key packages (numpy/pyarrow/pandas) are pinned

But I personally think maybe we should use a fixed version foo=1.2.3 or foo>=1.0,<=2.3 in the officially released images

@dongjoon-hyun @HyukjinKwon

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Oct 16, 2025

For this one, do you think we need to document somewhere this incompatibility because our minimum numpy is still 1.22 and pyarrow is 15.0.0.

it was caused by a bug in 19.0.0, see apache/arrow#45283

@dongjoon-hyun I am not sure since it is a pyarrow bug introduced in 19.0.0 and fixed in 19.0.1.
I suspect there maybe also other similar cases in other packages.

@pan3793
Copy link
Member

pan3793 commented Oct 16, 2025

But I personally think maybe we should use a fixed version foo=1.2.3 or foo>=1.0,<=2.3 in the officially released images

@zhengruifeng, that makes a lot of sense!

@zhengruifeng
Copy link
Contributor Author

merged to master to restore the CI

@zhengruifeng zhengruifeng deleted the restore_numpy_213 branch October 16, 2025 05:27
@dongjoon-hyun
Copy link
Member

Thank you. In that case, it looks okay to me, too. It doesn't need us to pay more attention.

I am not sure since it is a pyarrow bug introduced in 19.0.0 and fixed in 19.0.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants