Skip to content

[SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split/rsplit when ANSI mode is on #51006

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented May 24, 2025

What changes were proposed in this pull request?

Avoid INVALID_ARRAY_INDEX in split/rsplit when ANSI mode is on

Why are the changes needed?

Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52169.

Does this PR introduce any user-facing change?

Yes. INVALID_ARRAY_INDEX no longer fails split/rsplit when ANSI mode is on

>>> spark.conf.get("spark.sql.ansi.enabled")
'true'
>>> import pandas as pd
>>> pser = pd.Series(["hello-world", "short"])
>>> psser = ps.from_pandas(pser)

FROM

>>> psser.str.split("-", n=1, expand=True)
25/05/28 14:52:10 ERROR Executor: Exception in task 10.0 in stage 2.0 (TID 15)  
org.apache.spark.SparkArrayIndexOutOfBoundsException: [INVALID_ARRAY_INDEX] The index 1 is out of bounds. The array has 1 elements. Use the SQL function `get()` to tolerate accessing element at invalid index and return NULL instead. SQLSTATE: 22003
== DataFrame ==
"__getitem__" was called from
<stdin>:1
...

TO

>>> psser.str.split("-", n=1, expand=True)
       0      1                                                                 
0  hello  world
1  short   None

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@xinrong-meng xinrong-meng changed the title [WIP] Avoid INVALID_ARRAY_INDEX in split when ANSI mode is on [WIP][SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split when ANSI mode is on May 24, 2025

if ps.get_option("compute.ansi_mode_support"):
spark_columns = [
F.try_element_at(scol, F.lit(i + 1)).alias(str(i)) for i in range(n + 1)
Copy link

@abhiips07 abhiips07 May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we try to calculate f.lit(i+1) outside loop? since this might avoid function calls and object creation during loop execution

literals = [F.lit(i + 1) for i in range(n + 1)]

spark_columns = [F.try_element_at(scol, lit).alias(str(i)) for i, lit in enumerate(literals)]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion!
There might not be a significant perf difference between creating F.lit inside the loop or beforehand, it's just wrapping a Python literal into a Spark expression, which aren’t executed immediately(just nodes in the DAG), and will be deduplicated by Catalyst. With that being said I’d like to keep the original for simplicity, but feel free to share if you have other opinions!

@@ -2031,7 +2031,13 @@ def pudf(s: pd.Series) -> pd.Series:
if expand:
psdf = psser.to_frame()
scol = psdf._internal.data_spark_columns[0]
spark_columns = [scol[i].alias(str(i)) for i in range(n + 1)]

if ps.get_option("compute.ansi_mode_support"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we can always use the new branch, and remove the configuration read here (which needs one Config RPC)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you mean by "using the new branch"?

@xinrong-meng xinrong-meng changed the title [WIP][SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split when ANSI mode is on [SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split/split when ANSI mode is on May 28, 2025
@xinrong-meng xinrong-meng marked this pull request as ready for review May 28, 2025 21:01
@xinrong-meng xinrong-meng changed the title [SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split/split when ANSI mode is on [SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split/rsplit when ANSI mode is on May 28, 2025
@xinrong-meng
Copy link
Member Author

Pending config check introduced in #50972

@xinrong-meng xinrong-meng marked this pull request as draft May 28, 2025 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants