[SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in `split`/`rsplit` when ANSI mode is on #51006

xinrong-meng · 2025-05-24T02:39:31Z

What changes were proposed in this pull request?

Avoid INVALID_ARRAY_INDEX in split/rsplit when ANSI mode is on

Why are the changes needed?

Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52169.

Does this PR introduce any user-facing change?

Yes. INVALID_ARRAY_INDEX no longer fails split/rsplit when ANSI mode is on

>>> spark.conf.get("spark.sql.ansi.enabled")
'true'
>>> import pandas as pd
>>> pser = pd.Series(["hello-world", "short"])
>>> psser = ps.from_pandas(pser)

FROM

>>> psser.str.split("-", n=1, expand=True)
25/05/28 14:52:10 ERROR Executor: Exception in task 10.0 in stage 2.0 (TID 15)  
org.apache.spark.SparkArrayIndexOutOfBoundsException: [INVALID_ARRAY_INDEX] The index 1 is out of bounds. The array has 1 elements. Use the SQL function `get()` to tolerate accessing element at invalid index and return NULL instead. SQLSTATE: 22003
== DataFrame ==
"__getitem__" was called from
<stdin>:1
...

TO

>>> psser.str.split("-", n=1, expand=True)
       0      1                                                                 
0  hello  world
1  short   None

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

abhiips07 · 2025-05-26T06:20:39Z

python/pyspark/pandas/strings.py

+
+            if ps.get_option("compute.ansi_mode_support"):
+                spark_columns = [
+                    F.try_element_at(scol, F.lit(i + 1)).alias(str(i)) for i in range(n + 1)


can we try to calculate f.lit(i+1) outside loop? since this might avoid function calls and object creation during loop execution

literals = [F.lit(i + 1) for i in range(n + 1)] spark_columns = [F.try_element_at(scol, lit).alias(str(i)) for i, lit in enumerate(literals)]

Thanks for suggestion!
There might not be a significant perf difference between creating F.lit inside the loop or beforehand, it's just wrapping a Python literal into a Spark expression, which aren’t executed immediately(just nodes in the DAG), and will be deduplicated by Catalyst. With that being said I’d like to keep the original for simplicity, but feel free to share if you have other opinions!

python/pyspark/pandas/strings.py

xinrong-meng · 2025-06-03T00:43:06Z

@ueshin would you please review, thanks!

ueshin · 2025-06-06T18:22:00Z

Thanks! merging to master.

…ANSI mode is on ### What changes were proposed in this pull request? Avoid INVALID_ARRAY_INDEX in `split`/`rsplit` when ANSI mode is on ### Why are the changes needed? Ensure pandas on Spark works well with ANSI mode on. Part of https://issues.apache.org/jira/browse/SPARK-52169. ### Does this PR introduce _any_ user-facing change? Yes. INVALID_ARRAY_INDEX no longer fails `split`/`rsplit` when ANSI mode is on ```py >>> spark.conf.get("spark.sql.ansi.enabled") 'true' >>> import pandas as pd >>> pser = pd.Series(["hello-world", "short"]) >>> psser = ps.from_pandas(pser) ``` FROM ```py >>> psser.str.split("-", n=1, expand=True) 25/05/28 14:52:10 ERROR Executor: Exception in task 10.0 in stage 2.0 (TID 15) org.apache.spark.SparkArrayIndexOutOfBoundsException: [INVALID_ARRAY_INDEX] The index 1 is out of bounds. The array has 1 elements. Use the SQL function `get()` to tolerate accessing element at invalid index and return NULL instead. SQLSTATE: 22003 == DataFrame == "__getitem__" was called from <stdin>:1 ... ``` TO ```py >>> psser.str.split("-", n=1, expand=True) 0 1 0 hello world 1 short None ``` ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51006 from xinrong-meng/arr_idx_enable. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Takuya Ueshin <ueshin@databricks.com>

github-actions bot added PYTHON PANDAS API ON SPARK labels May 24, 2025

xinrong-meng changed the title ~~[WIP] Avoid INVALID_ARRAY_INDEX in split when ANSI mode is on~~ [WIP][SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split when ANSI mode is on May 24, 2025

abhiips07 reviewed May 26, 2025

View reviewed changes

zhengruifeng reviewed May 28, 2025

View reviewed changes

python/pyspark/pandas/strings.py Outdated Show resolved Hide resolved

xinrong-meng changed the title ~~[WIP][SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split when ANSI mode is on~~ [SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split/split when ANSI mode is on May 28, 2025

xinrong-meng marked this pull request as ready for review May 28, 2025 21:01

xinrong-meng changed the title ~~[SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split/split when ANSI mode is on~~ [SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split/rsplit when ANSI mode is on May 28, 2025

xinrong-meng marked this pull request as draft May 28, 2025 22:07

xinrong-meng added 3 commits June 2, 2025 14:23

fix + test

b1be33d

rstring

2e0e41e

utility

3fc1656

xinrong-meng force-pushed the arr_idx_enable branch from 21da378 to 3fc1656 Compare June 2, 2025 21:58

xinrong-meng marked this pull request as ready for review June 2, 2025 21:59

xinrong-meng requested review from ueshin and zhengruifeng June 4, 2025 00:16

HyukjinKwon approved these changes Jun 5, 2025

View reviewed changes

ueshin approved these changes Jun 6, 2025

View reviewed changes

ueshin closed this in 5633e93 Jun 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in `split`/`rsplit` when ANSI mode is on #51006

[SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in `split`/`rsplit` when ANSI mode is on #51006

Uh oh!

xinrong-meng commented May 24, 2025 •

edited

Loading

Uh oh!

abhiips07 May 26, 2025 •

edited

Loading

Uh oh!

xinrong-meng May 27, 2025

Uh oh!

Uh oh!

xinrong-meng commented Jun 3, 2025

Uh oh!

ueshin commented Jun 6, 2025

Uh oh!

Uh oh!

[SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split/rsplit when ANSI mode is on #51006

[SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split/rsplit when ANSI mode is on #51006

Uh oh!

Conversation

xinrong-meng commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

abhiips07 May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinrong-meng May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xinrong-meng commented Jun 3, 2025

Uh oh!

ueshin commented Jun 6, 2025

Uh oh!

Uh oh!

[SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in `split`/`rsplit` when ANSI mode is on #51006

[SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in `split`/`rsplit` when ANSI mode is on #51006

xinrong-meng commented May 24, 2025 •

edited

Loading

abhiips07 May 26, 2025 •

edited

Loading