Skip to content

[SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split/rsplit when ANSI mode is on #51006

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented May 24, 2025

What changes were proposed in this pull request?

Avoid INVALID_ARRAY_INDEX in split/rsplit when ANSI mode is on

Why are the changes needed?

Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52169.

Does this PR introduce any user-facing change?

Yes. INVALID_ARRAY_INDEX no longer fails split/rsplit when ANSI mode is on

>>> spark.conf.get("spark.sql.ansi.enabled")
'true'
>>> import pandas as pd
>>> pser = pd.Series(["hello-world", "short"])
>>> psser = ps.from_pandas(pser)

FROM

>>> psser.str.split("-", n=1, expand=True)
25/05/28 14:52:10 ERROR Executor: Exception in task 10.0 in stage 2.0 (TID 15)  
org.apache.spark.SparkArrayIndexOutOfBoundsException: [INVALID_ARRAY_INDEX] The index 1 is out of bounds. The array has 1 elements. Use the SQL function `get()` to tolerate accessing element at invalid index and return NULL instead. SQLSTATE: 22003
== DataFrame ==
"__getitem__" was called from
<stdin>:1
...

TO

>>> psser.str.split("-", n=1, expand=True)
       0      1                                                                 
0  hello  world
1  short   None

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@xinrong-meng xinrong-meng changed the title [WIP] Avoid INVALID_ARRAY_INDEX in split when ANSI mode is on [WIP][SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split when ANSI mode is on May 24, 2025

if ps.get_option("compute.ansi_mode_support"):
spark_columns = [
F.try_element_at(scol, F.lit(i + 1)).alias(str(i)) for i in range(n + 1)
Copy link

@abhiips07 abhiips07 May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we try to calculate f.lit(i+1) outside loop? since this might avoid function calls and object creation during loop execution

literals = [F.lit(i + 1) for i in range(n + 1)]

spark_columns = [F.try_element_at(scol, lit).alias(str(i)) for i, lit in enumerate(literals)]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion!
There might not be a significant perf difference between creating F.lit inside the loop or beforehand, it's just wrapping a Python literal into a Spark expression, which aren’t executed immediately(just nodes in the DAG), and will be deduplicated by Catalyst. With that being said I’d like to keep the original for simplicity, but feel free to share if you have other opinions!

@xinrong-meng xinrong-meng changed the title [WIP][SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split when ANSI mode is on [SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split/split when ANSI mode is on May 28, 2025
@xinrong-meng xinrong-meng marked this pull request as ready for review May 28, 2025 21:01
@xinrong-meng xinrong-meng changed the title [SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split/split when ANSI mode is on [SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split/rsplit when ANSI mode is on May 28, 2025
@xinrong-meng xinrong-meng marked this pull request as draft May 28, 2025 22:07
@xinrong-meng xinrong-meng marked this pull request as ready for review June 2, 2025 21:59
@xinrong-meng
Copy link
Member Author

@ueshin would you please review, thanks!

@ueshin
Copy link
Member

ueshin commented Jun 6, 2025

Thanks! merging to master.

@ueshin ueshin closed this in 5633e93 Jun 6, 2025
yhuang-db pushed a commit to yhuang-db/spark that referenced this pull request Jun 9, 2025
…ANSI mode is on

### What changes were proposed in this pull request?
Avoid INVALID_ARRAY_INDEX in `split`/`rsplit` when ANSI mode is on

### Why are the changes needed?
Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52169.

### Does this PR introduce _any_ user-facing change?
Yes. INVALID_ARRAY_INDEX no longer fails `split`/`rsplit` when ANSI mode is on

```py
>>> spark.conf.get("spark.sql.ansi.enabled")
'true'
>>> import pandas as pd
>>> pser = pd.Series(["hello-world", "short"])
>>> psser = ps.from_pandas(pser)
```

FROM
```py
>>> psser.str.split("-", n=1, expand=True)
25/05/28 14:52:10 ERROR Executor: Exception in task 10.0 in stage 2.0 (TID 15)
org.apache.spark.SparkArrayIndexOutOfBoundsException: [INVALID_ARRAY_INDEX] The index 1 is out of bounds. The array has 1 elements. Use the SQL function `get()` to tolerate accessing element at invalid index and return NULL instead. SQLSTATE: 22003
== DataFrame ==
"__getitem__" was called from
<stdin>:1
...
```
TO
```py
>>> psser.str.split("-", n=1, expand=True)
       0      1
0  hello  world
1  short   None
```

### How was this patch tested?
Unit tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#51006 from xinrong-meng/arr_idx_enable.

Authored-by: Xinrong Meng <xinrong@apache.org>
Signed-off-by: Takuya Ueshin <ueshin@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants