You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split/rsplit when ANSI mode is on
### What changes were proposed in this pull request?
Avoid INVALID_ARRAY_INDEX in `split`/`rsplit` when ANSI mode is on
### Why are the changes needed?
Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52169.
### Does this PR introduce _any_ user-facing change?
Yes. INVALID_ARRAY_INDEX no longer fails `split`/`rsplit` when ANSI mode is on
```py
>>> spark.conf.get("spark.sql.ansi.enabled")
'true'
>>> import pandas as pd
>>> pser = pd.Series(["hello-world", "short"])
>>> psser = ps.from_pandas(pser)
```
FROM
```py
>>> psser.str.split("-", n=1, expand=True)
25/05/28 14:52:10 ERROR Executor: Exception in task 10.0 in stage 2.0 (TID 15)
org.apache.spark.SparkArrayIndexOutOfBoundsException: [INVALID_ARRAY_INDEX] The index 1 is out of bounds. The array has 1 elements. Use the SQL function `get()` to tolerate accessing element at invalid index and return NULL instead. SQLSTATE: 22003
== DataFrame ==
"__getitem__" was called from
<stdin>:1
...
```
TO
```py
>>> psser.str.split("-", n=1, expand=True)
0 1
0 hello world
1 short None
```
### How was this patch tested?
Unit tests
### Was this patch authored or co-authored using generative AI tooling?
No
Closes#51006 from xinrong-meng/arr_idx_enable.
Authored-by: Xinrong Meng <xinrong@apache.org>
Signed-off-by: Takuya Ueshin <ueshin@databricks.com>
0 commit comments