-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split
/rsplit
when ANSI mode is on
#51006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
split
when ANSI mode is onsplit
when ANSI mode is on
|
||
if ps.get_option("compute.ansi_mode_support"): | ||
spark_columns = [ | ||
F.try_element_at(scol, F.lit(i + 1)).alias(str(i)) for i in range(n + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we try to calculate f.lit(i+1) outside loop? since this might avoid function calls and object creation during loop execution
literals = [F.lit(i + 1) for i in range(n + 1)]
spark_columns = [F.try_element_at(scol, lit).alias(str(i)) for i, lit in enumerate(literals)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion!
There might not be a significant perf difference between creating F.lit inside the loop or beforehand, it's just wrapping a Python literal into a Spark expression, which aren’t executed immediately(just nodes in the DAG), and will be deduplicated by Catalyst. With that being said I’d like to keep the original for simplicity, but feel free to share if you have other opinions!
split
when ANSI mode is onsplit
/split
when ANSI mode is on
split
/split
when ANSI mode is onsplit
/rsplit
when ANSI mode is on
21da378
to
3fc1656
Compare
@ueshin would you please review, thanks! |
Thanks! merging to master. |
…ANSI mode is on ### What changes were proposed in this pull request? Avoid INVALID_ARRAY_INDEX in `split`/`rsplit` when ANSI mode is on ### Why are the changes needed? Ensure pandas on Spark works well with ANSI mode on. Part of https://issues.apache.org/jira/browse/SPARK-52169. ### Does this PR introduce _any_ user-facing change? Yes. INVALID_ARRAY_INDEX no longer fails `split`/`rsplit` when ANSI mode is on ```py >>> spark.conf.get("spark.sql.ansi.enabled") 'true' >>> import pandas as pd >>> pser = pd.Series(["hello-world", "short"]) >>> psser = ps.from_pandas(pser) ``` FROM ```py >>> psser.str.split("-", n=1, expand=True) 25/05/28 14:52:10 ERROR Executor: Exception in task 10.0 in stage 2.0 (TID 15) org.apache.spark.SparkArrayIndexOutOfBoundsException: [INVALID_ARRAY_INDEX] The index 1 is out of bounds. The array has 1 elements. Use the SQL function `get()` to tolerate accessing element at invalid index and return NULL instead. SQLSTATE: 22003 == DataFrame == "__getitem__" was called from <stdin>:1 ... ``` TO ```py >>> psser.str.split("-", n=1, expand=True) 0 1 0 hello world 1 short None ``` ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51006 from xinrong-meng/arr_idx_enable. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Takuya Ueshin <ueshin@databricks.com>
What changes were proposed in this pull request?
Avoid INVALID_ARRAY_INDEX in
split
/rsplit
when ANSI mode is onWhy are the changes needed?
Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52169.
Does this PR introduce any user-facing change?
Yes. INVALID_ARRAY_INDEX no longer fails
split
/rsplit
when ANSI mode is onFROM
TO
How was this patch tested?
Unit tests
Was this patch authored or co-authored using generative AI tooling?
No