-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-52288][PS] Avoid INVALID_ARRAY_INDEX in split
/rsplit
when ANSI mode is on
#51006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
split
when ANSI mode is onsplit
when ANSI mode is on
|
||
if ps.get_option("compute.ansi_mode_support"): | ||
spark_columns = [ | ||
F.try_element_at(scol, F.lit(i + 1)).alias(str(i)) for i in range(n + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we try to calculate f.lit(i+1) outside loop? since this might avoid function calls and object creation during loop execution
literals = [F.lit(i + 1) for i in range(n + 1)]
spark_columns = [F.try_element_at(scol, lit).alias(str(i)) for i, lit in enumerate(literals)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion!
There might not be a significant perf difference between creating F.lit inside the loop or beforehand, it's just wrapping a Python literal into a Spark expression, which aren’t executed immediately(just nodes in the DAG), and will be deduplicated by Catalyst. With that being said I’d like to keep the original for simplicity, but feel free to share if you have other opinions!
@@ -2031,7 +2031,13 @@ def pudf(s: pd.Series) -> pd.Series: | |||
if expand: | |||
psdf = psser.to_frame() | |||
scol = psdf._internal.data_spark_columns[0] | |||
spark_columns = [scol[i].alias(str(i)) for i in range(n + 1)] | |||
|
|||
if ps.get_option("compute.ansi_mode_support"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we can always use the new branch, and remove the configuration read here (which needs one Config RPC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did you mean by "using the new branch"?
split
when ANSI mode is onsplit
/split
when ANSI mode is on
split
/split
when ANSI mode is onsplit
/rsplit
when ANSI mode is on
Pending config check introduced in #50972 |
What changes were proposed in this pull request?
Avoid INVALID_ARRAY_INDEX in
split
/rsplit
when ANSI mode is onWhy are the changes needed?
Ensure pandas on Spark works well with ANSI mode on.
Part of https://issues.apache.org/jira/browse/SPARK-52169.
Does this PR introduce any user-facing change?
Yes. INVALID_ARRAY_INDEX no longer fails
split
/rsplit
when ANSI mode is onFROM
TO
How was this patch tested?
Unit tests
Was this patch authored or co-authored using generative AI tooling?
No