-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53882][CONNECT][DOCS] Add documentation comparing behavioral differences between Spark Connect and Spark Classic #52585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
docs/spark-connect-gotchas.md
Outdated
limitations under the License. | ||
--- | ||
|
||
The comparison highlights key differences between Spark Connect and Spark Classic in terms of execution and analysis behavior. While both utilize lazy execution for transformations, Spark Connect emphasizes deferred schema analysis, introducing unique considerations like temporary view handling and UDF evaluation. The guide outlines common gotchas and provides strategies for mitigation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, Spark Connect emphasizes deferred schema analysis
-> , Spark Connect also defers analysis
/, Spark Connect analyzes lazily
try to avoid too much indirection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done, and updated several other places as well.
docs/spark-connect-gotchas.md
Outdated
|
||
**When does this matter?** These differences are particularly important when migrating existing code from Spark Classic to Spark Connect, or when writing code that needs to work with both modes. Understanding these distinctions helps avoid unexpected behavior and performance issues. | ||
|
||
**Note:** The examples in this guide use Python, but the same principles apply to Scala and Java. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please be a champ and also add Scala/Java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've just added Scala examples.
What changes were proposed in this pull request?
Spark Connect is a decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol, which is well documented in https://spark.apache.org/docs/latest/spark-connect-overview.html.
However, there is a lack of guidance to help users understand the behavioral differences between Spark Classic and Spark Connect and to avoid unexpected behavior.
In this PR, a document is added that details the behavioral differences between Spark Connect and Spark Classic, lazy schema analysis and name resolution, and their implications.
Why are the changes needed?
This doc helps users migrating from Spark Classic to Spark Connect to understand the behavioral differences.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
N/A.
Was this patch authored or co-authored using generative AI tooling?
No.