Skip to content

[SPARK-51560] Support cache/persist/unpersist for DataFrame #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Mar 19, 2025

What changes were proposed in this pull request?

This PR aims to support cache, persist, and unpersist for DataFrame.

Why are the changes needed?

For feature parity.

Does this PR introduce any user-facing change?

No. This is a new addition.

How was this patch tested?

Pass the CIs.

$ swift test --filter DataFrameTests
...
􀟈  Test run started.
􀄵  Testing Library Version: 102 (arm64e-apple-macos13.0)
􀟈  Suite DataFrameTests started.
􀟈  Test orderBy() started.
􀟈  Test isEmpty() started.
􀟈  Test show() started.
􀟈  Test persist() started.
􀟈  Test showCommand() started.
􀟈  Test table() started.
􀟈  Test selectMultipleColumns() started.
􀟈  Test showNull() started.
􀟈  Test schema() started.
􀟈  Test selectNone() started.
􀟈  Test rdd() started.
􀟈  Test sort() started.
􀟈  Test unpersist() started.
􀟈  Test limit() started.
􀟈  Test count() started.
􀟈  Test cache() started.
􀟈  Test selectInvalidColumn() started.
􀟈  Test collect() started.
􀟈  Test countNull() started.
􀟈  Test select() started.
􀟈  Test persistInvalidStorageLevel() started.
􁁛  Test rdd() passed after 0.571 seconds.
􁁛  Test selectNone() passed after 1.347 seconds.
􁁛  Test select() passed after 1.354 seconds.
􁁛  Test selectMultipleColumns() passed after 1.354 seconds.
􁁛  Test selectInvalidColumn() passed after 1.395 seconds.
􁁛  Test schema() passed after 1.747 seconds.
++
||
++

++
􁁛  Test showCommand() passed after 1.885 seconds.
+-----------+-----------+-------------+
| namespace | tableName | isTemporary |
+-----------+-----------+-------------+

+-----------+-----------+-------------+
+------+-------+------+
| col1 | col2  | col3 |
+------+-------+------+
| 1    | true  | abc  |
| NULL | NULL  | NULL |
| 3    | false | def  |
+------+-------+------+
􁁛  Test showNull() passed after 1.890 seconds.
+------+-------+
| col1 | col2  |
+------+-------+
| true | false |
+------+-------+
+------+------+
| col1 | col2 |
+------+------+
| 1    | 2    |
+------+------+
+------+------+
| col1 | col2 |
+------+------+
| abc  | def  |
| ghi  | jkl  |
+------+------+
􁁛  Test show() passed after 1.975 seconds.
􁁛  Test collect() passed after 2.045 seconds.
􁁛  Test countNull() passed after 2.566 seconds.
􁁛  Test persistInvalidStorageLevel() passed after 2.578 seconds.
􁁛  Test cache() passed after 2.683 seconds.
􁁛  Test isEmpty() passed after 2.778 seconds.
􁁛  Test count() passed after 2.892 seconds.
􁁛  Test persist() passed after 2.903 seconds.
􁁛  Test unpersist() passed after 2.917 seconds.
􁁛  Test limit() passed after 3.068 seconds.
􁁛  Test orderBy() passed after 3.101 seconds.
􁁛  Test sort() passed after 3.102 seconds.
􁁛  Test table() passed after 3.720 seconds.
􁁛  Suite DataFrameTests passed after 3.720 seconds.
􁁛  Test run with 21 tests passed after 3.720 seconds.

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun
Copy link
Member Author

Could you review this PR, @viirya ?

Comment on lines +204 to +210
@Test
func persist() async throws {
let spark = try await SparkSession.builder.getOrCreate()
#expect(try await spark.range(20).persist().count() == 20)
#expect(try await spark.range(21).persist(useDisk: false).count() == 21)
await spark.stop()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks the result correctness. If it is not actually cached, we still can get the same result. I'm not sure if we have a way to check if it is actually cached or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct! This is only API test, @viirya .

I checked the cached and uncached result manually via Connect Server UI because we didn't catalog part yet~

func persistInvalidStorageLevel() async throws {
let spark = try await SparkSession.builder.getOrCreate()
try await #require(throws: Error.self) {
let _ = try await spark.range(9999).persist(replication: 0).count()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this is a valid error result check~

@dongjoon-hyun
Copy link
Member Author

Thank you! Merged to main.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-51560 branch March 19, 2025 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants