Skip to content

[SPARK-51483] Add SparkSession and DataFrame actors #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

[SPARK-51483] Add SparkSession and DataFrame actors #10

wants to merge 3 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Mar 12, 2025

What changes were proposed in this pull request?

This PR aims to add SparkSession and DataFrame actors.

  • SparkSession.SparkContext is defined as an empty struct just as a type.
  • SparkSession.Builder is defined to match with the builder pattern.

Why are the changes needed?

To allow users to start to use this library. After this PR, we can run the test against the real Spark Connect servers.

Does this PR introduce any user-facing change?

No, this is not released yet.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

}

/// Add `Apache Arrow`'s `RecordBatch`s to the intenal array.
/// - Parameter batches: A ``RecordBatch`` instance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be an array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. typo. Let me fix it.

dongjoon-hyun and others added 2 commits March 11, 2025 23:18
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
@Test
func range() async throws {
let spark = try await SparkSession.builder.getOrCreate()
#expect(try await spark.range(10).count() == 10)
Copy link
Member

@viirya viirya Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so the test already connects to real server to execute a range plan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it works with the real server from now. Please see here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I mean in the current Github Actions, do we already have run Connect Server to run these tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improving CIs by enabling the Spark Connect server via Docker in GitHub Action CIs.

Ah, I see that it is in later steps. Thanks.

@dongjoon-hyun
Copy link
Member Author

Thank you for helping this effort so far, @viirya . This is the last of initial implementation. After this, I'm moving forward to

  • Adding more test cases for various data types and clarify the support matrix of types.
  • Adding missing features.
  • Polish the implementation .
  • Integrating with Swift Package Index.
  • Improving CIs by enabling the Spark Connect server via Docker in GitHub Action CIs.

@dongjoon-hyun
Copy link
Member Author

According to the review comment, I didn't mention 4.0.0 RC2 in README.md yet.

For now, this is supposed to support Apache Spark 4.0.0+ only.

@dongjoon-hyun
Copy link
Member Author

Thank you.

For the record, the first MVP (Minimum Viable Product) is focusing on SQL area of Apache Spark 4.0.0 including the following.

@dongjoon-hyun
Copy link
Member Author

Merged to main~

@dongjoon-hyun dongjoon-hyun deleted the SPARK-51483 branch March 12, 2025 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants