-
Notifications
You must be signed in to change notification settings - Fork 5
[SPARK-51508] Support collect(): [[String?]]
for DataFrame
#17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -58,7 +58,7 @@ public actor DataFrame: Sendable { | |||
|
|||
/// Add `Apache Arrow`'s `RecordBatch`s to the internal array. | |||
/// - Parameter batches: An array of ``RecordBatch``. | |||
private func addBathes(_ batches: [RecordBatch]) { | |||
private func addBatches(_ batches: [RecordBatch]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a typo fix.
return expression | ||
} | ||
sort.order = expressions | ||
sort.isGlobal = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I piggy-back this fix while improving a test coverage by checking the result via collect()
API.
#if os(iOS) || os(watchOS) || os(tvOS) | ||
let userName = processInfo.environment["SPARK_USER"] ?? "" | ||
#elseif os(macOS) || os(Linux) | ||
#if os(macOS) || os(Linux) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified the implementation in a new way to cover os(visionOS)
too.
9168481
to
7597ad5
Compare
@@ -125,19 +125,25 @@ struct DataFrameTests { | |||
await spark.stop() | |||
} | |||
|
|||
#if !os(Linux) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Like
show()
,collect()
currently has a binary compatibility issue onos(Linux)
. - On MacOS, all tests pass.
Could you review this, @viirya ? I added the first implementation for |
7597ad5
to
70ef41c
Compare
collect(): [[String]]
for DataFrame
collect(): [[String?]]
for DataFrame
for column in batch.columns { | ||
let str = column.array as! AsString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For DataFrame, I suppose that the return of collect
is an array of Row
. But this collect
returns strings. Is it just for initial implementation and will be Row
later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, correct. Row
implementation is on the way~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Scala
client, we also support Array[Long]
like the following.
$ bin/spark-shell --remote sc://localhost:15002
scala> spark.range(1).collect()
res0: Array[java.lang.Long] = Array(0L)
Thank you for helping this moving forward, @viirya ! |
What changes were proposed in this pull request?
This PR aims to support
DataFrame.collect()
with the return type, an array of String array.Why are the changes needed?
There are two main goals.
collect()
API.Does this PR introduce any user-facing change?
No, this is not released yet.
How was this patch tested?
Pass the CIs.
Was this patch authored or co-authored using generative AI tooling?
No.