[SPARK-53785][SS] Memory Source for RTM #52502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

jerrypeng wants to merge 2 commits into apache:master from jerrypeng:SPARK-53785

+346 −0

Contributor

jerrypeng commented Oct 2, 2025 •

edited

Loading

What changes were proposed in this pull request?

Add a memory source implementation that support Real-time Mode. This source is going to be used to test Real-time Mode. In this implementation of the memory source, a RPC server is set up on the driver and source tasks will constantly pull this RPC server for new data. This differs from the existing memory source implementation as data is sent once to tasks as part of the Partition/Split metadata at the beginning of a batch.

Why are the changes needed?

To test Real-time Mode queries.

Does this PR introduce any user-facing change?

No, this source is purely to help testing.

How was this patch tested?

Actual unit tests to test RTM will be added in the future once the engine supports actually running queries in RTM.

Was this patch authored or co-authored using generative AI tooling?

No

github-actions bot added SQL STRUCTURED STREAMING CORE labels

jerrypeng force-pushed the SPARK-53785 branch from 2799945 to 4bf7398 Compare

October 13, 2025 17:45

github-actions bot removed the CORE label

jerrypeng changed the title ~~[WIP] [SPARK-53785][SS] Memory Source for RTM~~ [SPARK-53785][SS] Memory Source for RTM

jerrypeng force-pushed the SPARK-53785 branch from 137390c to 20bdfa6 Compare

October 14, 2025 03:56

jerrypeng commented

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/RealTimeStreamScanExec.scala

    
              import org.apache.spark.util.{Clock, SystemClock}

              /* The singleton object to control the time in testing */

              object LowLatencyClock {

Contributor Author

jerrypeng Oct 14, 2025

Planning to add additional code to this file i.e. the actual implementation of RealTimeStreamScanExec. Adding LowLatencyClock here since it is used by LowLatencyMemorySource. Trying to keep the PRs small :)

Member

viirya Oct 15, 2025

Do you mean you will add the code of RealTimeStreamScanExec in other PR?

Contributor Author

jerrypeng Oct 15, 2025

This is the follow PR to provide more context:
#52620

github-actions bot added BUILD CORE labels

jerrypeng force-pushed the SPARK-53785 branch from 7d2f24f to 20bdfa6 Compare

October 15, 2025 06:42

github-actions bot removed BUILD CORE labels

viirya reviewed

View reviewed changes

Member

viirya left a comment

cc @HeartSaVioR

viirya reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/runtime/memory.scala

Member

viirya Oct 15, 2025

unnecessary change.

viirya reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/runtime/memory.scala Outdated

    
               * available.

               *

               * If numPartitions is provided, the rows will be redistributed to the given number of partitions.

               */

Member

viirya Oct 15, 2025

Hm, this doc is same and copied from MemoryStream. Do we need to update the doc of MemoryStream so it can be more specified or reflect the difference between it and LowLatencyMemoryStream?

Contributor Author

jerrypeng Oct 15, 2025

Let me remove this. The difference is described in the class LowLatencyMemoryStream

viirya reviewed

View reviewed changes

...src/main/scala/org/apache/spark/sql/execution/streaming/sources/LowLatencyMemoryStream.scala Outdated

    
                override def planInputPartitions(start: OffsetV2): Array[InputPartition] = {

                  val startOffset = start.asInstanceOf[LowLatencyMemoryStreamOffset]

                  synchronized {

                    val endpointName = s"ContinuousRecordEndpoint-${java.util.UUID.randomUUID()}-$id"

Member

viirya Oct 15, 2025

RealTimeRecordEndpoint? Or LowLatencyRecordEndpoint? As this is not for ContinuousStream, ContinuousRecordEndpoint looks a bit confused.

Contributor Author

jerrypeng Oct 15, 2025

I will rename but the reason is the ContinuousRecordEndpoint is borrowed from "ContinuousMemoryStream".

Member

viirya Oct 15, 2025

Hmm, oh, you are reusing ContinuousRecordEndpoint. I see.

viirya reviewed

View reviewed changes

...src/main/scala/org/apache/spark/sql/execution/streaming/sources/LowLatencyMemoryStream.scala Outdated

    
                  val startOffset = start.asInstanceOf[LowLatencyMemoryStreamOffset]

                  val endOffset = end.asInstanceOf[LowLatencyMemoryStreamOffset]

                  synchronized {

                    val endpointName = s"ContinuousRecordEndpoint-${java.util.UUID.randomUUID()}-$id"

Member

viirya Oct 15, 2025

ditto

viirya reviewed

View reviewed changes

...src/main/scala/org/apache/spark/sql/execution/streaming/sources/LowLatencyMemoryStream.scala Outdated Show resolved Hide resolved

viirya reviewed

View reviewed changes

...src/main/scala/org/apache/spark/sql/execution/streaming/sources/LowLatencyMemoryStream.scala Outdated

    
                  with SupportsRealTimeMode {

                private implicit val formats: Formats = Serialization.formats(NoTypeHints)

                // LowLatencyReader implementation

Member

viirya Oct 15, 2025

Is this comment supposed to be here? Seems unrelated to the code.

Contributor Author

jerrypeng Oct 15, 2025

will remove

viirya reviewed

View reviewed changes

...src/main/scala/org/apache/spark/sql/execution/streaming/sources/LowLatencyMemoryStream.scala

    
                }

                override def latestOffset(startOffset: OffsetV2, limit: ReadLimit): OffsetV2 = {

                  LowLatencyMemoryStreamOffset((0 until numPartitions).map(i => (i, records(i).size)).toMap)

Member

viirya Oct 15, 2025 •

edited

Loading

Do we also need to add synchronized to latestOffset as it also access records? Although it just reads, but the size might be inconsistent.

Contributor Author

jerrypeng Oct 15, 2025

This is an interesting point. For RTM, offset returned from latestOffset is actually not used. The offset returned from latestOffset defines the end offset of a batch for non-rtm streaming queries. In RTM, the end offset of a batch is calculated when the batch finishes. However, this source also support non-RTM queries. Though in streaming we typically use StreamTest framework that executes test actions and batches in synchronized steps so any race should not happen. Though for best practices I will add synchronized to the method.

viirya reviewed

View reviewed changes

...src/main/scala/org/apache/spark/sql/execution/streaming/sources/LowLatencyMemoryStream.scala

    
                override def reset(): Unit = {

                  super.reset()

                  records.foreach(_.clear())

Member

viirya Oct 15, 2025

Need synchronized too?

Contributor Author

jerrypeng Oct 15, 2025

will add

jerrypeng force-pushed the SPARK-53785 branch from 8c8b48a to f962a13 Compare

October 15, 2025 17:17


          [SPARK-53785] Memory Source for RTM

f9bace5

jerrypeng force-pushed the SPARK-53785 branch from f962a13 to f9bace5 Compare

October 15, 2025 17:18


          addressing comments

622c902

Contributor Author

jerrypeng commented Oct 15, 2025

@viirya thank you for the review. I have addressed your comments. PTAL!

jerrypeng requested a review from viirya

October 15, 2025 21:43

viirya approved these changes

View reviewed changes

viirya closed this in

8499a62

Member

viirya commented Oct 17, 2025

Merged to master. Thanks @jerrypeng

Contributor Author

jerrypeng commented Oct 17, 2025

@viirya thank you for the review and merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

SQL STRUCTURED STREAMING