[vpj][spark] PubSub backed Spark raw table module. #1800

eldernewborn · 2025-05-14T22:59:14Z

Problem Statement

To build KIF-repush and on demand ETL, we need to be able to materialize a Venice pub-sub topic in form of a Spark table.

Solution

Build a spark table and define Spark workers that can pull data from individual pub-sub partitions and fill the table with data from Venice. At this stage we only want to materialize raw pub-sub messages to pave the way for next stages where we process chunked messages and decide the presence of metadata.

Code changes

Added new code behind a config. If so list the config names and their default values in the PR description.
Introduced new log lines.
- Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues.
Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

New unit tests added.
New integration tests added.
Modified or extended existing tests.
Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.
Yes. Clearly explain the behavior change and its impact.

…n input task.

eldernewborn added 7 commits May 14, 2025 15:39

Added the Spark table definition for Raw Kafka input .

14d8584

RawPubSubInput components

7ae44e1

basic input partition class.

52f44ab

Cleanup and implementation of scanBuilder based on basic per partitio…

74c6881

…n input task.

bulk add of basic consumer and raw partition reader.

5be1c6b

added notes.

afa09e8

started incorporating new pubsub paradigm into partition reader.

6f9a00d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[vpj][spark] PubSub backed Spark raw table module. #1800

[vpj][spark] PubSub backed Spark raw table module. #1800

Uh oh!

eldernewborn commented May 14, 2025

Uh oh!

Uh oh!

[vpj][spark] PubSub backed Spark raw table module. #1800

Are you sure you want to change the base?

[vpj][spark] PubSub backed Spark raw table module. #1800

Uh oh!

Conversation

eldernewborn commented May 14, 2025

Problem Statement

Solution

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Uh oh!

Uh oh!