You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[cc][dvc][test] Add support for SpecificRecord deserialization in DVRT, and add isCaughtUp API for DVRT CDC (#1790)
Problem Statement
1. Currently, DVRT only supports deserializing keys and values into Avro GenericRecords. This is a problem as for complex schemas this can lead to poor deserialization performance. Additionally, regular CDC client supports deserializing values to SpecificRecords, but DVRT CDC doesn't.
2. In DVRT CDC, there is no way for the user to tell whether the client has caught up.
3. During the start phase of DVRT CDC, subscription to the DVC can die silently.
4. We allow users to call start multiple times in DVRT CDC.
5. The javadoc for BootstrappingVeniceChangelogConsumer hasn't been updated since the introduction of DVRT CDC.
6. DaVinciClientRecordTransformerTest::testRecordTransformer has been failing consistently in the CI.
Solution
1. Add support for SpecificRecord deserialization for keys and values in DVRT to improve performance, and so DVRT CDC can benefit from it. Additionally, I've moved deserialization and serialization to FastAvro as it can perform 90% better when deserializing complex schemas. Please note that the regular CDC client doesn't support SpecificRecord for keys, but we are adding it for DVRT CDC since a user is requesting it.
2. To provide context to the user on whether they're caught up in DVRT CDC, the isCaughtUp API is added to the BootstrappingVeniceChangelogConsumer interface. Since the original BootstrappingVeniceChangelogConsumer implementation is extended from VeniceAfterImageConsumerImpl, it already supports isCaughtUp.
3. To prevent subscription to DVC in DVRT CDC from dying silently, I re-organized the futures and if subscription fails to DVC we will complete the future returned to the user exceptionally.
4. If a user calls start multiple times on DVRT CDC, we now throw an exception. Additionally, if a user passes in an empty set to start we will subscribe to all partitions. I've also added sychronized to start.
5. Updated the javadoc for BootstrappingVeniceChangelogConsumer, explaining how it behaves differently compared to the regular CDC client.
6. To make DaVinciClientRecordTransformerTest::testRecordTransformer pass consistently, we need to ensure that this test runs first.
Copy file name to clipboardExpand all lines: clients/da-vinci-client/src/main/java/com/linkedin/davinci/consumer/BootstrappingVeniceChangelogConsumer.java
+19Lines changed: 19 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -29,6 +29,13 @@ public interface BootstrappingVeniceChangelogConsumer<K, V> {
29
29
* NOTE: This future may take some time to complete depending on how much data needs to be ingested in order to catch
30
30
* up with the time that this client started.
31
31
*
32
+
* NOTE: In the experimental client, the future will complete when there is at least one message to be polled.
33
+
* We don't wait for all partitions to catch up, as loading every message into a buffer will result in an
34
+
* Out Of Memory error. Instead, use the {@link #isCaughtUp()} method to determine once all subscribed partitions have
35
+
* caught up.
36
+
*
37
+
* NOTE: In the experimental client, if you pass in an empty set, it will subscribe to all partitions for the store
38
+
*
32
39
* @param partitions which partition id's to catch up with
33
40
* @return a future that completes once catch up is complete for all passed in partitions.
34
41
*/
@@ -41,9 +48,21 @@ public interface BootstrappingVeniceChangelogConsumer<K, V> {
41
48
/**
42
49
* polls for the next batch of change events. The first records returned following calling 'start()' will be from the bootstrap state.
43
50
* Once this state is consumed, subsequent calls to poll will be based off of recent updates to the Venice store.
51
+
*
52
+
* In the experimental client, records will be returned in batches configured to the MAX_BUFFER_SIZE. So the initial
53
+
* calls to poll will be from records from the bootstrap state, until the partitions have caught up.
54
+
* Additionally, if the buffer hits the MAX_BUFFER_SIZE before the timeout is hit, poll will return immediately.
Copy file name to clipboardExpand all lines: clients/da-vinci-client/src/main/java/com/linkedin/davinci/consumer/BootstrappingVeniceChangelogConsumerDaVinciRecordTransformerImpl.java
+51-34Lines changed: 51 additions & 34 deletions
Original file line number
Diff line number
Diff line change
@@ -32,6 +32,7 @@
32
32
importjava.io.IOException;
33
33
importjava.util.ArrayList;
34
34
importjava.util.Collection;
35
+
importjava.util.Collections;
35
36
importjava.util.HashMap;
36
37
importjava.util.HashSet;
37
38
importjava.util.LinkedHashMap;
@@ -46,6 +47,7 @@
46
47
importjava.util.concurrent.ExecutorService;
47
48
importjava.util.concurrent.Executors;
48
49
importjava.util.concurrent.TimeUnit;
50
+
importjava.util.concurrent.atomic.AtomicBoolean;
49
51
importjava.util.concurrent.locks.Condition;
50
52
importjava.util.concurrent.locks.ReentrantLock;
51
53
importorg.apache.avro.Schema;
@@ -65,12 +67,12 @@ public class BootstrappingVeniceChangelogConsumerDaVinciRecordTransformerImpl<K,
65
67
// A buffer of messages that will be returned to the user
0 commit comments