-
Notifications
You must be signed in to change notification settings - Fork 97
[vpj] Push Job Timeout #1786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[vpj] Push Job Timeout #1786
Conversation
…being applied to the start of the push job from now on. 🐔
…meout behavior. 🐏
69bde8d
to
e02f96c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments in tests. But main code change LGTM
clients/venice-push-job/src/test/java/com/linkedin/venice/hadoop/VenicePushJobTest.java
Outdated
Show resolved
Hide resolved
clients/venice-push-job/src/test/java/com/linkedin/venice/hadoop/VenicePushJobTest.java
Outdated
Show resolved
Hide resolved
clients/venice-push-job/src/test/java/com/linkedin/venice/hadoop/VenicePushJobTest.java
Outdated
Show resolved
Hide resolved
clients/venice-push-job/src/test/java/com/linkedin/venice/hadoop/VenicePushJobTest.java
Outdated
Show resolved
Hide resolved
clients/venice-push-job/src/test/java/com/linkedin/venice/hadoop/VenicePushJobTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @KaiSernLim!
…lying on `storeResponse`. 🥸
…ad of relying on `storeResponse`. 🥸" This reverts commit eac4a66.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the dilligence @KaiSernLim!
Venice Push Job Diagram as a refresher:

Problem Statement
There is no timeout on the Compute Engine
DataWriterComputeJob
. This leads to some push jobs taking days and even sometimes weeks to complete, which significantly delays the debugging process. There is an existing timeout (bootstrapToOnlineTimeoutInHours
) on the job status polling, but that takes place after the data writer job portion (which is where we've been witnessing significant delays). With the increased frequency of KIF repushes, this issue will become even more pronounced.Solution
There should be a timeout on the entire push job as a whole. If exceeded, the push job should be cancelled. The existing configuration
bootstrapToOnlineTimeoutInHours
can be repurposed for this, and this should've be the original purpose of the configuration.Concurrency-Specific Checks
Both reviewer and PR author to verify
ConcurrentHashMap
,CopyOnWriteArrayList
).How was this PR tested?
testPushJobTimeout()
which tests the poll job status gets cancelled.testDataWriterComputeJobTimeout()
which tests the data writer job gets killed.testPushJobPollStatus()
andtestPushJobUnknownPollStatusDoesWaiting()
, which pertained to the old usage of thebootstrapToOnlineTimeoutInHours
configuration.Does this PR introduce any user-facing or breaking changes?