-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[ENH]: Dead letter queuing for compaction jobs #5023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
622adc1
to
6757078
Compare
6757078
to
ce3d50a
Compare
@@ -205,6 +205,7 @@ impl ChromaError for CompactionError { | |||
|
|||
#[derive(Debug)] | |||
pub struct CompactionResponse { | |||
#[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not using this in CompactionManager
anymore as I pull it out into CompactionTaskCompletion
so I can also associate collection ids with compaction errors. I'm thinking we still keep this around for debugging?
Add Dead Letter Queuing for Compaction Jobs with Failure Tracking This PR introduces a dead letter queue for the compaction scheduler. Collections whose compaction jobs fail more than a configurable number of times ( Key Changes• Added failing_jobs HashMap and dead_jobs HashSet to track failure counts and dead collections in the Scheduler. Affected Areas• rust/worker/src/compactor/scheduler.rs This summary was automatically generated by @propel-code-bot |
@@ -541,7 +646,7 @@ mod tests { | |||
let jobs = jobs.collect::<Vec<&CompactionJob>>(); | |||
assert_eq!(jobs.len(), 1); | |||
assert_eq!(jobs[0].collection_id, collection_uuid_2,); | |||
scheduler.complete_collection(collection_uuid_2); | |||
scheduler.succeed_collection(collection_uuid_2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - maybe mark_collection_as_succeeded
style naming is a bit less clunky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you be adding the dashboards / triggers for killed jobs count ?
@@ -49,14 +49,19 @@ use tracing::Instrument; | |||
use tracing::Span; | |||
use uuid::Uuid; | |||
|
|||
type BoxedFuture = | |||
Pin<Box<dyn Future<Output = Result<CompactionResponse, Box<dyn ChromaError>>> + Send>>; | |||
type CompactionOutput = Result<CompactionResponse, Box<dyn ChromaError>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good cleanup ty
Description of changes
This change adds a dead letter queueing system to the compaction scheduler. If a compaction job on a collection fails
max_failure_count
times, it will be moved to a dead set that disables this collection from being compacted while it is in this set. As of this change, the only way to clear this set is by restarting the compaction process.compactor_dead_jobs_count
to track the size of the dead jobs set.Test plan
Added a test in scheduler.rs.
Also manually tested by injecting failures in certain compaction jobs and tracking the dead set size metric locally.
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?