Skip to content

Commit dbdd723

Browse files
Retry regressions in workers automatically
This is easier to implement in workers rather than in the agent because we don't need to keep track of the run count or concatenate logs within the database. It does mean we get slightly weaker fault isolation between runs (e.g. if one machine is having a bad day or low on memory or whatever), but this is still an improvement. This also means that the time taken per-regression increases, though an exact factor is hard to work out (we never retry baseline builds, and we only retry the first failed step...). Somewhere less than 5x though. In practice the expectation is that most of our runs have very few regressions (order of 1000 failed steps) compared to ~500,000 total steps executed. That is ~0.2%. So even a large increase here is likely to be acceptable. If there's a large amount of regressions a single crater run may run really slowly -- but probably it is also not very useful as a run, so we may want some early-exit clause or confirmation step in that case anyway. For now not too worried about that case.
1 parent 8276aa2 commit dbdd723

File tree

1 file changed

+34
-1
lines changed

1 file changed

+34
-1
lines changed

src/runner/worker.rs

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,40 @@ impl<'a, DB: WriteResults + Sync> Worker<'a, DB> {
173173
let mut result = Ok(());
174174
for task in tasks {
175175
if result.is_ok() {
176-
result = self.run_task(&task);
176+
let max_attempts = 5;
177+
for run in 1..=max_attempts {
178+
result = self.run_task(&task);
179+
180+
// We retry task failing on the second toolchain (i.e., regressions). In
181+
// the future we might expand this list further but for now this helps
182+
// prevent spurious test failures and such.
183+
//
184+
// For now we make no distinction between build failures and test failures
185+
// here, but that may change if this proves too slow.
186+
let mut should_retry = false;
187+
if result.is_err() && self.ex.toolchains.len() == 2 {
188+
let toolchain = match &task.step {
189+
TaskStep::Prepare | TaskStep::Cleanup => None,
190+
TaskStep::Skip { tc }
191+
| TaskStep::BuildAndTest { tc, .. }
192+
| TaskStep::BuildOnly { tc, .. }
193+
| TaskStep::CheckOnly { tc, .. }
194+
| TaskStep::Clippy { tc, .. }
195+
| TaskStep::Rustdoc { tc, .. }
196+
| TaskStep::UnstableFeatures { tc } => Some(tc),
197+
};
198+
if let Some(toolchain) = toolchain {
199+
if toolchain == self.ex.toolchains.last().unwrap() {
200+
should_retry = true;
201+
}
202+
}
203+
}
204+
if !should_retry {
205+
break;
206+
}
207+
208+
log::info!("Retrying task {:?} [{run}/{max_attempts}]", task);
209+
}
177210
}
178211
if let Err((err, test_result)) = &result {
179212
if let Err(e) = task.mark_as_failed(

0 commit comments

Comments
 (0)