-
Notifications
You must be signed in to change notification settings - Fork 30
Merge queue #330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Merge queue #330
Conversation
I wanted this PR to be smaller, e.g. not including the merge to master logic in it, but this wasn’t possible as it would leave the merge queue broken (it would have been left running in a loop of the same PR). Same thing with check runs. Also, I did have to make two commits where I backtracked:
I ran the merge queue so many times (nearly ≈200 automated PRs - in separate batches) and somehow got hit a new race condition or bug basically every time, which is why this took so long even though most of the code was written prior. I initially used the commit statuses API, but I kept consistently running into a situation where there were fewer workflows than checks (which triggered that early exit race condition). I believe this has something to do with GitHub Apps that have check suite write permission being counted as another check suite? I'm not actually sure about this, but in my test repo, I had five apps installed (unintentionally), and I saw five checks and the same number of workflows every time. Switching from commit status to the modern checks API fixed this. TBH, the checks API is the new and improved API with much more gradual and cooler features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! It's clear that you spent a lot of time on this. I haven't yet thought about the full merge process completely (I think there are some race conditions that we will still need to handle), but I went through the individual commits and left a bunch of drive-by comments.
Could you please create a separate PR the commit statuses first? That's useful also for try builds, and it should be a self-contained piece of logic that we could land before the actual merge queue.
I also don't think that we need all the labels that you created. As far as I know, we currently only use "merged-by-bors" when the PR is actually merged, and "waiting-for-bors" when the PR is approved and checks for an auto build to (start and) finish. Then after the auto build completes, the PR is either merged, or it's marked as a failure, and "waiting-for-bors" is removed on both cases.
src/database/operations.rs
Outdated
@@ -969,3 +969,54 @@ pub(crate) async fn delete_auto_build( | |||
}) | |||
.await | |||
} | |||
|
|||
pub(crate) async fn get_merge_queue_prs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a comment on top of this function that explains what it does?
src/utils/sort_queue.rs
Outdated
// 3. Compare approval status (approved PRs should come first) | ||
.then_with(|| a.is_approved().cmp(&b.is_approved()).reverse()) | ||
// 4. Compare priority numbers (higher priority should come first) | ||
.then_with(|| b.priority.unwrap_or(0).cmp(&a.priority.unwrap_or(0))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this have .reverse
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand,
a.is_approved().cmp(&b.is_approved()).reverse()
is_approved()
returns a boolean, andtrue.cmp(&false)
returnsGreater
so the.reverse()
makes approved PRs come before non-approved PRs
b.priority.unwrap_or(0).cmp(&a.priority.unwrap_or(0))
- It compares b to a (reversed order), making higher priority numbers come first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I didn't notice that b
and a
are switched here. Tbh, comparing a
and b
and using reverse
would be more explicit, as all the other comparison "branches" use a, then b
.
src/bors/merge_queue.rs
Outdated
@@ -46,8 +61,98 @@ pub async fn handle_merge_queue(ctx: Arc<BorsContext>) -> anyhow::Result<()> { | |||
} | |||
} | |||
} | |||
|
|||
// No build exists for this PR - start a new auto build. | |||
match start_auto_build(&repo, pr).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: we have to check a bunch of things here before we decide that the PR is mergeable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_merge_queue_prs(...)
retrieves only the PRs eligible for merge so by that point we should have a PR that is considered mergeable.
@@ -115,7 +125,44 @@ async fn start_auto_build( | |||
.await? | |||
{ | |||
MergeResult::Success(merge_sha) => { | |||
todo!("Deal with auto build merge success"); | |||
// 2. Push merge commit to `AUTO_BRANCH_NAME` where CI runs | |||
client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remark: could we unify this code with try builds? It does essentially the same. Perhaps in a future PR, if it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll do that in a future PR.
src/bors/merge_queue.rs
Outdated
client.post_comment(pr.number, comment).await?; | ||
|
||
// 6. Update label | ||
handle_label_trigger(repo, pr.number, LabelTrigger::AutoBuildSucceeded).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we use these labels nowadays. We only require merged-by-bors
after the PR has actually been merged.
src/bors/merge_queue.rs
Outdated
} | ||
}; | ||
|
||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: we have to handle the rollup race condition here somehow (doesn't need to happen in this PR). As the simplest option we could probably just ask GH about the state of the PR right before trying to merge it.
src/bors/merge_queue.rs
Outdated
current_base_sha | ||
); | ||
|
||
let comment = auto_build_base_moved_comment( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that homu handles? I never saw this occur in practice 😆 Can we perhaps examine the returned error
to check what exactly has failed on the GitHub side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realised this was dumb and reversed this in a future commit.
At that point in my commits, it was possible for this to occur but I fixed it later on and removed this error handling.
Now, if an error occurs we assume it's some error on GH's side.
src/bors/merge_queue.rs
Outdated
@@ -34,6 +36,10 @@ enum MergeResult { | |||
} | |||
|
|||
pub async fn handle_merge_queue(ctx: Arc<BorsContext>) -> anyhow::Result<()> { | |||
// Prevent concurrent merge queue processing. | |||
let lock = MERGE_QUEUE_LOCK.get_or_init(|| Mutex::new(())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need the lock for anything right now? It seems like if we made this function private and exposed just a future that runs a loop that reads from a channel, it should be impossible to invoke the merge queue concurrently, which would be much nicer than using an actual lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think we actually need it.
src/bors/handlers/workflow.rs
Outdated
@@ -261,15 +261,31 @@ async fn complete_auto_build( | |||
}; | |||
repo.client.post_comment(pr.number, comment).await?; | |||
|
|||
let (status, desc) = if !has_failure { | |||
(StatusState::Success, "Build succeeded") | |||
let (status, conclusion, desc) = if !has_failure { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do commit statuses and check runs differ? I'm confused, because I thought that we only use commit statuses to show a different "icon" on the PR list page (green/yellow/red), and for nothing else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check runs are the newer version of commit statuses. They have richer UI options (e.g. if the run fails you can point to a certain line). It does also create the icon as well, but is capable of more.
A check suite consists of check runs (e.g. build, test, lint, etc...). The GH API auto creates a check suite where we create a check run.

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I didn't realize that you can programmatically create check runs and attach them to commits. So, is there actually anything useful to us that check runs provide? It seems like maybe the duration is extra, but apart from that?
The reason I'm a bit worried about this is that bors currently uses check suites and check runs to determine if CI was green for a given Build
. This is more automated and magical than what was done in homu, where the user had to specify a concrete job name ("bors build finished"), and homu was waiting for that specific job to be green.
Here we would be creating additional check runs and check suites from within bors itself, which sounds a bit recursive and in theory prone to weird edge cases and errors.
I quite liked the approach of commit statuses, where I thought that they don't affect any of our CI success detection, and are only human readable labels of what's going on with the PR, nothing less, nothing more. So I'm actually quite surprised that you had issues with commit statuses, but using check runs was fine. I'd like to understand more what's going on there.
but I kept consistently running into a situation where there were fewer workflows than checks
I'm not actually sure about this, but in my test repo, I had five apps installed (unintentionally), and I saw five checks and the same number of workflows every time.
This sounds a bit worrying. If installed GH apps automatically add some check suites to each commit, then that makes me much less confident in the current approach of how we detect that a build is green. That being said, for try builds on rust-lang/rust, it seemed to be working fine, and that repo definitely has some other apps installed (IIRC). So I wonder what's going on here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's land the commit statuses/checks in a separate PR, so that we can discuss it there properly, and mainly also test it for try builds on r-l/r, to see what happens there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, is there actually anything useful to us that check runs provide?
- Commit statuses are limited to just
error
,failure
,pending
, orsuccess
, check runs provide a few more conclusions:cancelled
,timed_out
,action_required
(among others) - Re-run via GH UI (avoiding the git push workaround)
- Checks tab where you can add what you want (example). I'm not sure if this will be useful (if you think it's useful we could use it).
Here we would be creating additional check runs and check suites from within bors itself, which sounds a bit recursive and in theory prone to weird edge cases and errors.
I thought the same but in reality:
- GHA workflow runs →
WorkflowStarted webhook
to Bors - GHA workflow completes →
WorkflowCompleted
webhook to Bors - GHA creates a check suite →
CheckSuiteCompleted
webhook to Bors - Bors creates its own check run (status update) → No webhook sent back
It looks like maybe GitHub prevents this recursive issue.
That being said, for try builds on rust-lang/rust, it seemed to be working fine
I was able to reliably recreate this issue after a few PRs were auto built and merged on multiple runs of the merge queue. Sometimes it happened after 2 PRs, sometimes 10 PRs, etc...
I'll try to recreate the issue and show you the logs for it. The issue causes the merge queue to be blocked permanently with a pending build stuck (which is a fairly major issue). This issue could be something else entirely though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-run via GH UI (avoiding the git push workaround)
What exactly would this do? 🤔 We can't restart a failed/cancelled/timed out try build using the UI only, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to recreate the issue and show you the logs for it. The issue causes the merge queue to be blocked permanently with a pending build stuck (which is a fairly major issue). This issue could be something else entirely though.
We should have a timed out system that hopefully handles this, right? 😅
src/bors/merge_queue.rs
Outdated
Conflict, | ||
} | ||
|
||
pub async fn handle_merge_queue(ctx: Arc<BorsContext>) -> anyhow::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is really critical that the merge queue is only ever processed serially (not concurrently), I would move consume_merge_queue
into this module, and expose it via a function that starts the queue and returns a channel sender, so something like this:
pub async fn start_merge_queue(ctx: Arc<BorsContext>) -> (Sender, impl Future<Output=anyhow::Result<()>>) {
let (tx, rx) = // create channel
let fut = async move {
while let Some(_) = rx.next().await { merge_queue_tick(&ctx).await; }
};
(tx, fut)
}
and then drive the returned Future
from the main async loop in server.rs
.
So that we ensure that the merge queue is sequential and nothing else can call this method from elsewhere in bors. Let's keep the "tick" method separated, we can still call it manually in tests if needed, although I expect to mostly use integration tests that will trigger merge queue events through webhooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that makes a lot more sense than what I did.
To reduce the size of this PR, I'm going to open a separate PR that only adds the merge queue skeleton - that is, the merge queue process and refresh event. I'll also update the docs in that PR about how the merge queue works as well. |
- Post comment - Update check run - Modify labels
- Post comment - Update check run
9d08dbd
to
7e61fb1
Compare
The tests fail when run together but pass individually. I plan on adding more tests but will have to figure out that issue first. |
This PR adds the complete merge queue process from auto build -> merge to master.
The merge queue works like so:
handle_merge_queue(...)
is calledthe two most important.
handle_merge_queue(...)
sees:try_complete_build(...)
where we attempt to determine the build status and post relevant commentsHomu works in exactly the same way, with the only difference being that we use the GitHub checks API rather than the the commit status API.