Skip to content

Merge queue #330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Conversation

Sakib25800
Copy link
Contributor

This PR adds the complete merge queue process from auto build -> merge to master.

The merge queue works like so:

  • Every 30s handle_merge_queue(...) is called
  • It selects a PR that can be merged (based on a set of merge queue priority rules)
  • As part of the sorts, successful auto builds come first and pending ones second and a bunch of other rules. But those are
    the two most important.
  • If handle_merge_queue(...) sees:
    • Successful auto build, it will merge it
    • Pending auto build, it will halt the queue until the auto build is done (to prevent simultaneous auto builds)
  • From the check suite completed webhook, we attempt to try_complete_build(...) where we attempt to determine the build status and post relevant comments

Homu works in exactly the same way, with the only difference being that we use the GitHub checks API rather than the the commit status API.

@Sakib25800
Copy link
Contributor Author

Sakib25800 commented Jun 29, 2025

I wanted this PR to be smaller, e.g. not including the merge to master logic in it, but this wasn’t possible as it would leave the merge queue broken (it would have been left running in a loop of the same PR). Same thing with check runs.

Also, I did have to make two commits where I backtracked:

  • Switching from commit status -> check run
  • Dealing with outdated base branch error (I made some changes and realised this wasn't possible anymore)

I ran the merge queue so many times (nearly ≈200 automated PRs - in separate batches) and somehow got hit a new race condition or bug basically every time, which is why this took so long even though most of the code was written prior.

I initially used the commit statuses API, but I kept consistently running into a situation where there were fewer workflows than checks (which triggered that early exit race condition). I believe this has something to do with GitHub Apps that have check suite write permission being counted as another check suite? I'm not actually sure about this, but in my test repo, I had five apps installed (unintentionally), and I saw five checks and the same number of workflows every time. Switching from commit status to the modern checks API fixed this. TBH, the checks API is the new and improved API with much more gradual and cooler features.

@Sakib25800 Sakib25800 changed the title Merge queue process Merge queue Jun 30, 2025
Copy link
Member

@Kobzol Kobzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! It's clear that you spent a lot of time on this. I haven't yet thought about the full merge process completely (I think there are some race conditions that we will still need to handle), but I went through the individual commits and left a bunch of drive-by comments.

Could you please create a separate PR the commit statuses first? That's useful also for try builds, and it should be a self-contained piece of logic that we could land before the actual merge queue.

I also don't think that we need all the labels that you created. As far as I know, we currently only use "merged-by-bors" when the PR is actually merged, and "waiting-for-bors" when the PR is approved and checks for an auto build to (start and) finish. Then after the auto build completes, the PR is either merged, or it's marked as a failure, and "waiting-for-bors" is removed on both cases.

@@ -969,3 +969,54 @@ pub(crate) async fn delete_auto_build(
})
.await
}

pub(crate) async fn get_merge_queue_prs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a comment on top of this function that explains what it does?

// 3. Compare approval status (approved PRs should come first)
.then_with(|| a.is_approved().cmp(&b.is_approved()).reverse())
// 4. Compare priority numbers (higher priority should come first)
.then_with(|| b.priority.unwrap_or(0).cmp(&a.priority.unwrap_or(0)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this have .reverse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand,

a.is_approved().cmp(&b.is_approved()).reverse()

  • is_approved() returns a boolean, and true.cmp(&false) returns Greater so the .reverse() makes approved PRs come before non-approved PRs

b.priority.unwrap_or(0).cmp(&a.priority.unwrap_or(0))

  • It compares b to a (reversed order), making higher priority numbers come first

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I didn't notice that b and a are switched here. Tbh, comparing a and b and using reverse would be more explicit, as all the other comparison "branches" use a, then b.

@@ -46,8 +61,98 @@ pub async fn handle_merge_queue(ctx: Arc<BorsContext>) -> anyhow::Result<()> {
}
}
}

// No build exists for this PR - start a new auto build.
match start_auto_build(&repo, pr).await {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we have to check a bunch of things here before we decide that the PR is mergeable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_merge_queue_prs(...) retrieves only the PRs eligible for merge so by that point we should have a PR that is considered mergeable.

@@ -115,7 +125,44 @@ async fn start_auto_build(
.await?
{
MergeResult::Success(merge_sha) => {
todo!("Deal with auto build merge success");
// 2. Push merge commit to `AUTO_BRANCH_NAME` where CI runs
client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: could we unify this code with try builds? It does essentially the same. Perhaps in a future PR, if it makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll do that in a future PR.

client.post_comment(pr.number, comment).await?;

// 6. Update label
handle_label_trigger(repo, pr.number, LabelTrigger::AutoBuildSucceeded).await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we use these labels nowadays. We only require merged-by-bors after the PR has actually been merged.

}
};

continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we have to handle the rollup race condition here somehow (doesn't need to happen in this PR). As the simplest option we could probably just ask GH about the state of the PR right before trying to merge it.

current_base_sha
);

let comment = auto_build_base_moved_comment(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that homu handles? I never saw this occur in practice 😆 Can we perhaps examine the returned error to check what exactly has failed on the GitHub side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realised this was dumb and reversed this in a future commit.

At that point in my commits, it was possible for this to occur but I fixed it later on and removed this error handling.

Now, if an error occurs we assume it's some error on GH's side.

@@ -34,6 +36,10 @@ enum MergeResult {
}

pub async fn handle_merge_queue(ctx: Arc<BorsContext>) -> anyhow::Result<()> {
// Prevent concurrent merge queue processing.
let lock = MERGE_QUEUE_LOCK.get_or_init(|| Mutex::new(()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need the lock for anything right now? It seems like if we made this function private and exposed just a future that runs a loop that reads from a channel, it should be impossible to invoke the merge queue concurrently, which would be much nicer than using an actual lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't think we actually need it.

@@ -261,15 +261,31 @@ async fn complete_auto_build(
};
repo.client.post_comment(pr.number, comment).await?;

let (status, desc) = if !has_failure {
(StatusState::Success, "Build succeeded")
let (status, conclusion, desc) = if !has_failure {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do commit statuses and check runs differ? I'm confused, because I thought that we only use commit statuses to show a different "icon" on the PR list page (green/yellow/red), and for nothing else.

Copy link
Contributor Author

@Sakib25800 Sakib25800 Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check runs are the newer version of commit statuses. They have richer UI options (e.g. if the run fails you can point to a certain line). It does also create the icon as well, but is capable of more.

A check suite consists of check runs (e.g. build, test, lint, etc...). The GH API auto creates a check suite where we create a check run.

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I didn't realize that you can programmatically create check runs and attach them to commits. So, is there actually anything useful to us that check runs provide? It seems like maybe the duration is extra, but apart from that?

The reason I'm a bit worried about this is that bors currently uses check suites and check runs to determine if CI was green for a given Build. This is more automated and magical than what was done in homu, where the user had to specify a concrete job name ("bors build finished"), and homu was waiting for that specific job to be green.

Here we would be creating additional check runs and check suites from within bors itself, which sounds a bit recursive and in theory prone to weird edge cases and errors.

I quite liked the approach of commit statuses, where I thought that they don't affect any of our CI success detection, and are only human readable labels of what's going on with the PR, nothing less, nothing more. So I'm actually quite surprised that you had issues with commit statuses, but using check runs was fine. I'd like to understand more what's going on there.

but I kept consistently running into a situation where there were fewer workflows than checks

I'm not actually sure about this, but in my test repo, I had five apps installed (unintentionally), and I saw five checks and the same number of workflows every time.

This sounds a bit worrying. If installed GH apps automatically add some check suites to each commit, then that makes me much less confident in the current approach of how we detect that a build is green. That being said, for try builds on rust-lang/rust, it seemed to be working fine, and that repo definitely has some other apps installed (IIRC). So I wonder what's going on here :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's land the commit statuses/checks in a separate PR, so that we can discuss it there properly, and mainly also test it for try builds on r-l/r, to see what happens there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is there actually anything useful to us that check runs provide?

  • Commit statuses are limited to just error, failure, pending, or success, check runs provide a few more conclusions: cancelled, timed_out, action_required (among others)
  • Re-run via GH UI (avoiding the git push workaround)
  • Checks tab where you can add what you want (example). I'm not sure if this will be useful (if you think it's useful we could use it).

Here we would be creating additional check runs and check suites from within bors itself, which sounds a bit recursive and in theory prone to weird edge cases and errors.
I thought the same but in reality:

  1. GHA workflow runs → WorkflowStarted webhook to Bors
  2. GHA workflow completes → WorkflowCompleted webhook to Bors
  3. GHA creates a check suite → CheckSuiteCompleted webhook to Bors
  4. Bors creates its own check run (status update) → No webhook sent back

It looks like maybe GitHub prevents this recursive issue.

That being said, for try builds on rust-lang/rust, it seemed to be working fine

I was able to reliably recreate this issue after a few PRs were auto built and merged on multiple runs of the merge queue. Sometimes it happened after 2 PRs, sometimes 10 PRs, etc...

I'll try to recreate the issue and show you the logs for it. The issue causes the merge queue to be blocked permanently with a pending build stuck (which is a fairly major issue). This issue could be something else entirely though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-run via GH UI (avoiding the git push workaround)

What exactly would this do? 🤔 We can't restart a failed/cancelled/timed out try build using the UI only, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to recreate the issue and show you the logs for it. The issue causes the merge queue to be blocked permanently with a pending build stuck (which is a fairly major issue). This issue could be something else entirely though.

We should have a timed out system that hopefully handles this, right? 😅

Conflict,
}

pub async fn handle_merge_queue(ctx: Arc<BorsContext>) -> anyhow::Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is really critical that the merge queue is only ever processed serially (not concurrently), I would move consume_merge_queue into this module, and expose it via a function that starts the queue and returns a channel sender, so something like this:

pub async fn start_merge_queue(ctx: Arc<BorsContext>) -> (Sender, impl Future<Output=anyhow::Result<()>>) {
   let (tx, rx) = // create channel
   let fut = async move {
       while let Some(_) = rx.next().await { merge_queue_tick(&ctx).await; }
    };
   (tx, fut)
}

and then drive the returned Future from the main async loop in server.rs.

So that we ensure that the merge queue is sequential and nothing else can call this method from elsewhere in bors. Let's keep the "tick" method separated, we can still call it manually in tests if needed, although I expect to mostly use integration tests that will trigger merge queue events through webhooks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that makes a lot more sense than what I did.

@Sakib25800
Copy link
Contributor Author

To reduce the size of this PR, I'm going to open a separate PR that only adds the merge queue skeleton - that is, the merge queue process and refresh event. I'll also update the docs in that PR about how the merge queue works as well.

- Post comment
- Update check run
- Modify labels
- Post comment
- Update check run
@Sakib25800 Sakib25800 force-pushed the merge-queue-process branch from 9d08dbd to 7e61fb1 Compare July 14, 2025 13:16
@Sakib25800
Copy link
Contributor Author

Sakib25800 commented Jul 14, 2025

The tests fail when run together but pass individually.

I plan on adding more tests but will have to figure out that issue first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants