Merge queue #330

Sakib25800 · 2025-06-29T22:51:14Z

This PR adds the complete merge queue process from auto build -> merge to master.

The merge queue works like so:

Every 30s handle_merge_queue(...) is called
It selects a PR that can be merged (based on a set of merge queue priority rules)
As part of the sorts, successful auto builds come first and pending ones second and a bunch of other rules. But those are
the two most important.
If handle_merge_queue(...) sees:
- Successful auto build, it will merge it
- Pending auto build, it will halt the queue until the auto build is done (to prevent simultaneous auto builds)
From the check suite completed webhook, we attempt to try_complete_build(...) where we attempt to determine the build status and post relevant comments

Homu works in exactly the same way, with the only difference being that we use the GitHub checks API rather than the the commit status API.

Sakib25800 · 2025-06-29T23:07:00Z

I wanted this PR to be smaller, e.g. not including the merge to master logic in it, but this wasn’t possible as it would leave the merge queue broken (it would have been left running in a loop of the same PR). Same thing with check runs.

Also, I did have to make two commits where I backtracked:

Switching from commit status -> check run
Dealing with outdated base branch error (I made some changes and realised this wasn't possible anymore)

I ran the merge queue so many times (nearly ≈200 automated PRs - in separate batches) and somehow got hit a new race condition or bug basically every time, which is why this took so long even though most of the code was written prior.

I initially used the commit statuses API, but I kept consistently running into a situation where there were fewer workflows than checks (which triggered that early exit race condition). I believe this has something to do with GitHub Apps that have check suite write permission being counted as another check suite? I'm not actually sure about this, but in my test repo, I had five apps installed (unintentionally), and I saw five checks and the same number of workflows every time. Switching from commit status to the modern checks API fixed this. TBH, the checks API is the new and improved API with much more gradual and cooler features.

Kobzol

Thank you! It's clear that you spent a lot of time on this. I haven't yet thought about the full merge process completely (I think there are some race conditions that we will still need to handle), but I went through the individual commits and left a bunch of drive-by comments.

Could you please create a separate PR the commit statuses first? That's useful also for try builds, and it should be a self-contained piece of logic that we could land before the actual merge queue.

I also don't think that we need all the labels that you created. As far as I know, we currently only use "merged-by-bors" when the PR is actually merged, and "waiting-for-bors" when the PR is approved and checks for an auto build to (start and) finish. Then after the auto build completes, the PR is either merged, or it's marked as a failure, and "waiting-for-bors" is removed on both cases.

Kobzol · 2025-06-30T12:47:31Z

src/database/operations.rs

@@ -969,3 +969,54 @@ pub(crate) async fn delete_auto_build(
    })
    .await
 }
+
+pub(crate) async fn get_merge_queue_prs(


Could you please add a comment on top of this function that explains what it does?

Kobzol · 2025-06-30T12:52:28Z

src/utils/sort_queue.rs

+            // 3. Compare approval status (approved PRs should come first)
+            .then_with(|| a.is_approved().cmp(&b.is_approved()).reverse())
+            // 4. Compare priority numbers (higher priority should come first)
+            .then_with(|| b.priority.unwrap_or(0).cmp(&a.priority.unwrap_or(0)))


Shouldn't this have .reverse?

From what I understand,

a.is_approved().cmp(&b.is_approved()).reverse()

is_approved() returns a boolean, and true.cmp(&false) returns Greater so the .reverse() makes approved PRs come before non-approved PRs

b.priority.unwrap_or(0).cmp(&a.priority.unwrap_or(0))

It compares b to a (reversed order), making higher priority numbers come first

I see, I didn't notice that b and a are switched here. Tbh, comparing a and b and using reverse would be more explicit, as all the other comparison "branches" use a, then b.

Kobzol · 2025-06-30T12:54:22Z

src/bors/merge_queue.rs

@@ -46,8 +61,98 @@ pub async fn handle_merge_queue(ctx: Arc<BorsContext>) -> anyhow::Result<()> {
                    }
                }
            }
+
+            // No build exists for this PR - start a new auto build.
+            match start_auto_build(&repo, pr).await {


Note: we have to check a bunch of things here before we decide that the PR is mergeable.

The get_merge_queue_prs(...) retrieves only the PRs eligible for merge so by that point we should have a PR that is considered mergeable.

Kobzol · 2025-06-30T12:55:25Z

src/bors/merge_queue.rs

@@ -115,7 +125,44 @@ async fn start_auto_build(
    .await?
    {
        MergeResult::Success(merge_sha) => {
-            todo!("Deal with auto build merge success");
+            // 2. Push merge commit to `AUTO_BRANCH_NAME` where CI runs
+            client


Remark: could we unify this code with try builds? It does essentially the same. Perhaps in a future PR, if it makes sense.

Yeah, I'll do that in a future PR.

Kobzol · 2025-06-30T12:59:14Z

src/bors/merge_queue.rs

+            client.post_comment(pr.number, comment).await?;
+
+            // 6. Update label
+            handle_label_trigger(repo, pr.number, LabelTrigger::AutoBuildSucceeded).await?;


I don't think we use these labels nowadays. We only require merged-by-bors after the PR has actually been merged.

Kobzol · 2025-06-30T13:07:44Z

src/bors/merge_queue.rs

+                            }
+                        };
+
+                        continue;


Note: we have to handle the rollup race condition here somehow (doesn't need to happen in this PR). As the simplest option we could probably just ask GH about the state of the PR right before trying to merge it.

Kobzol · 2025-06-30T13:12:23Z

src/bors/merge_queue.rs

+                                        current_base_sha
+                                    );
+
+                                    let comment = auto_build_base_moved_comment(


Is this something that homu handles? I never saw this occur in practice 😆 Can we perhaps examine the returned error to check what exactly has failed on the GitHub side?

I realised this was dumb and reversed this in a future commit.

At that point in my commits, it was possible for this to occur but I fixed it later on and removed this error handling.

Now, if an error occurs we assume it's some error on GH's side.

src/bors/handlers/trybuild.rs

Kobzol · 2025-06-30T13:14:41Z

src/bors/merge_queue.rs

@@ -34,6 +36,10 @@ enum MergeResult {
 }

 pub async fn handle_merge_queue(ctx: Arc<BorsContext>) -> anyhow::Result<()> {
+    // Prevent concurrent merge queue processing.
+    let lock = MERGE_QUEUE_LOCK.get_or_init(|| Mutex::new(()));


Do we actually need the lock for anything right now? It seems like if we made this function private and exposed just a future that runs a loop that reads from a channel, it should be impossible to invoke the merge queue concurrently, which would be much nicer than using an actual lock.

Hmm, I don't think we actually need it.

Kobzol · 2025-06-30T13:22:22Z

src/bors/handlers/workflow.rs

@@ -261,15 +261,31 @@ async fn complete_auto_build(
    };
    repo.client.post_comment(pr.number, comment).await?;

-    let (status, desc) = if !has_failure {
-        (StatusState::Success, "Build succeeded")
+    let (status, conclusion, desc) = if !has_failure {


How do commit statuses and check runs differ? I'm confused, because I thought that we only use commit statuses to show a different "icon" on the PR list page (green/yellow/red), and for nothing else.

Check runs are the newer version of commit statuses. They have richer UI options (e.g. if the run fails you can point to a certain line). It does also create the icon as well, but is capable of more.

A check suite consists of check runs (e.g. build, test, lint, etc...). The GH API auto creates a check suite where we create a check run.

I see, I didn't realize that you can programmatically create check runs and attach them to commits. So, is there actually anything useful to us that check runs provide? It seems like maybe the duration is extra, but apart from that?

The reason I'm a bit worried about this is that bors currently uses check suites and check runs to determine if CI was green for a given Build. This is more automated and magical than what was done in homu, where the user had to specify a concrete job name ("bors build finished"), and homu was waiting for that specific job to be green.

Here we would be creating additional check runs and check suites from within bors itself, which sounds a bit recursive and in theory prone to weird edge cases and errors.

I quite liked the approach of commit statuses, where I thought that they don't affect any of our CI success detection, and are only human readable labels of what's going on with the PR, nothing less, nothing more. So I'm actually quite surprised that you had issues with commit statuses, but using check runs was fine. I'd like to understand more what's going on there.

but I kept consistently running into a situation where there were fewer workflows than checks

I'm not actually sure about this, but in my test repo, I had five apps installed (unintentionally), and I saw five checks and the same number of workflows every time.

This sounds a bit worrying. If installed GH apps automatically add some check suites to each commit, then that makes me much less confident in the current approach of how we detect that a build is green. That being said, for try builds on rust-lang/rust, it seemed to be working fine, and that repo definitely has some other apps installed (IIRC). So I wonder what's going on here :)

Let's land the commit statuses/checks in a separate PR, so that we can discuss it there properly, and mainly also test it for try builds on r-l/r, to see what happens there.

So, is there actually anything useful to us that check runs provide?

Commit statuses are limited to just error, failure, pending, or success, check runs provide a few more conclusions: cancelled, timed_out, action_required (among others)

Re-run via GH UI (avoiding the git push workaround)

Checks tab where you can add what you want (example). I'm not sure if this will be useful (if you think it's useful we could use it).

Here we would be creating additional check runs and check suites from within bors itself, which sounds a bit recursive and in theory prone to weird edge cases and errors.
I thought the same but in reality:

GHA workflow runs → WorkflowStarted webhook to Bors

GHA workflow completes → WorkflowCompleted webhook to Bors

GHA creates a check suite → CheckSuiteCompleted webhook to Bors

Bors creates its own check run (status update) → No webhook sent back

It looks like maybe GitHub prevents this recursive issue.

That being said, for try builds on rust-lang/rust, it seemed to be working fine

I was able to reliably recreate this issue after a few PRs were auto built and merged on multiple runs of the merge queue. Sometimes it happened after 2 PRs, sometimes 10 PRs, etc...

I'll try to recreate the issue and show you the logs for it. The issue causes the merge queue to be blocked permanently with a pending build stuck (which is a fairly major issue). This issue could be something else entirely though.

Re-run via GH UI (avoiding the git push workaround)

What exactly would this do? 🤔 We can't restart a failed/cancelled/timed out try build using the UI only, right?

I'll try to recreate the issue and show you the logs for it. The issue causes the merge queue to be blocked permanently with a pending build stuck (which is a fairly major issue). This issue could be something else entirely though.

We should have a timed out system that hopefully handles this, right? 😅

src/github/api/operations.rs

Kobzol · 2025-06-30T20:41:49Z

src/bors/merge_queue.rs

+    Conflict,
+}
+
+pub async fn handle_merge_queue(ctx: Arc<BorsContext>) -> anyhow::Result<()> {


Since it is really critical that the merge queue is only ever processed serially (not concurrently), I would move consume_merge_queue into this module, and expose it via a function that starts the queue and returns a channel sender, so something like this:

pub async fn start_merge_queue(ctx: Arc<BorsContext>) -> (Sender, impl Future<Output=anyhow::Result<()>>) { let (tx, rx) = // create channel let fut = async move { while let Some(_) = rx.next().await { merge_queue_tick(&ctx).await; } }; (tx, fut) }

and then drive the returned Future from the main async loop in server.rs.

So that we ensure that the merge queue is sequential and nothing else can call this method from elsewhere in bors. Let's keep the "tick" method separated, we can still call it manually in tests if needed, although I expect to mostly use integration tests that will trigger merge queue events through webhooks.

Oh, that makes a lot more sense than what I did.

Sakib25800 · 2025-07-08T17:39:33Z

To reduce the size of this PR, I'm going to open a separate PR that only adds the merge queue skeleton - that is, the merge queue process and refresh event. I'll also update the docs in that PR about how the merge queue works as well.

- Post comment - Update check run - Modify labels

- Post comment - Update check run

Sakib25800 · 2025-07-14T13:18:27Z

The tests fail when run together but pass individually.

I plan on adding more tests but will have to figure out that issue first.

Sakib25800 changed the title ~~Merge queue process~~ Merge queue Jun 30, 2025

Kobzol reviewed Jun 30, 2025

View reviewed changes

src/github/api/operations.rs Outdated Show resolved Hide resolved

Kobzol reviewed Jun 30, 2025

View reviewed changes

Sakib25800 added 4 commits July 13, 2025 10:23

Start auto builds in merge queue

76c1d7a

Deal with auto build success

4071b91

- Post comment - Update check run - Modify labels

Deal with auto build failure

cbe518d

- Post comment - Update check run

Add tests

7e61fb1

Sakib25800 force-pushed the merge-queue-process branch from 9d08dbd to 7e61fb1 Compare July 14, 2025 13:16

Merge queue #330

Are you sure you want to change the base?

Merge queue #330

Uh oh!

Conversation

Sakib25800 commented Jun 29, 2025

Uh oh!

Sakib25800 commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kobzol left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sakib25800 Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sakib25800 commented Jul 8, 2025

Uh oh!

Sakib25800 commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Sakib25800 commented Jun 29, 2025 •

edited

Loading

Kobzol left a comment •

edited

Loading

Sakib25800 Jun 30, 2025 •

edited

Loading

Sakib25800 commented Jul 14, 2025 •

edited

Loading