Skip to content

Check that there are no changes during SR.scan #6413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

gthvn1
Copy link
Contributor

@gthvn1 gthvn1 commented Apr 8, 2025

Currently, we are only checking that no VDIs have been removed during the SR scan performed by the SM plugin. However, there are situations where a VDI has been added, and if this VDI is not present in the list obtained from SR.scan, it will be forgotten. The checks only prevent this in the case where the VDI was added during the scan. There is still a TOCTOU situation if the issue happens after the scan, and there is room for that.

@gthvn1
Copy link
Contributor Author

gthvn1 commented Apr 8, 2025

For information we had the issue in integration during a live migration. The destination VDI has been created and introduced in the database after the SR.scan returns the list of VDIs and before the returned list is used by XAPI. The VDI has been forgotten and then the live migration task tried to use it.... and so it failed with invalid UUID.

@last-genius last-genius requested a review from Vincent-lau April 8, 2025 17:14
@last-genius
Copy link
Contributor

We'd really appreciate if someone at XS could run some live migration xenrt suites with this fix. It worked just fine for us but I'm wondering if there's some possible situation where it'd cause frequent retries during live migration. Thank you!

let refs2 = List.map fst db_vdi2 in

(* VDIs that are in db_vdi1 but not in db_vdi2 *)
let vdis_removed = Listext.List.set_difference refs1 refs2 in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these lists short? set_difference is O(n^2). When nothing is added and nothing is removed, why is this not just set equality, i.e., sorted lists are equal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be thousands of VDIs in a system, definitely lists should not be used here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of set equality/sorted lists are equal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the code to use Set instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we just check for equality between two sets now instead of two set differences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact I wanted to add (and I just pushed it now) some debug messages. Because as Pau said there can be hundreds of VDIs so it can helps to show the ones that have been added or removed. And probably I can remove the debug message that write the list before and after. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes set equality looks good now. Because if during the scan a VDI has been added and removed sets are not empty but we don't need to start a new scan because the list of VDIs is ok. So yes I'm using equality now.

@gthvn1 gthvn1 force-pushed the gtn-check-all-changes-during-sr-scan branch from 316c7a3 to c207cb4 Compare April 9, 2025 09:58
@gthvn1 gthvn1 force-pushed the gtn-check-all-changes-during-sr-scan branch 3 times, most recently from 6c88eb5 to 8b7a7f1 Compare April 9, 2025 11:24
@gthvn1 gthvn1 force-pushed the gtn-check-all-changes-during-sr-scan branch from 8b7a7f1 to bfcb401 Compare April 9, 2025 12:51
|> String.concat ","
)
limit ;
debug "%s detected db change while scanning, retry limit left %d"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we report the VDIs that have been added or removed in vdis_ref_equal I think it is ok to remove the dump of the two lists (before and after).

@gthvn1 gthvn1 force-pushed the gtn-check-all-changes-during-sr-scan branch from bfcb401 to ef461b2 Compare April 9, 2025 13:29
@gthvn1
Copy link
Contributor Author

gthvn1 commented Apr 9, 2025

So the last version reports the Refs of VDI that has been removed or added. I agree that with the Ref we probably be able in the logs to find the corresponding UUID that makes more sense for the user.

@gthvn1 gthvn1 force-pushed the gtn-check-all-changes-during-sr-scan branch from ef461b2 to 1d7a5b7 Compare April 9, 2025 13:41
@gthvn1
Copy link
Contributor Author

gthvn1 commented Apr 9, 2025

I think it's starting to look pretty good 😅 . Thanks for the feedback!!!

@gthvn1
Copy link
Contributor Author

gthvn1 commented Apr 9, 2025

@Vincent-lau if you can have a quick look again because it has been modified since your review ;)

@gthvn1 gthvn1 requested a review from Vincent-lau April 9, 2025 13:44
@gthvn1 gthvn1 force-pushed the gtn-check-all-changes-during-sr-scan branch from 1d7a5b7 to 7276654 Compare April 9, 2025 13:47
@gthvn1 gthvn1 force-pushed the gtn-check-all-changes-during-sr-scan branch from 7276654 to eb9b01f Compare April 9, 2025 16:51
Currently, we are only checking that no VDIs have been removed during the
SR scan performed by the SM plugin. However, there are situations where
a VDI has been added, and if this VDI is not present in the list obtained
from SR.scan, it will be forgotten. The checks only prevent this in the
case where the VDI was added during the scan. There is still a TOCTOU
situation if the issue happens after the scan, and there is room for that.

Signed-off-by: Guillaume <guillaume.thouvenin@vates.tech>
@gthvn1 gthvn1 force-pushed the gtn-check-all-changes-during-sr-scan branch from eb9b01f to c927979 Compare April 9, 2025 17:30
@gthvn1
Copy link
Contributor Author

gthvn1 commented Apr 9, 2025

I added a debug message when the diff is clean and thus when the update of VDIs can start.
I think it can help to debug issues because we know that if you find in logs another task that modifies the DB between this message and the end of the SR.scan task then we are in TOCTOU situation.

@gthvn1
Copy link
Contributor Author

gthvn1 commented Apr 11, 2025

Just for your information: this patch fixes the issue we had in CI. We've already tested it, but since the version was modified, I can confirm that this updated version behaves the same.

@psafont
Copy link
Member

psafont commented Apr 11, 2025

I've kicked Toolstack Smoke and Verification tests with a build with these changes: 215889

These contain both intra and cross pool migrations to exercise this codepath to ensure there aren't any obvious regressions

@psafont
Copy link
Member

psafont commented Apr 11, 2025

Intra-pool and cross-pool migration tests passed. There was an error elsewhere, but it's a testing issue.

@psafont psafont added this pull request to the merge queue Apr 14, 2025
Merged via the queue into xapi-project:master with commit f9a5a8e Apr 14, 2025
17 checks passed
@gthvn1 gthvn1 deleted the gtn-check-all-changes-during-sr-scan branch April 29, 2025 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants