Skip to content

Refactor SXM for SMAPIv1 code to fit the new code structure #6423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 17, 2025

Conversation

Vincent-lau
Copy link
Contributor

@Vincent-lau Vincent-lau commented Apr 13, 2025

This is a continuation of #6404, which completes the refactoring of SXM code for the new architecture.

I expect there to be no functional change, although there is a significant change of how error handling is done.

The last couple of commits contain the design for this PR.

To be continued...

@Vincent-lau Vincent-lau force-pushed the private/shul2/sxm-mux3 branch from 58aae78 to 45e0df2 Compare April 13, 2025 18:35
Comment on lines +153 to +175
Currently error handling was done by building up a list of cleanup functions in
the `on_fail` list ref as the function executes. For example, if the `receive_start`
has been completed successfully, add `receive_cancel` to the list of cleanup functions.
And whenever an exception is encountered, just execute whatever has been added
to the `on_fail` list ref. This is convenient, but does entangle all the error
handling logic with the core SXM logic itself, making the code rather than hard
to understand and maintain.

The idea to fix this is to introduce explicit "stages" during the SXM and define
explicitly what error handling should be done if it fails at a certain stage. This
helps separate the error handling logic into the `with` part of a `try with` block,
which is where they are supposed to be. Since we need to accommodate the existing
SMAPIv1 migration (which has more stages than SMAPIv3), the following stages are
introduced: preparation (v1,v3), snapshot(v1), mirror(v1, v3), copy(v1). Note that
each stage also roughly corresponds to a helper function that is called within `MIRROR.start`,
which is the wrapper function that initiates storage migration. And each helper
functions themselves would also have error handling logic within themselves as
needed (e.g. see `Storage_smapiv1_migrate.receive_start) to deal with exceptions
that happen within each helper functions.
Copy link
Member

@psafont psafont Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this helps reviewing the changes (thanks!), I think there's no need to document how the old system works after the change is done :)

Copy link
Contributor Author

@Vincent-lau Vincent-lau Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was trying to document how the new error handling is done, but then I thought it would be difficult to understand the error handling at different stages in the SMAPIv1 SXM without understanding how they are splitting into multiple stages... So I wrote it anyway

@psafont
Copy link
Member

psafont commented Apr 14, 2025

I think Rob should review this, otherwise it looks like a very positive change

@Vincent-lau Vincent-lau force-pushed the private/shul2/sxm-mux3 branch from 45e0df2 to 6596419 Compare April 14, 2025 14:56
@Vincent-lau Vincent-lau force-pushed the private/shul2/sxm-mux3 branch from 6596419 to 615de98 Compare April 17, 2025 14:28
We define different failures for SXM, for different stages of the SXM:

- preparation failure: raised when preparing the environment for SXM,
  for example, when the dest host creates VDIs for data mirroring
    (SMAPIv1 and v3);
- mirror_fd_failure: happens when passing fds to tapdisks for mirroring
  (SMAPIv1 only);
- mirror_snapshot_failure: raised when taking a snapshot as the base image
  before copying it over to the destination (SMAPIv1 only);
- mirror_copy_failure: raised when copying of the base image fails;
- mirror_failure: raised when there is any issues that causes the mirror
  to crash during SXM (SMAPIv1 and v3)

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Although copying is not exclusively used by SMAPIv1, e.g.
DATA.copy maps to Storage_migrate.copy which maps to
Storage_smapiv1_migrate.copy. And this is used when a VDI does not need
to be migrated live (attached read-only). It's a bit easier in terms of
dependency to leave them in storage_smapiv1_migrate.ml

And some related helper functions such as `progress_callback`.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Because the latter needs to use the former.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
The original Storage_migrate.start function is quite fat and does lots
of things. For better readability and maintainability, split it into
multiple functions that represent different stages of the SXM:
- prepare: makes a remote call to the dest host and asks it to prepare
  for the VDIs for receiving data
- mirror_pass_fds: passes the fd to tapdisk to establish connection
  between two tapdisk processes (sending & receiving)
- mirror_snapshot: takes a snapshot of the VDI to be mirrored, because
  the tapdisk mirroring only mirrors data written after the mirror is
  initiated
- mirror_copy: copy the snapshot from the source to destination

As these operations are all specific to SMAPIv1, move them to
`storage_smapiv1_migrate` as well.

Also restructure the way clean up is done. As there are multiple points
at which the migration might fail, previously a `on_fail` list ref is
constructed and populated as we go. Now that we have explicitly defined
the different errors during the migration, we could handle these errors
accordingly.

The way to reason about cleanups is: we have three possible clean up
actions:

- receive_cancel: which will undo the preparation work done on the dest
  host. This is added once preparation is done;
- destroy snapshot: which will destroy the snapshot. This is called
  after the snapshot is created;
- stop: which will disable mirroring, delete snapshot and also do what
  receive_cancel does, which is basically clean up everything. And this
  will be called after the mirror has been started.

Note many of these functions are best-effort, for example, the stop
function won't delete a snapshot if it does not exist, so it's generally
safe to call them.

We put these clean up functions into the appropriate places, see the
actual arrangement of these functions in the code, they should be
self-explanatory.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Now that the split is done, we can implement the `send_start` function
defined earlier for SMAPIv1, by combining all the different stages into
one function and invoke it from `Storage_migrate.start`.

At this point the refacotring for SMAPIv1 should be finished, and there
should still be no functional change.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
@Vincent-lau Vincent-lau enabled auto-merge April 17, 2025 14:29
@Vincent-lau Vincent-lau added this pull request to the merge queue Apr 17, 2025
Merged via the queue into xapi-project:master with commit 99ee18b Apr 17, 2025
17 checks passed
github-merge-queue bot pushed a commit that referenced this pull request Apr 24, 2025
Continuing on #6423

This PR changes the way `receive_start2` is called, and it can now
multiplex based on the source SR and destination SR type.

Note that `receive_start2` is introduced for inbound SXM for SMAPIv3,
which is flagged off by default, so no need to worry about backwards
compatability.

Also introduces a couple of new types for SMAPIv3 migration, but they
are uncurrently unused.

There should still be no functional change.

More to come...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants