Skip to content

Sync master to feature/configure-ssh-phase3 #6538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

LunfanZhang
Copy link
Contributor

Merge master to feature branch

edwintorok and others added 30 commits April 17, 2025 17:15
These shouldn't be necessary anymore, since no exception is raised.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
We should only log_backtrace if we are the final handler.

The exception is raised here, so the caller will have a chance to log it.
This was also inconsistent: some *_interface logged the backtrace, and others didn't.

In theory there is a chance that the caller is buggy and doesn't log the correct backtrace.
But if we simplify the places that call the Backtrace module, we'll have fewer chances of that going wrong.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
response_internal_error already calls Backtrace.is_important in the correct place,
and logs the exception.
There is no need to do that a 2nd time in the caller.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
This shows how brittle the current Backtrace API is,
this was missing from a lot of places.

We have some better alternatives (`with_backtraces`,
or a `try_with` function) that'd guarantee that `important` is
always called in the right place, but that would be a more invasive change,
which will be done in a followup commit.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
xapi-project#6432)

Windows PV drivers do not store their version information into
`drivers/{xeneventchn,xenvbd,xennet}` xenstore keys since 2015, see:

* PV drivers commit `784af16810d705ba2bc02bab6ac93b24119f886c (Publish
distribution information to xenstore)`
https://xenbits.xen.org/gitweb/?p=pvdrivers/win/xenbus.git;a=commit;h=784af16810d705ba2bc02bab6ac93b24119f886c

* Xen commit `71e64e163b2dae7d08f7d77ee942749663f484d5 (docs: Introduce
xenstore paths for PV driver information)`
https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=71e64e163b2dae7d08f7d77ee942749663f484d5

Instead it is stored like this:
```
drivers/0 = "XenServer XENBUS 9.1.9.105 "
drivers/1 = "XenServer XENVBD 9.1.8.79 "
drivers/2 = "XenServer XENVIF 9.1.12.101 "
drivers/3 = "XenServer XENIFACE 9.1.10.87 "
drivers/4 = "XenServer XENNET 9.1.7.65 "
```

Modify `xapi_guest_agent` to list such entries under `drivers/` and
present version information for each driver.
This was not set properly when it was first introduced in xapi-24.15.0

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
…mportant part of the backtrace (xapi-project#6430)

Before/after can be seen clearly in the newly added unit test:
* previously line 8 was not printed anywhere, even though that is the
actual place the exception is raised
* the whole backtrace was printed on a single line, which can be quite
annoying

After:
* line 8 is present
* the correct backtrace printer is used that prints each stack frame on
a separate line with a counter

```diff --git a/ocaml/libs/log/test/log_test.t b/ocaml/libs/log/test/log_test.t
index f25ee70..fbdeebf 100644
--- a/ocaml/libs/log/test/log_test.t
+++ b/ocaml/libs/log/test/log_test.t
@@ -1,4 +1,9 @@
   $ ./log_test.exe | sed -re 's/[0-9]+T[0-9:.]+Z//'
-  [|debug||0 |main|log_test.ml] Raised at Xapi_stdext_pervasives__Pervasiveext.finally in file \"ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml\", line 39, characters 6-15\nCalled from Dune__exe__Log_test.(fun) in file \"ocaml/libs/log/test/log_test.ml\", line 15, characters 4-55\n
+  [|error||0 |main|backtrace] Raised Invalid_argument("index out of bounds")
+  [|error||0 |main|backtrace] 1/4 log_test.exe Raised at file ocaml/libs/log/test/log_test.ml, line 8
+  [|error||0 |main|backtrace] 2/4 log_test.exe Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
+  [|error||0 |main|backtrace] 3/4 log_test.exe Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 39
+  [|error||0 |main|backtrace] 4/4 log_test.exe Called from file ocaml/libs/log/test/log_test.ml, line 15
+  [|error||0 |main|backtrace]
```
The `expected_votes` field in corosync represents the number of hosts
that is expected by the cluster stack. In the context of corosync, this
is the same as the number of hosts as in the corosync.conf file*. This
is a useful field to expose to the user so that they can see how many
nodes actually are expected.

We also have `Cluster_host` object, which represents xapi's view of what
nodes should be in the cluster, but that might not be identical to
corosync's view, especially when a host is disabled, but is still left
in the list of Cluster_host objects.

Although one could argue that we could infer this `expected_votes` field
from the number of enabled Cluster_hosts, it might still be useful to
get this information directly from corosync.

*: there are ways in corosync to make one host cast multiple votes, but
that feature is not used.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
The `expected_votes` field in corosync represents the number of hosts
that is expected by the cluster stack. In the context of corosync, this
is the same as the number of hosts as in the corosync.conf file*. This
is a useful field to expose to the user so that they can see how many
nodes actually are expected.

We also have `Cluster_host` object, which represents xapi's view of what
nodes should be in the cluster, but that might not be identical to
corosync's view, especially when a host is disabled, but is still left
in the list of Cluster_host objects.

Although one could argue that we could infer this `expected_votes` field
from the number of enabled Cluster_hosts, it might still be useful to
get this information directly from corosync.

*: there are ways in corosync to make one host cast multiple votes, but
that feature is not used.
This function updates the snapshot related db fields after the storage
migration. There is no need to leave this in the storage layer as
xapi-storage-script will not be able to access xapi db.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Move this to storage_utils.ml since it is used by storage_smapiv1.ml and
storage_mux.ml

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Extract common logic on finding vdi_info given vdi, and also add a
parameter to specify where to find the VDI (locally or remotely).

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Move the update_snapshot_info_dest to storage mux as this function just
does db operations.

Also rescan the SR after updaing the content_id during SXM, so that the
latest content_id can be reflected in the returned vdi_info, which gets
used later on in `update_snapshot_info`
These will be used when mirroring a VDI that is on a SMAPIv3 SR.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Adding a few more states that will be tracked by the send_state and
receive_state. These will be used later on for SMAPIv3 outbound
migration.

We do need to give them a default value in case the deserialisation
of the on-disk copy does not have these values.

These are not used yet, so no functional change.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
These functions will not be exposed in storage_migrate.ml any more.
`receive_start` is kept in the mux for backwards compatability reasons,
and will go straight to the SMAPIv1 implemenetation.

`receive_start2` calls will not be directed to the storage_mux any more,
instead it will be multiplexed by the new layer in storage_migrate.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
Previously `MIRROR.receive_start2` is called as a remote function, i.e.
`Remote.DATA.MIRROR.receive_start2` and it will be forwarded to the
destination host and multiplexes based on the destination SR type.

This is inconvenient as what `receive_start2` should do is more
dependent on what the source SR type is. For example, if the source SR
is using SMAPIv1, then `receive_start2` needs to tell the destination
host to create snapshots VDIs, while this is not necessary if the source
SR type is of SMAPIv3. Hence instead of calling `Remote.receive_start2`,
call each individual functions inside `receive_start2` remotely, such as
`Remote.VDI.create`, and these SMAPIv2 functions will still be
multiplexed on the destiantion side, based on the destination SR type.
And this is indeed what we want, imagine a case where we are migrating
v1 -> v3, what we want is still create a snapshot VDI, but on the v3 SR.

This does mean that the state tracking, such as `State.add`, which was
previously tracked by the destination host, now needs to be tracked by
the source host. And this will affect a number of other `receive_`
functions such as `receive_finalize2` and `receive_cancel2`, which are
updated accordingly.

For backwards compatability reasons, we still need to preserve
`receive_start` which might still be called from an older host trying to
do a v1 -> v1 migration. And this is done by making sure that the
SMAPIv2 done inside `receive_start` are all local, while the
`receive_start` call itself is remote.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
…ct#6434)

Continuing on xapi-project#6423

This PR changes the way `receive_start2` is called, and it can now
multiplex based on the source SR and destination SR type.

Note that `receive_start2` is introduced for inbound SXM for SMAPIv3,
which is flagged off by default, so no need to worry about backwards
compatability.

Also introduces a couple of new types for SMAPIv3 migration, but they
are uncurrently unused.

There should still be no functional change.

More to come...
Since we are now supporting qcow file `tap-ctl list` can return strings like:
  - "1564848    0    0      qcow2 /var/run/sr-mount/..."
Without this patch the type "qcow2" is unknown and xcp-rrdd-iostat generates an
error like: returned a line that could not be parsed. Ignoring
This patch fixes the issue.

Signed-off-by: Guillaume <guillaume.thouvenin@vates.tech>
Since XCP-ng is now supporting qcow files `tap-ctl list` can return
string like:
  - "1564848    0    0      qcow2 /var/run/sr-mount/..."
Without this patch the type "qcow2" is unknown and xcp-rrdd-iostat
generates an
error like: `returned a line that could not be parsed. Ignoring`

This patch fixes the issue by allowing qcow2 type.
…time reduction later (xapi-project#6362)

Split the plug and unplug atomics into separate attach/activate and
deactivate/detach atomics. This is gated by two separate flags and both
are switched off by default. When switched on, this should still be a
no-op by calling them in serial where plug/unplug are currently used,
but will allow us to later use these atomics separately elsewhere for
optimisation.
Opening this as a draft PR as plug/unplug is fundamental so the logic
needs to be watertight before we consider merging.
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
On IO and timeout errors do not remove the device, but try to reconnect.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
xapi-project#6183)

Activate old xapi_vdi.update_allowed_operations optimization:
get_internal_records_where does a linear scan currently, so operating on
N VDIs is O(N^2).

Look at the VBD records directly, like before this 2013 commit which
regressed it: 5097475

(We are going to optimize get_record separately so it doesn't go through
serialization)

This still requires validation and measurements (e.g. using @contificate
's benchmark)
…stent connections (xapi-project#6444)

This was found during a previous stress test.
Without a persistent connection the kernel might disconnect from the
backing qemu/tapdisk in Dom0 when I/O takes too long, and that prevents
all future I/O to that device from working, not just the one that had
the speed hickup.
Since our NBD connection is local, mark it as persistent, so that it
tries to reconnect on timeout.
last-genius and others added 27 commits June 12, 2025 10:42
The regex removed parameters with a particular suffix instead of checking for
the whole name.

For example, after providing a uuid parameter to xe vif-move, network-uuid
would no longer be suggested:
```
$ xe vif-move <TAB><TAB>
network-uuid=  uuid=

$ xe vif-move uuid=0af7619c-0798-c5be-5a0e-20813a48c7df <TAB><TAB>
```

This is fixed now:
```
$ xe vif-move uuid=0af7619c-0798-c5be-5a0e-20813a48c7df <TAB><TAB>
$ xe vif-move uuid=0af7619c-0798-c5be-5a0e-20813a48c7df network-uuid=
```

Signed-off-by: Andrii Sultanov <andriy.sultanov@vates.tech>
There were several optional boolean parameters that were checked and used in
cli_operations but were not included in cli_frontend (therefore would not be
shown in suggestions or 'xe help command').

Add these to cli_frontend. is_unique is a ... unique case because it does not
follow the style of the CLI parameters (which use dashes, not underscores, to
separate words), so add 'is-unique' to cli_frontend but handle both in
cli_operations.

Signed-off-by: Andrii Sultanov <andriy.sultanov@vates.tech>
This code hasn't been used for 10+ years.

Signed-off-by: Andrii Sultanov <andriy.sultanov@vates.tech>
Signed-off-by: Andrii Sultanov <andriy.sultanov@vates.tech>
…oject#6503)

Instrument xenops vm non-atomic functions.
Instruments:
- `VM.add`,
- `VM.stat`,
- `VM.exists`,
- `VM.list`.

Instruments `switch_rpc` according to OpenTelemetry standard on
instrumenting rpc calls.
- `server.address` is the name of the message queue.

Intruments sending the message on a queue according to OpenTelemetry
standard on instrumenting messaging.
- `destination` is the name of the message queue.

`Tracing.with_tracing` now accepts an optional argument to set the Span
Kind.

![image](https://github.com/user-attachments/assets/b58a3309-6e31-4e40-84a5-c053bc00d5c8)
Maintenance mode is entered by running Host.evacuate, followed
by promoting a new pool coordinator and shutting down XAPI.

We only export spans every 30s, so we may miss exporting the span for Host.evacuate.
Ensure that we at least trigger the export when XAPI is about to shutdown.
Do not wait for the export to finish, because this could take a long time
(e.g. when exporting to a remote Jaeger instance).

After this change I now see Host.evacuate properly in the exported trace.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
json_reformat cannot handle newline delimited json, it is easier if we have a command to reformat it ourselves.

This can be useful when debugging why a trace is missing elements. Traces are stored as newline-delimited JSON
in /var/log/dt/zipkinv2/json, however json_reformat cannot process them directly, and the lines can be very long and difficult to read otherwise.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
xapi-project#6525)

Soon after Host.evacuate XAPI could be restarted (e.g. on coordinator
promotion).
But we only export traces every 30s, so we lose the spans from the last
30s, including the toplevel Host.evacuate span (which although long
running is only emitted on completion).

After this change I'm now able to see Host.evacuate and all the migrate
calls in the exported distributed trace.
Use the numa-compat argument to be able to override the default numa placement
using xenopsd.conf option.

This allows to change the default placement when building a package

This patch reverts some of the changes in e6f94be

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Adds the missing dependency to stdext-threads for testing

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
unfortunately threadext uses this log package, define a with_lock in xapi-log
tests to avoid using the former.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
SR-IOV (Single Root I/O Virtualization) is a technology that
allows a single physical PCI Express (PCIe) device, such as a
network adapter, to be shared efficiently among multiple virtual
machines (VMs) or containers. It achieves this by creating
Virtual Functions (VFs) that act as lightweight PCIe functions,
each assigned to a VM, while the Physical Function (PF) remains
responsible for managing the device.

Add check in Sysfs.is_physical - check if there is "physfn" in
the device dir to filter out VF, then XAPI will not create PIF
object for VF during scan.

Signed-off-by: Changlei Li <changlei.li@cloud.com>
Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
…oject#6488)

Currently, xapi-storage-script uses the presence/absence of a smapi
observer config file to determine whether it should create traces. This
only happens on startup which means smapiv3 traces will often not be
created when they should be (e.g. if the smapi component is enabled
after startup, tracing for smapiv3 will not be)

This commit updates the Smapi Observer forwarder to use an RPC client to
send messages to xapi-storage-script, updating it on any relevant
changes to the Observer. This is done in a generic way so that any
future components could use also listen to this queue to call its own
Observer functions.

Also move the Observer RPC declarations to a common file to reduce code
duplication and make some debug logs more helpful.
)

Use the numa-compat argument to be able to override the default numa
placement using xenopsd.conf option.

This allows to change the default placement when building a package

This patch reverts some of the changes in
e6f94be
SR-IOV (Single Root I/O Virtualization) is a technology that allows a
single physical PCI Express (PCIe) device, such as a network adapter, to
be shared efficiently among multiple virtual machines (VMs) or
containers. It achieves this by creating Virtual Functions (VFs) that
act as lightweight PCIe functions, each assigned to a VM, while the
Physical Function (PF) remains responsible for managing the device.

Add check in Sysfs.is_physical - check if there is "physfn" in the
device dir to filter out VF, then XAPI will not create PIF object for VF
during scan.
…api-project#6513)

`Host.cpu_info` list no longer contains a value associated with
a "features" key, but the CLI implementation was hardcoded to expect it.

As a result, it would show an unhelpful error like:

```
The server failed to handle your request, due to an internal error.
The given message may give details useful for debugging the problem.
message: INTERNAL_ERROR: [ Not_found ]
```

Instead use the `cpu_info_features` keys from xapi-consts (after moving
these keys from xapi-globs first).

Add the pool version of the command.

Additionally document their output format:

```
# xe help host-get-cpu-features
command name            : host-get-cpu-features
        reqd params     :
        optional params : uuid
        description     : Prints a hexadecimal representation of the host's physical-CPU
           features for PV and HVM VMs. features_{hvm,pv} are "maximum"
           featuresets the host will accept during migrations, and
           features_{hvm,pv}_host will be used to start new VMs.

# xe help pool-get-cpu-features
command name            : pool-get-cpu-features
        reqd params     :
        optional params :
        description     : Prints a hexadecimal representation of the pool's physical-CPU
           features for PV and HVM VMs. These are combinations of all the
           hosts' policies and are used when starting new VMs in a pool.

# xe host-get-cpu-features uuid=7f566729-0ee7-47c4-853d-2c5f3a195ad4
features_pv          : 1fc9cbf5-f6f83203-2991cbf5-00000123-00000001-000c0b39-00000000-00000100-00001000-ac000c00-00000000-00000000-00000000-00000000-00000000-00000000-1c020004-00000000-00000000-00000000-00000000-00000000
         features_hvm: 1fcbfbff-f7fa3223-2d93fbff-00000523-00000001-001c0fbb-00000000-00000100-00101000-bc000c00-00000000-00000000-00000000-00000000-00000000-00000000-1c020004-00000000-00000000-00000000-00000000-00000000
     features_pv_host: 1fc9cbf5-f6d83203-28100800-00000121-00000001-000c0b29-00000000-00000000-00001000-ac000400-00000000-00000000-00000000-00000000-00000000-00000000-0c000000-00000000-00000000-00000000-00000000-00000000
    features_hvm_host: 1fcbfbff-f7fa3203-2c100800-00000121-00000001-001c0fab-00000000-00000000-00101000-bc000400-00000000-00000000-00000000-00000000-00000000-00000000-0c000000-00000000-00000000-00000000-00000000-00000000

# xe pool-get-cpu-features
features_pv_host     : 1fc9cbf5-f6d83203-28100800-00000121-00000001-000c0b29-00000000-00000000-00001000-ac000400-00000000-00000000-00000000-00000000-00000000-00000000-0c000000-00000000-00000000-00000000-00000000-00000000
    features_hvm_host: 1fcbfbff-f7fa3203-2c100800-00000121-00000001-001c0fab-00000000-00000000-00101000-bc000400-00000000-00000000-00000000-00000000-00000000-00000000-0c000000-00000000-00000000-00000000-00000000-00000000
```

====

Tested manually
And remove the testing dependency to threadext to avoid a circular
dependency
@psafont psafont merged commit 0ac0908 into xapi-project:feature/configure-ssh-phase3 Jun 17, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.