-
Notifications
You must be signed in to change notification settings - Fork 292
Merge master into feature/host-network-device-ordering #6520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…onfigure Add new host object fields: - ssh_enabled - ssh_enabled_timeout - ssh_expiry - console_idle_timeout Add new host/pool API to enable to set a temporary enabled SSH service timeout - set_ssh_enabled_timeout Add new host/pool API to enable to set console timeout - set_console_idle_timeout Signed-off-by: Lunfan Zhang <Lunfan.Zhang@cloud.com>
This PR introduces support for Dom0 SSH control, providing the following capabilities: Query the SSH status. Configure a temporary SSH enable timeout for a specific host or all hosts in the pool. Configure the console idle timeout for a specific host or all hosts in the pool. Changes New Host Object Fields: - `ssh_enabled`: Indicates whether SSH is enabled. - `ssh_enabled_timeout`: Specifies the timeout for temporary SSH enablement. - `ssh_expiry`: Tracks the expiration time for temporary SSH enablement. - `console_idle_timeout`: Configures the idle timeout for the console. New Host/Pool APIs (This PR only include the change of data model, the implementation of this API will be include in the next PR): - `set_ssh_enabled_timeout`: Allows setting a temporary timeout for enabling the SSH service. - `set_console_idle_timeout`: Allows configuring the console idle timeout.
During pool join, create a new host obj in the remote pool coordinator DB with the same SSH settings as pool coordinator. Also configure SSH service locally before xapi restart which will persist after xapi restart. Signed-off-by: Gang Ji <gang.ji@cloud.com>
After being ejected from a pool, a new host obj will be created with default settings in DB. This commit configures SSH service in the ejected host to default state during pool eject. Signed-off-by: Gang Ji <gang.ji@cloud.com>
Signed-off-by: Gang Ji <gang.ji@cloud.com>
Signed-off-by: Gang Ji <gang.ji@cloud.com>
Implemented XAPI APIs: - `host.set_console_idle_timeout` - `pool.set_console_idle_timeout` These APIs allow XAPI to configure timeout for idle console sessions. Signed-off-by: Lunfan Zhang <Lunfan.Zhang@cloud.com>
Implemented XAPI APIs: - `host.set_ssh_enabled_timeout` - `pool.set_ssh_enabled_timeout` These APIs allow XAPI to configure timeout for SSH service. `host.enable_ssh` now also supports enabling the SSH service with a ssh_enabled_timeout Signed-off-by: Lunfan Zhang <Lunfan.Zhang@cloud.com>
Updated `records.ml` file to support `host-param-set/get/list` and `pool-param-set/get/list` for SSH-related fields. Signed-off-by: Lunfan Zhang <Lunfan.Zhang@cloud.com>
Implemented XAPI APIs: - `set_ssh_enabled_timeout` - `set_console_idle_timeout` These APIs allow XAPI to configure timeouts for the SSH service and idle console sessions from both host and pool level. Updated `records.ml` to support `host-param-set/get/list` and `pool-param-set/get/list` for SSH-related fields.
The error set_console_idle_timeout_failed was added in feature branch while it is not used anywhere. The error used in set_console_idle_timeout now is invalid_value. Signed-off-by: Gang Ji <gang.ji@cloud.com>
Signed-off-by: Zeroday BYTE <pwnosecauth@gmail.com>
- Ensure host.enabled_ssh reflects the actual SSH service state on startup, in case it was manually changed by the user. - Reschedule the "disable SSH" job if: - host.ssh_enabled_timeout is set to a positive value, and - host.ssh_expiry is in the future. - Disable the SSH if: - host.ssh_enabled_timeout is set to a positive value, and - host.ssh_expiry is in the past. Signed-off-by: Lunfan Zhang <Lunfan.Zhang@cloud.com>
merge master to feature branch
Viewing RRDs produced by xcp-rrdd is difficult, because the format is incompatible with rrdtool. rrdtool has a hardcoded limit of 20 char for RRD names for backward compat with its binary format. Steps: * given a directory of xml .gz files containing xcp-rrdd produced rrds * invokes itself recursively with each file in turn using xargs -P (easy way to parallelize on OCaml 4) * load all RRDs, and split them into separate files, allowing us to shorten many of their names without conflicts * some names are still too long, there is a builtin translation table to shorten these * once split an .rrd file is created using 'rrdtool restore'. This can further be queried/inspected/transformed by rrdtool as needed * a .sh script is produced that can plot the RRD if desired. There are many RRDs so plotting isn't done automatically yet. RRDs contain min/avg/max usually, so this is drawn as a strong line at the average, and an area in a lighter color for min/max (especially useful for historic data that has been aggregated). Caveats: * we don't know the unit name, that is part of the XAPI metadata, but not the XML apparently? * separate plots are generated for separate intervals, it'd be nice to join all these into the same graph * the visualization type is not the best for all RRDs, some might benefit from a smoother line, etc. * for now the tool is just built, but not installed (that'll require a .spec change too and can be done later) This is just a starting point to be able to visualize this data somehow, and we can improve the actual plotting later. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Add 'tyre' dependency. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
- Refine the exception when host.enable_ssh/host.disable_ssh failed - Reset the host.ssh_expiry to default when host.enabl_ssh with no timeout Signed-off-by: Lunfan Zhang[Lunfan.Zhang] <Lunfan.Zhang@cloud.com>
When we have multiple SM plugins in XAPI for the same type (which happens only because of past problems) and want to remove the obsolete one, do this iby reference. The code so far was assuming only one per type and looked up the reference by name which was not unique and hence could end up removing the wrong SM entry. Signed-off-by: Christian Lindig <christian.lindig@cloud.com>
When we have multiple SM plugins in XAPI for the same type (which happens only because of past problems) and want to remove the obsolete one, do this iby reference. The code so far was assuming only one per type and looked up the reference by name which was not unique and hence could end up removing the wrong SM entry.
When adding a feature, developers had to change the variant, and the list all_features. Now the list is autogenerated from the variant, and the compiler will complain if its properties are not defined. Also reduced complexity of the code in the rest of the module. Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
I tried sharing more code between hard and soft affinities, but the memory management of the two cpumaps blows up the number of branches that need to be taken care of, making it more worthwhile to duplicate a bit of code instead. Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
No functional change. This prepares pre_build in the domain module to be able to set the hard affinity mask, without communicating the mask to xenguest through xenstore. Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
To handle deviations in CPU rates, Derive values exceeding the maximum by up to 5% are capped at the maximum; others are marked as unknown. This logic is specific to Derive data sources because they represent rates derived from differences over time, which can occasionally exceed expected bounds due to measurement inaccuracies.
In XSI-1915, MCS shutdowned a VM and tried to destroy VBD right after MCS received the event which came from power_state's change and failed. The failure reason is below: 1. The update for VM's power_state and the update for VBDs is not a transaction, so the client may receive the event from the update for power_state and operate VBDs before the update for VBDs. 2. The VM's running on supporter. The DB operation needs to send RPC to the coordinator. This needs time. 3. Between the update for VM's power_state and the update for VBD, xapi also updates the field pending_guildencs which needs at least 8 DB operation. This also delays the update for VBDs. It's not straightforward to add transactions for these DB operations. The workaround is to move the update for pending_guildencs to the end of the DB operation of VBDs, VIFs, GPUs, etc. So that VBD will be updated after the update for VM's power_state immediately.
Use cram tests to expect the desired output of the command instead This reduces the amount of text displayed when running tests, which makes locating the errors in the logs easier When the output of the tools changes deliberately, the expect files can be changed with `dune runtest --auto-promote`
Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Some rough guidelines on the contribution process for the project. Intended more as a starting point for a discussion.
I wrote this in ~2022, so I don't fully remember how it all works, but I tried to document what I know in the commit message, in the CLI flag docs and here: ``` scp -r root@$YOURBOX:/var/lib/xcp/blobs/rrds /tmp/rrds dune exec ./rrdview.exe -- /tmp/rrds bash /tmp/rrds/16db833b-7cd6-4b69-9037-144076c71033.cpu_avg.DERIVE.sh ```  Viewing RRDs produced by xcp-rrdd is difficult, because the format is incompatible with rrdtool. rrdtool has a hardcoded limit of 20 char for RRD names for backward compat with its binary format. Steps: * given a directory of xml .gz files containing xcp-rrdd produced rrds * invokes itself recursively with each file in turn using xargs -P (easy way to parallelize on OCaml 4) * load all RRDs, and split them into separate files, allowing us to shorten many of their names without conflicts * some names are still too long, there is a builtin translation table to shorten these * once split an .rrd file is created using 'rrdtool restore'. This can further be queried/inspected/transformed by rrdtool as needed * a .sh script is produced that can plot the RRD if desired. There are many RRDs so plotting isn't done automatically yet. RRDs contain min/avg/max usually, so this is drawn as a strong line at the average, and an area in a lighter color for min/max (especially useful for historic data that has been aggregated). Caveats: * we don't know the unit name, that is part of the XAPI metadata, but not the XML apparently? * separate plots are generated for separate intervals, it'd be nice to join all these into the same graph * the visualization type is not the best for all RRDs, some might benefit from a smoother line, etc. * for now the tool is just built, but not installed (that'll require a .spec change too and can be done later) * there is some code there to start parsing the data source definitions, eventually I wanted to plot the data using OCaml instead of rrdtool (e.g. generate Vega/Vega-lite graphs), but I can't find which branch I put *that* code on, what I have here is incomplete (or maybe I never wrote that part, just thought about it). We could trim the dead code from here if needed, but it might be useful if we continue improving the tool later, so for now I left the parsing in. This is just a starting point to be able to visualize this data somehow, and we can improve the actual plotting later.
The consolidator used to be aware of which domains were paused, this was used to avoid reporting memory changes for paused domains, exclusively. Move that responsibility to the domain memory reporter instead, this makes the decision local, simplifying code. This is useful to separate the memory code from the rest of rrdd. Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Update all `25.20.0-next` to `25.21.0` in `datamodel_lifecycle.ml`. Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
The /sys/fs/cgroup/systemd/cgroup.procs file is not always present, particularly in updated Linux systems with newer cgroup and SystemD. So fallback to root /sys/fs/cgroup/cgroup.procs. Also handle and report errors back to Ocaml. Although SystemD discourage handling cgroups without service configuration changes the root cgroup is a bit special as receiving processes from multiple sources, including the kernel. Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com>
Update all `25.20.0-next` to `25.21.0` in `datamodel_lifecycle.ml`.
Unfortunately mirage-crypto has accumulated breaking changes: - Cstructs have been replaced with strings - The digestif library has replaced ad-hoc hash implementation A deprecation has happened as well: - RNG initialization has changed Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
This API call and corresponding XE implementation calls a host plugin on the host where a VM is running. It thus takes care of finding the right host, compared to Host.call_plugin where this would be left to the user. Signed-off-by: Christian Lindig <christian.lindig@cloud.com>
Add a new function that will invoke a callback every time one of the tasks is deemed non-pending. This will allow its users to: 1) track the progress of tasks within the submitted batch 2) schedule new tasks to replace the completed ones Modify wait_for_all_inner so that it adds the tasks returned from the callback to its internal set on every new task completion. Signed-off-by: Andrii Sultanov <andriy.sultanov@vates.tech>
With bab83d9, host evacuation was parallelized by grouping VMs into batches, and starting a new batch once the previous one has finished. This means that a single slow VM can potentially slow down the whole evacuation. Instead use Tasks.wait_for_all_with_callback to schedule a new migration as soon as any of the previous ones have finished, thus maintaining a constant flow of n migrations. Signed-off-by: Andrii Sultanov <andriy.sultanov@vates.tech>
This API call and corresponding XE implementation calls a host plugin on the host where a VM is running. It thus takes care of finding the right host, compared to Host.call_plugin where this would be left to the user.
at least for a while longer... Mirrors the changes in the 1.249 LCM branch: #6473 Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Currently rrdd needs to know when a metric comes from a newly created domain, (after a local migration, for example). This is because when a new domain is created the counters start from zero again. This needs special logic for aggregating metrics since xcp-rrdd needs to provide continuity of metrics of a VM with a UUID, even if the domid changes. Previously rrdd fetched the data about domains before metrics from plugins were collected, and reused the data for self-reported metrics. While this meant that for self-reported metrics it was impossible to miss collected information, for plugin metrics it meant that for created and destroyed domains, the between between domain id and VM UUID was not available. With the current change the domain ids and VM UUIDs are collected every iteration of the monitor loop, and kept for one more iteration, so domains destroyed in the last iteration are remembered and not missed. With this done it's now safe to move the host and memory metrics collection into its own plugin. Also use sequences more thoroughly in the code for transformations Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
The only use of it was a parameter that was not used anywhere Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
at least for a while longer... Mirrors the changes in the 1.249 LCM branch: #6473
Signed-off-by: Andrii Sultanov <andriy.sultanov@vates.tech>
Currently rrdd needs to know when a metric comes from a new domain, (after a local migration, for example). This is because when a new domain is created the counters start from zero again, and so this needs special logic to handle when aggregating the metrics into rrds. Previously rrdd collected this information before metrics were collected, this means that metrics collected by plugins could be be lost if the domain was created in that small amount of time, or if the domain was destroyed after a plugin collected data about it. With the current change the domains are collected every loop and added to the domains collected in the previous loop to avoid missing any newly created or destroyed domains. The current iteration only gets fed data from the last iteration to avoid accumulating all domains seen since the start of xcp-rrdd. With this done it's now safe to move the host and memory metrics collection into its own plugin. Also use sequences more throroughly in the code for transformations I've manually tested this change by repeatedly by single-host live-migrating a VM and checking that no beats are missed on the graphs. 
With bab83d9, host evacuation was parallelized by grouping VMs into batches, and starting a new batch once the previous one has finished. This means that a single slow VM can potentially slow down the whole evacuation. Add a new `Tasks.wait_for_all_with_callback` function that will invoke a callback every time one of the tasks is deemed non-pending. This will allow its users to: 1) track the progress of tasks within the submitted batch 2) schedule new tasks to replace the completed ones Use the new `Tasks.wait_for_all_with_callback` in `xapi_host` to schedule a new migration as soon as any of the previous ones have finished, thus maintaining a constant flow of `n` migrations. Additionally expose the `evacuate-batch-size` parameter in the CLI, this was missed when it was originally added with the CLI setting it to `0` (pick the default) all the time. === Manually tested multiple times, confirmed to not break anything and to actually maintain a constant flow of migrations. This should greatly speed up host evacuations when there is a combination of bigger and smaller VMs (in terms of memory/disk, or VMs with some other reason for slow migration) on the host
Unfortunately mirage-crypto has accumulated breaking changes: - Cstructs have been replaced with strings - The digestif library has replaced ad-hoc hash implementation A deprecation has happened as well: - RNG initialization has changed Because there are breaking changes, xs-opam changes need to be introduced at the same time: xapi-project/xs-opam#731 Only xapi is affected by the breaking builds, so no other toolstack repositories have incoming PRs. I've tested builds with Smoke and validation tests: SR 218740 This means that the merge will be done as such: - Both PRs are approved - First I will merge xs-opam will be emrged (with failing CI) - Then this PR will be merged with the merge train that runs tests before actually merging. - CI is rerun manually on xs-opam to make it green again After merging both xenserver's CI should create a successful build with both PR included
The /sys/fs/cgroup/systemd/cgroup.procs file is not always present, particularly in updated Linux systems with newer cgroup and SystemD. So fallback to root /sys/fs/cgroup/cgroup.procs. Also handle and report errors back to Ocaml. Although SystemD discourage handling cgroups without service configuration changes the root cgroup is a bit special as receiving processes from multiple sources, including the kernel.
Also fix the Makefile, so that 'make clean' also deletes the `.o.d` files. This avoids accidentally adding these files to git (although normally dune would invoke make in _build, only if you manually invoke it would it create these extra files): ``` A ocaml/forkexecd/helper/close_from.o A ocaml/forkexecd/helper/close_from.o.d A ocaml/forkexecd/helper/syslog.o A ocaml/forkexecd/helper/syslog.o.d A ocaml/forkexecd/helper/vfork_helper A ocaml/forkexecd/helper/vfork_helper.o A ocaml/forkexecd/helper/vfork_helper.o.d ``` Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Also fix the Makefile, so that 'make clean' also deletes the `.o.d` files. This avoids accidentally adding these files to git (although normally dune would invoke make in _build, only if you manually invoke it would it create these extra files): ``` A ocaml/forkexecd/helper/close_from.o A ocaml/forkexecd/helper/close_from.o.d A ocaml/forkexecd/helper/syslog.o A ocaml/forkexecd/helper/syslog.o.d A ocaml/forkexecd/helper/vfork_helper A ocaml/forkexecd/helper/vfork_helper.o A ocaml/forkexecd/helper/vfork_helper.o.d ```
gangj
approved these changes
Jun 11, 2025
I need to fix the CI error. It's strange the error didn't emerge before. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.