Skip to content

Update feature/perf from master #5913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 219 commits into from
Aug 27, 2024
Merged

Update feature/perf from master #5913

merged 219 commits into from
Aug 27, 2024

Conversation

edwintorok
Copy link
Contributor

Draft.
Waiting on:
#5912
#5910

minglumlu and others added 30 commits July 9, 2024 12:06
Prior to this commit, the xapi on the coordinator host records the
'Unix.gettimeofday' as the timestamps of the heartbeat with other pool
supporter hosts. When the system clock is updated with a huge jump
forward, the timestamps would be far back into the past. This would
cause the xapi assumes that the hosts are offline as long time no
heartbeats.

In this commit, the timestamps are changed to get from a monotonic
clock. In this way, the system clock changes will not impact the
heartbeats' timestamps any more.

Additionally, Host_metrics.last_updated is only set when the object is
created. It's useless in check_host_liveness at all.

Signed-off-by: Ming Lu <ming.lu@cloud.com>
…to-non-cdn-update

Merge master to feature/non-cdn-update
…to-non-cdn-update

Merge master to feature/non-cdn-update
Add a field "origin" in "repository" to indicate the origin of the
repository. It's an enum type with 2 values: "remote" and "bundle".

Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
CP-49212: Update datamodel for non-CDN update
By default message-switch calls are serialized for backwards compatibility reasons in the Lwt and Async modes.
(We tried enabling parallel actions by default but got some hard to debug failures in the CI).

This causes very long VM start times when multiple VBDs are plugged/unplugged concurrently: the operations are seen concurrently by message-switch,
but xapi-storage-script only retrieves and dispatches them sequentially, so any opportunity for parallel execution is lost.
Even though the actions themselves only take seconds, due to serialization, a VM start may take minutes.

Enable parallel processing explicitly here (instead of for all message-switch clients).
SMAPIv3 should expect to be called concurrently (on different hosts even), so in theory this change should be safe
and improve performance, but there are some known bugs in SMAPIv3 plugins currently.

So introduce a config file flag 'concurrent' for now, that defaults to false,
but that can be turned to 'true' for testing purposes.
When all SMAPIv3 concurrency bugs are believed to be fixed we can flip the default,
and eventually remove this flag once no more bugs are reported.
The configuration value is done as a global to simplify integrating intot he Lwt port,
instead of changing a lot of functions to thread through an argument.

This doesn't change the behaviour of xapi-storage-script in its default configuration.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Add a module Tar_ext to unpack a tar file.
During the unpacking process, verify the tar file, containing total file size,
file type, file path, file integrity, etc.

Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
Signed-off-by: Bengang Yuan <bengang.yuan@cloud.com>
CP-49213: Add new tar unpacking module
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
If we want to reproduce a failure we need to know the exact commandline that was used.
Longer than 80 chars is not a problem, this is a logfile, and a truncated line is worse than a long line.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Non-running VMs' metrics are stored in the coordinator. When the coordinator is
asked about the metrics try to unarchive them instead of failing while trying
to fetch the coordinator's IP address.

This needs to force the HTTP method of the query to be POST

Also returns a Service Unavailable when the host is marked as Broken.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Forces users to use an address, instead of being implicit, this avoid the
underlying cause for the issue fixed in the previous commit: it allowed a
coordinator to call Pool_role.get_master_address, which always fails.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
This makes the selection of the action obvious, previously the two booleans
made it hazy to understand the decision, and was part of the error why the
coordinator tried to get the coordinator address from the pool_role file (and
failed badly)

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Currently a List.assoc is used, which raises an unhandled exception.

Signed-off-by: Pau Ruiz Safont <pau.ruizsafont@cloud.com>
Shell quoting changes in xen-api.git 65f152d
broke the dry-run functionality, as by quoting parameters in the way it was done
it meant the space separation was not properly handled to separate parameters
etc.

Signed-off-by: Alex Brett <alex.brett@cloud.com>
The xe-[backup,restore]-metadata scripts have cleanup logic designed to ensure
we do not leave any vbd objects etc behind.

This logic calls `vbd-unplug` with a 20s timeout, and then (regardless of the
result) allows up to 10s for any device specified in the VBD to disappear - if
it does not, it does not trigger a `vbd-destroy`.

This logic fails in the case where a VDI is attached to a VM running on the same
host, as the `device` field in the new VBD will be populated with the backend
device for the running VM. In this case, the `vbd-unplug` fails immediately (as
the vbd is not plugged because the original `vbd-plug` attempt fails as the VDI
is in use), but then we sit waiting for 10s for the device to disappear (which
is obviously does not), and then fail to trigger a `vbd-destroy`, leaving the
VBD behind.

Fix this by simply removing the wait for the device to disappear and always
attempting a `vbd-destroy`, as I am not aware of any situation where this
additional 10s wait will give any benefit given current behaviours.

Signed-off-by: Alex Brett <alex.brett@cloud.com>
This patch makes the feature use the debugfs utility, part of e2fsprogs. This
makes the system as a whole a heck of a lot better, if only because it won't be
able to parse XFS, ReiserFS or any of the many plugins of libfsimage.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Add a new option `-o` to xe-restore-metadata, which is used to distinguish
whether to allow use of legacy backup VDIs, or enforce only use of the new
format VDIs with known UUIDs.

Also modify xe-restore-metadata such that it no longer stops searching the
candidate list if only one VDI is found, but instead identifies all possible
backup VDIs. If more than one is found, and you are doing anything other than
listing the VDIs, the script will abort. This is to cover the case where a
malicious legacy format VDI is present - we will detect it and the expected
'real' backup VDI.

Modify xe-backup-metadata to always expect to use the deterministic UUID to
identify the VDI to add backups to, do not rely on the
`other-config:ctxs-pool-backup` property for identification in any way.

This is XSA-459 / CVE-2024-31144

Signed-off-by: Alex Brett <alex.brett@cloud.com>
- Quote a parameter
- Adjust how we check the returncode of some function calls to satifsy shellcheck
- Disable the warnings where we are explicitly relying on string splitting

Signed-off-by: Alex Brett <alex.brett@cloud.com>
This parameter was added in 7f1d315, but the
changes to always use the new metadata VDIs with known UUIDs means it is no
longer required, so remove it.

Signed-off-by: Alex Brett <alex.brett@cloud.com>
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
* Remove redundant parameter wiping

Removes ineffectual parameter wiping introduced by 6e24ca4.

Signed-off-by: Colin James <colin.barr@cloud.com>
Andrii Sultanov and others added 27 commits August 23, 2024 15:16
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Other selectors help strings have spaces before them, VM doesn't, so the final
combined help string looked like this: "Suspend the selected VM(s).VMs...."

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Currently, `xe <TAB>` shows a pager after asking whether or not to show 600+
commands. This is extremely unhelpful and does not speed up working with the
CLI. Instead, determine unique prefixes ("groups", or "modules", such as 'vm',
or 'repository', or 'pool') for all commands, and show them first (there's
only 50). Once the command entered by the user unambiguously refers to only
one such group, only then show the subcommands in said group (e.g. if 'vm-' is
entered, show 'vm-copy', 'vm-migrate', etc.)

This does not change the underlying CLI workings, just how autocompletion
presents existing commands.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Group data sources. Leave groups with '-' as suffix, and unique data source
names without '-' (i.e. given input of "cpu0-foo,cpu0-bar,xapi_open_fds",
return "cpu0-,xapi_open_fds"). Once the data source group is referred to
unambiguously, show all suggestions

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
These are cleaned up from the final string before it's autocompleted into the
command line. Stylistically, the auxiliary information is right-padded to the
longest command so that it starts on the same vertical line.
The help string is cut to the first sentence since some of these can be
quite long.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Covers all kinds of UUIDs (options 'uuid=', 'sr-uuid=')
These are stripped before the suggestion is entered into the command line.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Until autocompletion can be generated from the
datamodel, this is just naive hardcoding. These cases were
obtained by looking for boolean fields:
'xapi-cli-server/records.ml | grep bool_of_string' and 'grep string_of_bool'

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
It was a duplicate switch case that would never be hit - unify them instead.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
':' as separators did not work, as colon was treated as a separator and wasn't
parsed correctly in the initial OLDSTYLE_WORDS phase. This makes the behaviour
correct and improves the code around it.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Dereferences need to be quoted, but curly brackets are only needed when the
variable name is inside a sentence where characters after it can still
potentially refer to a variable name; or when indexing into an array.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Check if the parameter has already been provided and if so, do not include it
in the list of autocompletions anymore.

This should be safe - xe has other ways of entering parameters multiple times:
with commas, or as parts of records. To verify, cli_operations.ml receives the
command line arguments as an associative list and only contains List.assoc and
the like, no List.find_all (apart from the special case in
5d1601c which detects if a parameter was
entered multiple times and errors out, warning the user - which only underlines
that entering a parameter multiple times is not the desired behaviour, can
confuse things and lead to errors)

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
…ompletion

If command has both optional and required parameters, shows a prefix before
the parameter name specifying its status:

```
$ xe vmss-create <TAB><TAB>
OPTIONAL: enabled=             REQUIRED: frequency=
OPTIONAL: name-description=    REQUIRED: name-label=
OPTIONAL: retained-snapshots=  REQUIRED: type=
OPTIONAL: schedule:
$ xe vmss-create n<TAB><TAB>
OPTIONAL: name-description=  REQUIRED: name-label=
```

But:
```
$ xe observer-create <TAB><TAB>
OPTIONAL: attributes=        OPTIONAL: endpoints=         REQUIRED: name-label=
OPTIONAL: components=        OPTIONAL: host-uuids=
OPTIONAL: enabled=           OPTIONAL: name-description=
$ xe observer-create en<TAB><TAB>
enabled=    endpoints=
```

(If suggested parameters all come from one class, this status is not shown due
to the limitation in the architecture of the autocompletion - this prefix would
be common to all commands and would be inserted into the user's command line)

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
…ters

```
$ xe vdi-create <Ctrl-r><q>

$ xe vdi-create sr-uuid= name-label= virtual-size=
```

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Factors out logic for transforming custom specifications into standard
specifications in order to permit custom argument specifications to be provided
alongside descriptions. Introduces a new list of options for this purpose.

Signed-off-by: Colin James <colin.barr@cloud.com>
In preparation for memoization, we wrap a region of
"login_with_password" (the part calling out to external authentication code)
within a nested function ("query_external_auth").

We introduce a new record type "external_auth_result" that represents the set
of downward exposed variables that that are bound within the code our nested
function has incorporated. This record becomes the result of
"query_external_auth". Consequently, the function is immediately invoked and
its result is matched upon to bring all of the fields into scope. The future
memoization code will treat a call to this function as the "slow path".

Signed-off-by: Colin James <colin.barr@cloud.com>
Implements a wrapper for Psq that provides maximum capacity behaviour.

Signed-off-by: Colin James <colin.barr@cloud.com>
Adds basic functional testing for a bounded priority search queue.

Signed-off-by: Colin James <colin.barr@cloud.com>
Provides a simple authentication cache built using a bounded priority search
queue.

The structure of the cache is effectively a (bounded) priority search queue,
mapping usernames to entries - where entries have an explicit time-to-live and
can store arbitrary data (optionally secured with a cryptographic hash).

Also provides a configurable expiry span in the form of a global in Xapi_globs,
"external_authentication_expiry". This time span is added to the current
time - upon entering data into the cache - to provide an expiration time. Its
default value is 5 minutes. The input is specified in seconds.

The underlying priority search queue treats the minimum entry as being the
entry with highest priority. Therefore, as entries are compared by their
expiration time, the entry that is soonest to expire is the one that is evicted
when attempting to add a new entry when the cache is at capacity.

Signed-off-by: Colin James <colin.barr@cloud.com>
Signed-off-by: Colin James <colin.barr@cloud.com>
- Adds mirage-crypto-rng and mirage-crypto-rng.unix libraries to xapi's dune
  file. These are used to provide random number generation used for generating
  salts.
- Introduces a cache implementation that secures authenticated sessions
  (results) using SHA-512 hashes (computed using libcrypt's crypt_r function).

Signed-off-by: Colin James <colin.barr@cloud.com>
Adds configurable options "enable-external-authentication-cache" and
"external-authentication-cache-size" to Xapi_globs, which control whether the
cache is enabled and the maximum capacity of the cache, respectively.

Also adds "external-authentication-expiry" option, which permits
configuration of the time-to-live assigned to each cache entry. The global that
it manipulates in Xapi_globs was introduced earlier.

In future, these configuration options may become a pool-wide feature.

Signed-off-by: Colin James <colin.barr@cloud.com>
This change provides the workings required to allow the previously functionalised
section of login_with_password to be bypassed by consulting an authentication
cache.

Before attempting to execute that section of login_with_password (which calls
out to external authentication plugins), a cache is consulted. If a
previously-computed result resides within the cache, that is provided. If there
is no such result present in the cache, or external authentication caching is
disabled, the previous section of code (the "slow path") is invoked.

Signed-off-by: Colin James <colin.barr@cloud.com>
tests for Unix.select and introduce select-as-epoll
@edwintorok edwintorok marked this pull request as ready for review August 27, 2024 13:34
@edwintorok
Copy link
Contributor Author

The epoll tests have been merged to master, so we need to update feature/perf to latest master to make the `feature/perf\ PRs reviewable.

@edwintorok edwintorok merged commit 0cf353c into feature/perf Aug 27, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.