Skip to content

Merge feature/perf to master #5760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 83 commits into from
Jul 10, 2024
Merged

Merge feature/perf to master #5760

merged 83 commits into from
Jul 10, 2024

Conversation

edwintorok
Copy link
Contributor

edwintorok and others added 30 commits June 1, 2022 18:39
It is in /usr/sbin not /usr/bin

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Run xentrace in continuous tracing mode to a circular buffer
(of fixed size).
When a CPU spike is detected xentrace is sent a SIGTERM with `killall`
to get it to dump the xentrace buffer to disk.
The trace is compressed and another xentrace started.

This repeats until we run out of the space on the VDI set up for
tracing.

CPU spikes are detected by using `rrd2csv` and watching the host
`cpu_avg` which should be the average of the CPU usage of all the host's
pCPUs.

New flags:
* -c to enable circular tracing mode (default is a -T time one-off trace)
* -p <cpu-usage> e.g. -p 98 the detection point for CPU spike
* -r <n> repeat count - how many high cpu usage measurements in a row
  before triggerring the xentrace dump
* -M <n> to set the size of the circular buffer in MiB (default: 400MiB)

There is also a (hardcoded for now) sleep 60 between the traces just
so we don't continuously measure a very long spike.

To stop circular tracing mode you have to Ctrl-C on the running
xe-xentrace (e.g. under `screen`), or `killall xe-xentrace`
(but this seems to prevent automatic cleanup)

Signed-off-by: Edwin Török <edwin@etorok.net>
The C SDK build was failing with a recent GCC on Fedora39 like this:
```
src/xen_common.c: In function ‘xen_session_logout’:
src/xen_common.c:298:5: error: ‘xen_call_’ accessing 16 bytes in a region of size 0 [-Werror=stringop-overflow=]
  298 |     xen_call_(session, "session.logout", params, 0, NULL, NULL);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

Use `NULL` instead of a VLA of size 0.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Requires libxml2-devel installed.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
This will be useful for testing record_util.ml

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…ed case

Other enums here accept mixed case, not just lower case.

Fixes: 638f58e ("CP-38020: add HOST.set_numa_affinity_policy")

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
There are some inconsistencies between enum string values and their
declarations, e.g. `balance-slb` (which has special handling in the API
generator), or `depth-first` (which does not).

We want to automatically generate record_util.ml, but we must ensure backwards
compatibility, and the only way to do that is to exhaustively test all the old
values in a unit test.

This test won't need to be updated when new enum values are introduced: we'll
use the automatically generated ones for those already.

The `old*` files were created using the following command:
```
cp ocaml/xapi-cli-server/record_util.ml ocaml/tests/record_util/old_record_util.ml && dune build @check --profile=release && grep 'let all_' _build/o\default/ocaml/xapi-types/aPI.ml >|ocaml/tests/record_util/old_enum_all.ml && dune fmt --auto-promote
```

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Add record_util backwards compatibility test
XAPI was just retrying endlessly in a loop saying "Could not find block device",
but didn't raise any alerts that XenRT could detect.

Report the redo log broken the first time we fail due to the lack of a bloc k device (which would indicate something going wrong in SM).

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Nested virt is not a supported feature yet, but when moving the setting
from Xen to toolstack as part of the Xen-4.17 update I typoed the name of the setting.

It was platform/nested-virt in xenguest.patch before

Fixes: 664de76 ("Xen-4.15+: CDF_NESTED_VIRT")
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
For regular replies we look at the request and reply with a matching version.
However when we fail to parse the JsonRPC request itself then we don't know what version to use.

XenCenter uses JsonRPC v2 by default, and JsonRPC v2 has been supported in XAPI for a long time.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Fixes this pytype report:
```
pytype_reporter:          Expected: (self, uri, transport, encoding, verbose: bool = ..., ...)
pytype_reporter:   Actually passed: (self, uri, transport, encoding, verbose: int, ...)
pytype_reporter:          Expected: (self, uri, transport, encoding, verbose, allow_none: bool = ..., ...)
pytype_reporter:   Actually passed: (self, uri, transport, encoding, verbose, allow_none: int)
```

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
pytype claims that there is no 'errno' in the socket module, but there is:
```
Python 3.6.8 (default, Nov 16 2020, 16:55:22)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.errno.ETIMEDOUT
110
```

Also `session.local_logout` is a remote method that is proxied by `__getattr__`,
I don't know why `pytype` complains about that one, and not about `logout`, but suppress it.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…ication

This is already the case for HTTP(S) connections, but XenAPI overrides the xmlrpclib implementation
to provide a Unix socket transport, and it lacks this optimization.

Otherwise we've noticed that SM makes at least 4 separate connections to XAPI to query the version number for example.
On a busy system each of those connections can get delayed by 100ms:

* OCaml's tick thread switches threads every 50ms, and you might need 2 switches when XAPI is busy with CPU-bound workloads:
  * one to accept the connection
  * another to handle the request on the newly spawned thread

See:
https://github.com/python/cpython/blob/v2.7.5/Lib/xmlrpclib.py#L1361-L1371
https://github.com/python/cpython/blob/v3.6.8/Lib/xmlrpc/client.py#L1237-L1245

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Old version of XenAPI.py used to look at the API version to implement a compatibility layer,
but that code got dropped and the API version is now completely unused:

df9b539 ("CA-35286: Remove forward compatability code from the python XenAPI module in favour of the compatability layer in xapi")

Remove the unused argument, this will allow us to avoid making 4 additional API calls after every login.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
The API version is not used anymore, so remove the 4 additional queries it'd make.
For backward compatibility retain the field as a cached property: should a client of this code want to access the API version,
then it'll make the 4 queries at that time, and cache the result for the duration of the API object.

Cached properties are writable, so the fallback that writes the API version to 1.1 should keep working too

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
CP-48623: reduce XenAPI.py connection rate and drop 4 useless API calls
CA-389506: fix platform:nested_virt typo
redo_log: report redo log as broken if we cannot find the block device
This forces us to fully declare the dependencies of our code,
and not rely on libraries that are brought in only as transitive dependencies
of other libraries we happen to link to.

E.g. if our module A depends on library X, which itself depends on library Y,
then currently by linking X we also get Y linked and accessible from A
directly.
If code in module A uses both module X and Y *directly* then it needs to
declare a dependency on both when implicit transitive deps are off or it gets a
link failure (typically an error about a module or type being abstract).
If the code in module A only uses module X then no change is needed (X can
still use Y and the final executable will link both, it is just a question of
what is visible and callable from A directly).

This is especially useful when writing new code to get dependencies correct
from the beginning.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
(cherry picked from commit 7203430)
…check

Bytecode builds for `http_lib` are disabled due to '(modes best)',
and that means that anything that depends on it must have it disabled too to avoid this warning.

Avoids these kinds of warnings:
```
File "_none_", line 1:
Error: Module `Buf_io' is unavailable (required by `Http_svr')
```

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
This will be a new library that will provide a more type-safe interface to file descriptor operations.
Useful on its own, but also for testing stdext.

Minimal dependencies, only Unix (and Alcotest for testing).

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
This will be a test framework providing QCheck generators and properties for
testing file descriptor operations.
It will try to generate:
* different kinds of file descriptors
* actual data written/read on the other end of pipes and socket pairs
* different speeds and delays on the other end to find buffering bugs
* file descriptors that are >1024 to find bugs with select

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
We are going to use type-level constraints a lot.
Try to future proof it by using the recommended compiler flag.
`ocamlc` says this about `-principal`:
> When using labelled arguments and/or polymorphic methods, this flag is required to
> ensure future versions of the compiler will be able to infer types correctly, even if internal algorithms change

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
edwintorok and others added 7 commits June 19, 2024 09:46
We'll check statically that we are not using Unix.select, but it is good to have a runtime check too,
in case some library (C or OCaml) indirectly uses it.

Default is 0 but can be set to 1024 to test for the absence of Unix.select.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…rtup failures

If 'server_init' raised an exception previously we wouldn't be able to log the full stacktrace.
Enable backtraces earlier to ensure that we can.

We need to use Debug.with_thread_associated instead of just Backtrace.with_backtraces, because if the thread is not registered,
then xapi-backtrace won't print the backtrace even if it has one.

A startup failure now looks like this in the logs:
```
Jun 19 04:58:41 lcy2-dt72 xapi: [error||0 ||backtrace] Xapi.watchdog failed with exception Unix.Unix_error(Unix.EMFILE, "dup", "")
Jun 19 04:58:41 lcy2-dt72 xapi: [error||0 ||backtrace] Raised Unix.Unix_error(Unix.EMFILE, "dup", "")
Jun 19 04:58:41 lcy2-dt72 xapi: [error||0 ||backtrace] 1/7 xapi Raised at file ocaml/libs/xapi-stdext/lib/xapi-stdext-unix/unixext.ml, line 873
Jun 19 04:58:41 lcy2-dt72 xapi: [error||0 ||backtrace] 2/7 xapi Called from file ocaml/xapi/xapi.ml, line 936
Jun 19 04:58:41 lcy2-dt72 xapi: [error||0 ||backtrace] 3/7 xapi Called from file ocaml/xapi/xapi.ml, line 946
Jun 19 04:58:41 lcy2-dt72 xapi: [error||0 ||backtrace] 4/7 xapi Called from file ocaml/xapi/xapi.ml, line 1535
Jun 19 04:58:41 lcy2-dt72 xapi: [error||0 ||backtrace] 5/7 xapi Called from file ocaml/xapi/xapi.ml, line 1541
Jun 19 04:58:41 lcy2-dt72 xapi: [error||0 ||backtrace] 6/7 xapi Called from file ocaml/xapi/xapi.ml, line 1548
Jun 19 04:58:41 lcy2-dt72 xapi: [error||0 ||backtrace] 7/7 xapi Called from file ocaml/libs/log/debug.ml, line 250
```

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
This commit got split between the 3 epoll PRs, and the Makefile part got missed:
f4baafa ("Test for absence of select: introduce open_1024 in tests")

The build still worked locally because I have ulimit > 1024 (and apparently in the CI too?),
but it failed in Koji where we had a 1024 limit.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
This code snippet was unchanged since the beginning,
but pylint complained:
```
R1720: Unnecessary "else" after "raise", remove the "else" and de-indent the code inside it (no-else-raise)
```

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Fixes: 500a1f7 ("XenAPI: suppress pytype false positives")

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Copy link

codecov bot commented Jul 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #5760     +/-   ##
========================================
+ Coverage    44.8%   51.4%   +6.5%     
========================================
  Files          16      13      -3     
  Lines        2210    1928    -282     
========================================
- Hits          992     991      -1     
+ Misses       1218     937    -281     

see 4 files with indirect coverage changes

Flag Coverage Δ
python2.7 ?
python3.11 51.4% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

@coveralls
Copy link

Coverage Status

coverage: 44.887%. remained the same
when pulling ce17da2 on feature/perf
into e61e0ac on master.

@xapi-project xapi-project deleted a comment from coveralls Jul 4, 2024
@edwintorok edwintorok marked this pull request as ready for review July 4, 2024 14:23
@psafont
Copy link
Member

psafont commented Jul 8, 2024

turn off transitive deps in dune, so we are more explicit about our build dependencies (this makes it more obvious if you want a certain module to not link with something)

I'm surprised this doesn't cause problems in xs-opam, I've tried turning them off, but the xen-built xenctrl findlib libraries have a different name from the dummy xenctrl's (xenmmap and xenctrl.xenmmap in particular), making it impossible to build using an opam switch. Have you tested these changes in a developer opam switch?

@edwintorok
Copy link
Contributor Author

but the xen-built xenctrl findlib libraries have a different name from the dummy xenctrl's (xenmmap and xenctrl.xenmmap in particular),

There is an explicit dependency on xenmmap now for dbgring. And the koji/opam inconsistency has been fixed long ago by xapi-project/xenctrl#7.

I haven't tried building this in the xs-opam CI, but it builds in koji, in this CI and locally in an xs-opam based 4.14.1 opam switch.

@psafont
Copy link
Member

psafont commented Jul 10, 2024

@edwintorok the branch has a conflict, does it need another merge from master?

@edwintorok
Copy link
Contributor Author

yes, I can't push to the branch directly. I'll raise another PR to feature/perf to update it, and then we should really merge this branch or this'll keep happening (conflicts with other PRs being merged).

@edwintorok
Copy link
Contributor Author

PR here #5804

Update feature/perf from master
Copy link
Member

@robhoes robhoes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume all changes in feature/perf have been reviewed in PRs previously, and subsequently tested extensively. So this is good to go now.

@edwintorok edwintorok merged commit 3c41333 into master Jul 10, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants