Skip to content

CP-52524: Generate an alert when various host kernel taints are set #6128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

last-genius
Copy link
Contributor

Issue an alert about a broken host kernel if bits 4, 5, 7, 9, or 14 are set in
/proc/sys/kernel/tainted, indicating some kind of error was encountered and the
future behaviour of the kernel might not be predictable or safe anymore (though
it generally should reasonably recover).

Only one alert per tainted bit per boot should be issued.

Distinguish between Major (4,5,7 - these are all things that might cause a
host crash, but are unlikely to corrupt whatever data has been written out) and
Warning (9, 14 - might be a concern and could be raised to Support but usually
are not severe enough to crash the host) levels of errors as
suggested by the Foundations team.

This should serve as an indicator during issue investigation to look for the
cause of the taint.

@last-genius last-genius force-pushed the private/asultanov/CP-52524 branch 3 times, most recently from 61ee7d2 to 5c6a022 Compare November 19, 2024 14:08
@last-genius
Copy link
Contributor Author

Ran a Ring 3 BST+BVT suite with a jobtest verifying that no kernel error alerts are raised in otherwise successful runs, doesn't seem to have been triggered, so this should be good to go alongside a XenRT PR for the jobtest.

@last-genius last-genius marked this pull request as ready for review November 20, 2024 08:22
Unixext.string_of_file "/proc/stat"
|> String.trim
|> String.split '\n'
|> List.find (fun s -> String.starts_with ~prefix:"btime" s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once found, the line could be parsed with sscanf as well.

@last-genius last-genius force-pushed the private/asultanov/CP-52524 branch from 5c6a022 to 3c0b352 Compare November 21, 2024 09:45
@last-genius
Copy link
Contributor Author

Implemented the suggestions - now I only read through the alerts once on startup.

Copy link
Contributor

@contificate contificate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most recent changes are a lot clearer.

The r := !r |> .. is a neat syntactic approach to clarifying what's going on there.

|> List.filter (fun (_, alert_message) ->
let alert_already_issued_for_this_boot =
Helpers.call_api_functions ~__context (fun rpc session_id ->
Client.Client.Message.get_all_records ~rpc ~session_id
Copy link
Member

@psafont psafont Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performing this call while loading the module seems dangerous, it looks like it might cause a dependency loop.

Does it block until the socket is available? Does the socket depend on this module before it's available? (message forwarding depends on this module)

Using a lazy value, and only filling it up on the first call in an independent thread seems safer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much more sensible, done, thank you! (likely the tests were failing for this exact reason)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably could have avoided the pervasive use of Lazy within the logic by making it a concern of the periodic scheduler, not the logic here.

Copy link
Contributor

@edwintorok edwintorok Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy is also not thread safe (and crashed on older OCaml versions), so we have to be very careful how to use it. Another way to avoid lazy is to add a new function to xapi's startup sequence in xapi.ml, or to the periodic scheduler init file, e.g. with a very low initial periodic, or as a oneshot action.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is being registered with the periodic scheduler already.

The point is lazy is being used here to prevent evaluation at module init time, so it's forced once the periodic scheduler schedules it for the first time - but then the rest of the code is laboured by manipulating a lazy value. When, instead, it can factor the laziness out, implemented in terms of non-lazy values. There just need to be an outer function that filters the list initially and returns the relevant closure, capturing the filtered list. Then, the presence of this function would be what is considered lazy, within the scheduler ((_ -> _) option ref), with its value coming from the first time the task is scheduled.

That is my understanding, anyway - which is largely based on the first commits before the introduction of lazy. You can avoid evaluating it at module init time in the usual fashion: a function that returns a function, not a function that relies on lazy data that it forces on first call.

@last-genius last-genius force-pushed the private/asultanov/CP-52524 branch from 3c0b352 to e5fd26c Compare November 22, 2024 09:22
Andrii Sultanov added 3 commits November 22, 2024 09:23
Completes the 15+ years old TODO, at the expense of removing an ultimate
example of a "not invented here" attitude.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Issue an alert about a broken host kernel if bits 4, 5, 7, 9, or 14 are set in
/proc/sys/kernel/tainted, indicating some kind of error was encountered and the
future behaviour of the kernel might not be predictable or safe anymore (though
it generally should reasonably recover).

Only one alert per tainted bit per boot can be present (more than one can be
issued, if the user dismissed the alerts and restarted the toolstack).

Distinguish between Major (4,5,7 - these are all things that might cause a
host crash, but are unlikely to corrupt whatever data has been written out) and
Warning (9, 14 - might be a concern and could be raised to Support but usually
are not severe enough to crash the host) levels of errors as
suggested by the Foundations team.

This should serve as an indicator during issue investigation to look for the
cause of the taint.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
The manual notes that `Lazy.from_fun` "should only be used if the function f
is already defined. In particular it is always less efficient to write
`from_fun (fun () -> expr)` than `lazy expr`.

So, replace `Lazy.from_fun` with `lazy` in this particular case

Compare the lambda dump for `lazy (fun () -> [| 1;2;3 |] |> Array.map (fun x -> x+1))`:
```
(seq
  (let
    (l/269 =
       (function param/319[int]
         (apply (field_imm 12 (global Stdlib__Array!))
           (function x/318[int] : int (+ x/318 1)) (makearray[int] 1 2 3))))
    (setfield_ptr(root-init) 0 (global Main!) l/269))
  0)
```

with the lambda dump of the `let x = Lazy.from_fun (fun () -> [| 1;2;3 |] |>
Array.map (fun x -> x+1))`:
```
(seq
  (let
    (x/269 =
       (apply (field_imm 5 (global Stdlib__Lazy!))
         (function param/332[int]
           (apply (field_imm 12 (global Stdlib__Array!))
             (function x/331[int] : int (+ x/331 1)) (makearray[int] 1 2 3)))))
    (setfield_ptr(root-init) 0 (global Main!) x/269))
  0)
```

See: https://patricoferris.github.io/js_of_ocamlopt/#code=bGV0IGwgPSBsYXp5IChmdW4gKCkgLT4gW3wgMTsyOzMgfF0gfD4gQXJyYXkubWFwIChmdW4geCAtPiB4KzEpKQoKbGV0IHggPSBMYXp5LmZyb21fZnVuIChmdW4gKCkgLT4gW3wgMTsyOzMgfF0gfD4gQXJyYXkubWFwIChmdW4geCAtPiB4KzEpKQ%3D%3D

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
@last-genius last-genius force-pushed the private/asultanov/CP-52524 branch from e5fd26c to aaabb6c Compare November 22, 2024 09:23
@last-genius
Copy link
Contributor Author

Moved to evaluating the issued alerts in a lazy value. Also added another commit on top improving another usage of Lazy

@contificate
Copy link
Contributor

I think the logic is more complicated now by bringing lazy into the mix.

My understanding of this is that it will force filtering of the list when it's first scheduled by the periodic scheduler - which is semantically identical to keeping the code as it was, introducing a () parameter to avoid evaluating the outside function, then lazily evaluating that (possibly with a _ option ref of the closure) inside the scheduler, and applying its result every time it's scheduled.

That said, if others are happy with the changes, that's fine.

@last-genius last-genius added this pull request to the merge queue Dec 3, 2024
Merged via the queue into xapi-project:master with commit b782202 Dec 3, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants