-
Notifications
You must be signed in to change notification settings - Fork 292
CP-52524: Generate an alert when various host kernel taints are set #6128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CP-52524: Generate an alert when various host kernel taints are set #6128
Conversation
61ee7d2
to
5c6a022
Compare
Ran a Ring 3 BST+BVT suite with a jobtest verifying that no kernel error alerts are raised in otherwise successful runs, doesn't seem to have been triggered, so this should be good to go alongside a XenRT PR for the jobtest. |
Unixext.string_of_file "/proc/stat" | ||
|> String.trim | ||
|> String.split '\n' | ||
|> List.find (fun s -> String.starts_with ~prefix:"btime" s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once found, the line could be parsed with sscanf
as well.
5c6a022
to
3c0b352
Compare
Implemented the suggestions - now I only read through the alerts once on startup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most recent changes are a lot clearer.
The r := !r |> ..
is a neat syntactic approach to clarifying what's going on there.
ocaml/xapi/xapi_host.ml
Outdated
|> List.filter (fun (_, alert_message) -> | ||
let alert_already_issued_for_this_boot = | ||
Helpers.call_api_functions ~__context (fun rpc session_id -> | ||
Client.Client.Message.get_all_records ~rpc ~session_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performing this call while loading the module seems dangerous, it looks like it might cause a dependency loop.
Does it block until the socket is available? Does the socket depend on this module before it's available? (message forwarding depends on this module)
Using a lazy value, and only filling it up on the first call in an independent thread seems safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much more sensible, done, thank you! (likely the tests were failing for this exact reason)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably could have avoided the pervasive use of Lazy
within the logic by making it a concern of the periodic scheduler, not the logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lazy is also not thread safe (and crashed on older OCaml versions), so we have to be very careful how to use it. Another way to avoid lazy is to add a new function to xapi's startup sequence in xapi.ml, or to the periodic scheduler init file, e.g. with a very low initial periodic, or as a oneshot action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is being registered with the periodic scheduler already.
The point is lazy is being used here to prevent evaluation at module init time, so it's forced once the periodic scheduler schedules it for the first time - but then the rest of the code is laboured by manipulating a lazy value. When, instead, it can factor the laziness out, implemented in terms of non-lazy values. There just need to be an outer function that filters the list initially and returns the relevant closure, capturing the filtered list. Then, the presence of this function would be what is considered lazy, within the scheduler ((_ -> _) option ref
), with its value coming from the first time the task is scheduled.
That is my understanding, anyway - which is largely based on the first commits before the introduction of lazy. You can avoid evaluating it at module init time in the usual fashion: a function that returns a function, not a function that relies on lazy data that it forces on first call.
3c0b352
to
e5fd26c
Compare
Completes the 15+ years old TODO, at the expense of removing an ultimate example of a "not invented here" attitude. Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Issue an alert about a broken host kernel if bits 4, 5, 7, 9, or 14 are set in /proc/sys/kernel/tainted, indicating some kind of error was encountered and the future behaviour of the kernel might not be predictable or safe anymore (though it generally should reasonably recover). Only one alert per tainted bit per boot can be present (more than one can be issued, if the user dismissed the alerts and restarted the toolstack). Distinguish between Major (4,5,7 - these are all things that might cause a host crash, but are unlikely to corrupt whatever data has been written out) and Warning (9, 14 - might be a concern and could be raised to Support but usually are not severe enough to crash the host) levels of errors as suggested by the Foundations team. This should serve as an indicator during issue investigation to look for the cause of the taint. Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
The manual notes that `Lazy.from_fun` "should only be used if the function f is already defined. In particular it is always less efficient to write `from_fun (fun () -> expr)` than `lazy expr`. So, replace `Lazy.from_fun` with `lazy` in this particular case Compare the lambda dump for `lazy (fun () -> [| 1;2;3 |] |> Array.map (fun x -> x+1))`: ``` (seq (let (l/269 = (function param/319[int] (apply (field_imm 12 (global Stdlib__Array!)) (function x/318[int] : int (+ x/318 1)) (makearray[int] 1 2 3)))) (setfield_ptr(root-init) 0 (global Main!) l/269)) 0) ``` with the lambda dump of the `let x = Lazy.from_fun (fun () -> [| 1;2;3 |] |> Array.map (fun x -> x+1))`: ``` (seq (let (x/269 = (apply (field_imm 5 (global Stdlib__Lazy!)) (function param/332[int] (apply (field_imm 12 (global Stdlib__Array!)) (function x/331[int] : int (+ x/331 1)) (makearray[int] 1 2 3))))) (setfield_ptr(root-init) 0 (global Main!) x/269)) 0) ``` See: https://patricoferris.github.io/js_of_ocamlopt/#code=bGV0IGwgPSBsYXp5IChmdW4gKCkgLT4gW3wgMTsyOzMgfF0gfD4gQXJyYXkubWFwIChmdW4geCAtPiB4KzEpKQoKbGV0IHggPSBMYXp5LmZyb21fZnVuIChmdW4gKCkgLT4gW3wgMTsyOzMgfF0gfD4gQXJyYXkubWFwIChmdW4geCAtPiB4KzEpKQ%3D%3D Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
e5fd26c
to
aaabb6c
Compare
Moved to evaluating the issued alerts in a |
I think the logic is more complicated now by bringing My understanding of this is that it will force filtering of the list when it's first scheduled by the periodic scheduler - which is semantically identical to keeping the code as it was, introducing a That said, if others are happy with the changes, that's fine. |
Issue an alert about a broken host kernel if bits 4, 5, 7, 9, or 14 are set in
/proc/sys/kernel/tainted
, indicating some kind of error was encountered and thefuture behaviour of the kernel might not be predictable or safe anymore (though
it generally should reasonably recover).
Only one alert per tainted bit per boot should be issued.
Distinguish between Major (4,5,7 - these are all things that might cause a
host crash, but are unlikely to corrupt whatever data has been written out) and
Warning (9, 14 - might be a concern and could be raised to Support but usually
are not severe enough to crash the host) levels of errors as
suggested by the Foundations team.
This should serve as an indicator during issue investigation to look for the
cause of the taint.