-
Notifications
You must be signed in to change notification settings - Fork 13
Introduce JSON-Based Provisioning Health Reporting and Reporting Failures to Wireserver/Azure #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4961a57
to
8fa233e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/kvp.rs:257
- Relying on expect to extract the 'reason' field for a failure health report may cause a panic if the field is missing. Consider handling the missing reason case more gracefully, for example by logging an error or providing a default failure message.
reason.as_deref().expect("Reason must be provided for failure"),
8fa233e
to
00c219d
Compare
6cd0680
to
f9bcb15
Compare
+1
+1, though doesn't have to happen in this PR |
f54da5d
to
80e0fe6
Compare
bb181fe
to
ac7775a
Compare
b4d362b
to
cec5528
Compare
libazureinit/src/health.rs
Outdated
status, | ||
delay_secs | ||
); | ||
tokio::time::sleep(Duration::from_secs(delay_secs)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduces an accounting issue with the total_retry_timeout_secs
. The remaining
variable is updated from the result of http::post
, but in the event that the server gives us a Retry-After
header, however long it tells us to wait is not included in the remaining
time. One option would be to use Duration::saturating_sub
to account for this, but this introduces another (less likely) issue.
In the worst case, the server is buggy and tells us to sleep for MAX_INT
or something absurd, in which case this will appear to hang forever (although we will get a helpful log message with the duration, so it would be pretty obvious what happened). What I would recommend doing instead of the while !remaining.is_zero()
is to use loop {}
and wrap it in a tokio::time::timeout
like the http::request
function does. This way, no matter what happens, no matter what the server tells you to do, you will stop after total_retry_timeout_secs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peytonr18 what is the typical Retry-After duration?
We don't use it in cloud-init because we're sensitive to excessive delays in provisioning. Given we retry every second (at most) and we know that the load is in no way unreasonable as it's just a host/node-local endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like wireserver responds with Retry-After, even when throttled:
* Connected to 168.63.129.16 (168.63.129.16) port 80
> GET /machine/?comp=goalstate HTTP/1.1
> Host: 168.63.129.16
> User-Agent: curl/8.5.0
> Accept: */*
> x-ms-version:2012-11-30
>
< HTTP/1.1 403 Forbidden
< Content-Type: text/xml; charset=utf-8
< Server: Microsoft-IIS/10.0
< Date: Fri, 02 May 2025 20:28:09 GMT
< Content-Length: 357
<
<?xml version="1.0" encoding="utf-8"?>
<Error xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Code>RequestRateLimitExceeded</Code>
<Message>The request was denied because of too many requests by this client. Please retry the request after a few seconds.</Message>
<Details></Details>
</Error>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove Retry-After support and ensure we log response headers to give us data in the future if wireserver ever uses it and it seems legit. Sleeping one second is very reasonable in this situation.
0951159
to
8abe2a5
Compare
537070f
to
75695ef
Compare
- Remove legacy goalstate/XML reporting. - Add report_provisioning_health() that builds its own Client. - Split KVP builder into build_success_health_report() and build_failure_health_report(). - Update provisioning flow and tests to use new health reporting logic instead of goalstate.
…lth reporting - Replace panic-based extraction of the reason field in failure health reports with logging an error and providing a default. - Refactor main() to utilize timeout and retry values defined in config.rs (for both Wireserver and IMDS) rather than hardcoded literal values.
- Remove low‑level report_provisioning_health function and knobs; callers no longer supply state/substatus. - Split report_provisioning into report_ready() and report_failure() that drive a shared _report helper. - Pass full Config into health reporting rather than individual timeouts. - Rename goalstate.rs to health.rs to reflect the removal of goalstate.
… status check, and update tests according to CoPilot suggestions
…ther minor formatting changes
…he status manually
…nd enums - Removed Retry-After handling to simplify retry logic and avoid timeout accounting issues - Introduced ReportableError struct to standardize failure reporting format (parity with KVP) - Logged response headers for future observability of wireserver behavior - Replaced raw string states with ProvisioningState and ProvisioningSubStatus enums
…ehavior - Support structured 'extra' and 'optional_key' fields in tracing-based health reports (with 'extra' accepting both JSON and key=value;key2=value2 formats) - Updated rustdoc to clearly explain supported tracing field formats - Switched tracing of HTTP headers from info! to debug! - Added HTTP 500 (Internal Server Error) to the set of retryable status codes - Added test coverage for structured 'extra' values in KVP reports
…erver parity with cloud-init) - Major refactor of health reporting to Azure wireserver and KVP pool: - Centralized all report-building logic in health.rs, separating error/success/in-progress serialization. - Improved error handling and tracing for all report operations; all errors and retries are now surfaced explicitly. - Used enums and explicit structs to clarify state and error handling with ProvisioningErrorKind. - Modularized code: report formatting, KVP writing, and wireserver posting are cleanly separated. - Made all reporting API entrypoints clear (report_ready, report_failure, report_in_progress), with builder-pattern support for error reports.
… update as_encoded_report to accept vm_id, agent, and pps_type params for cleaner separation of concerns. Also addressed a few clippy errors
…to_load_config and provisioning_error constructors for standardized error reporting; fixed spacing not caught by cargo fmt
bfec860
to
788188e
Compare
…plify KVP formatting - Replace ReportableError with unified, extensible Error enum for all reportable failures. - Move all health reporting and KVP encoding logic into Error (as_encoded_report), centralizing error-to-health mapping. - Refactor health and main error-handling to wrap only known, reportable error variants; treat all others as unhandled exception.
53e22a6
to
aafc960
Compare
libazureinit/src/error.rs
Outdated
@@ -72,10 +75,181 @@ pub enum Error { | |||
Timeout, | |||
#[error("Failed to update the sshd configuration")] | |||
UpdateSshdConfig, | |||
#[error("Failed to load sshd config: {details}")] | |||
ConfigLoadFailure { details: String }, | |||
#[error("Unhandled exception")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[error("Unhandled exception")] | |
#[error("Unhandled error")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually i think we can remove this one entirely, and for reason() return "unhandled error" for the _
case
libazureinit/src/error.rs
Outdated
} | ||
|
||
/// Generates an encoded KVP report string for an unhandled exception. | ||
pub fn unhandled_error_report( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't special case this
libazureinit/src/error.rs
Outdated
/// Implement reportable formatting for `Error` to be used in health reporting. | ||
impl Error { | ||
/// Returns a human-readable summary describing the error variant. | ||
pub fn reason(&self) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm inclined to ask that reason=lower case unless ACRONYM to be consistent with existing reasons used in cloud-init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks awesome! great work
…move other UnhandledException Logic - Added UnhandledError enum variant so we can explicitly trigger fallback/catch-all error handling via the match arm. - Removed the dedicated unhandled_exception formatting method; now all unknown or unexpected errors use the general case in as_encoded_report. - Minor syntax and enum construction fixes for consistency. - Documents intent and provides a consistent way to surface unexpected failures.
/// | ||
/// Includes the result, reason, agent, supporting data, and standard fields such as | ||
/// `vm_id`, `timestamp`, and documentation URL. | ||
pub fn as_encoded_report(&self, vm_id: &str, _pps_type: &str) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why passing in _pps_type if you're going to report "None"? Or is this a place holder for future pps support?
/// Constructs a KVP entry representing a successful provisioning event. | ||
pub fn encoded_success_report( | ||
vm_id: &str, | ||
optional_key_value: Option<(&str, &str)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why pps_type not needed here if you needed a place holder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally my bad - forgot to pass down the placeholder. I suppose we can do this either way, leave the placeholder for future work or just hardcode it as I did below in the meantime. Let me know which you think is best.
let s = match self { | ||
ProvisioningState::Ready => "Ready", | ||
ProvisioningState::NotReady => "NotReady", | ||
ProvisioningState::InProgress => "InProgress", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please confirm if InProgress is a state supported by Wireserver. I believe InProgress is NotReady with a substatus of Provisioning (while "Failed" is NotReady with a substatus of ProvisioningFailed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cjp256 I couldn't find any reference to this in the documentation other than that InProgress should just be NotReady. Do you know for sure one way or the other?
This PR introduces failure reporting for provisioning and does so using a new JSON endpoint that does not rely on goalstate. The changes ensure that the provisioning status is reported to wireserver or, as a fallback, via KVP. Additionally, this utilizes a new JSON-based endpoint for reporting, rather than the old XML payload.
Key Changes
Removal of Legacy XML/Goalstate Logic:
New Public APIs:
report_ready
&report_failure
:_report
helper handles JSON payload construction, headers, and the retry loop under the hood.Integration Changes in Provisioning Flow:
main()
, we now callreport_ready()
orreport_failure()
to send the health status on success or failure.Unify Error Handling and Reporting:
Error
enum has been extended to include the necessary formatting for errors that are reportable to wireserver.Error::as_encoded_report
.Tests and Test Helpers Updated:
unittest.rs
andfunctional_tests.rs
have both been updated to reflect the new JSON endpoint.Initially based on #170, but that PR will be closed in favor of this one.
Resolves #58 and #186.