Introduce JSON-Based Provisioning Health Reporting and Reporting Failures to Wireserver/Azure #188

peytonr18 · 2025-04-09T21:10:38Z

This PR introduces failure reporting for provisioning and does so using a new JSON endpoint that does not rely on goalstate. The changes ensure that the provisioning status is reported to wireserver or, as a fallback, via KVP. Additionally, this utilizes a new JSON-based endpoint for reporting, rather than the old XML payload.

Key Changes

Removal of Legacy XML/Goalstate Logic:
- The old XML payload format and its associated goalstate logic have been completely removed.
- The new endpoint accepts JSON and no longer requires goalstate information.
New Public APIs: report_ready & report_failure:
- Both build their own reqwest::Client using time‑outs from config.wireserver.
- The shared _report helper handles JSON payload construction, headers, and the retry loop under the hood.
Integration Changes in Provisioning Flow:
- In main(), we now call report_ready() or report_failure() to send the health status on success or failure.
- Early failure paths (e.g., a configuration load error) now use the new JSON health reporting rather than an XML blob (as was the case in Report failures to Azure via Wireserver with KVP Fallback #170).
Unify Error Handling and Reporting:
- The existing Error enum has been extended to include the necessary formatting for errors that are reportable to wireserver.
- All reportable errors, including supporting data, are now mapped to standardized KVP health reports via Error::as_encoded_report.
- “Unhandled” and unexpected errors are now formatted and reported consistently for observability.
Tests and Test Helpers Updated:
- unittest.rs and functional_tests.rs have both been updated to reflect the new JSON endpoint.

Initially based on #170, but that PR will be closed in favor of this one.
Resolves #58 and #186.

Copilot

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/kvp.rs:257

Relying on expect to extract the 'reason' field for a failure health report may cause a panic if the field is missing. Consider handling the missing reason case more gracefully, for example by logging an error or providing a default failure message.

reason.as_deref().expect("Reason must be provided for failure"),

src/main.rs

cjp256 · 2025-04-09T23:13:09Z

Now that we don't use goalstate, should goalstate.rs be renamed to something like health.rs to better describe what it actually does?

+1

Additionally, should we make some kind of universal API for the agent string? It's done a couple different ways throughout the codebase and I wonder if having one single API would be helpful. See:

+1, though doesn't have to happen in this PR

libazureinit/src/config.rs

libazureinit/src/goalstate.rs

src/main.rs

libazureinit/src/health.rs

src/kvp.rs

libazureinit/src/health.rs

jeremycline · 2025-04-30T16:27:46Z

libazureinit/src/health.rs

+                        status,
+                        delay_secs
+                    );
+                    tokio::time::sleep(Duration::from_secs(delay_secs)).await;


This introduces an accounting issue with the total_retry_timeout_secs. The remaining variable is updated from the result of http::post, but in the event that the server gives us a Retry-After header, however long it tells us to wait is not included in the remaining time. One option would be to use Duration::saturating_sub to account for this, but this introduces another (less likely) issue.

In the worst case, the server is buggy and tells us to sleep for MAX_INT or something absurd, in which case this will appear to hang forever (although we will get a helpful log message with the duration, so it would be pretty obvious what happened). What I would recommend doing instead of the while !remaining.is_zero() is to use loop {} and wrap it in a tokio::time::timeout like the http::request function does. This way, no matter what happens, no matter what the server tells you to do, you will stop after total_retry_timeout_secs.

@peytonr18 what is the typical Retry-After duration?

We don't use it in cloud-init because we're sensitive to excessive delays in provisioning. Given we retry every second (at most) and we know that the load is in no way unreasonable as it's just a host/node-local endpoint.

Doesn't look like wireserver responds with Retry-After, even when throttled:

* Connected to 168.63.129.16 (168.63.129.16) port 80 > GET /machine/?comp=goalstate HTTP/1.1 > Host: 168.63.129.16 > User-Agent: curl/8.5.0 > Accept: */* > x-ms-version:2012-11-30 > < HTTP/1.1 403 Forbidden < Content-Type: text/xml; charset=utf-8 < Server: Microsoft-IIS/10.0 < Date: Fri, 02 May 2025 20:28:09 GMT < Content-Length: 357 < <?xml version="1.0" encoding="utf-8"?> <Error xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <Code>RequestRateLimitExceeded</Code> <Message>The request was denied because of too many requests by this client. Please retry the request after a few seconds.</Message> <Details></Details> </Error>

I would remove Retry-After support and ensure we log response headers to give us data in the future if wireserver ever uses it and it seems legit. Sleeping one second is very reasonable in this situation.

libazureinit/src/health.rs

libazureinit/src/http.rs

- Remove legacy goalstate/XML reporting. - Add report_provisioning_health() that builds its own Client. - Split KVP builder into build_success_health_report() and build_failure_health_report(). - Update provisioning flow and tests to use new health reporting logic instead of goalstate.

…lth reporting - Replace panic-based extraction of the reason field in failure health reports with logging an error and providing a default. - Refactor main() to utilize timeout and retry values defined in config.rs (for both Wireserver and IMDS) rather than hardcoded literal values.

- Remove low‑level report_provisioning_health function and knobs; callers no longer supply state/substatus. - Split report_provisioning into report_ready() and report_failure() that drive a shared _report helper. - Pass full Config into health reporting rather than individual timeouts. - Rename goalstate.rs to health.rs to reflect the removal of goalstate.

… status check, and update tests according to CoPilot suggestions

…ther minor formatting changes

…he status manually

…nd enums - Removed Retry-After handling to simplify retry logic and avoid timeout accounting issues - Introduced ReportableError struct to standardize failure reporting format (parity with KVP) - Logged response headers for future observability of wireserver behavior - Replaced raw string states with ProvisioningState and ProvisioningSubStatus enums

…ehavior - Support structured 'extra' and 'optional_key' fields in tracing-based health reports (with 'extra' accepting both JSON and key=value;key2=value2 formats) - Updated rustdoc to clearly explain supported tracing field formats - Switched tracing of HTTP headers from info! to debug! - Added HTTP 500 (Internal Server Error) to the set of retryable status codes - Added test coverage for structured 'extra' values in KVP reports

…erver parity with cloud-init) - Major refactor of health reporting to Azure wireserver and KVP pool: - Centralized all report-building logic in health.rs, separating error/success/in-progress serialization. - Improved error handling and tracing for all report operations; all errors and retries are now surfaced explicitly. - Used enums and explicit structs to clarify state and error handling with ProvisioningErrorKind. - Modularized code: report formatting, KVP writing, and wireserver posting are cleanly separated. - Made all reporting API entrypoints clear (report_ready, report_failure, report_in_progress), with builder-pattern support for error reports.

… update as_encoded_report to accept vm_id, agent, and pps_type params for cleaner separation of concerns. Also addressed a few clippy errors

…to_load_config and provisioning_error constructors for standardized error reporting; fixed spacing not caught by cargo fmt

src/main.rs

libazureinit/src/health.rs

…plify KVP formatting - Replace ReportableError with unified, extensible Error enum for all reportable failures. - Move all health reporting and KVP encoding logic into Error (as_encoded_report), centralizing error-to-health mapping. - Refactor health and main error-handling to wrap only known, reportable error variants; treat all others as unhandled exception.

cjp256 · 2025-07-24T17:04:07Z

libazureinit/src/error.rs

@@ -72,10 +75,181 @@ pub enum Error {
    Timeout,
    #[error("Failed to update the sshd configuration")]
    UpdateSshdConfig,
+    #[error("Failed to load sshd config: {details}")]
+    ConfigLoadFailure { details: String },
+    #[error("Unhandled exception")]


Suggested change

#[error("Unhandled exception")]

#[error("Unhandled error")]

actually i think we can remove this one entirely, and for reason() return "unhandled error" for the _ case

cjp256 · 2025-07-24T17:08:20Z

libazureinit/src/error.rs

+    }
+
+    /// Generates an encoded KVP report string for an unhandled exception.
+    pub fn unhandled_error_report(


don't special case this

libazureinit/src/error.rs

cjp256 · 2025-07-24T17:10:38Z

libazureinit/src/error.rs

+/// Implement reportable formatting for `Error` to be used in health reporting.
+impl Error {
+    /// Returns a human-readable summary describing the error variant.
+    pub fn reason(&self) -> String {


i'm inclined to ask that reason=lower case unless ACRONYM to be consistent with existing reasons used in cloud-init

cadejacobson

this looks awesome! great work

libazureinit/src/health.rs

…move other UnhandledException Logic - Added UnhandledError enum variant so we can explicitly trigger fallback/catch-all error handling via the match arm. - Removed the dedicated unhandled_exception formatting method; now all unknown or unexpected errors use the general case in as_encoded_report. - Minor syntax and enum construction fixes for consistency. - Documents intent and provides a consistent way to surface unexpected failures.

anhvoms · 2025-07-25T15:29:00Z

libazureinit/src/error.rs

+    ///
+    /// Includes the result, reason, agent, supporting data, and standard fields such as
+    /// `vm_id`, `timestamp`, and documentation URL.
+    pub fn as_encoded_report(&self, vm_id: &str, _pps_type: &str) -> String {


why passing in _pps_type if you're going to report "None"? Or is this a place holder for future pps support?

anhvoms · 2025-07-25T15:31:28Z

libazureinit/src/health.rs

+/// Constructs a KVP entry representing a successful provisioning event.
+pub fn encoded_success_report(
+    vm_id: &str,
+    optional_key_value: Option<(&str, &str)>,


why pps_type not needed here if you needed a place holder?

Totally my bad - forgot to pass down the placeholder. I suppose we can do this either way, leave the placeholder for future work or just hardcode it as I did below in the meantime. Let me know which you think is best.

anhvoms · 2025-07-25T15:45:09Z

libazureinit/src/health.rs

+        let s = match self {
+            ProvisioningState::Ready => "Ready",
+            ProvisioningState::NotReady => "NotReady",
+            ProvisioningState::InProgress => "InProgress",


Please confirm if InProgress is a state supported by Wireserver. I believe InProgress is NotReady with a substatus of Provisioning (while "Failed" is NotReady with a substatus of ProvisioningFailed)

@cjp256 I couldn't find any reference to this in the documentation other than that InProgress should just be NotReady. Do you know for sure one way or the other?

peytonr18 force-pushed the probertson-new-endpoint branch from 4961a57 to 8fa233e Compare April 9, 2025 21:25

peytonr18 requested a review from Copilot April 9, 2025 21:28

Copilot AI reviewed Apr 9, 2025

View reviewed changes

src/main.rs Outdated Show resolved Hide resolved

peytonr18 force-pushed the probertson-new-endpoint branch from 8fa233e to 00c219d Compare April 9, 2025 21:31

peytonr18 mentioned this pull request Apr 9, 2025

Adding runcmd and write_files functionality #187

Closed

peytonr18 force-pushed the probertson-new-endpoint branch from 6cd0680 to f9bcb15 Compare April 9, 2025 22:41

peytonr18 marked this pull request as ready for review April 9, 2025 22:48

cjp256 reviewed Apr 10, 2025

View reviewed changes

libazureinit/src/config.rs Outdated Show resolved Hide resolved

cjp256 reviewed Apr 10, 2025

View reviewed changes

libazureinit/src/goalstate.rs Outdated Show resolved Hide resolved

cjp256 reviewed Apr 10, 2025

View reviewed changes

src/main.rs Outdated Show resolved Hide resolved

peytonr18 force-pushed the probertson-new-endpoint branch from f54da5d to 80e0fe6 Compare April 21, 2025 21:39

cjp256 requested a review from Copilot April 21, 2025 22:15

This comment was marked as outdated.

Sign in to view

cjp256 reviewed Apr 21, 2025

View reviewed changes

libazureinit/src/health.rs Outdated Show resolved Hide resolved

cjp256 reviewed Apr 21, 2025

View reviewed changes

libazureinit/src/health.rs Show resolved Hide resolved

cjp256 reviewed Apr 21, 2025

View reviewed changes

libazureinit/src/health.rs Outdated Show resolved Hide resolved

peytonr18 force-pushed the probertson-new-endpoint branch 3 times, most recently from bb181fe to ac7775a Compare April 22, 2025 18:50

anhvoms requested changes Apr 22, 2025

View reviewed changes

This was referenced Apr 23, 2025

Report PPS Type in KVP / Azure Heath Reports #192

Open

[RFE] Refactor main.rs to remove exposed goalstate sections and use configurable IMDS timeout and retry settings #141

Open

cjp256 reviewed Apr 25, 2025

View reviewed changes

libazureinit/src/health.rs Outdated Show resolved Hide resolved

peytonr18 force-pushed the probertson-new-endpoint branch 2 times, most recently from b4d362b to cec5528 Compare April 29, 2025 17:26

anhvoms requested changes Apr 29, 2025

View reviewed changes

libazureinit/src/health.rs Show resolved Hide resolved

jeremycline requested changes Apr 30, 2025

View reviewed changes

cjp256 reviewed May 2, 2025

View reviewed changes

libazureinit/src/http.rs Outdated Show resolved Hide resolved

peytonr18 force-pushed the probertson-new-endpoint branch from 0951159 to 8abe2a5 Compare May 14, 2025 18:17

peytonr18 force-pushed the probertson-new-endpoint branch 3 times, most recently from 537070f to 75695ef Compare July 22, 2025 00:29

peytonr18 added 13 commits July 22, 2025 12:05

Merging in squashed version of probertson-report-failure

296ef1e

Add report_in_progress API, log every HTTP response and simplify HTTP…

37fbb88

… status check, and update tests according to CoPilot suggestions

Improve health report log clarity by including the report state and o…

40046bc

…ther minor formatting changes

Using is_success and RetryAfter logic from Rust instead of checking t…

b9dca08

…he status manually

Updating report_failure to match kvp and unify formatting

07ba79b

Add specialized ReportableError constructors with supporting data and…

a53b4e8

… update as_encoded_report to accept vm_id, agent, and pps_type params for cleaner separation of concerns. Also addressed a few clippy errors

Refactor report_failure to take ReportableError directly; add failed_…

788188e

…to_load_config and provisioning_error constructors for standardized error reporting; fixed spacing not caught by cargo fmt

peytonr18 force-pushed the probertson-new-endpoint branch from bfec860 to 788188e Compare July 22, 2025 19:11

peytonr18 mentioned this pull request Jul 22, 2025

Move kvp.rs into libazureinit #216

Draft

cjp256 reviewed Jul 22, 2025

View reviewed changes

src/main.rs Outdated Show resolved Hide resolved

cjp256 reviewed Jul 22, 2025

View reviewed changes

libazureinit/src/health.rs Outdated Show resolved Hide resolved

cjp256 reviewed Jul 22, 2025

View reviewed changes

libazureinit/src/health.rs Outdated Show resolved Hide resolved

peytonr18 force-pushed the probertson-new-endpoint branch from 53e22a6 to aafc960 Compare July 24, 2025 16:59

cjp256 reviewed Jul 24, 2025

View reviewed changes

libazureinit/src/error.rs Outdated Show resolved Hide resolved

cjp256 reviewed Jul 24, 2025

View reviewed changes

cadejacobson reviewed Jul 24, 2025

View reviewed changes

libazureinit/src/health.rs Outdated Show resolved Hide resolved

anhvoms requested changes Jul 25, 2025

View reviewed changes

Introduce JSON-Based Provisioning Health Reporting and Reporting Failures to Wireserver/Azure #188

Are you sure you want to change the base?

Introduce JSON-Based Provisioning Health Reporting and Reporting Failures to Wireserver/Azure #188

Uh oh!

Conversation

peytonr18 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cjp256 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjp256 May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjp256 May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadejacobson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

peytonr18 commented Apr 9, 2025 •

edited

Loading

cjp256 commented Apr 9, 2025 •

edited

Loading

cjp256 May 1, 2025 •

edited

Loading

cjp256 May 2, 2025 •

edited

Loading