Skip to content

Introduce JSON-Based Provisioning Health Reporting and Reporting Failures to Wireserver/Azure #188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

peytonr18
Copy link
Contributor

@peytonr18 peytonr18 commented Apr 9, 2025

This PR introduces failure reporting for provisioning and does so using a new JSON endpoint that does not rely on goalstate. The changes ensure that the provisioning status is reported to wireserver or, as a fallback, via KVP. Additionally, this utilizes a new JSON-based endpoint for reporting, rather than the old XML payload.

Key Changes

  • Removal of Legacy XML/Goalstate Logic:

    • The old XML payload format and its associated goalstate logic have been completely removed.
    • The new endpoint accepts JSON and no longer requires goalstate information.
  • New Public APIs: report_ready & report_failure:

    • Both build their own reqwest::Client using time‑outs from config.wireserver.
    • The shared _report helper handles JSON payload construction, headers, and the retry loop under the hood.
  • Integration Changes in Provisioning Flow:

  • Unify Error Handling and Reporting:

    • The existing Error enum has been extended to include the necessary formatting for errors that are reportable to wireserver.
    • All reportable errors, including supporting data, are now mapped to standardized KVP health reports via Error::as_encoded_report.
    • “Unhandled” and unexpected errors are now formatted and reported consistently for observability.
  • Tests and Test Helpers Updated:

    • unittest.rs and functional_tests.rs have both been updated to reflect the new JSON endpoint.

Initially based on #170, but that PR will be closed in favor of this one.
Resolves #58 and #186.

@peytonr18 peytonr18 force-pushed the probertson-new-endpoint branch from 4961a57 to 8fa233e Compare April 9, 2025 21:25
@peytonr18 peytonr18 requested a review from Copilot April 9, 2025 21:28
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/kvp.rs:257

  • Relying on expect to extract the 'reason' field for a failure health report may cause a panic if the field is missing. Consider handling the missing reason case more gracefully, for example by logging an error or providing a default failure message.
reason.as_deref().expect("Reason must be provided for failure"),

@peytonr18 peytonr18 force-pushed the probertson-new-endpoint branch from 8fa233e to 00c219d Compare April 9, 2025 21:31
@peytonr18 peytonr18 force-pushed the probertson-new-endpoint branch from 6cd0680 to f9bcb15 Compare April 9, 2025 22:41
@peytonr18 peytonr18 marked this pull request as ready for review April 9, 2025 22:48
@cjp256
Copy link
Contributor

cjp256 commented Apr 9, 2025

Now that we don't use goalstate, should goalstate.rs be renamed to something like health.rs to better describe what it actually does?

+1

Additionally, should we make some kind of universal API for the agent string? It's done a couple different ways throughout the codebase and I wonder if having one single API would be helpful. See:

+1, though doesn't have to happen in this PR

@peytonr18 peytonr18 force-pushed the probertson-new-endpoint branch from f54da5d to 80e0fe6 Compare April 21, 2025 21:39
@cjp256 cjp256 requested a review from Copilot April 21, 2025 22:15
Copilot

This comment was marked as outdated.

@peytonr18 peytonr18 force-pushed the probertson-new-endpoint branch 3 times, most recently from bb181fe to ac7775a Compare April 22, 2025 18:50
@peytonr18 peytonr18 force-pushed the probertson-new-endpoint branch 2 times, most recently from b4d362b to cec5528 Compare April 29, 2025 17:26
status,
delay_secs
);
tokio::time::sleep(Duration::from_secs(delay_secs)).await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces an accounting issue with the total_retry_timeout_secs. The remaining variable is updated from the result of http::post, but in the event that the server gives us a Retry-After header, however long it tells us to wait is not included in the remaining time. One option would be to use Duration::saturating_sub to account for this, but this introduces another (less likely) issue.

In the worst case, the server is buggy and tells us to sleep for MAX_INT or something absurd, in which case this will appear to hang forever (although we will get a helpful log message with the duration, so it would be pretty obvious what happened). What I would recommend doing instead of the while !remaining.is_zero() is to use loop {} and wrap it in a tokio::time::timeout like the http::request function does. This way, no matter what happens, no matter what the server tells you to do, you will stop after total_retry_timeout_secs.

Copy link
Contributor

@cjp256 cjp256 May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peytonr18 what is the typical Retry-After duration?

We don't use it in cloud-init because we're sensitive to excessive delays in provisioning. Given we retry every second (at most) and we know that the load is in no way unreasonable as it's just a host/node-local endpoint.

Copy link
Contributor

@cjp256 cjp256 May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like wireserver responds with Retry-After, even when throttled:

* Connected to 168.63.129.16 (168.63.129.16) port 80
> GET /machine/?comp=goalstate HTTP/1.1
> Host: 168.63.129.16
> User-Agent: curl/8.5.0
> Accept: */*
> x-ms-version:2012-11-30
>
< HTTP/1.1 403 Forbidden
< Content-Type: text/xml; charset=utf-8
< Server: Microsoft-IIS/10.0
< Date: Fri, 02 May 2025 20:28:09 GMT
< Content-Length: 357
<
<?xml version="1.0" encoding="utf-8"?>
<Error xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <Code>RequestRateLimitExceeded</Code>
    <Message>The request was denied because of too many requests by this client. Please retry the request after a few seconds.</Message>
    <Details></Details>
</Error>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove Retry-After support and ensure we log response headers to give us data in the future if wireserver ever uses it and it seems legit. Sleeping one second is very reasonable in this situation.

@peytonr18 peytonr18 force-pushed the probertson-new-endpoint branch from 0951159 to 8abe2a5 Compare May 14, 2025 18:17
@peytonr18 peytonr18 force-pushed the probertson-new-endpoint branch 3 times, most recently from 537070f to 75695ef Compare July 22, 2025 00:29
peytonr18 added 13 commits July 22, 2025 12:05
- Remove legacy goalstate/XML reporting.
- Add report_provisioning_health() that builds its own Client.
- Split KVP builder into build_success_health_report() and build_failure_health_report().
- Update provisioning flow and tests to use new health reporting logic instead of goalstate.
…lth reporting

- Replace panic-based extraction of the reason field in failure health reports with logging an error and providing a default.
- Refactor main() to utilize timeout and retry values defined in config.rs (for both Wireserver and IMDS) rather than hardcoded literal values.
- Remove low‑level report_provisioning_health function and knobs; callers no longer supply state/substatus.
- Split report_provisioning into report_ready() and report_failure() that drive a shared _report helper.
- Pass full Config into health reporting rather than individual timeouts.
- Rename goalstate.rs to health.rs to reflect the removal of goalstate.
… status check, and update tests according to CoPilot suggestions
…nd enums

- Removed Retry-After handling to simplify retry logic and avoid timeout accounting issues
- Introduced ReportableError struct to standardize failure reporting format (parity with KVP)
- Logged response headers for future observability of wireserver behavior
- Replaced raw string states with ProvisioningState and ProvisioningSubStatus enums
…ehavior

- Support structured 'extra' and 'optional_key' fields in tracing-based health reports
  (with 'extra' accepting both JSON and key=value;key2=value2 formats)
- Updated rustdoc to clearly explain supported tracing field formats
- Switched tracing of HTTP headers from info! to debug!
- Added HTTP 500 (Internal Server Error) to the set of retryable status codes
- Added test coverage for structured 'extra' values in KVP reports
…erver parity with cloud-init)

- Major refactor of health reporting to Azure wireserver and KVP pool:
  - Centralized all report-building logic in health.rs, separating error/success/in-progress serialization.
  - Improved error handling and tracing for all report operations; all errors and retries are now surfaced explicitly.
  - Used enums and explicit structs to clarify state and error handling with ProvisioningErrorKind.
  - Modularized code: report formatting, KVP writing, and wireserver posting are cleanly separated.
  - Made all reporting API entrypoints clear (report_ready, report_failure, report_in_progress), with builder-pattern support for error reports.
… update as_encoded_report to accept vm_id, agent, and pps_type params for cleaner separation of concerns. Also addressed a few clippy errors
…to_load_config and provisioning_error constructors for standardized error reporting; fixed spacing not caught by cargo fmt
…plify KVP formatting

- Replace  ReportableError with unified, extensible Error enum for all reportable failures.
- Move all health reporting and KVP encoding logic into Error (as_encoded_report), centralizing error-to-health mapping.
- Refactor health and main error-handling to wrap only known, reportable error variants; treat all others as unhandled exception.
@peytonr18 peytonr18 force-pushed the probertson-new-endpoint branch from 53e22a6 to aafc960 Compare July 24, 2025 16:59
@@ -72,10 +75,181 @@ pub enum Error {
Timeout,
#[error("Failed to update the sshd configuration")]
UpdateSshdConfig,
#[error("Failed to load sshd config: {details}")]
ConfigLoadFailure { details: String },
#[error("Unhandled exception")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[error("Unhandled exception")]
#[error("Unhandled error")]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i think we can remove this one entirely, and for reason() return "unhandled error" for the _ case

}

/// Generates an encoded KVP report string for an unhandled exception.
pub fn unhandled_error_report(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't special case this

/// Implement reportable formatting for `Error` to be used in health reporting.
impl Error {
/// Returns a human-readable summary describing the error variant.
pub fn reason(&self) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm inclined to ask that reason=lower case unless ACRONYM to be consistent with existing reasons used in cloud-init

Copy link
Contributor

@cadejacobson cadejacobson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks awesome! great work

…move other UnhandledException Logic

- Added UnhandledError enum variant so we can explicitly trigger fallback/catch-all error handling via the  match arm.
- Removed the dedicated unhandled_exception formatting method; now all unknown or unexpected errors use the general case in as_encoded_report.
- Minor syntax and enum construction fixes for consistency.
- Documents intent and provides a consistent way to surface unexpected failures.
///
/// Includes the result, reason, agent, supporting data, and standard fields such as
/// `vm_id`, `timestamp`, and documentation URL.
pub fn as_encoded_report(&self, vm_id: &str, _pps_type: &str) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why passing in _pps_type if you're going to report "None"? Or is this a place holder for future pps support?

/// Constructs a KVP entry representing a successful provisioning event.
pub fn encoded_success_report(
vm_id: &str,
optional_key_value: Option<(&str, &str)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why pps_type not needed here if you needed a place holder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally my bad - forgot to pass down the placeholder. I suppose we can do this either way, leave the placeholder for future work or just hardcode it as I did below in the meantime. Let me know which you think is best.

let s = match self {
ProvisioningState::Ready => "Ready",
ProvisioningState::NotReady => "NotReady",
ProvisioningState::InProgress => "InProgress",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm if InProgress is a state supported by Wireserver. I believe InProgress is NotReady with a substatus of Provisioning (while "Failed" is NotReady with a substatus of ProvisioningFailed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjp256 I couldn't find any reference to this in the documentation other than that InProgress should just be NotReady. Do you know for sure one way or the other?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFE] Report failures to Azure when there's an unrecoverable error
6 participants