Skip to content

OADP-5967-prometheus support #95469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion modules/oadp-creating-service-monitor.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ servicemonitor.monitoring.coreos.com/oadp-service-monitor created

.Verification

* Confirm that the new service monitor is in an *Up* state by using the *Administrator* perspective of the {product-title} web console:
* Confirm that the new service monitor is in an *Up* state by using the *Administrator* perspective of the {product-title} web console. It takes a few minutes for the service monitor to be in the *Up* state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to define "it" as it's a vague referent. How about:

Wait a few minutes for the service monitor to reach the Up state.

.. Navigate to the *Observe* -> *Targets* page.
.. Ensure the *Filter* is unselected or that the *User* source is selected and type `openshift-adp` in the `Text` search field.
.. Verify that the status for the *Status* for the service monitor is *Up*.
Expand Down
199 changes: 34 additions & 165 deletions modules/oadp-list-of-metrics.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,173 +6,42 @@
[id="list-of-metrics_{context}"]
= List of available metrics

These are the list of metrics provided by the OADP together with their https://prometheus.io/docs/concepts/metric_types/[Types].
Refer to the following table for a list of `Velero` metrics provided by {oadp-short} together with their https://prometheus.io/docs/concepts/metric_types/[Types]:

.Velero metrics
|===
|Metric name |Description |Type

|`kopia_content_cache_hit_bytes`
|Number of bytes retrieved from the cache
|Counter
| `velero_backup_tarball_size_bytes` | Size, in bytes, of a backup | Gauge
| `velero_backup_total` | Current number of existent backups | Gauge
| `velero_backup_attempt_total` | Total number of attempted backups | Counter
| `velero_backup_success_total` | Total number of successful backups | Counter
| `velero_backup_partial_failure_total` | Total number of partially failed backups | Counter
| `velero_backup_failure_total` | Total number of failed backups | Counter
| `velero_backup_validation_failure_total` | Total number of validation failed backups | Counter
| `velero_backup_duration_seconds` | Time taken to complete backup, in seconds | Histogram
| `velero_backup_duration_seconds_bucket` | Total count of observations for a bucket in the histogram for the metric `velero_backup_duration_seconds` | Counter
| `velero_backup_duration_seconds_count` | Total count of observations for the metric `velero_backup_duration_seconds` | Counter
| `velero_backup_duration_seconds_sum` | Total sum of observations for the metric `velero_backup_duration_seconds` | Counter
| `velero_backup_deletion_attempt_total` | Total number of attempted backup deletions | Counter
| `velero_backup_deletion_success_total` | Total number of successful backup deletions | Counter
| `velero_backup_deletion_failure_total` | Total number of failed backup deletions | Counter
| `velero_backup_last_successful_timestamp` | Last time a backup ran successfully, Unix timestamp in seconds | Gauge
| `velero_backup_items_total` | Total number of items backed up | Gauge
| `velero_backup_items_errors` | Total number of errors encountered during backup | Gauge
| `velero_backup_warning_total` | Total number of warned backups | Counter
| `velero_backup_last_status` | Last status of the backup. A value of 1 is success, 0 is failure | Gauge
| `velero_restore_total` | Current number of existent restores | Gauge
| `velero_restore_attempt_total` | Total number of attempted restores | Counter
| `velero_restore_validation_failed_total` | Total number of failed restores failing validations | Counter
| `velero_restore_success_total` | Total number of successful restores | Counter
| `velero_restore_partial_failure_total` | Total number of partially failed restores | Counter
| `velero_restore_failed_total` | Total number of failed restores | Counter
| `velero_volume_snapshot_attempt_total``| Total number of attempted volume snapshots | Counter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

volume_snapshot_attempt_total` has an extra backtick.

| `velero_volume_snapshot_success_total` | Total number of successful volume snapshots | Counter
| `velero_volume_snapshot_failure_total` | Total number of failed volume snapshots | Counter
| `velero_csi_snapshot_attempt_total` | Total number of CSI attempted volume snapshots | Counter
| `velero_csi_snapshot_success_total` | Total number of CSI successful volume snapshots | Counter
| `velero_csi_snapshot_failure_total` | Total number of CSI failed volume snapshots | Counter

|`kopia_content_cache_hit_count`
|Number of times content was retrieved from the cache
|Counter

|`kopia_content_cache_malformed`
|Number of times malformed content was read from the cache
|Counter

|`kopia_content_cache_miss_count`
|Number of times content was not found in the cache and fetched
|Counter

|`kopia_content_cache_missed_bytes`
|Number of bytes retrieved from the underlying storage
|Counter

|`kopia_content_cache_miss_error_count`
|Number of times content could not be found in the underlying storage
|Counter

|`kopia_content_cache_store_error_count`
|Number of times content could not be saved in the cache
|Counter

|`kopia_content_get_bytes`
|Number of bytes retrieved using `GetContent()`
|Counter

|`kopia_content_get_count`
|Number of times `GetContent()` was called
|Counter

|`kopia_content_get_error_count`
|Number of times `GetContent()` was called and the result was an error
|Counter

|`kopia_content_get_not_found_count`
|Number of times `GetContent()` was called and the result was not found
|Counter

|`kopia_content_write_bytes`
|Number of bytes passed to `WriteContent()`
|Counter

|`kopia_content_write_count`
|Number of times `WriteContent()` was called
|Counter

|`velero_backup_attempt_total`
|Total number of attempted backups
|Counter

|`velero_backup_deletion_attempt_total`
|Total number of attempted backup deletions
|Counter

|`velero_backup_deletion_failure_total`
|Total number of failed backup deletions
|Counter

|`velero_backup_deletion_success_total`
|Total number of successful backup deletions
|Counter

|`velero_backup_duration_seconds`
|Time taken to complete backup, in seconds
|Histogram

|`velero_backup_failure_total`
|Total number of failed backups
|Counter

|`velero_backup_items_errors`
|Total number of errors encountered during backup
|Gauge

|`velero_backup_items_total`
|Total number of items backed up
|Gauge

|`velero_backup_last_status`
|Last status of the backup. A value of 1 is success, 0.
|Gauge

|`velero_backup_last_successful_timestamp`
|Last time a backup ran successfully, Unix timestamp in seconds
|Gauge

|`velero_backup_partial_failure_total`
|Total number of partially failed backups
|Counter

|`velero_backup_success_total`
|Total number of successful backups
|Counter

|`velero_backup_tarball_size_bytes`
|Size, in bytes, of a backup
|Gauge

|`velero_backup_total`
|Current number of existent backups
|Gauge

|`velero_backup_validation_failure_total`
|Total number of validation failed backups
|Counter

|`velero_backup_warning_total`
|Total number of warned backups
|Counter

|`velero_csi_snapshot_attempt_total`
|Total number of CSI attempted volume snapshots
|Counter

|`velero_csi_snapshot_failure_total`
|Total number of CSI failed volume snapshots
|Counter

|`velero_csi_snapshot_success_total`
|Total number of CSI successful volume snapshots
|Counter

|`velero_restore_attempt_total`
|Total number of attempted restores
|Counter

|`velero_restore_failed_total`
|Total number of failed restores
|Counter

|`velero_restore_partial_failure_total`
|Total number of partially failed restores
|Counter

|`velero_restore_success_total`
|Total number of successful restores
|Counter

|`velero_restore_total`
|Current number of existent restores
|Gauge

|`velero_restore_validation_failed_total`
|Total number of failed restores failing validations
|Counter

|`velero_volume_snapshot_attempt_total`
|Total number of attempted volume snapshots
|Counter

|`velero_volume_snapshot_failure_total`
|Total number of failed volume snapshots
|Counter

|`velero_volume_snapshot_success_total`
|Total number of successful volume snapshots
|Counter

|===
|===
9 changes: 8 additions & 1 deletion modules/oadp-monitoring-setup.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,13 @@ With enabled User Workload Monitoring, it is possible to configure and use any P

Monitoring metrics requires enabling monitoring for the user-defined projects and creating a `ServiceMonitor` resource to scrape those metrics from the already enabled OADP service endpoint that resides in the `openshift-adp` namespace.

[NOTE]
====
The {oadp-short} support for Prometheus metrics is offered on a best-effort basis and is not fully supported.
====

For more information about setting up the monitoring stack, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{product-version}/html/monitoring/configuring-user-workload-monitoring[Configuring user workload monitoring].

.Prerequisites

* You have access to an {product-title} cluster using an account with `cluster-admin` permissions.
Expand All @@ -31,10 +38,10 @@ $ oc edit configmap cluster-monitoring-config -n openshift-monitoring
[source,yaml]
----
apiVersion: v1
kind: ConfigMap
data:
config.yaml: |
enableUserWorkload: true <1>
kind: ConfigMap
metadata:
# ...
----
Expand Down