Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
8387728
Added ports for the sidecars to allow prometheus to scrape the metrics
prajwalvathreya Sep 23, 2024
0c39d75
Fixed error in linode-csi-plugin container due to incorrect metrics port
prajwalvathreya Sep 23, 2024
68b9b98
Added documentation and example graphs for metrics in the csi-driver.
prajwalvathreya Sep 24, 2024
361c077
updated graphs and documentation.
prajwalvathreya Sep 25, 2024
0a4609e
added line break after future scope to keep the doc consistent
prajwalvathreya Sep 25, 2024
4df77e3
added additional node metrics
prajwalvathreya Sep 25, 2024
06d04b3
added clarification on the unit of measurement of time
prajwalvathreya Sep 25, 2024
fcf133b
fixed typo
prajwalvathreya Sep 25, 2024
aedc721
Moved metrics-documentation.md and example-images folder to the docs …
prajwalvathreya Sep 25, 2024
1db1a39
Merge branch 'refs/heads/main' into metrics-endpoint
prajwalvathreya Sep 26, 2024
28691b8
- Created make target for creating a grafana-dashboard
prajwalvathreya Sep 30, 2024
7f133cc
- created services to expose metrics to prometheus
prajwalvathreya Sep 30, 2024
cb43eff
- updated install script to run process in the background
prajwalvathreya Oct 1, 2024
4b09fbf
Merge branch 'main' into metrics-endpoint
prajwalvathreya Oct 1, 2024
c10ee59
- fixed conflict in Makefile
prajwalvathreya Oct 1, 2024
960c574
Update hack/install-monitoring-tools.sh
prajwalvathreya Oct 1, 2024
016edfa
Updated the syntax of passing the CLUSTER_NAME variable
prajwalvathreya Oct 1, 2024
efeea6e
Update hack/install-monitoring-tools.sh, namespace creation
prajwalvathreya Oct 1, 2024
9370d03
Update hack/install-monitoring-tools.sh Grafana helm chart update
prajwalvathreya Oct 1, 2024
5d92fbb
Update hack/install-monitoring-tools.sh Prometheus helm chart update
prajwalvathreya Oct 1, 2024
c0a1c56
- added environment variables for username, password, data retention …
prajwalvathreya Oct 1, 2024
069b55d
- removed echo used for debugging
prajwalvathreya Oct 1, 2024
802245f
Merge branch 'main' into metrics-endpoint
prajwalvathreya Oct 2, 2024
9bc3beb
- updated the script to 3 make targets
prajwalvathreya Oct 3, 2024
c120fa7
- updated templates to opt in to install using helm
prajwalvathreya Oct 4, 2024
3276b46
- fixed container port mapping which was causing containers to crash …
prajwalvathreya Oct 4, 2024
9e18c05
- resolving Makefile conflict
prajwalvathreya Oct 7, 2024
ceac735
Merge branch 'main' into metrics-endpoint
prajwalvathreya Oct 7, 2024
2977e89
- updated to helm chart to expose drivers based on passed flag `enabl…
prajwalvathreya Oct 7, 2024
9380f5d
- updated documentation to explain how to use the helm chart to enabl…
prajwalvathreya Oct 7, 2024
6fe3e49
Merge branch 'main' into metrics-endpoint
prajwalvathreya Oct 7, 2024
9d97d6e
- made changes to install metrics services through helm chart
prajwalvathreya Oct 8, 2024
79750ca
- reverted csi-driver image to latest
prajwalvathreya Oct 8, 2024
07c63df
- updated documentation to explain modifications to make targets
prajwalvathreya Oct 8, 2024
6ac52d6
- updated comment to a more sensible one
prajwalvathreya Oct 8, 2024
62e806d
- updated documentation to be less verbose
prajwalvathreya Oct 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,10 @@ HELM_VERSION ?= "v0.2.1"
CAPL_VERSION ?= "v0.6.4"
CONTROLPLANE_NODES ?= 1
WORKER_NODES ?= 1
GRAFANA_PORT ?= 3000
GRAFANA_USERNAME ?= admin
GRAFANA_PASSWORD ?= admin
DATA_RETENTION_PERIOD ?= 15d # Prometheus data retention period

.PHONY: build
build:
Expand Down Expand Up @@ -185,3 +189,27 @@ release:
cp ./internal/driver/deploy/releases/linode-blockstorage-csi-driver-$(IMAGE_VERSION).yaml ./$(RELEASE_DIR)
sed -e 's/appVersion: "latest"/appVersion: "$(IMAGE_VERSION)"/g' ./helm-chart/csi-driver/Chart.yaml
tar -czvf ./$(RELEASE_DIR)/helm-chart-$(IMAGE_VERSION).tgz -C ./helm-chart/csi-driver .

#####################################################################
# Grafana Dashboard End to End Installation
#####################################################################
.PHONY: grafana-dashboard
grafana-dashboard: install-prometheus install-grafana setup-dashboard

#####################################################################
# Monitoring Tools Installation
#####################################################################
.PHONY: install-prometheus
install-prometheus:
KUBECONFIG=test-cluster-kubeconfig.yaml DATA_RETENTION_PERIOD=$(DATA_RETENTION_PERIOD) \
./hack/install-prometheus.sh --timeout=600s

.PHONY: install-grafana
install-grafana:
KUBECONFIG=test-cluster-kubeconfig.yaml GRAFANA_PORT=$(GRAFANA_PORT) \
GRAFANA_USERNAME=$(GRAFANA_USERNAME) GRAFANA_PASSWORD=$(GRAFANA_PASSWORD) \
./hack/install-grafana.sh --timeout=600s

.PHONY: setup-dashboard
setup-dashboard:
KUBECONFIG=test-cluster-kubeconfig.yaml ./hack/setup-dashboard.sh --namespace=monitoring --dashboard-file=observability/metrics/dashboard.json
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@
- [Creating a Development Cluster](docs/development-setup.md#️-creating-a-development-cluster)
- [Running E2E Tests](docs/testing.md)
- [Contributing](docs/contributing.md)
- [Observability](docs/observability.md)
- [Metrics](docs/metrics-documentation.md)
- [License](#license)
- [Disclaimers](#-disclaimers)
- [Community](#-join-us-on-slack)
Expand Down
Binary file added docs/example-images/create-volume-request.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/example-images/delete-volume-request.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/example-images/expand-volume-request.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/example-images/publish-volume-request.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/example-images/pv.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/example-images/pvc.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/example-images/runtime-error.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
169 changes: 169 additions & 0 deletions docs/metrics-documentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
## Grafana Dashboard Documentation: **CSI Driver Metrics**

### 1. **Introduction**
This Grafana dashboard provides an in-depth view of the CSI Driver operations for Linode Block Storage, with real-time data on volume creation, deletion, publication, and expansion. It also tracks persistent volume claims and potential runtime errors. The data is sourced from Prometheus, making it ideal for monitoring and diagnosing issues with CSI Driver operations.

### 2. **Dashboard Structure**
The dashboard is divided into several panels. Each panel focuses on a different aspect of CSI Driver operations, including Create/Delete/Publish Volume requests, runtime operation errors, and Persistent Volume (PV) and Persistent Volume Claim (PVC) events.

---

### 3. **Key Metrics and Visualizations with Graphs**

---

##### **Key points to know in-order to understand the graphs**:

- The y-axis is scaled by 1000. To get the correct number, multiply the decimal by 1000.
- The graphs which show total time taken, show time taken in `seconds`.
- The example graphs are plotted over a period of 48 hours, due to which the x-axis contains date and time.
- The spikes you see happened during e2e tests.

---

#### **Controller Create Volume**

- **Create Volume Requests**
- **Description**: Displays the total number of volume creation requests made to the CSI Driver.
- **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/CreateVolume"}`
- **Graph**:
![Create Volume Request](example-images/create-volume-request.jpg)
- **Explanation**: This graph shows the rate of volume creation requests over time. Spikes indicate increased provisioning activity.

- **Total Time Taken to Create Volume**
- **Description**: Displays the cumulative time taken to create volumes.
- **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/CreateVolume"}`
- **Graph**:
![Total Time to Create Volume](example-images/tt-create-volume-request.jpg)
- **Explanation**: Tracks the total amount of time spent creating volumes, useful for identifying delays in provisioning.

---

#### **Controller Delete Volume**

- **Delete Volume Requests**
- **Description**: Shows the number of requests to delete volumes through the CSI Driver.
- **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/DeleteVolume"}`
- **Graph**:
![Delete Volume Request](example-images/delete-volume-request.jpg)
- **Explanation**: This graph tracks how often volumes are deleted. A consistent increase means regular cleanup of resources.

- **Total Time Taken to Delete Volume**
- **Description**: Tracks the time spent deleting volumes.
- **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/DeleteVolume"}`
- **Graph**:
![Total Time to Delete Volume](example-images/tt-delete-volume-request.jpg)
- **Explanation**: Shows the time taken to delete volumes, highlighting the efficiency of resource cleanup operations.

---

#### **Controller Expand Volume**

- **Expand Volume Requests**
- **Description**: Monitors requests to expand volumes.
- **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/ControllerExpandVolume"}`
- **Graph**:
![Expand Volume Request](example-images/expand-volume-request.jpg)
- **Explanation**: This graph tracks how frequently volume expansion operations occur.

- **Total Time Taken to Expand Volume**
- **Description**: Displays the cumulative time taken to expand volumes.
- **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/ControllerExpandVolume"}`
- **Graph**:
![Total Time to Expand Volume](example-images/tt-expand-volume-request.jpg)
- **Explanation**: Tracks the total time taken to expand volumes.

---

#### **Controller Publish Volume**

- **Publish Volume Requests**
- **Description**: The number of requests made to attach or publish volumes to nodes.
- **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/ControllerPublishVolume"}`
- **Graph**:
![Publish Volume Request](example-images/publish-volume-request.jpg)
- **Explanation**: This graph tracks how often volumes are published (attached) to nodes, indicating mounting operations.

- **Total Time Taken to Publish Volume**
- **Description**: Displays the cumulative time taken to publish (attach) volumes to nodes.
- **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/ControllerPublishVolume"}`
- **Graph**:
![Total Time to Publish Volume](example-images/tt-publish-volume-request.jpg)
- **Explanation**: Tracks the total time spent publishing volumes to nodes.

---

#### **Controller Unpublish Volume**

- **Unpublish Volume Requests**
- **Description**: Tracks the number of requests to unpublish volumes.
- **Query**: `csi_sidecar_operations_seconds_count{method_name="/csi.v1.Controller/ControllerUnpublishVolume"}`
- **Graph**:
![Unpublish Volume Requests](example-images/unpublish-volume-request.jpg)
- **Explanation**: This graph shows how frequently volumes are unpublished (detached) from nodes.

- **Total Time Taken to Unpublish Volume**
- **Description**: Displays the cumulative time taken to unpublish (detach) volumes from nodes.
- **Query**: `csi_sidecar_operations_seconds_sum{method_name="/csi.v1.Controller/ControllerUnpublishVolume"}`
- **Graph**:
![Total Time to Unpublish Volume](example-images/tt-unpublish-volume-request.jpg)
- **Explanation**: Tracks the total time spent unpublishing volumes from nodes.

---

### 4. **Additional Metrics**

---

#### **Persistent Volumes (PV)**

- **Description**: Displays the total number of PV-related events that the CSI controller processed.
- **Query**: `workqueue_adds_total{name="volumes"}`
- **Graph**:
![Persistent Volumes](example-images/pv.jpg)
- **Explanation**: This graph shows how many PV requests were made, indicating the provisioning of new storage resources.

---

#### **Volume Claims (PVC)**

- **Description**: Tracks the number of PVC-related events that the controller reconciles.
- **Query**: `workqueue_adds_total{name="claims"}`
- **Graph**:
![PVC Events](example-images/pvc.jpg)
- **Explanation**: This graph tracks PVC-related events, providing insights into the frequency of new claims or bindings.

---

#### **Runtime Operation Errors**

- **Description**: Visualizes errors encountered by the CSI Driver during operations.
- **Query**: `kubelet_runtime_operations_errors_total`
- **Graph**:
![Runtime Operation Errors](example-images/runtime-error.jpg)
- **Explanation**: A rise in runtime errors indicates potential issues within the Kubernetes nodes or the CSI components.

---

#### **CSI Sidecar Operations Seconds Sum**

- **Description**: Shows the cumulative time taken for operations handled by CSI sidecars (attacher, provisioner, etc.).
- **Query**: `csi_sidecar_operations_seconds_sum`
- **Graph**:
![Sidecar Operations Seconds Sum](example-images/sidecar-operations-time-sum.jpg)
- **Explanation**: This graph tracks the total time consumed by all CSI operations, helping identify potential bottlenecks.

---

### 5. **Missing Metrics/ Future Scope**

---

#### **Volume Utilization Metrics**:
- **Volume Size**: Track the size of volumes currently in use to better understand resource consumption.
- **Potential Implementation**: Metrics could be added to track how much space is being utilized by each volume, ensuring optimal usage and highlighting volumes nearing full capacity.

#### **Node Metrics**:
- **Node Attachments**: Track the total number of volumes attached to each node.
- **Node Publish/Unpublish**: Track how often volumes are published (attached) and unpublished (detached) from nodes, giving better visibility into volume mounting and unmounting operations.
- **Node Stage/Unstage**: Monitor staging and unstaging operations to identify any potential delays or issues when preparing a volume for use on a node.
Loading
Loading