From 8c41788c509f21f5c6c311d7778ff940d960764e Mon Sep 17 00:00:00 2001 From: Michael Oviedo Date: Wed, 2 Jul 2025 15:40:57 -0700 Subject: [PATCH 1/3] update redline testing documentation Signed-off-by: Michael Oviedo --- _benchmark/reference/commands/redline-test.md | 35 +++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/_benchmark/reference/commands/redline-test.md b/_benchmark/reference/commands/redline-test.md index 2ab228b4651..91d9f49e1a7 100644 --- a/_benchmark/reference/commands/redline-test.md +++ b/_benchmark/reference/commands/redline-test.md @@ -12,7 +12,7 @@ The `--redline-test` command enables OpenSearch Benchmark to automatically deter When the `--redline-test` flag is used, OpenSearch Benchmark performs the following steps: -1. **Client initialization**: OpenSearch Benchmark initializes a large number of clients (default: 1,000). You can override this with `--redline-test=`. +1. **Client initialization**: OpenSearch Benchmark initializes a large number of clients (default: 1,000). You can override this with the optional `--redline-max-clients=` flag. 2. **Feedback mechanism**: OpenSearch Benchmark ramps up the number of active clients. A FeedbackActor monitors real-time request failures and adjusts the client count accordingly. 3. **Shared state coordination**: OpenSearch Benchmark uses Python's multiprocessing library to manage shared dictionaries and queues for inter-process communication: - **Workers** create and share client state maps with the WorkerCoordinatorActor. @@ -63,6 +63,32 @@ opensearch-benchmark execute-test \ ``` {% include copy.html %} +## Latency or CPU-based Feedback +OSB supports a `timeout` value per request, which tells the system to cancel a request if it is taking longer than the specified time. The flag `--client-options=timeout:` allows users to specify this time period. The default value is 10 seconds. + +You can increase or decrease this value to customize the maximum latency tolerated by OSB during a redline test as well. +For example, if you want to find the maximum load your cluster can handle without bringing latency above 15 seconds, setting the timeout value in client options to 15 will achieve this. + +Redline testing now supports CPU-based feedback in addition to request error and latency monitoring. This helps avoid pushing your cluster past its safe utilization limits. + +### Requirements +- A metrics store must be configured for CPU-based feedback. Using an in-memory store will result in an error: +```bash +[ERROR] Cannot execute-test. Error in worker_coordinator (CPU-based feedback requires a metrics store. You are using an in-memory metrics store) +``` +The `--redline-cpu-max-usage` flag is required. This sets the maximum CPU usage percentage (per node) allowed during the test. +The `node-stats` telemetry device is automatically enabled when using this feature. + +### Behavior +The redline CPU feedback loop works as follows: +- The `FeedbackActor` queries the metrics store at regular intervals to retrieve the average CPU usage for each node +- If any node exceeds the CPU usage threshold defined by `--redline-cpu-max-usage`, the system triggers a scale-down. +- After scaling down, the actor waits before attempting to scale up again + +### Optional Tuning Flags +- `--redline-cpu-window-seconds`: Duration (in seconds) over which to average CPU usage per node (default: 30). +- `--redline-cpu-check-interval`: Interval (in seconds) between CPU usage checks (default: 30). + ## Results During a redline test, OpenSearch Benchmark provides detailed logs with scaling decisions and request failures during the test. At the end of a redline test, OpenSearch Benchmark logs the maximum number of clients that your cluster supported without request errors. @@ -81,4 +107,9 @@ Use the following options and behaviors to better understand and customize redli - `--redline-scale-step`: Specifies the number of clients to unpause in each scaling iteration. - `--redline-scaledown-percentage`: Specifies the percentage of clients to pause when an error occurs. - `--redline-post-scaledown-sleep`: Specifies the number of seconds the feedback actor waits before initiating a scale-up after scaling down. -- `--redline-max-clients`: Specifies the maximum number of clients allowed during redline testing. If unset, OpenSearch Benchmark defaults to the number of clients defined in the test procedure. \ No newline at end of file +- `--redline-max-clients`: Specifies the maximum number of clients allowed during redline testing. If unset, OpenSearch Benchmark defaults to the number of clients defined in the test procedure. + +### For CPU-based Feedback +- `--redline-cpu-max-usage`: (Required) Max allowed CPU load (%) per node before triggering a scale-down. +- `--redline-cpu-window-seconds`: Duration (in seconds) over which to average CPU usage per node (default: 30). +- `--redline-cpu-check-interval`: Interval (in seconds) between CPU usage checks (default: 30). \ No newline at end of file From 582229045f59eadc4a78fda1c5fa6d12c181d192 Mon Sep 17 00:00:00 2001 From: Michael Oviedo Date: Mon, 7 Jul 2025 11:09:09 -0700 Subject: [PATCH 2/3] address ian's comments Signed-off-by: Michael Oviedo --- _benchmark/reference/commands/redline-test.md | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/_benchmark/reference/commands/redline-test.md b/_benchmark/reference/commands/redline-test.md index 91d9f49e1a7..0cc43976a0c 100644 --- a/_benchmark/reference/commands/redline-test.md +++ b/_benchmark/reference/commands/redline-test.md @@ -76,8 +76,8 @@ Redline testing now supports CPU-based feedback in addition to request error and ```bash [ERROR] Cannot execute-test. Error in worker_coordinator (CPU-based feedback requires a metrics store. You are using an in-memory metrics store) ``` -The `--redline-cpu-max-usage` flag is required. This sets the maximum CPU usage percentage (per node) allowed during the test. -The `node-stats` telemetry device is automatically enabled when using this feature. +- The `--redline-cpu-max-usage` flag is required. This sets the maximum CPU usage percentage (per node) allowed during the test. +- The `node-stats` telemetry device is automatically enabled when using this feature. ### Behavior The redline CPU feedback loop works as follows: @@ -85,10 +85,6 @@ The redline CPU feedback loop works as follows: - If any node exceeds the CPU usage threshold defined by `--redline-cpu-max-usage`, the system triggers a scale-down. - After scaling down, the actor waits before attempting to scale up again -### Optional Tuning Flags -- `--redline-cpu-window-seconds`: Duration (in seconds) over which to average CPU usage per node (default: 30). -- `--redline-cpu-check-interval`: Interval (in seconds) between CPU usage checks (default: 30). - ## Results During a redline test, OpenSearch Benchmark provides detailed logs with scaling decisions and request failures during the test. At the end of a redline test, OpenSearch Benchmark logs the maximum number of clients that your cluster supported without request errors. @@ -102,7 +98,7 @@ Redline test finished. Maximum stable client number reached: 410 ## Configuration tips and test behavior -Use the following options and behaviors to better understand and customize redline test execution: +Use the following optional command flags to better understand and customize redline test execution: - `--redline-scale-step`: Specifies the number of clients to unpause in each scaling iteration. - `--redline-scaledown-percentage`: Specifies the percentage of clients to pause when an error occurs. From 8957cf431a8e980429b5e805b62eb0be4261fe89 Mon Sep 17 00:00:00 2001 From: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Date: Tue, 8 Jul 2025 06:10:07 -0500 Subject: [PATCH 3/3] writer edits Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- _benchmark/reference/commands/redline-test.md | 49 ++++++++++++------- 1 file changed, 30 insertions(+), 19 deletions(-) diff --git a/_benchmark/reference/commands/redline-test.md b/_benchmark/reference/commands/redline-test.md index 0cc43976a0c..4f8fe43c877 100644 --- a/_benchmark/reference/commands/redline-test.md +++ b/_benchmark/reference/commands/redline-test.md @@ -63,27 +63,35 @@ opensearch-benchmark execute-test \ ``` {% include copy.html %} -## Latency or CPU-based Feedback -OSB supports a `timeout` value per request, which tells the system to cancel a request if it is taking longer than the specified time. The flag `--client-options=timeout:` allows users to specify this time period. The default value is 10 seconds. +## Latency- or CPU-based feedback -You can increase or decrease this value to customize the maximum latency tolerated by OSB during a redline test as well. -For example, if you want to find the maximum load your cluster can handle without bringing latency above 15 seconds, setting the timeout value in client options to 15 will achieve this. +OpenSearch Benchmark (OSB) supports a `timeout` value per request, which cancels a request if it exceeds the specified duration. You can set this value using the `--client-options=timeout:` flag. The default is 10 seconds. -Redline testing now supports CPU-based feedback in addition to request error and latency monitoring. This helps avoid pushing your cluster past its safe utilization limits. +You can adjust this value to define the maximum request latency OSB should tolerate during redline testing. For example, to determine the highest load your cluster can handle without exceeding 15 seconds of latency, set the timeout in client options to `15`. + +Redline testing also supports CPU-based feedback in addition to latency and request error monitoring. This helps prevent exceeding safe utilization limits for your cluster. ### Requirements -- A metrics store must be configured for CPU-based feedback. Using an in-memory store will result in an error: -```bash -[ERROR] Cannot execute-test. Error in worker_coordinator (CPU-based feedback requires a metrics store. You are using an in-memory metrics store) -``` -- The `--redline-cpu-max-usage` flag is required. This sets the maximum CPU usage percentage (per node) allowed during the test. -- The `node-stats` telemetry device is automatically enabled when using this feature. + +To use CPU-based feedback during redline testing, your setup must meet the following requirements: + +- A metrics store must be configured. Using an in-memory store results in the following error: + + ```bash + [ERROR] Cannot execute-test. Error in worker_coordinator (CPU-based feedback requires a metrics store. You are using an in-memory metrics store) + ``` + +- The `--redline-cpu-max-usage flag` is required. This flag sets the maximum allowed CPU usage (as a percentage) per node during testing. +- The `node-stats` telemetry device is automatically enabled when CPU-based feedback is active. ### Behavior -The redline CPU feedback loop works as follows: -- The `FeedbackActor` queries the metrics store at regular intervals to retrieve the average CPU usage for each node -- If any node exceeds the CPU usage threshold defined by `--redline-cpu-max-usage`, the system triggers a scale-down. -- After scaling down, the actor waits before attempting to scale up again + +The redline CPU feedback loop operates with the following behaviors: + +- The `FeedbackActor` queries the metrics store at regular intervals to retrieve average CPU usage for each node. +- If any node exceeds the threshold set by `--redline-cpu-max-usage`, the system initiates a scale-down. +- After scaling down, the actor waits before attempting to scale up again. + ## Results @@ -105,7 +113,10 @@ Use the following optional command flags to better understand and customize redl - `--redline-post-scaledown-sleep`: Specifies the number of seconds the feedback actor waits before initiating a scale-up after scaling down. - `--redline-max-clients`: Specifies the maximum number of clients allowed during redline testing. If unset, OpenSearch Benchmark defaults to the number of clients defined in the test procedure. -### For CPU-based Feedback -- `--redline-cpu-max-usage`: (Required) Max allowed CPU load (%) per node before triggering a scale-down. -- `--redline-cpu-window-seconds`: Duration (in seconds) over which to average CPU usage per node (default: 30). -- `--redline-cpu-check-interval`: Interval (in seconds) between CPU usage checks (default: 30). \ No newline at end of file +### For CPU-based feedback + +Use the following additional flags to configure CPU-based feedback: + +- `--redline-cpu-max-usage`: (Required) Maximum allowed CPU load (as a percentage) per node before triggering a scale-down. +- `--redline-cpu-window-seconds`: Duration (in seconds) over which to average CPU usage per node. Default is 30 seconds. +- `--redline-cpu-check-interval`: Interval (in seconds) between CPU usage checks. Default is 30 seconds.