Add script to run and automatically retry tests #542

lfrancke · 2025-07-16T14:20:44Z

This script can be used to run the full test suite and then automatically retry the failing tests a configurable number of times. It can optionally keep the failed namespaces and it writes all the logs to files.

I did create this to run tests locally on my machine but I hope we can also make it usable for CI builds.

I am not the biggest fan of the output as it is very noisy but on the other hand I also don't want to miss crucial information during debugging.

This will generate output like this:

❯ python scripts/auto-retry-tests.py --parallel 4 --attempts-parallel 1 --attempts-serial 1 --venv venv --keep-failed-namespaces  --output-dir test-results
Starting Automated Test Suite with Retry Logic
==============================================

Step 1: Running initial full test suite...

Step 2: Parsing failed tests...
  Found 1 failed tests:
    1. orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false

Step 3: Retrying failed tests...

=== Parallel retries for 1 tests (up to 1 attempts each) ===

--- Parallel attempt 1 ---
Retrying 1 tests in parallel (max 1 at once)...
Running orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars... (attempt 1/parallel)...
  ❌ Completed in 10.2m
  ✗ orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false FAILED (attempt 1)

1 tests still failing after parallel retries, starting serial retries...

=== Serial retries for orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false (up to 1 attempts
Running orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars... (attempt 1/serial)... ⏰ Estimated: 10.2m
  ❌ Completed in 10.0m
    📊 Average: 10.1m
📁 State saved to: test-results/test_run_state.json
  ✗ Serial attempt 1 FAILED

📊 Step 4: Generating final report...

📊 Generating final report...
================================================================================
AUTOMATED TEST SUITE REPORT
==============================================
Started: 2025-07-16 15:01:18
Ended: 2025-07-16 15:31:43
Total Duration: 0:30:25.015686

SUMMARY
----------------------------------------------
Total Tests: 1
Passed: 0
Flaky (eventually passed): 0
Failed: 1
Success Rate: 0.0%

FAILED TESTS
----------------------------------------------
  ✗ orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false
    Total attempts: 3
    Namespace kept for debugging: kuttl-test-liberal-horse
    Average runtime: 10.1m (from 2 runs)
    Last error: failed in step 1-install-zk...

NAMESPACE MANAGEMENT
----------------------------------------------
Namespaces kept for debugging: 1
  - kuttl-test-liberal-horse

RUNTIME STATISTICS
----------------------------------------------
Total test runs recorded: 2
Overall average runtime: 10.1m
Overall median runtime: 10.1m
Fastest test run: 10.0m
Slowest test run: 10.2m

Slowest tests (by average):
  orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false: 10.1m

Fastest tests (by average):
  orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false: 10.1m

CONFIGURATION
----------------------------------------------
Parallel: 4
Parallel retry attempts: 1
Serial retry attempts: 1
Keep failed namespaces: True
Virtualenv: venv

Text report saved to: test-results/test_report.txt
Detailed JSON report saved to: test-results/detailed_report.json

sbernauer · 2025-07-25T12:37:12Z

I think I would prefer to have --keep-failed-namespaces by default.
A namespace is quickly deleted, but you can wait for an hour just to come back to the tests and have to re-run the test because the namespace was deleted.
For CI we can disable it (although it also doesn't hurt, as the replicated cluster is deleted afterwards)

lfrancke · 2025-07-25T13:06:16Z

I'm fine with that. I'll see if more comments are added and can then implement it.

sbernauer · 2025-07-25T13:20:22Z

I modified a smoke test (so that it fails) and ran scripts/auto-retry-tests.py.
Some remarks:

--keep-failed-namespaces default mentioned above
It said Namespace kept for debugging: kuttl-test-square-swift but it in fact deleted the namespace.
The smoke test failed with requests.exceptions.ConnectionError: HTTPConnectionPool(host='test-opa-server-default-metricssssssssssssssssssss', port=8081): Max retries exceeded with url: /metrics (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f7839ad9af0>: Failed to resolve 'test-opa-server-default-metricssssssssssssssssssss' ([Errno -2] Name or service not known)")). I would have spotted the "bug" much faster if I would see the STDOUT of the test run. On the other hand I guess you silenced that by choice?
My smoke_opa-1.0.1_openshift-false_attempt_1_parallel.txt only has this content:

INFO:root:Expanding test case id [smoke_opa-1.0.1_openshift-false]
INFO:root:Expanding test case id [smoke_opa-1.4.2_openshift-false]
Traceback (most recent call last):
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/bin/.beku-wrapped", line 9, in <module>
    sys.exit(main())
             ^^^^^^
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/main.py", line 97, in main
    return expand(
           ^^^^^^^
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/kuttl.py", line 332, in expand
    test_case.expand(template_dir, output_dir, namespace)
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/kuttl.py", line 149, in expand
    test_source.build_destination()
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/kuttl.py", line 79, in build_destination
    with open(dest, encoding="utf8", mode="w") as stream:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'tests/_work/tests/smoke/smoke_opa-1.4.2_openshift-false/01-install-vector-aggregator-discovery-configmap.yaml'
ERROR:root:beku failed

Is this maybe because of multiple processes running in parallel?

lfrancke added this to Stackable Engineering Jul 16, 2025

lfrancke self-assigned this Jul 16, 2025

lfrancke moved this to Development: Waiting for Review in Stackable Engineering Jul 16, 2025

Add script to run and automatically retry tests

9a967dd

lfrancke force-pushed the feat/auto-retry branch from 7bbc2ba to 9a967dd Compare July 16, 2025 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add script to run and automatically retry tests #542

Add script to run and automatically retry tests #542

lfrancke commented Jul 16, 2025

Uh oh!

sbernauer commented Jul 25, 2025

Uh oh!

lfrancke commented Jul 25, 2025

Uh oh!

sbernauer commented Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

Add script to run and automatically retry tests #542

Are you sure you want to change the base?

Add script to run and automatically retry tests #542

Conversation

lfrancke commented Jul 16, 2025

Uh oh!

sbernauer commented Jul 25, 2025

Uh oh!

lfrancke commented Jul 25, 2025

Uh oh!

sbernauer commented Jul 25, 2025

Uh oh!

Uh oh!