Skip to content

Add script to run and automatically retry tests #542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lfrancke
Copy link
Member

This script can be used to run the full test suite and then automatically retry the failing tests a configurable number of times. It can optionally keep the failed namespaces and it writes all the logs to files.

I did create this to run tests locally on my machine but I hope we can also make it usable for CI builds.

I am not the biggest fan of the output as it is very noisy but on the other hand I also don't want to miss crucial information during debugging.

This will generate output like this:

❯ python scripts/auto-retry-tests.py --parallel 4 --attempts-parallel 1 --attempts-serial 1 --venv venv --keep-failed-namespaces  --output-dir test-results
Starting Automated Test Suite with Retry Logic
==============================================

Step 1: Running initial full test suite...

Step 2: Parsing failed tests...
  Found 1 failed tests:
    1. orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false

Step 3: Retrying failed tests...

=== Parallel retries for 1 tests (up to 1 attempts each) ===

--- Parallel attempt 1 ---
Retrying 1 tests in parallel (max 1 at once)...
Running orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars... (attempt 1/parallel)...
  ❌ Completed in 10.2m
  ✗ orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false FAILED (attempt 1)

1 tests still failing after parallel retries, starting serial retries...

=== Serial retries for orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false (up to 1 attempts
Running orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars... (attempt 1/serial)... ⏰ Estimated: 10.2m
  ❌ Completed in 10.0m
    📊 Average: 10.1m
📁 State saved to: test-results/test_run_state.json
  ✗ Serial attempt 1 FAILED

📊 Step 4: Generating final report...

📊 Generating final report...
================================================================================
AUTOMATED TEST SUITE REPORT
==============================================
Started: 2025-07-16 15:01:18
Ended: 2025-07-16 15:31:43
Total Duration: 0:30:25.015686

SUMMARY
----------------------------------------------
Total Tests: 1
Passed: 0
Flaky (eventually passed): 0
Failed: 1
Success Rate: 0.0%

FAILED TESTS
----------------------------------------------
  ✗ orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false
    Total attempts: 3
    Namespace kept for debugging: kuttl-test-liberal-horse
    Average runtime: 10.1m (from 2 runs)
    Last error: failed in step 1-install-zk...

NAMESPACE MANAGEMENT
----------------------------------------------
Namespaces kept for debugging: 1
  - kuttl-test-liberal-horse

RUNTIME STATISTICS
----------------------------------------------
Total test runs recorded: 2
Overall average runtime: 10.1m
Overall median runtime: 10.1m
Fastest test run: 10.0m
Slowest test run: 10.2m

Slowest tests (by average):
  orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false: 10.1m

Fastest tests (by average):
  orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false: 10.1m

CONFIGURATION
----------------------------------------------
Parallel: 4
Parallel retry attempts: 1
Serial retry attempts: 1
Keep failed namespaces: True
Virtualenv: venv

Text report saved to: test-results/test_report.txt
Detailed JSON report saved to: test-results/detailed_report.json

@lfrancke lfrancke self-assigned this Jul 16, 2025
@lfrancke lfrancke moved this to Development: Waiting for Review in Stackable Engineering Jul 16, 2025
@sbernauer
Copy link
Member

I think I would prefer to have --keep-failed-namespaces by default.
A namespace is quickly deleted, but you can wait for an hour just to come back to the tests and have to re-run the test because the namespace was deleted.
For CI we can disable it (although it also doesn't hurt, as the replicated cluster is deleted afterwards)

@lfrancke
Copy link
Member Author

I'm fine with that. I'll see if more comments are added and can then implement it.

@sbernauer
Copy link
Member

I modified a smoke test (so that it fails) and ran scripts/auto-retry-tests.py.
Some remarks:

  1. --keep-failed-namespaces default mentioned above
  2. It said Namespace kept for debugging: kuttl-test-square-swift but it in fact deleted the namespace.
  3. The smoke test failed with requests.exceptions.ConnectionError: HTTPConnectionPool(host='test-opa-server-default-metricssssssssssssssssssss', port=8081): Max retries exceeded with url: /metrics (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f7839ad9af0>: Failed to resolve 'test-opa-server-default-metricssssssssssssssssssss' ([Errno -2] Name or service not known)")). I would have spotted the "bug" much faster if I would see the STDOUT of the test run. On the other hand I guess you silenced that by choice?
  4. My smoke_opa-1.0.1_openshift-false_attempt_1_parallel.txt only has this content:
INFO:root:Expanding test case id [smoke_opa-1.0.1_openshift-false]
INFO:root:Expanding test case id [smoke_opa-1.4.2_openshift-false]
Traceback (most recent call last):
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/bin/.beku-wrapped", line 9, in <module>
    sys.exit(main())
             ^^^^^^
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/main.py", line 97, in main
    return expand(
           ^^^^^^^
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/kuttl.py", line 332, in expand
    test_case.expand(template_dir, output_dir, namespace)
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/kuttl.py", line 149, in expand
    test_source.build_destination()
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/kuttl.py", line 79, in build_destination
    with open(dest, encoding="utf8", mode="w") as stream:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'tests/_work/tests/smoke/smoke_opa-1.4.2_openshift-false/01-install-vector-aggregator-discovery-configmap.yaml'
ERROR:root:beku failed

Is this maybe because of multiple processes running in parallel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Development: Waiting for Review
Development

Successfully merging this pull request may close these issues.

2 participants