Address and reduce flakiness in integration tests

This task focuses on addressing and reducing the flakiness of our integration tests to ensure a stable and reliable CI pipeline.

#### The Problem

Our integration tests frequently produce inconsistent results, a phenomenon known as "flakiness." These failures are often not due to actual bugs in the CLI but rather the non-deterministic nature of the services we interact with, particularly the Large Language Models (LLMs). Assertions that expect exact string matches from an LLM response are a primary source of these flakes.

This instability forces developers to re-run jobs, slows down the development cycle, and erodes confidence in our test suite. A failing test should be a clear signal of a real problem, not noise.

**CI Workflow:** Test results and logs can be viewed in our [End-to-End CI Workflow](https://github.com/google-gemini/gemini-cli/actions/workflows/e2e.yml).

#### Potential Solutions

1.  **Robust Assertions:** We need to refactor our tests to be more resilient to minor variations in LLM output. Instead of expecting exact matches, assertions should validate the presence of key information, structure, or intent. A great example of this approach is the recent pull request that fixed the `list-directory` tests by improving its assertions: [PR #3418](https://github.com/google-gemini/gemini-cli/pull/3418).

2.  **Mocking Strategies:** For tests where the interaction with the LLM is not the primary focus, we should implement robust mocking of the model's responses. This will provide predictable test runs and isolate the component being tested.

3.  **Implicit Retries (Investigation):** As part of this work, we should investigate the feasibility and impact of implementing an automatic retry mechanism for failed tests. For example, a test could be run up to three times and considered passing if it succeeds in a majority of those runs. This could be a pragmatic short-term solution for tests that are difficult to make fully deterministic.

The goal is to create a test suite that is both reliable and provides a high degree of confidence in the quality of our releases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Address and reduce flakiness in integration tests #3693

The Problem

Potential Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Address and reduce flakiness in integration tests #3693

Description

The Problem

Potential Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions