Flaky tests
What's a flaky test?
It's a test that sometimes fails, but if you retry it enough times, it passes, eventually.
What are the potential cause for a test to be flaky?
Unclean environment
Label: flaky-test::unclean environment
Description: The environment got dirtied by a previous test. The actual cause is probably not the flaky test here.
Difficulty to reproduce: Moderate. Usually, running the same spec files until the one that's failing reproduces the problem.
Resolution: Fix the previous tests and/or places where the environment is modified, so that it's reset to a pristine test after each test.
Examples:
- Example 1: A migration test might roll-back the database, perform its testing, and then roll-up the database in an inconsistent state, so that following tests might not know about certain columns.
- Example 2: A test modifies data that is used by a following test.
- Example 3: A test for a database query passes in a fresh database, but in a CI/CD pipeline where the database is used to process previous test sequences, the test fails. This likely means that the query itself needs to be updated to work in a non-clean database.
Ordering assertion
Label: flaky-test::ordering assertion
Description: The test is expecting a specific order in the data under test yet the data is in a non-deterministic order.
Difficulty to reproduce: Easy. Usually, running the test locally several times would reproduce the problem.
Resolution: Depending on the problem, you might want to:
- loosen the assertion if the test shouldn't care about ordering but only on the elements
- fix the test by specifying a deterministic ordering
- fix the app code by specifying a deterministic ordering
Examples:
-
Example 1: Without
specifying
ORDER BY
, database will not give deterministic ordering, or data race happening in the tests. - Example 2.
Dataset-specific
Label: flaky-test::dataset-specific
Description: The test assumes the dataset is in a particular (usually limited) state, which might not be true depending on when the test run during the test suite.
Difficulty to reproduce: Moderate, as the amount of data needed to reproduce the issue might be difficult to achieve locally.
Resolution: Fix the test to not assume that the dataset is in a particular state, don't hardcode IDs.
Examples:
-
Example 1: The database is recreated when
any table has more than 500 columns. It could pass in the merge request, but fail later in
master
if the order of tests changes. -
Example 2: A test asserts
that trying to find a record with an nonexistent ID returns an error message. The test uses an
hardcoded ID that's supposed to not exist (e.g.
42
). If the test is run early in the test suite, it might pass as not enough records were created before it, but as soon as it would run later in the suite, there could be a record that actually has the ID42
, hence the test would start to fail.
Random input
Label: flaky-test::random input
Description: The test use random values, that sometimes match the expectations, and sometimes not.
Difficulty to reproduce: Easy, as the test can be modified locally to use the "random value" used at the time the test failed
Resolution: Once the problem is reproduced, it should be easy to debug and fix either the test or the app.
Examples:
- Example 1: The test isn't robust enough to handle a specific data, that only appears sporadically since the data input is random.
Unreliable DOM Selector
Label: flaky-test::unreliable dom selector
Description: The DOM selector used in the test is unreliable.
Difficulty to reproduce: Moderate to difficult. Depending on whether the DOM selector is duplicated, or appears after a delay etc. Adding a delay in API or controller could help reproducing the issue.
Resolution: It really depends on the problem here. It could be to wait for requests to finish, to scroll down the page etc.
Examples:
-
Example 1: A non-unique CSS selector
matching more than one element, or a non-waiting selector method that does not allow rendering
time before throwing an
element not found
error. - Example 2: A CSS selector only appears after a GraphQL requests has finished, and the UI has updated.
Datetime-sensitive
Label: flaky-test::datetime-sensitive
Description: The test is assuming a specific date or time.
Difficulty to reproduce: Easy to moderate, depending on whether the test consistently fails after a certain date, or only fails at a given time or date.
Resolution: Freezing the time is usually a good solution.
Examples:
- Example 1: A test that breaks after some time passed.
- Example 2: A test that breaks in the last day of the month.
Unstable infrastructure
Label: flaky-test::unstable infrastructure
Description: The test fails from time to time due to infrastructure issues.
Difficulty to reproduce: Hard. It's really hard to reproduce CI infrastructure issues. It might be possible by using containers locally.
Resolution: Starting a conversation with the Infrastructure department in a dedicated issue is usually a good idea.
Examples:
- Example 1: The runner is under heavy load at this time.
- Example 2: The runner is having networking issues, making a job failing early
Quarantined tests
When we have a flaky test in master
, quarantine the test after the first failure and
create a ~"failure::flaky-test" issue.
If the test cannot be fixed in a timely fashion, there is an impact on the productivity of all the developers, so it should be quarantined. There are two ways to quarantine tests, depending on the test framework being used: RSpec and Jest.
RSpec
Add the corresponding category and group labels to issue using feature_category
metadata.
For RSpec tests, you can use the :quarantine
metadata with the issue URL.
it 'succeeds', quarantine: 'https://gitlab.com/gitlab-org/gitlab/-/issues/12345' do
expect(response).to have_gitlab_http_status(:ok)
end
This means it is skipped unless run with --tag quarantine
:
bin/rspec --tag quarantine
Jest
For Jest specs, you can use the .skip
method along with the eslint-disable-next-line
comment to disable the jest/no-disabled-tests
ESLint rule and include the issue URL. Here's an example:
// https://gitlab.com/gitlab-org/gitlab/-/issues/56789
// eslint-disable-next-line jest/no-disabled-tests
it.skip('should throw an error', () => {
expect(response).toThrowError(expected_error)
});
This means it is skipped unless the test suit is run with --runInBand
Jest command line option:
jest --runInBand
For both test frameworks, make sure to add the ~"quarantined test"
label to the issue.
Once a test is in quarantine, there are 3 choices:
- Fix the test (that is, get rid of its flakiness).
- Move the test to a lower level of testing.
- Remove the test entirely (for example, because there's already a lower-level test, or it's duplicating another same-level test, or it's testing too much etc.).
Automatic retries and flaky tests detection
On our CI, we use RSpec::Retry to automatically retry a failing example a few
times (see spec/spec_helper.rb
for the precise retries count).
We also use a custom RspecFlaky::Listener
.
This listener runs in the update-tests-metadata
job in maintenance
scheduled pipelines
on the master
branch, and saves flaky examples to rspec/flaky/report-suite.json
.
The report file is then retrieved by the retrieve-tests-metadata
job in all pipelines.
This was originally implemented in: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/13021.
If you want to enable retries locally, you can use the RETRIES
environment variable.
For instance RETRIES=1 bin/rspec ...
would retry the failing examples once.
To generate the reports locally, use the FLAKY_RSPEC_GENERATE_REPORT
environment variable.
For example, FLAKY_RSPEC_GENERATE_REPORT=1 bin/rspec ...
.
rspec/flaky/report-suite.json
report
Usage of the The rspec/flaky/report-suite.json
report is:
- Used for automatically skipping known flaky tests.
- Imported into Snowflake once per day, for monitoring with the internal dashboard.
Problems we had in the past at GitLab
-
rspec-retry
is biting us when some API specs fail: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/9825 -
Sporadic RSpec failures due to
PG::UniqueViolation
: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/9846 - ffaker generates funky data that tests are not ready to handle (and tests should be predictable so that's bad!):
-
Make
spec/mailers/notify_spec.rb
more robust: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/10015 -
Transient failure in
spec/requests/api/commits_spec.rb
: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/9944 - Replace ffaker factory data with sequences: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/10184
- Transient failure in spec/finders/issues_finder_spec.rb: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/10404
-
Make
Order-dependent flaky tests
These flaky tests can fail depending on the order they run with other tests. For example:
To identify the tests that lead to such failure, we can use scripts/rspec_bisect_flaky
,
which would give us the minimal test combination to reproduce the failure:
-
First obtain the list of specs that ran before the flaky test. You can search for the list under
Knapsack node specs:
in the CI job output log. -
Save the list of specs as a file, and run:
cat knapsack_specs.txt | xargs scripts/rspec_bisect_flaky
If there is an order-dependency issue, the script above will print the minimal reproduction.
Time-sensitive flaky tests
- https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/10046
- https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/10306
Array order expectation
Feature tests
- Be sure to create all the data the test need before starting exercise: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/12059
- Bis: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/12604
- Bis: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/12664
- Assert against the underlying database state instead of against a page's content: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/10934
- In JS tests, shifting elements can cause Capybara to mis-click when the element moves at the exact time Capybara sends the click
- Triggering JS events before the event handlers are set up
- Wait for the image to be lazy-loaded when asserting on a Markdown image's
src
attribute - Avoid asserting against flash notice banners
Capybara viewport size related issues
- Transient failure of spec/features/issues/filtered_search/filter_issues_spec.rb: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/10411
Capybara JS driver related issues
- Don't wait for AJAX when no AJAX request is fired: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/10454
- Bis: https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/12626
Capybara expectation times out
Hanging specs
If a spec hangs, it might be caused by a bug in Rails:
- https://gitlab.com/gitlab-org/gitlab/-/merge_requests/81112
- https://gitlab.com/gitlab-org/gitlab/-/issues/337039
Suggestions
Split the test file
It could help to split the large RSpec files in multiple files in order to narrow down the context and identify the problematic tests.
Resources
- Flaky Tests: Are You Sure You Want to Rerun Them?
- How to Deal With and Eliminate Flaky Tests
- Tips on Treating Flakiness in your Rails Test Suite
- 'Flaky' tests: a short story
- Using Insights to Discover Flaky, Slow, and Failed Tests