Triage Test Instability Tickets

Unstable tests fail non-deterministically and thus can sneak into the main codebase if they pass during the initial PR. Failing tests make it difficult to understand if new contributions introduce issues or if the failures are entirely unrelated, increasing the burden of the review process. To maintain a healthy test suite, the Flink community tracks tests that fail in CI via JIRA Bug tickets with the test-stability label, "Critical" severity and the affected versions.

Finding Test Instability Tickets

Go to the Apache JIRA site and click "View all issues and filters"

In the "Advanced" search mode, enter a query like:

project = FLINK AND resolution = unresolved AND labels in (test-stability) ORDER BY createdDate DESC

This will filter down and sort all the test instability issues. It can also be helpful to further refine the results to a specific component of the project, with an additional clause:

project = FLINK AND component in ("Runtime / Coordination", "Runtime / REST", "Runtime / Queryable State", "Runtime / Metrics", "Deployment / Mesos", "Deployment / Kubernetes", "Deployment / YARN", "Build System", "Release System", "BuildSystem / Shaded", flink-docker) AND resolution = unresolved AND labels in (test-stability) ORDER BY createdDate DESC

Triaging Techniques

Finding and Downloading Logs in CI

CI tests run on a logging configuration with no test-output to standard out (visible in the web interface). The full log statements are instead written to disk. Here, we describe the steps to retrieve the logs from the testing machines:

Navigate to the Azure Pipelines build of the failure, which should be linked in the ticket
1. If it is not, please comment on the ticket asking the reporter to provide those details
Find the failing Job and see the x artifact produced link
1. NOTE: remember the name of the failing job before clicking the link
In the link, there should be an artifact similar to logs-ci-JOB_NAME_COMPONENT-12345678
1. ex: with Job name test_ci kafka_gelly, there should be a log artifact named logs-ci-kafkagelly-12345678
Download the log artifact via the inline three-dot menu (on the right)
Unarchive the log artifact
1. The artifact should be a zip containing a tarball
In the archive, look for the mvn-x.log files
1. There may be more than one, depending on how many threads Maven is running tests in
Open the log files in your editor, IntelliJ is fairly good at displaying them
In each log file, search for the test name and hope there are some details pointing to the failure cause (if the log line doesn't appear, make sure that your test extends TestLogger, a utility class that prints log statement and the beginning and end of each test). We are happy to accept "hotfix" pull requests to address this.

Reproducing Locally

The easiest way to see how a test is failing is by reproducing it in your IDE. This is not always possible but is a nice first step if the issue is not painfully obvious by the initial failure logs.

Setup up your Flink development environment in IntelliJ, if you have not already by following the Setting up a Flink development environment guide
Navigate to the failing test/ test suite and run it via the IntelliJ Junit integration
1. The integration offers ways to configure the test runner to run the test repeatedly, either a set number of times or until failure, which is helpful for flaky tests
2. https://www.jetbrains.com/help/idea/run-debug-configuration-junit.html#configTab
If this does not yield a failure in a reasonable amount of time, the next step is to repeat this tactic using CI with debug logs
(optional) If you can reproduce the issue locally, remember to set the logging level in "test/resources/log4j-test.properties" from "OFF" to "INFO" or "DEBUG" to get more information.

For flink-yarn-tests there is something specific to consider: These tests spin up a MiniYarnCluster which lives outside of the JVM. You have to specify JAVA_HOME in your IDE's run configuration to make some of these tests succeed. See flink-yarn-tests/README.md for further details.

Enabling CI Debug Logging and Test Repetition for bash-based e2e tests

There is currently no way to configure this from the Azure Pipelines UI, so you'll have to patch and commit the test running script configured just for the failing test.

Patch the test configuration
1. Set the tools/ci/log4j.properties rootLogger level to DEBUG
2. Edit the tools/ci/test_controller.sh script to run a test repeatedly
3. An example commit can be seen here: https://github.com/XComp/flink/commit/4360ed402859fd5d3359b323d5d92e5ed5b1ea31
Ensure you have an Azure DevOps account, which you can sign up for using GitHub
1. You can then set up your Flink fork to run in your personal Azure pipelines
Once a pipeline has run and reproduced the error, find the logs in the same way as above, and cross your fingers that something indicative of the failure is in them this time

Page tree