May 30, 2026
Flaky Test Triage Checklist for CI/CD Pipelines
A practical flaky test triage checklist for CI/CD pipelines, with steps to isolate app bugs, test issues, environment drift, and pipeline problems.
Flaky tests are expensive because they hide in plain sight. A red build might mean the product regressed, the test is brittle, the environment drifted, or the pipeline itself introduced noise. The hard part is not seeing the failure, it is deciding what to do next without wasting hours rerunning jobs and arguing about ownership.
This checklist is designed to help QA leads, DevOps engineers, engineering managers, and SDETs triage failures in a CI/CD pipeline quickly and consistently. It focuses on one question: is the failure caused by the app, the test, the environment, or the pipeline? If you can answer that reliably, you can protect CI/CD quality gates without turning every red build into a fire drill.
A good triage process does not try to make every test pass. It tries to make every failure explainable.
What counts as flaky in CI/CD
A flaky test is one that sometimes passes and sometimes fails without a meaningful product change. In practice, the term gets used more broadly, so it helps to separate a few different failure classes:
- App defect: the product behavior changed, and the test correctly caught it.
- Test defect: the assertion, locator, timeout, or setup is brittle.
- Environment defect: browser version, container image, network, clock skew, data state, or external dependency changed.
- Pipeline defect: retries, parallelization, caching, artifacts, or runner configuration caused the failure.
This distinction matters because the response is different for each. An app defect goes to engineering. A test defect goes to the automation owner. An environment defect goes to the platform or infra team. A pipeline defect often needs CI owners, because the test may be fine when run locally.
For a baseline definition of Continuous integration and CI/CD, the CI/CD concept is useful, but this article is about the operational side, not the theory.
The fast triage flow
Use this as the first pass whenever a test fails unexpectedly.
1. Confirm the failure is reproducible
Before you inspect stack traces, ask whether the same test fails again under the same conditions.
Check:
- Did the test fail only once, or multiple times in the same branch or commit?
- Does a rerun on the same runner, image, and commit fail the same way?
- Does it fail in a clean environment, or only after a prior suite runs?
- Does it pass locally, fail in CI, or fail only in one pipeline stage?
If a rerun passes once and fails later with no code changes, you likely have a flaky condition or a pipeline timing issue. If it fails consistently on the same commit, the probability shifts toward a real app issue or a deterministic test bug.
2. Identify the failure surface
Classify the failure by where it appeared:
- Unit test, integration test, or end-to-end test
- API test or UI test
- Specific browser, operating system, or runner image
- One stage, multiple stages, or all branches
A UI-only failure that appears in Chrome but not Firefox often points to selector, rendering, or timing differences. An API test that fails only in CI may point to missing secrets, stub mismatch, or rate limiting.
3. Compare the failure to the last known good run
Look for the first bad run, not just the latest red one.
Compare:
- Test duration
- Screenshot or video changes
- Log ordering
- Network calls
- DOM changes
- Fixture or seed data changes
- Dependency versions
A change in test duration is often a clue. If a step that normally completes in 2 seconds now takes 9 seconds, the failure may be a timing mismatch rather than a business logic regression.
4. Decide whether the failure is deterministic or timing-sensitive
If the test fails at the same line every time, focus on the app or the test. If it fails at different lines, with different symptoms, focus on timing, environment, or shared state.
Common timing-sensitive signals include:
element not foundstale element referencetimeout exceededintercepted clickexpected condition not met- intermittent 5xx or network errors
Triage checklist by failure source
A. Is the app likely broken?
Treat the app as the likely cause when the failure is repeatable on the same commit and supported by product evidence.
Check:
- Did a relevant code change land near the failing test?
- Does the failure happen in multiple test layers, not just E2E?
- Does the UI or API response clearly violate the assertion?
- Is the failure visible in local dev, staging, and CI?
- Do logs, traces, or screenshots show an actual product bug?
Signs of a real regression:
- Response payload changed unexpectedly
- Validation error now appears for valid input
- A backend endpoint returns a consistent 4xx or 5xx
- UI renders the wrong state regardless of timing
- Database records are missing, duplicated, or malformed
A useful discipline is to ask whether a human user would consider the behavior wrong even if the test did not exist. If the answer is yes, treat it as a product issue first.
B. Is the test itself brittle?
Test problems are common in flaky test triage, especially in UI automation.
Look for these patterns:
- Locators rely on dynamic IDs or CSS classes that change often
- Assertions depend on exact text formatting, timestamps, or ordering
- The test assumes a page is ready before it actually is
- Previous state leaks into the test from shared fixtures
- The test depends on UI details instead of stable user-facing signals
- Waits are hard-coded instead of condition-based
Common brittle patterns and better alternatives:
typescript // brittle: tied to a generated class
await page.locator('.btn.primary.x9k2').click();
// better: use role or text when stable
await page.getByRole('button', { name: 'Save' }).click();
typescript // brittle: fixed sleep
await page.waitForTimeout(5000);
// better: wait for the actual state
await expect(page.getByText('Saved successfully')).toBeVisible();
Brittleness is often introduced by one of these tradeoffs:
- Faster test authoring versus better selectors
- Wider coverage versus stronger isolation
- UI-only validation versus checking data at the API or database layer
If a test only fails when the UI layout changes but the behavior remains correct, the test is probably validating the wrong layer.
C. Is the environment the problem?
Environment issues are easy to miss because they often look like app or test failures.
Check for drift in:
- Browser version
- Node.js, Python, Java, or .NET runtime version
- Container image digest
- OS patches and fonts
- Time zone and locale
- Parallel test isolation
- Network reliability and DNS resolution
- Secrets, tokens, or certificate rotation
- External services, feature flags, and test doubles
A good signal that the environment is to blame is when the same test passes on one runner but fails on another with no code change.
Questions to ask:
- Did the runner image recently change?
- Did a dependency update happen in a base image or shared library?
- Are tests running with the same locale, time zone, and date format as local machines?
- Is the test data seeded the same way in every environment?
- Did a SaaS dependency change rate limits or response timing?
If your suite depends on real network calls, treat external instability as an environmental risk unless you explicitly control it with mocks, recorded responses, or service virtualization.
D. Is the pipeline itself introducing noise?
Sometimes the application and test are both fine, but the pipeline is not.
Inspect:
- Retry logic that hides the real failure pattern
- Parallel jobs competing for the same shared resource
- Cache invalidation problems
- Artifact or workspace contamination between jobs
- Build steps that mutate the working tree
- Different test ordering in CI versus local execution
- Resource starvation on the runner
Pipeline-specific failures often appear when a suite is stable in isolation but fails when run in the full matrix.
A classic example is shared test data. Two workers create or update the same user account, then one assertion sees the wrong state. Another example is a cleanup step that deletes artifacts before the final reporting step reads them.
A practical decision tree for first responders
Use the following questions in order.
- Did the test fail once or repeatedly?
- Once, it may be noise.
- Repeatedly, continue.
- Did the same commit produce a clean rerun?
- Yes, suspect flakiness or pipeline nondeterminism.
- No, continue.
- Is the failure local to one test or one suite?
- One test, suspect test brittleness or a product edge case.
- Many tests, suspect environment, shared state, or pipeline changes.
- Does the failure map to a visible product bug?
- Yes, route to the application team.
- No, continue.
- Does it reproduce in a controlled environment?
- Yes, likely app or test.
- No, likely environment or pipeline.
- Does changing the browser, runner, or execution order change the outcome?
- Yes, likely environment, timing, or test isolation.
- No, likely app or a deterministic test issue.
If you cannot reproduce the issue outside the pipeline, do not spend too long treating it as a product regression.
What to collect before rerunning anything
A flaky test triage checklist is more effective when you capture the right evidence before the next rerun destroys it.
Collect:
- Commit SHA and branch name
- Pipeline job ID and stage
- Test name and suite name
- Runner image or container tag
- Browser and version
- Full stack trace
- Console logs
- Network logs or HAR file, if available
- Screenshots or video for UI tests
- Seed data or fixture version
- Recent dependency changes
If your CI system supports artifacts, make sure failed runs preserve enough detail to debug later. The goal is to answer the question, “What changed between the last passing run and the first failing run?”
How to debug pipeline failures without wasting time
Compare isolated execution with suite execution
Run the test alone, then as part of the full suite. If it passes alone but fails in sequence, look for setup leakage or resource collisions.
Useful checks:
- Order-dependent assertions
- Shared state in the database
- Reused browser contexts
- Leftover files in workspace or temp directories
- Global mocks not reset between tests
Inspect timing, not just failure text
Many flaky failures are timing failures in disguise. A click that used to work can fail if the element is present but not interactable yet.
In Playwright, prefer explicit state checks over sleeps:
typescript
await page.goto('/checkout');
await expect(page.getByRole('heading', { name: 'Checkout' })).toBeVisible();
await page.getByRole('button', { name: 'Place order' }).click();
If the test still fails, inspect whether the button is disabled, covered by a loader, or replaced by a rerender.
Look for hidden dependency changes
A test can break because of dependencies you do not control directly:
- Browser auto-updated in the runner
- Shared library changed via a floating version
- API contract changed in another service
- CSS or DOM structure changed due to a component refactor
- Time-sensitive assertions started crossing midnight or timezone boundaries
If you do not pin versions in CI, you may be debugging an invisible moving target.
Reproduce with the same runner whenever possible
A local reproduction that uses a different browser, OS, or Node version can mislead you. Use the same Docker image, same browser channel, same env vars, and same secret configuration if possible.
If you are using GitHub Actions, a simple matrix can help isolate whether the issue is runner-specific:
name: smoke
on: [push]
jobs:
e2e:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm ci
- run: npm test -- --grep "checkout"
How to decide whether to quarantine or fix
Not every flaky test should block the release train forever. But unbounded flake is just deferred pain.
Use these rules of thumb:
- Quarantine immediately if the test blocks unrelated merges and you do not yet know the cause.
- Fix immediately if the failure is deterministic and points to a real regression.
- Refactor soon if the test is fragile but still covers valuable behavior.
- Delete or replace if the test is low-value, redundant, or impossible to stabilize.
A useful policy is to quarantine only with an owner and an expiration date. Otherwise, the suite accumulates ignored reds that erode trust in CI/CD quality gates.
Checklist you can paste into a triage runbook
Failure classification
- Record commit SHA, job ID, suite name, and runner image
- Confirm whether the failure is single-run or repeatable
- Compare against the last known good run
- Classify the failure as app, test, environment, or pipeline
- Preserve logs, screenshots, videos, and network traces
App checks
- Does the failure reproduce outside CI?
- Is there a relevant product change in the commit range?
- Do logs or traces show a real defect?
- Does the failure appear across multiple browsers or layers?
Test checks
- Are selectors stable and user-facing?
- Are waits condition-based instead of time-based?
- Is the assertion validating behavior, not layout noise?
- Does the test depend on shared or mutable state?
- Does it pass in isolation?
Environment checks
- Is the browser/runtime version pinned?
- Did the runner image or base container change?
- Are locale, timezone, and seed data consistent?
- Are external dependencies mocked or controlled?
- Does the issue occur on all runners or only one?
Pipeline checks
- Does the failure depend on test order or parallelism?
- Are artifacts preserved until the final report step?
- Are caches invalidated correctly?
- Do retry policies hide the actual failure rate?
- Are shared resources isolated per job?
Reducing triage time over the long term
The best flaky test triage checklist is useful, but the better outcome is needing it less often.
Invest in a few structural improvements:
- Use stable selectors, preferably roles, labels, or data attributes that reflect user intent
- Isolate test data per run or per job
- Pin browser and runtime versions in CI
- Treat external APIs as dependencies that need control, not assumptions
- Capture artifacts on failure automatically
- Separate fast smoke checks from deeper end-to-end coverage
- Review flakes as a product of the system, not just the test author
For browser-heavy suites, tools that reduce locator brittleness can help. Endtest is one relevant option, especially when teams want an agentic AI Test automation platform with low-code and no-code workflows that can keep runs readable while reducing locator maintenance. Its self-healing approach is useful when UI changes would otherwise turn the same test red for reasons that are not user-visible. Endtest also logs healed locators transparently, which can make triage faster because reviewers can see what changed instead of guessing.
If you want to understand how that works under the hood, the self-healing documentation explains the recovery behavior in more detail.
That said, self-healing is not a substitute for good test design. It helps when locators drift, but it does not excuse poor isolation, unstable data, or environment entropy. A healthy strategy combines resilient automation, strong observability, and disciplined CI/CD ownership.
When to escalate
Escalate the issue to the appropriate owner when:
- The failure is repeatable and looks like a product regression
- The same flaky pattern affects multiple repositories or teams
- The environment cannot be stabilized by the owning team
- The pipeline is producing inconsistent results across identical runs
- The test is blocking release decisions and needs a short-term mitigation
Escalation is more effective when you bring evidence, not just a red badge. Include the exact failure mode, the first bad run, and the reproduction matrix you already tried.
Final takeaway
Flaky tests are not just annoying, they distort release decisions. A red pipeline should tell you something specific. If it does not, the problem is usually one of four things: the app changed, the test is brittle, the environment drifted, or the pipeline added noise.
A good flaky test triage checklist makes that distinction fast enough to protect delivery without normalizing reruns as a debugging strategy. Start with reproducibility, preserve evidence, compare the failure surface, and make an explicit call about ownership. Over time, that discipline improves release reliability and keeps CI/CD quality gates meaningful.