Skip to content

fix(dora): count failed deployments, not incidents, in Change Failure Rate#8890

Open
spiffaz wants to merge 1 commit into
apache:mainfrom
spiffaz:fix/dora-cfr-count-failed-deployments
Open

fix(dora): count failed deployments, not incidents, in Change Failure Rate#8890
spiffaz wants to merge 1 commit into
apache:mainfrom
spiffaz:fix/dora-cfr-count-failed-deployments

Conversation

@spiffaz
Copy link
Copy Markdown
Contributor

@spiffaz spiffaz commented May 28, 2026

What does this PR do?

The Change Failure Rate dashboard (grafana/dashboards/DORADetails-ChangeFailureRate.json) computes the rate as sum(has_incident) / count(deployment_id), where has_incident is count(distinct i.id) per deployment, i.e. the number of incidents a deployment caused.

When one deployment causes more than one incident, it is counted multiple times in the numerator, so the rate exceeds the DORA definition of Change Failure Rate (the share of deployments that caused a failure, not the number of incidents).

This makes has_incident a 0/1 flag (CASE WHEN count(distinct i.id) > 0 THEN 1 ELSE 0 END) so each failed deployment counts once, and relabels the companion stat panel field from "incident count" to "failed deployment count" so the displayed counts stay consistent with the rate.

Validation

Against a production dataset, a 30-day window had 6 incidents spread across 4 distinct deployments out of 23 total:

  • Before: 6 / 23 = 26.1%
  • After: 4 / 23 = 17.4%

Only the SQL in the two CFR panels changed; no datasource, layout, or variable changes.

… Rate

The Change Failure Rate panels summed `has_incident`, which was the count of
distinct incidents per deployment. A deployment that caused multiple incidents
was counted multiple times, inflating CFR above the DORA definition
(deployments that caused a failure / total deployments).

Make `has_incident` a 0/1 flag so each deployment counts once regardless of how
many incidents it caused, and relabel the companion stat from "incident count"
to "failed deployment count" so the displayed counts match the rate.

Validated against a production dataset: a 30-day window with 6 incidents across
4 deployments out of 23 total dropped from 26.1% to the correct 17.4%.

Signed-off-by: Spiff Azeta <spiffazeta@yahoo.com>
@dosubot dosubot Bot added size:XS This PR changes 0-9 lines, ignoring generated files. component/ext This issue or PR relates to external components, such as Grafana pr-type/bug-fix This PR fixes a bug labels May 28, 2026
@spiffaz spiffaz closed this May 28, 2026
@spiffaz spiffaz deleted the fix/dora-cfr-count-failed-deployments branch May 28, 2026 17:51
@spiffaz spiffaz restored the fix/dora-cfr-count-failed-deployments branch May 28, 2026 17:52
@spiffaz spiffaz reopened this May 28, 2026
@klesh klesh requested a review from Startrekzky May 30, 2026 03:42
@klesh
Copy link
Copy Markdown
Contributor

klesh commented May 30, 2026

@Startrekzky Please take a look when you find time, thx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/ext This issue or PR relates to external components, such as Grafana pr-type/bug-fix This PR fixes a bug size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants