Most PR quality gates fail the same way. Someone turns them on with strict thresholds across every domain on day one. The next ten PRs all fail. The team adds a [quality-bypass] commit prefix. The gates now exist as theatre.
A useful gate is calibrated to your codebase today, not to an ideal codebase. It tightens over time.
What a PR quality gate is, exactly
A check that runs on every pull request, scores the branch against configured thresholds, and fails the PR if any enabled domain drops below the line.
Three pieces:
- What to score. A domain or set of domains (security, testing, architecture and others).
- Where to set the line. A threshold between 0 and 100 per domain.
- What failure means. An advisory comment, a failing GitHub check or a merge block.
That is the whole mechanism. The interesting part is not the mechanism; it is the calibration.
The first mistake: strict thresholds on day one
Teams turn on gates because they want better code. They set thresholds at "good" levels. The existing codebase is not at those levels, which means every PR looks like a regression, even one that improves the file it touches.
Symptoms: every PR failing, complaints in Slack, a bypass pattern emerges, the gate gets switched off two weeks later.
Better approach: set the threshold at the current baseline. The gate blocks regressions, not baseline code. When the baseline improves, raise the threshold.
The second mistake: too many domains at once
Enabling all seven or eight domains on day one overwhelms the feedback. A developer sees a PR failing on four domains with eight findings each and has no idea where to start. Most developers fix the easiest finding, submit again, and watch the same four domains fail.
Better approach: enable two or three domains at the start. Security and architecture are usually the right pick. Add domains one at a time as the team gets comfortable with the feedback.
Calibration that actually works
A pattern that survives contact with real teams:
- Measure current state. Run the tool on your main branch. Note the score in each domain.
- Enable gates in advisory mode first. For two weeks, gates run but do not block merging. The team sees feedback without consequences.
- Tune thresholds to baseline. Pick thresholds 5-10 points below your current score so small fluctuations do not fire. If you are at 72 on security, set the gate at 65.
- Switch to blocking mode. Require the check in branch protection. A failing gate now prevents merge.
- Ratchet upward. Once the team is comfortable and baseline has improved, raise the thresholds. Repeat.
Two weeks advisory, then blocking, then ratchet. That is the rhythm.
Where to set the line per domain
Opinions vary. A sensible starting set for most codebases:
- Security at 70. The one domain worth being strict on from day one.
- Architecture at 60. Captures circular dependencies and coupling issues.
- Maintainability at 60. File size and complexity signals.
- Testing at 0 (disabled). Most projects start with low coverage; gating on testing from day one creates immediate frustration. Turn it on once coverage is above 40%.
- Performance, dependencies, accessibility, documentation: 0 initially. Enable whichever one matters most to your product.
These are starting lines. Your mileage varies.
What to do when a gate fails
The gate should tell you which domain failed, by how much, and what specific findings contributed. A failing gate without actionable output is a failing gate pattern.
The developer reads the output, either fixes the finding or justifies it. The justification mechanism matters. If there is no way to say "this is intentional, approve the exception", every gate becomes a friction point. A useful system lets senior reviewers override a gate with a comment explaining why.
Common failure modes
A few patterns that waste teams' time:
- No per-file context. The gate says "security dropped by 3 points" with no indication of which file or line. The developer cannot act. The gate feels arbitrary.
- Gates on generated code. A generated file that includes 500 lines of "dangerous API usage" fires the gate on every change. Solution: exclude generated paths from scoring.
- Gates that measure unrelated work. A developer fixes a typo in a comment and the gate fails because an unrelated domain dropped since the baseline. Solution: the gate measures the delta, not the absolute state of unchanged files.
- Gates that never block. Advisory mode forever means the gate is decorative. Commit to blocking once calibrated, or remove the gate entirely.
The goal
The goal is not perfection on every PR. It is steady, measurable improvement over time with low friction. The gate blocks genuine regressions. It does not block routine work. Thresholds tighten as the codebase improves. Every failed gate teaches the team something specific.
That loop is what makes PR gates valuable. Without calibration, they just make merging annoying.