Run the same scanner over the same commit at 9am. Run it again at 5pm. If the findings differ, the analysis is not deterministic. If they match, line for line, file for file, severity for severity, it is.
Sounds trivial. It is not. Most teams that adopt AI code review discover by accident that the same PR gets a different score on a re-run, and only then start asking what the word "deterministic" actually means.
This piece is the explanation I wish I had read before we built Implera's analysis pipeline.
The definition
A code analysis is deterministic when, given identical inputs, it always produces identical outputs. Inputs include the source files, the configuration and the rule set. Outputs include the findings, the locations, the severities and any score derived from them.
That is the whole definition. Same in, same out. No clock, no random seed, no model temperature.
The opposite is non-deterministic analysis: same input, possibly different output. Probabilistic systems sit here. So do scanners that depend on hidden state (network calls to an evolving database, a cache that expires, a remote rule pack that updated overnight).
Both have a place. They are not interchangeable.
What makes a tool deterministic
Three properties have to hold:
- Pure functions over the input. The tool reads the files and produces results from them alone. No environment leakage. No "we noticed the user has feature X enabled" behaviour.
- Pinned rule set. The rules that fire today fire tomorrow. If the ruleset updates, the version is part of the input, not a hidden variable.
- Stable ordering. Findings are emitted in the same order every run. Sort by file, then by line, then by rule. No "first to finish in the worker pool wins".
Tools that satisfy these include ESLint, Semgrep, Biome, Ruff, and most AST walkers. The classic static program analysis family is deterministic by design.
Tools that do not include LLM-based reviewers, anything that calls a remote rule database without pinning, and any pipeline that runs rules in parallel and stops at the first error.
Why it matters for gating
A quality gate decides whether a pull request can merge. It needs to be a function of the code, not a roll of the dice.
Consider a PR that changes 12 files. The deterministic pipeline scores it 74. The team's threshold is 70. It merges.
Now imagine the analysis was non-deterministic. The score on Tuesday is 74. The same code re-run on Wednesday is 67. Same diff, same dependencies, same rules. The author has done nothing. The PR is now blocked.
That is not a quality gate. That is a coin flip with extra steps.
Determinism is the property that makes the gate fair. It also makes failures debuggable: if the score dropped, you can point at the specific finding that caused it and the specific change that introduced it. Non-deterministic systems leave you arguing with a number that moved on its own.
What deterministic analysis is good at
It excels at pattern-shaped problems.
- Secret detection. A hardcoded AWS key, a GitHub token, a private RSA key. The pattern is fixed; either the file matches or it does not.
- Dangerous APIs.
eval,child_process.execwith concatenated input, raw SQL strings with template literals. - Known-bad imports. Deprecated libraries, packages with known CVEs.
- Structural rules. Cyclomatic complexity above a threshold, files above a size limit, circular import cycles.
- Lockfile and dependency audits. Tools like OSV-Scanner take a lockfile and a CVE database and emit a stable list of vulnerabilities.
The common thread: each finding has a clear yes or no answer. Either the rule fires or it does not.
What it is not good at
Anything that requires reading the code with intent.
- Is this test actually testing something? A test file might satisfy a "tests exist" rule and still assert nothing meaningful. A pattern matcher cannot tell.
- Is this abstraction worth it? A class with three methods is fine in some contexts and over-engineered in others. Rules cannot weigh context.
- Is this README accurate? The file exists. The headings look right. Whether the prose matches what the code does is beyond a regex.
- Does this naming make sense?
processDatais technically valid. Whether it is a good name depends on what it does.
These are the cases where deterministic analysis hits its ceiling and AI-assisted review starts to earn its place. We covered the trade-offs in Static Analysis vs AI Code Review.
A concrete example
Here is what deterministic and non-deterministic look like side by side.
// app/auth.ts
const ADMIN_TOKEN = "tok_live_8f3a9c2b1d4e5f6a7b8c";
function isAdmin(token) {
return token === ADMIN_TOKEN;
}
A deterministic scanner will report:
- File:
app/auth.ts, line 2, rulehardcoded-secret, severityhigh. - File:
app/auth.ts, line 4, ruleweak-comparison-secret, severitymedium(depending on rule pack).
Run it 1,000 times. You get the same two findings, same lines, same severities.
An LLM reviewer asked "is there anything wrong with this file?" might mention the hardcoded token, or the parameter type, or the function name, or all three, or none. It might rate the issue as high one run and medium the next. Each run is plausible. None is reproducible.
Both kinds of output have value. Only one can drive a gate.
Determinism in continuous integration
CI is where determinism pays off most. A few practical implications:
- Reproducible failures. When a build fails, an engineer can re-run it locally and see the same failure. Non-deterministic checks waste hours on "works on my machine" investigations.
- Cacheable results. If the inputs have not changed, the result has not changed. You can skip re-running the scanner. This is how Biome and ESLint stay fast on monorepos.
- Auditable history. Auditors can re-run last quarter's analysis on last quarter's commit and verify the findings match the report. Non-deterministic systems cannot offer this.
- Stable trend lines. A 90-day chart of your codebase score only means something if the same code produces the same score. Otherwise the line is noise on top of signal.
The team that learns this the hard way usually does so after a PR quality gate starts blocking merges for reasons nobody can reproduce.
How to tell if your tool is deterministic
A few quick checks:
- Run the scanner twice on the same commit. Diff the output. If anything moves, you have non-determinism somewhere.
- Check the rule pack source. Is the version pinned in your config or pulled live from a server?
- Check for network dependencies. Does the scanner call out to a service mid-run? If yes, the service is part of your input and should be versioned too.
- Check parallelism behaviour. Does the order of findings change run to run? That is non-determinism in the output even if the set of findings matches. Sort everything before comparing.
If a tool fails these checks, that does not mean it is bad. It means it is not the right tool for a gating decision. Use it for advisory feedback, not for blocking merges.
The deterministic-first principle
The pattern that has worked for us:
- Deterministic analysis runs first. It produces the score, the gates, the audit trail. Fast, free, repeatable.
- AI review runs second. It refines findings, adds context and explanation, catches the things rules cannot. Advisory, not gating.
- Humans review third. For intent, architecture and judgement.
The order matters. Determinism is the foundation; everything probabilistic sits on top of it. Reverse the order and the foundation is sand.
For a longer treatment of how this layering works in practice, see what makes a good quality gate.
The bottom line
Deterministic code analysis is the boring, reliable workhorse of code quality. It is the part that gives you a number you can trust, a finding you can reproduce, and a gate that fails for the same reason on Tuesday as it did on Monday.
If you cannot reproduce the result, you do not have analysis. You have an opinion.