A team I spoke to last month had a dashboard with a single green number on it: Maintainability Index 84. Everyone trusted it. Nobody could explain how it was calculated, what would move it, or why a file they all hated to touch still scored 79.
That is the Maintainability Index in one anecdote. It feels authoritative, it fits in a widget, and it tells you almost nothing you can act on.
Where the number comes from
The Maintainability Index is a formula from the early 1990s. It combines three older metrics into a single value: Halstead Volume (a measure derived from operator and operand counts), cyclomatic complexity, and lines of code. The original output was an open-ended number; most modern tools rescale it to 0 to 100, where higher is better.
Microsoft's implementation is the one most developers meet, because it ships inside Visual Studio. Their code metrics reference documents the formula and the colour bands: green above 20, yellow from 10 to 19, red below 10 on the original scale. In Python land, Radon computes the same family of metrics and is the tool most teams actually run in CI.
So far, so reasonable. The problem is not the maths. The problem is what the maths leaves out and what it rewards.
Why one number misleads
It blends signals that should stay separate. A file can score well because it is short, even though every function in it is a tangle. Another can score badly because it is long, even though it is a clean list of well-named handlers. The index averages these into a single value, and the average hides the cases you care about.
Lines of code dominates more than it should. Because LOC is one of the three inputs, splitting a 600-line file into three 200-line files raises the index without changing a single line of logic. You have moved the problem, not solved it, and the number says you improved.
Halstead Volume measures the wrong thing. It counts operators and operands. Dense, terse code scores as more complex; verbose, repetitive code scores as simpler. That is backwards. The terse version is often the one a reader understands faster.
It is trivial to game. Once a team is measured on the index, they optimise for the index. Rename to shorten. Split files mechanically. Inline a helper to drop an operand count. None of this makes the codebase easier to change, which was the entire point.
The deeper issue: maintainability is not one axis
Maintainability is the cost of making a change staying stable over time. That cost is driven by several independent things: how complex the control flow is, how coupled the modules are, how clearly things are named, whether tests give you the confidence to change anything, and whether the next engineer can find their way around.
These are different axes. A codebase can be excellent on complexity and terrible on coupling. Collapsing them into one scalar throws away exactly the information that would tell you what to fix. We made this case in full in how to measure codebase maintainability: the signals are useful individually and misleading once averaged.
What to track instead
Keep the underlying signals, drop the blend. Each of these has a clear reaction when it moves.
| Signal | What it tells you | Reaction when it worsens |
|---|---|---|
| Cyclomatic complexity (p95, count over threshold) | Where control flow is hard to test and reason about | Refactor the specific functions above the threshold |
| File size distribution | Files doing too many jobs | Split the genuine outliers, not everything |
| Change coupling | Hidden dependencies between modules | Investigate the top coupled pairs; extract or separate |
| Circular dependencies | Structural tangles that worsen with growth | Fail CI on new cycles |
| Dead code | Cognitive overhead with no payoff | Remove on sight, gate on new dead exports |
| Documentation density | Whether a newcomer can navigate | Document public interfaces below the threshold |
This is the same set we recommend in code quality metrics every team should track. The difference from the Maintainability Index is that every row points at something you can do today, to a specific file or function. The index points at nothing.
When a single number is still fine
There is a legitimate case for one headline number: communicating direction to people who are not in the code every day. A product lead does not want six trend lines. They want to know if things are getting better or worse.
The fix is not to abandon the headline. It is to make sure the headline is built from risk-weighted, explainable signals rather than an averaged formula, and that anyone can click through from the number to the reason behind it. That is the design we argue for in what a codebase health score is: a single figure backed by a per-domain breakdown, where the number is a summary of evidence and not a replacement for it.
A score you cannot explain is a score you cannot act on. The Maintainability Index fails that test. A health score that decomposes into the signals above passes it.
The takeaway
The Maintainability Index is not wrong so much as it is inert. It compresses real signals into a value that is easy to display, easy to game, and hard to act on. Track the components directly, weight them by the risk they carry, and keep the path from the number back to the cause open.
If you want a score built that way, with every figure traceable to the detection behind it, that is what Implera does on every commit. Connect a repository and watch the signals, not the average.