@@ ✦ · the evidence @@

Do code reviews actually find bugs?

Othman Shareef · July 4, 2026 · 6 min read

Every few months the argument reruns: a “code review is theatre” essay, a “reviews saved us” rebuttal, and a Hacker News thread titled something like “Code reviews do find bugs” with a hundred-plus comments relitigating it. Both camps wave anecdotes. The research actually has an answer, and it’s more interesting than either camp’s version.

What the studies actually say

Microsoft (Bacchelli & Bird): across thousands of reviews, the study found developers expect defect-finding to be the top outcome, but observed comments skew toward improvements, style, and understanding. The biggest obstacle to finding deep defects is reviewers not understanding the change.
Google (Sadowski et al.): their case study describes review converging on small changes reviewed quickly by usually one reviewer, with education and codebase consistency as explicit goals: a system tuned for understanding-at-scale, not maximal defect yield per review.
The older inspection data (the Cisco study popularized by SmartBear) shows review can catch a substantial share of defects, but only under the conditions teams routinely violate: small scope, limited pace, fresh attention.

Both camps are right about the failure mode

The skeptics aren’t hallucinating: a rushed glance at a 1,500-line diff finds nothing, and calling it quality assurance is theatre. The defenders aren’t either: a real read of a scoped change catches the wrong-assumption bug no test suite was written to catch: tests verify what someone thought to verify; a reviewer questions the thinking. The variable isn’t whether review “works.” It’s whether the conditions for real review (size, time, and a surface that doesn’t fight you) were present.

The agent-era update

Two things changed since that research. Volume: authoring accelerated and review didn’t, so the conditions for real review are violated more often by default. And provenance: when an agent wrote the change, review is no longer a second pair of human eyes. It’s often the first, which raises the stakes of doing it honestly. The defect-finding share of review’s value is probably rising again, because generated code fails in ways tests written by the same model miss.

So: worth it?

Worth it: for the bugs it does catch, the worse designs it quietly prevents (knowing a human will read your code changes what you write), and above all for keeping the team’s understanding of its own system alive. That last one has no substitute, no bot replacement, and no shortcut. The actionable conclusion isn’t “review more” or “review less.” It’s create the conditions where review is real: small changes, honest re-review rounds, nits automated away, and reviewer attention treated as the scarce resource it is.

Frequently asked questions

Should we drop review for low-risk changes?

Calibrating rigor to risk is reasonable: many teams fast-track docs, config, and mechanical changes. But “low-risk” is exactly where untested assumptions hide, and review’s knowledge-transfer value applies to boring changes too. Lighten the process before you remove it.

Are AI reviewers a replacement for human review?

They replace a slice of it: the mechanical defect hunt. They don’t replace the parts research says humans uniquely deliver: judging intent, transferring context across the team, and a human taking ownership of the change. Use bots to clear noise so humans can do that part.

← All posts