Do code reviews actually find bugs?
Othman Shareef · July 4, 2026 · 6 min read
Every few months the argument reruns: a “code review is theatre” essay, a “reviews saved us” rebuttal, and a Hacker News thread titled something like “Code reviews do find bugs” with a hundred-plus comments relitigating it. Both camps wave anecdotes. The research actually has an answer, and it’s more interesting than either camp’s version.
What the studies actually say
- Microsoft (Bacchelli & Bird): across thousands of reviews, the study found developers expect defect-finding to be the top outcome, but observed comments skew toward improvements, style, and understanding. The biggest obstacle to finding deep defects is reviewers not understanding the change.
- Google (Sadowski et al.): their case study describes review converging on small changes reviewed quickly by usually one reviewer, with education and codebase consistency as explicit goals: a system tuned for understanding-at-scale, not maximal defect yield per review.
- The older inspection data (the Cisco study popularized by SmartBear) shows review can catch a substantial share of defects, but only under the conditions teams routinely violate: small scope, limited pace, fresh attention.
Both camps are right about the failure mode
The skeptics aren’t hallucinating: a rushed glance at a 1,500-line diff finds nothing, and calling it quality assurance is theatre. The defenders aren’t either: a real read of a scoped change catches the wrong-assumption bug no test suite was written to catch: tests verify what someone thought to verify; a reviewer questions the thinking. The variable isn’t whether review “works.” It’s whether the conditions for real review (size, time, and a surface that doesn’t fight you) were present.
The agent-era update
Two things changed since that research. Volume: authoring accelerated and review didn’t, so the conditions for real review are violated more often by default. And provenance: when an agent wrote the change, review is no longer a second pair of human eyes. It’s often the first, which raises the stakes of doing it honestly. The defect-finding share of review’s value is probably rising again, because generated code fails in ways tests written by the same model miss.
So: worth it?
Worth it: for the bugs it does catch, the worse designs it quietly prevents (knowing a human will read your code changes what you write), and above all for keeping the team’s understanding of its own system alive. That last one has no substitute, no bot replacement, and no shortcut. The actionable conclusion isn’t “review more” or “review less.” It’s create the conditions where review is real: small changes, honest re-review rounds, nits automated away, and reviewer attention treated as the scarce resource it is.