Every few days, it seems, we see a series of tweets objecting to automated testing and code coverage measurements, suggesting that they tell us nothing about the “real” situation. Today, it was this:

Now this particular thread objected to coverage goals, and I completely agree. We can cover all the code most rapidly by writing no assertions (and perhaps by fielding all the exceptions), while learning nothing. So yes, let’s agree that coverage goals are bad. But what about coverage? So I tweeted:

Two projects, one has 95% code coverage with tests, one has 45%. You’re going to be paid per bug found. Which one do you want to work on?

In retrospect, I wish the tweet had had room for “all other things being equal”. People kept thinking, “But yeah, the one might have been coded by malicious evil demons who got high coverage, and the other by really careful angels who thought really hard”. But I tried a recasting with “Ceteris paribus” and people wanted to play games with that as well, e.g. “OK both systems have the same number of bugs, I’ll take #1.” Yes, well played, but not the point, is it?

I’ll not copy the responses here. Most respondents seemed to wriggle on the hook, saying that we don’t know the domains, the size of the code, the this or the that. True enough. My main reaction was “Yes, sure, all true. This is all the information you get, which do you choose?”

“High coverage means low understanding”?

One respondent seemed to conclude that the team who had 95% coverage clearly didn’t understand TDD and therefore they’d look in that program for defects. Others wanted more information, which I declined to provide, because I was trying to get people to think, and trying to find out how people think. So I was singularly unhelpful in my responses to requests for more information.

“Insufficient information for a meaningful response”?

Many respondents wanted to assert that we have insufficient information to know anything. Certainly we have very little information: the point is to get at our intuition and the reasons behind it, for thinking about the probable quality of systems with high test coverage versus low. It’s quite true that we don’t know if the one team was really competent and the other incompetent. We don’t know if the one program was rocket science and the other was trivial. We don’t know if the one was large or the other small. Those are all things we’d like to know.

What I’m after, though, is to examine our intuition about coverage. What effect does increasing test coverage tend to have on the defect density in a program? None? Does increasing test coverage tend to reduce defects? Does it tend to increase them?

To me, the answer is obvious: If I have to bet, I’ll bet that more testing leads to fewer defects. This isn’t guaranteed, but it’s how I’d bet. If you’d really bet that more tests lead in most cases to more defects left in the code, I’d like to read your article supporting that idea.

If your thought is that on the average, increasing testing has no effect on bug density, I’d like to read your support of that theory as well. And I assume that you are aware that that theory leads to zero testing being the right amount of testing. This seems unlikely to me. Anyway, here’s my best argument for “we can know nothing”:

You know nothing, Ron Jeffries …

Possibly, in the space of all possible random arrangements of code, it might be that the defect density is the same, on the average, no matter the test coverage, which is also random. Therefore, test coverage tells us nothing.

Nothing? Really?

Random arrangements of code are not what we find in the universe. In the universe, Bill Tozier notwithstanding, most code is written with intention to make it work, and with some non-zero level of inspection and testing to ascertain whether it does. In that light, I’d expect that code that has had lots of testing done probably has fewer defects than code with little testing. I’d suggest that code with 95% automated coverage probably has had more testing done than code with only 45%.

Therefore, I’d expect that the 95% code has fewer defects per line, i.e. lower defect density, than the 45% code. I’d expect that if we threw darts at the code listings, our dart would hit more functions with defects in them in the code with less coverage.

I know of no data to support this thinking: if someone does know of some, I’d love to read your article about what is known, and how it applies to this Gedankenexperiment. But if my assumption is correct that tests tend to reduce defects, I think it’s a pretty solid bet that the 95% code is more nearly defect free than the 45%.

And again: I solidly oppose setting a goal for any given level of coverage. But even if you do, the tests probably won’t make the code more buggy.

If your view is different (or the same), I look forward to reading your article on the subject, and will add a link to it here if you make sure I know about it.

Thanks for playing!