What TDD is -- and isn't -- like.

Some folks who don’t know TDD, and some folks who should know better, have been describing TDD in ways that don’t match what I experience. Herewith, some thoughts.

See also this companion article demonstrating TDD.

Test-Driven Development is defined quite simply:

write a small failing test;
make the test run with the least code possible;
refactor to improve the design you now have;
repeat

Like so many aspects of software development, be it Agile or not, TDD is simple, but it definitely isn’t easy.

I find TDD very useful, and I know many other developers who do as well. I communicate with them often enough – and have had the pleasure of introducing some of them to the practice – so that I’m really rather sure that we’re talking about the same basic thing when we talk about TDD.

The people who recommend TDD to others, at least the ones I’d trust, are the ones who use the technique themselves. Some do recommend it from hearsay, and since I recommend TDD, I’d have to say that I agree with those folks, but someone considering the practice would do well to listen to those of us who actually do it. Possibly, our descriptions would be more helpful.

But the facts seem to be that a lot of people who try TDD do not have good results, and come away with a bad taste in their mouth. I’ve written elsewhere about the main ways that might legitimately happen: different design preferences, different kinds of applications, and so on. One possibility, unfortunately, is that they’re just not doing it right.

That’s a horrid thing to say, and people often feel offended if one says it, but let’s get real. When we use any tool, our results depend on how well we use it. Even pounding a nail can be done well or poorly, and once you’ve watched a professional do it, you can see that. TDD is harder than pounding nails, and it can be done well or poorly. Here, I intend to talk about what I think I’m doing when I do TDD, and what it feels like to me, in the hope that interested people can assess whether they’re doing TDD or some other thing.

If you’re interested, trying it, and wondering, you can always write me a note or tweet me. I answer almost every communication I get.

Anyway, here goes.

Evolution is slow

Someone in a recent thread objected to TDD because it evolves the code, and evolution is a slow process. I’m sorry that I can’t seem to find the exact quote.

Those of us who use TDD use it because it is the fastest way we know to produce running software that works. I mean, get serious, if it didn’t produce working software faster for us, we sure wouldn’t do it. With TDD, yes, we do evolve the design and code, and no, it is not a slow process. It is a steady, evenly-paced process – and it’s the fastest way we know.

Most developers rely on testing one way or another to build confidence that our code works. Perhaps we use TDD. Perhaps we use a test-first mechanism where we write acceptance tests first, or have them provided, and use those to see whether our program works. Perhaps we write the code, then run it and determine by looking whether it works. And perhaps we write it and someone else tests it and tells us about the defects. Sometimes, unfortunately, it’s the users who have that job.

Now, it happens that there is some evidence that code inspection finds defects better than testing, but nonetheless, almost no organization uses code inspection in lieu of tests. So I feel pretty safe suggesting that most of us here rely on testing in one way or another.

It seems reasonable to suppose that we insert defects with a more or less constant frequency, so that one hundred lines of our code is about ten times more likely to contain a defect than ten lines. I’d also suggest that it’s at least ten times harder to find a defect in 100 lines than in 10.

Furthermore, there’s good reason to suspect that defect insertion increases with chunk size: writing correct code involves a lot of juggling of ideas, names, and other details, and larger chunks might quite likely be disproportionately more likely to include defects.

All this would suggest that – if it were fast enough – testing each few lines might be better than writing lots of code and then testing it. Even if we make mistakes at the same rate, we’ll see them sooner and in a smaller batch of mostly working code. And if, as I suspect, defect insertion actually increases with size, we would likely produce fewer defects over all.

That’s what TDD does. It causes us – enables us – to do our testing in very small batches, so small that finding defects is nearly trivial. And we seem to make fewer mistakes, or at least leave fewer in the code. Fewer defects, faster finding, simpler debugging: these are the things that make TDD faster.

Let’s look again at what TDD is and then consider a story about how it goes.

write one small failing test;
make the test run with the least code possible;
refactor to improve the design you now have;
repeat

The steady state we work toward with TDD is that when we start, we are looking at a well-designed program with no known defects – because all our tests run, and we’ve continuously improved the design as we go.

Now we’re looking at a new feature, and we think of a small step toward that new feature. We advise choosing the smallest real step you can think of, but as we gain skill, we choose, not the smallest step, but a step of reasonable size, one where we are sure we can get it to work in one go.

We write a test showing that the new bit we’re about to build doesn’t already work. If we’re adding sales tax to our purchasing software, maybe we set up an empty cart and ask it what the sales tax is, asserting that it will be zero dollars. We know, of course, that the cart can’t even answer the question yet, so we are sure this test will fail.

We run the test and it fails. We need to build the simplest code that passes the test, so we add a salesTax method that returns zero dollars. Now the test runs. We check the design and it’s still good: adding this method was needed.

Now … why didn’t we instead populate the cart with one item or a few items, and ask for a sales tax of $1.98 or something? We could – and on a given day we might – but we might also discover that when we take a bite that big, it actually slows us down. We’ve got sales tax on our mind, we know we’re going to do a salesTax method, we know what type it returns, but we don’t have any of that mechanism yet. So why waste time building up a cart and such?

Everyone’s mileage varies but many if not most TDDers would go for trivial zero dollar case. The reason is that that test is sufficient to “drive” us to build the method and return a value, and does not drive us to iterate over the cart items adding in their sales tax.

Most of us prefer that each test “drives out” just one code improvement, one small concept on the way to the larger goal. By choosing the zero dollar tax first, we get to defer the question of whether we’ll ask each item for its amount and calculate sales tax on that amount, or whether we’ll ask each item for its sales tax. (The latter is almost certainly the better design, and we’ve had the thought flit through our mind, but we know we can deal with it later.)

So we do the simple test, ask the cart, get the zero, driving out the base cart method and its type.

Then it’s time for a new test. This time, I’d consider adding just one item to the cart, but maybe I’d skip the single case and add two or more. It’s a judgment call, and one that I’m always prepared to make – and I’m prepared to discover that I’ve made the wrong choice.

And here’s why evolution isn’t slow. I’m not typing random tests and random code. I’m thinking about the design and making choices as I go. I choose whether to do one, two, or three items. Then I set up the cart, calculate the sales tax by hand, and ask the cart to give me back that value.

The test fails, because the cart still says tax is zero dollars. Great, now we get to make it work.

Here we come to another TDD myth we can bust. People say TDD is inefficient because you write all these intermediate solutions and have to rewrite them. Our zero is such an intermediate solution: it looks like this

Cart>>Money salesTax() {
   return Money.Zero();
}

What code do we have to rewrite? That return statement! Wow, terrible rework, we’re going to have to rewrite over a dozen whole characters of code it took us zero thought time to create!

So we remove that return, and decide to do something like iterate over the cart.items, summing item.salesTax() and returning it. In a decent language that’s two or three lines of code.

We run the test again. It fails, now because items don’t understand salesTax. We’re not surprised. So maybe:

Item>>Money salesTax() {
    return this.price*Percent(6)
}

This works and if the answer isn’t what we expected, most likely we did the hand arithmetic wrong. All our tests are working again and we check our design and it’s still good. (Quite often, we’ll find that. Had we made the other choice of summing item.price*Percent(6), we might then realize that putting sales tax computation into Item is better. Or maybe we wouldn’t see it until our next test. Since we’re paying attention to the design quite often, we’re likely to see improvement opportunities while the changes are still quite small.

Perhaps our next concern is that some items are not taxable. We create a cart, maybe with one item taxable and one not, the test of course fails, and we turn to code.

Had we gone with the tax computation in Cart, we’d have to write something about

if ( item.taxable ) then item.price*Percent(6) else Money.Zero()

Hopefully we’d see that that was bad. But we’d be on a red bar, so we either have to back out the test (or ignore it) and fix the design, or wait until we get that test working. Either way is bad.

So here we see something about evolution. If we have a good sense of design, we’ll go right to asking the Item about salesTax. If not, we may have to wait until we discover the ugly code that sends two messages to item and conditions behavior based on one of them. And when we discover it, either we ignore the test formally, so as to refactor on a green, or we ignore it in our head and refactor on red, or we build an even worse design to get to green and then refactor.

One of these choices is better than the others in my view. I’d comment out or otherwise formally ignore the new test, refactor, then re-enable the test. Not that I couldn’t ignore it in my head, and refactor on red, but I prefer to play by the rules, because the urge to take the shortcut bites me often enough that I’ve learned better.

Notice what is happening here!! The discipline of TDD is teaching me, by simple feedback, to do things that tend to work for me, and not to do things that don’t. So every time I get a red bar that I don’t expect, or find code that makes me wonder who the hell wrote it, I reflect back on my development ritual, and see whether I’ve taken a shortcut that I know better than to take, and I refresh my dedication to working in the range of behaviors that works best for me.

Bottom line, the code evolves, but it isn’t evolving randomly by survival of the fittest. Instead an expert designer (us) is evolving it directly, making the best choices we can, correcting the mistakes as we see them, and evolving the design as effectively as we can.

And we’re doing it with increasing certainty that the code is doing exactly what we intended it to do.

TDD does evolve the code. It does evolve in small seemingly slow steps. But it’s directed evolution, taking the largest safe step according to my best judgment, and my judgment gets corrected quickly when I judge wrongly, because something goes wrong right before my eyes.

Evolving the code wisely is the fastest way I know to program. It’s not slow at all.

Objections regarding coverage and unit testing.

The brilliant James Coplien has written that most unit testing is waste, and seems to use that argument to argue that therefore TDD is also waste. Cope has written this follow-up article as well. On twitter, he referred also to Vitaliy Pisarev’s article arguing against unit tests. Finally, David Heinemeier Hansson (the famous DHH, creator of Rails) has an article about Test-induced design damage that’s worth a look.

These articles certainly raise some interesting concerns. Brittle tests are a problem, and whether you’re just “unit testing” or “TDDing”, they are evidence that you’re “doing it wrong”. Brittle tests are not very useful and should be avoided. Neither unit testing nor TDD require you to write brittle tests. Both practices permit it. There’s nothing man-made that can’t be done poorly.

DHH, for his part, is concerned that an over-focus on testability can result in an inferior design. Certainly this must be possible … in the hands of someone who doesn’t know a good design when they see one … but if you know good design, you’d be a fool to create a bad one for testing purposes. There is another matter, though, which is that people’s design sense is to some degree a matter of taste. Based on observing a bit of his work, for example, I think that DHH tolerates and enjoys a bit less detailed factoring than I do. That will certainly change the responsiveness of our designs to testing concerns.

Cope also leans hard – rather too hard in my view – on the argument that test coverage means very little, because your program has zillions of paths through it, zillions of possible states, and no reasonable number of tests can test them all.

There are also zillions of possible ways to die in your car, including being killed by the second duck in a series striking your windshield. (The first duck just cracks the glass and bounces off.) The possibility of death by duck strike is not an argument against seat belts and air bags, and the possibility that you’ve missed an important test is not an argument against testing.

Do read the articles. They are interesting and raise real issues about quality in testing and about ways to gain confidence in your code.

Do those papers have much to say about TDD? Well, they don’t have much to say about what I do when I say I’m doing TDD.

Cope’s article objects to unit tests. For me, TDD isn’t even about unit testing. I don’t hesitate for a moment to write a test that’s going to use multiple objects. Often, at the time I write the test, I don’t even know how many objects will be involved. After all, we’re going to refactor to a better design, so, often, by the time we’re done, we’ll have quite a few more objects than we started with. Here’s an example.

As a demonstration of TDD, my colleague Chet Hendrickson and I often do a little bowling scoring exercise, a program that calculates the total final score of a legal game of bowling. This problem is scaled very nicely for us to solve in an hour or ninety minutes, which is just about right for the occasion.¹

We always begin that example with a test that creates a BowlingGame instance, primes it with 20 gutter balls, and requires that the game score is zero. We implement the return, of course, by just returning zero.

When the full suite of tests has come into being, and they’re all running, that method may be doing any number of things, depending on the design choices we make going forward. It might sum an array inside the game. It might forward to a collection of Frame objects. We don’t know what it’s going to do, because we do the example a bit differently every time.

That test is not brittle. It never needs to be revised. The implementation of BowlingGame>>score needs to be revised, of course. But the test? Never.

In TDD, when our tests are all green and the code is good, we ask ourselves what doesn’t work that needs to work by the time we’re done. And we pick the next simplest thing to test.

In the case of bowling, having tested all zero frames, we usually next test all open frames (frames with score less than ten), because that’s easier, in our view, than a frame with a strike or spare. We create a BowlingGame, prime it with open frames, and require the game score to be whatever the frames add up to.

And here’s something that James and Vitaliy seem to miss. They seem to be assuming that the test writer is ignorant and remains that way. TDD is about learning all the time. We learn what to test by deciding what to test. In this case, we’re solving an old problem a new way, but even as we solve new problems, sometimes we bite off more than we should. We quickly learn, because making the test run green takes longer than a couple of minutes, and if we’re wise, we delete that test and write a simpler one, and if we’re not quite so wise, we just make a mental note to take smaller steps.

Speaking of bowling, here’s an example of something we learned. If you want a series of open frames to test, a good example might be frames all containing 5,4, because that’s 9 and ten times nine is ninety. We’ve done that … and had things go wrong.

Clearly, as we build up our real implementation, we need to iterate over all the frames and all the rolls. One time, doing our demo, we somehow forgot to increment the frame index, but since the frames were all the same, we got the same 90 that we expected. Only later, when we put in some spares or strikes, did we discover the bug.

These days, when we do this exercise, we don’t prime the frames all with the same values. Usually we alternate, maybe 5,4 and 4,3. That’s enough for us to be sure that the program will have to increment through the frames. And the same lesson applies to anything involving a collection: we now know that priming with all one constant isn’t quite as robust as we might hope. It does make a harder problem adding up the frames by hand, and we have to figure out what 16 times 5 is, but we usually manage that with help from the audience.

Now, if we followed the TDD dictates exactly and always, we’d never write a line of code that wasn’t required by some test, and therefore we’d have 100% line and branch coverage without even trying. Cope and others are correct that this isn’t much, but the fact that every line of code is exercised by at least one test is still of value.

When you look at a program that has been developed using TDD, and you look at any given line of code, with an eye to putting an error into it that won’t get detected, it turns out to be pretty difficult. Almost anything you change will break a test. If it doesn’t, and you ask yourself whether there should have been a test for that, the answer is usually yes.

Now, it’s easy enough to say, yes, well, sure, if you tested perfectly your program would be perfect, but we’re never perfect. That’s quite true, but it’s also true that as we practice TDD, and review those places where it doesn’t save us, we learn to practice it better.

TDD builds up a scaffolding of tests, surrounding our code, such that every line of code is there for a tested reason, and as we learn to do it better, our testing scaffolds improve and our programs become more reliable. As they become more reliable, we do less and less debugging, which means that we go faster and faster.

Theory vs Practice

I’ve tossed a few gibes at computer science theories here (and elsewhere), so I would like to comment rather seriously for a moment. I really have studied those theories, while attaining not entirely dishonorable degrees in math and computer science, and the theories are, formally, correct.

There really are problems such that it’s not possible to write a computer program that is guaranteed to get the right answer, even assuming that the hardware works. There really are problems where the time to get the right answer increases without bound as the problem becomes more complex.

However, that doesn’t mean that we cannot have great confidence that our particular programs do in fact work. The programs that are undecidable are those that try to do things like answer, for any given other program, whether that program will ever halt. Fine. No one cares but a computer scientist.

Our job as software developers is to produce programs that do work, that do get the right answers. We can do that, by coloring well within the lines. The navigation program in your car or phone almost always finds a decent route to where you’re going. Rarely, it may lack road information and ask you to drive “to the highlighted route”. That’s not a theoretical failure, it’s insufficient information, and if you do find your way to a road it knows, it’ll recover and start telling you what to do. By and large, almost all the time, you’ll get there.

In theory, we can’t test a program enough to be certain that it works. In practice, by using our brains, we can gain enough confidence, using a combination of testing, careful design, inspection, and all the tricks we know, to produce useful programs that do the job they were asked to do.

TDD is one tool in the kit we can use to do that. For me, it’s a very valuable tool.

Summing Up

I’m really not (quite) here to sell you TDD, but I am here to try to describe what it is like, in response to articles you may find that will tell you mostly what it’s not like, while also telling you that it doesn’t work, despite describing something that isn’t quite TDD.

If TDD seems interesting, I’d suggest that you try it, and I’d suggest that you go into it realizing that it is quite different from unit testing, quite different from top-down or bottom-up design, quite different from most of the things you’ve already learned so well. I say that because it was quite different to me, despite that I had an advanced degree in computer science and forty years of serious work and study in software before I learned it.

And don’t forget to check out the companion article demonstrating TDD in Lua.

I’ve written up an example in Codea Lua. Check it out: Lua is easy to understand. ↩