Problems with Acceptance Testing

Initial Remarks

Jim Shore has written a short item on the above subject. I’ll begin by paraphrasing him, following his outline closely.

In a nutshell, he's saying that customer acceptance testing using Fit (and FitNesse I imagine) isn't paying off for him. His objection seems primarily to be that the "natural language" tools don't pay off.

Two things drive this. First, the people in the customer role don't want to take the time to write the examples. However, they don't trust examples written by others. The result is that the responsibility gets handed off to testers, which defeats the purpose.

Second, the tests that Jim's teams get with Fit seem to be end-to-end, and therefore slow, and brittle.

Therefore, Jim is no longer recommending this kind of testing. Instead, he relies on close communication with the on-site customers (product owners), only sometimes creating the examples that correspond to conventional Fit tests, and then turning those examples into TDD tests.

In a couple of tweets, Jim also tells me that his teams do create regression tests, and of course they do exploratory testing as any wise team would. When exploration finds things, or when defects are reported, then the team realizes they are too confident, improves their practice a bit, bears down a bit, and gets back on track.

On the one hand, I find Fit and FitNesse to be an intense pain to use, and in my gut I suspect that if I were made to use it, I’d refuse. Or, I hope, come up with a better idea.

On the other hand, I believe that examples are the best way to communicate about any complex requirement, and I’m concerned that Jim’s teams may not be working on the complex end of the requirements scale. (This could be good for a number of reasons, but it might imply that his advice is only good at that end of the scale.) Still, his teams are using examples where they think it counts. In many applications that are basically CRUD, examples may well be a waste of time.

On the gripping hand, I’m concerned about the potential loss of understanding and agreement between on-site customer / product owner and the team. Examples are the best way to do that and while it is more work for the customer, the examples are more likely to be understood, and to be correct.

Bottom line, I’m concerned about this issue because I like the clarity that results from having concrete tests that are agreed to be “the definition of done”. At the same time, Jim is a smart and experienced person, and we need to pay attention to what he’s finding out there.

Jim Explains ...

For all our benefit, Jim Shore goes on to describe the things his teams do that help make it safe to operate without Acceptance Tests per se. Good stuff. Read it. I’ll wait.

OK, well, yes. Jim lists essentially the entire contents of the XP lexicon. And all those practices are good and done in concert, you’ll be pretty safe. Let’s look at a few of these for comment.

Jim points out that his teams use unit tests, focused integration tests, and end to end integration tests, written in TDD style. The latter two are exactly the kind of tests you’d get if you could get what I call “Customer Tests”, except that they might not be in a form that the Customer herself could write or read. These should ensure reliability, though they might fall short a bit in communication value compared to Customer-written tests. With good communication around them, this can surely work.

(Mind you, I'm sure it does work. Jim is doing it. I'm just saying that we can see how the laws of software physics support this approach working. It's readily credible.)

Jim then lists a few million standard XP practices which do support quality. Excellent. Then, he says:

And finally, when my teams find a bug, we fix it right away -- or decide that it isn't ever worth fixing. ... After writing a test to reproduce the bug and fixing it, we refactor the code so similar bugs are less likely in the future.

Well. This is interesting. The point is that if we fix things right away, and reflect, and improve our code (and our process I imagine), we’ll improve so as to reduce defect injection. Again, it is easy to see that this will certainly work.

We still have open a small concern about customer-team communication, since all this is mostly on the technical side.

Now Jim goes on to talk about collaboration. He has fully cross-functional, collocated teams, including customers and testers on site. Forgive me for suggesting that if you’re not doing these things, you are not on safe ground for full communication and that the need for Acceptance Tests, however painful, may be increased.

But wait, don’t answer yet! There’s more! Jim’s teams use “Customer Examples” for anything that is difficult to explain. And the developers use these examples for tests … which may not be able to be reviewed by the customers.

This is quite OK. If customers provided Acceptance Tests, the developers would be fools not to run them until they work. It really wouldn’t matter, other than for confidence building, if the customers never saw them run. That has always been true: on C3, our test running person would report the results of the tests every day, and the customer almost never ran any of them. She was satisfied that they were being run and looked at.

Jim goes on to say that he’s OK if the tests are not automated and if they are not customer-understandable. I’m OK that they are not customer-understandable – though I would prefer that they were if it were close to free. I am less comfortable with the notion that they are not automated. My concern would be that if they are not automated, doors are opened to regressions.

It would be interesting to know when these tests are automated, and when they are not, and what other tests are commonly put in place when they are not. Certainly it is not necessary to run every example to be sure that the code works. Probably it is necessary to run some.

Jim then digresses to point out that the real way to trust is to ship stuff that works. Yes. Our question here is about the extent to which Customer Acceptance Tests contribute to things working. I agree that with a sufficiently weird customer they wouldn’t generate trust just by printing “Yay, we work!”

Jim goes on to list every other known XP and related practice as things that his teams do. If your team is doing all those things, then you, too, may not need to automate acceptance tests.

And remember: I hate automating acceptance tests with Fit and suspect that I couldn't sustain doing them. I also hope and vow that I would do everything else I knew about, to make up for it.

In sum, take a hard look at all the things Jim’s teams do, including every known good practice, especially including root cause analysis on every defect. Notice that his teams do not treat defects as business as usual.

My bottom line on all this is that if I were that good, I could skip acceptance tests as well. On my best days, I’m that good. On my normal days, not so much.

I suggest that the right notion might be this one:

Acceptance Tests are at shu level and the bottom half part of ha level. When your team is at or near the ri level, you’ll have enough other rigor in place to safely drop them. Even then, keep a weather eye peeled for increases in defects. We’re not as good as we think we are.

Summary Remarks

Jim and I, and a few others, exchanged some Q&A on Twitter. My main concern is this one:

Acceptance Tests are what Joshua Kerievsky calls "Story Tests". They test whether stories work. It appears that Jim's teams do these sometimes, but not always. They always do lots of low-level TDD, and they do what he calls "integration" and "end-to-end" tests. However, Jim says that they do not always save the end to end tests, sometimes just doing them manually during development.

It seems to me that if any given patch of code is not covered by an automated test, in principle we do not know whether it still works or not. It could be broken by a change to the code itself ... or by a change to any object on which the code relies. Yes, the TDD tests for the objects involved will help prevent defects in the users but they are not as solid a guarantee as we'd have with automated story tests for all stories.

Jim’s observations seem to me to touch on these related but separate topics.

First, customer-owned examples are hard to drag out of the customer. Jim’s teams work to get the examples in the cases where the stories are complex enough to warrant them.

Second, building visible story tests for customers is costly, given today’s tools, and often the team and its customers do not get enough value to make it worth doing customer-visible story tests.

Third, Jim’s teams have found that intensive TDD, including unit tests, integration and some story tests, gives them enough confidence (and enough real quality).

To the first, I’ve seen plenty of teams where the customer (Product Owner) was too busy to create examples. I’ve also seen it lead to slow-downs, and to mistakes, especially when things were complex. Examples are a good way to go in those cases.

To the second, I, too, find the existing tools to be way too much trouble to use. On the C3 project, we built our own tool, which made tests practical. We had examples, mostly from the Customer’s helpers, and it turns out that we reported the results to the Customer: she rarely looked directly at the tests. In addition the C3 project had a very complex domain, which made the tests more valuable.

To the third, I remain a bit uncomfortable. Certainly, if the team is producing very few defects, then what they are doing is working. I would be interested to know some statistics, notably: How often, when a defect arises, could the defect have been prevented by a story test that was not written?

My bet would be that missing story tests will be associated with the majority of story defects that are discovered.

Does that make Customer-owned Acceptance Tests written with Fit or equivalent worth while? No, not if resistance is high enough and value low enough.

Does that make story tests automatically worth while? No, not if errors are infrequent enough.

My conclusion is that certainly what Jim’s teams are doing is working, and they are doing all the XP practices quite well. If other teams do the practices that well, they’ll probably have similar results.

And I think that automated story tests are the simplest and most certain way to prevent defects cropping up in stories later on.