Here are a few experiments that might be effective in evaluating XP practices. I would invite my academic colleagues to design and run experiments like these to provide experimental data on the effectiveness of the practices. I would be delighted to work with interested experimenters in the design of the experiments, training of the developers, and so on. <table border="1">
Purpose: Compare the performance of groups of developers working on the same requirements, with full XP Unit Tests, with Functional Tests used as they go, with Functional Tests at the end, and with both. (Functional Tests are provided and are used to drive the applications to a common level of quality.)Draft Design: Four groups are compared, each consisting of a number of developers working alone on the same application. Group 1 uses only Functional Tests, when they think their development is done. Group 2 uses only Functional Tests, running them regularly during the project. Froup 3 writes Unit Tests for each class as they go, in the XP style, and uses the Functional Tests only when they think development is done. Group 4 (full XP testing) writes Unit Tests as they go, and runs the Functional Tests regularly during the project. All groups are driven to the same level of quality, i.e. 100% on the functional tests. Measure: Time to delivery of desired quality level. Code metrics, code quality from expert and peer review after the fact, etc.
Purpose: Compare the effectiveness of a refactoring orientation in development with rapid development without refactoring.Draft Design Notes: Provide User Stories incrementally. Compare four groups, evaluating impact of design up front and refactoring. Divide experimental groups among those predisposed by instruction to refactor frequently, and those predisposed not to refactor. Divide between groups doing design for generality up front, and those doing simple solutions. Measure: Time to delivery of desired quality level. Code metrics, code quality from expert and peer review after the fact, etc. Evaluate design-only vs refactoring-only. Compare to mixed-strategy groups.
Purpose: Evaluate utility of design for generality.Draft Design:Choose a moderately rich domain. Make all experimental groups familiar with the domain. Choose, secretly, a planned release of a product, with a series of requirements stories that call for increasing generality in the code: for example, an early requirement might specify just one instance of some object, while a subsequent requirement might require extension to a collection. Measure: Time to delivery of desired quality level. Code metrics, code quality from expert and peer review after the fact, etc.