Troubling Testing Trends

Seeking the low-cost, high-value sweet spot.

For what seemed like too long, many developers resisted writing tests. A mix of time pressures and lack of familiarity with "best practice" programming principles conducive to effective testing might have been to blame.

The proliferation of SOLID principles, TDD (Test-Driven Development) as a (sometimes) viable problem-solving approach, significant improvements in automated test frameworks and IDE integration, and guilt-inducing test coverage analytics have all helped us evolve software delivery into a more confident industry.

Anecdotally, most developers seem to regard automated testing as "part of the job" now. Reassuring. However, some troubling patterns have emerged in recent years which undermine the key value proposition, transforming tests into more costly and frustrating overheads.

In this article we'll look at some unfavourable places our good intentions have landed us and how to avoid such ill fate.

Purpose

First, let's try to agree on the value and purpose of automated tests.

Primarily, automated tests are a solution to the problem of confidence (specifically, the absence thereof). Like some sort of semantic compiler imbued with the knowledge of our problem domain, a healthy suite of tests provides (some) certainty that our software meets the requirements while handling edge cases and failures gracefully. They also provide some protection against regression, presenting a reassuring ✅ or an insightful ❌ after large refactoring exercises, the introduction of new features, the updating of dependencies, etc. This confidence can be achieved with manual testing, albeit more slowly and less deterministically - hence why it is automated.

Secondarily, tests (especially well-named, focused, expressive ones) serve as a form of documentation for the features that make up the solution. They may also document bugs that have been discovered and squashed.

Another function of tests, particularly as part of a TDD flow, is to help a developer maintain focus and devise a minimal, testable API for their solutions. TDD works on a premise of "writing code as the consumer of an API that doesn't exist yet", imagining some utopian syntax and then filling in the gaps as the compiler fails and, later, the test itself fails. At the end of the exercise, the developer has both a testable, hopefully intelligible solution and a test to validate it (now and forever). They then move on to the next test, refactoring as necessary to build a cohesive solution, with the support of prior tests. This is the "red, green, refactor" cycle and it, as with much of the well-intended academia around testing, is "A Good Thing". Untestable code is typically so because it violates some best practice principles of software development. These violations lend themselves to illegible, unmaintainable, buggy code - issues which compound over time. So, with TDD, tests serve as a starting point for arguably "better code".

It's possible to "overdo it" on any of these fronts, so without further ado let's segue to the "when testing goes awry" part.

Pointless Code Coverage Mandates

Many people ask, "What is the best code coverage target?". The question assumes there is an optimal range between 0% and 100% to aim for. Why set a floor? Or a ceiling? Testing is not a matter of achieving a coverage goal and putting the keyboard away. It's about adding confidence where it is otherwise lacking. The only valid guidance would be to avoid zero coverage.

100% code coverage is usually:

impossible, or
not a good use of time.

100% code coverage is also not a valid indicator of correctness, and can theoretically be achieved simply by executing all code in the solution without making any valid assertions on the effects. In fact, one of the markers of incorrect software is when code is executed even though it shouldn't be. So, incorrect code can actually contribute positively to code coverage.

Ironic.

Given that "percentage covered" does not correlate to "amount of correct software", and incorrect code can contribute to misleading stats, any percentage value is arbitrary. Regardless, some teams incorporate minimum code coverage requirements into their code review processes or "Definition of Done". These pointless, arbitrary requirements not only add needless overhead but are also easily gamed and incentivise some of the problematic behaviours we'll cover in this article.

If the code coverage reports make you feel guilty, disable them. Focus on increasing confidence, not metrics. The numbers mean little, likely a manifestation of Goodhart's Law.

Quantity over Quality

A side-effect of test coverage requirements, though not exclusively so, is a "quantity over quality" mindset. Developers will write many, often low-value, repetitive tests in an effort to up the stats (or perhaps just to seem more productive).

More tests surely equals more confidence, right?

You're gravely mistaken.

It should go without saying that this is not necessarily the case, but there are too many instances of tests like these:

test("When setting the text, the text is set.") {
    let vm = ViewModel()
    vm.setText("TEST")
    vm.text.shouldEqual("TEST")
}

test("When setting the text to null, it fails.") {
    shouldFail {
        ViewModel().setText(null)
    }
}

test("View model should contain links.") {
    ViewModel()
        .links
        .shouldNotBeEmpty()
}

test("View model should contain specifically 3 links.") {
    ViewModel()
        .links
        .count
        .shouldEqual(3)
}

test("View model should contain a /Profile link.") {
    ViewModel()
        .links
        .shouldContain {
            $0.url == "/Profile"
        }
}

...

These tests are extremely low-value examples of:

Code that, in all likely practical scenarios, won't ever fail.
Constraints that a compiler could / should enforce at compile-time.
Vague assertions that provide no confidence about correctness.
Redundant assertions that are reasonably covered by other tests.
Testing what is, effectively, configuration.

Though certainly succinct enough that anyone could understand them, basically none of them provide any value. Any amount of time spent writing thes tests was likely a waste, or at the very least could have been better-spent. They should probably just be deleted, but they now constitute an incumbent suite of tests that, actually, other developers will likely feel obligated to maintain...

Refactoring Overhead

Every test increases the surface area of change. Refactorings cascade, reverberating through test code, leaving debris that needs to be cleaned up before the compiler and the test-runner will allow passage. A well-written feature with robust tests typically presents minimal hazard here. Bad code, however, will create a landslide and the clean-up will probably take longer than anticipated.

The only solution is a mitigative one; to maintain the same discipline and rigour around the quality of our test code as we would for the rest of the codebase. Too often we see the bad behaviours that wouldn't pass code review making their way into test code:

Copy-pasted code borrowed from other tests.
Repetitious setup code, duplicated with minor variations.
Magic strings and numbers that become wrong and fail the tests needlessly, or - worse - don't fail the tests when they should.
Verbose, illegible, inexpressive and/or incoherent code lacking any kind of abstraction or encapsulation.
Tests that don't represent valid use-cases which, much like features that a developer imagines, are needless scope-creep and time-sinks.

Maintain your standards, and - perhaps equally importantly - be very selective about what tests even make it into source control. If they don't add confidence where confidence is lacking, they're begging for the delete treatment.

UI Test Code

Automated UI tests typically come in two styles:

Hand-rolled code that runs an application, facilitating state transitions via some combination of:
- programmatic state manipulation, and
- coded UI interactions / events.
Recorded execution paths re-enacted by some test runner / tool.

Assertions are typically made in a couple of ways:

Selective inspection and evaluation of discrete UI components.
Aggregate visual comparisons of the running application against approved baselines (i.e. Screenshot tests, or Approval tests).

It's possible to mix the two styles and the two assertion methods, provided the tooling supports it, but I'm going to make the claim that coded UI tests should largely be avoided. Instead, embrace a philosophy around UI tests as discardable, replaceable mechanisms to defend against regression and nothing more. Recorded tests which automate screenshot comparisons are the most efficient way to achieve this. Anyone can record a test - not just a developer - and it can be done for every device orientation and form-factor with predictable effort.

In contrast, it's typically the case that hand-rolling UI test code and programming assertions against discrete UI components takes a lot of time. This isn't a problem if the value provided by those tests outweighs the cost, but here's something important to consider:

User interfaces change a lot.

Any software delivery team observing a truly agile methodology will be iterating on their software in fairly tight feedback loops. Some of the best teams will have a sprintly release cadence (once a week or fortnight), with next-level teams releasing multiple times per week... or per day. This is the utopia - a culture of delivering high-value at high-frequency with little to no bureaucratic overheads. Introduced a bug? Fix it, ship it. Users hate the UX? Iterate, ship it. Had an epiphany? Build it, ship it. Even with robust quality control and testing measures this is entirely possible, but UI tests that are not happily discarded and/or efficiently replaced will kill the dream.

This is a real example of a coded UI test seen in the wild.

test("Component should have blue text.") {
    
    // ... complex setup code

    component.setText("Lorem ipsum")

    container
        .findElementById(component.id)
        .findElementByType("text")
        .first!
        .colour
        .rgbRepresentation
        .shouldMatch([10, 10, 150])
}

Though simple, this test is of extremely low, arguably negative value. When the users start to protest the outrageous blue text decision, the designers will concede the change, and the developer will not only need to update the UI code but the test code too. What will they do? Will they delete the test? Or will they update the .shouldMatch(...) part to reflect the new design decision? And how long will that last?

UI tests can get much more complicated than this, with assertions against complex component hierarchies that are equally subject to change and present a significant maintenance burden. In the web world, I've seen UI tests that make grand assumptions about the DOM structure, repeating the most heinous CSS selectors imaginable ad nauseam, only to validate that some text is as it should be or a hidden element is display: none; or (as above) the colour of some validation text is red. No one wants to touch those tests. Their mere presence adds an overwhelming sense of dread at the prospect of changing anything in the UI, and rightly so as developers find themselves trapped in an endless cycle of "It's pretty much done, I just need to fix the tests...".

Don't fix them. Just delete them.

Half the solution is avoiding the temptation to test the mundanity. The other half is choosing the right kind of testing style to achieve the team's goals. In most cases, it seems most sensible to avoid programming (with code) any UI tests. Instead, observe proper "separation of concerns". Use view models and test those. Use a declarative (not imperative) UI framework like React for web, SwiftUI for the Apple ecosystem, and Jetpack Compose for Android, so that the UI code is as expressive and pure as possible. Opt for the much more efficient screenshot testing approach and - even then - be selective about what test scenarios deserve automated attention.

Closing

Automated testing is important. Few developers dispute that in 2022 - a welcome change over the last decade or so. What we need now, though, is to move the pendulum back (just a little) in the opposite direction; to invest wisely, to resist the dogmatic urge to "processify" everything with % coverage quotas, to test sustainably without losing sight of the goal.

We are not here to write tests. Very few users will thank us for doing so.

We are here to ship software that solves problems.