Ruthlessly Helpful

Stephen Ritchie's offerings of ruthlessly helpful software engineering practices.

Category Archives: Value

Problem Prevention; Early Detection

On a recent project there was one single line of code that a developer added intended to fix a reported issue. It was checked in with a nice, clear comment. It was an innocuous change. If you saw the change yourself then you’d probably say it seemed like a reasonable and well-considered change. It certainly wasn’t a careless change, made in haste. But it was a ticking time bomb; it was a devilish lurking bug (DLB).

Before we started our code-quality initiative the DLB nightmare would have gone something like this:

  • The change would have been checked in, built and then deployed to the QA/testing team’s system-test environment. There were no automated tests, and so, the DLB would have had cover-of-night.
  • QA/testing would have experienced this one code change commingled with a lot of other new features and fixes. The DLB would have had camouflage.
  • When the DLB was encountered the IIS service-host would have crashed. This would have crashed any other QA/testing going on at the same time. The DLB would have had plenty of patsies.
  • Ironically, the DLB would have only been hit after a seldom-used feature had succeeded. This would have created a lot of confusion. There would have been many questions and conversations about how to reproduce the tricky DLB, for there would have been many red herrings.
  • Since the DLB would seem to involve completely unrelated functionality, all the developers, testers and the PM would never be clear on a root cause. No one  would be sure of its potential impact or how to reproduce it; many heated debates would ensue. The DLB would have lots of political cover.
  • Also likely, the person who added the one line of code would not be involved in fixing the problem because time had since passed and misdirection led to someone else working on the fix. The DLB would never become a lesson learned.

With our continuous integration (CI) and automated testing in place, here is what actually happened:

  • First thing that morning the developer checked in the change and pushed it into the Mercurial integration repository.
  • The TeamCity Build configuration detected the code push, got the latest, and started the MSBuild script that rebuilt the software.
  • Once that rebuild was successful, the Automated Testing configuration was triggered and ran the automated test suite with the NUnit runner. Soon there were automated tests failing, which caused the build to fail and TeamCity notified the developer.
  • A few minutes later that developer was investigating the build failure.
  • He couldn’t see how his one line of code was causing the build to fail. I was brought in to consult with him on the problem. I couldn’t see any problem. We were perplexed.
  • Together the developer and I used the test code to guide our debugging. We easily reproduced the issue. Of course we did, we were given a bulls-eye target on the back of Mr. DLB. We  quickly identified a fix for the DLB.
  • Less than an hour and a half after checking in the change that created the DLB, the same developer who had added that one line made the proper fix to resolve the originally reported issue without adding the DLB.
  • Shortly thereafter, our TeamCity build server re-made the build and all automated tests passed successfully.
  • The new build, with the proper fix, was deployed to the QA/testers and they began testing; having completely avoided an encounter with the DLB.

Mr. DLB involved a method calling itself recursively, a death-spiral, a very pernicious bug. Within just that one line of added code a seed was planted that sent the system into a stack overflow that then caused the service-host to crash.

Just think about it, having our CI server in place and running automated tests is of immense value to the project:

  • Greatly reduces the time to find and fix issues
  • Detection of issues is very proximate in time to when the precipitating  change is made
  • The developer who causes the issue [note: in this case it wasn’t carelessness] is usually responsible for fixing the issue
  • QA/testing doesn’t experience the time wasted; being diverted and distracted trying to isolate, describe, reproduce, track or otherwise deal with the issue
  • Of the highest value, the issue doesn’t cause contention, confrontation or communication-breakdown between the developers and testers; QA never sees the issue

CI and automated testing are always running and vigilantly. They’re working to ensure that the system deployed to the system-test environment actually works the way the developers intend it to work.

If there’s a difference between the way the system works and the way the developer wants it to work then the team’s going to burn and waste a lot of time, energy and goodwill coping with a totally unintended and unavoidable conflict. If you want problem prevention then you have to focus on early detection.