Friday, September 17, 2004

An interesting bug

One down, god knows how many to go. This bug was fortunately one that I found myself, just in time before the new version was mailed out to the users. Finding it was pure luck: If I hadn't tested one particular datafile, the software would have been released, and would have broken a few databases.

There is a subroutine in the database to automate the business of installing new software versions: it reads through the user's data and makes changes as needed to suit the new structure. It ran perfectly on my development machine, but that's not saying much: its datafile has only a few dozen records in each table.

Knowing this, I've collected sample datafiles from customers wherever I could, and test new versions against these. The first four updated perfectly. Only because it happened to be a sunny day and I happened to be in a good mood, I chose to try a fifth, and happened to pick this datafile. Crash.

I've spent the last two days (halftime) on this bug, culminating this morning in a marathon line-by-line debugger walkthrough of the whole damn programme which took over an hour of press-the-return-key-press-the-return-key-press-the-return-key, but showed me the problem.

It turns out to have been a stack overflow caused by incautious programming. The routine to update the list of current locations, spawns a subprocess if needed to search for and mark as "closed" the previous location. These subprocesses start, run and quit almost instantly, because a current location cannot have more than one previous location and many have none. However, the devil is in the details: note the word almost.

In the smaller sample datafiles, this was OK: the datafile ran through quickly enough, and there were so few previous locations to process, that the server handled the subprocesses without strain. However, in an older file with many previous locations to be updated, the system broke: the updater was spawning subprocesses faster than the server could run them. A backlog piled up until server quite simply ran out of memory and stopped.

As with most bugs, the thing is just blindingly obvious once it's been identified and described in writing. Any second-semester computer-studies student reading this description should spot that the old system was risky, but nobody who read this part of the source code found the problem before it occurred.

I've rewritten the system to avoid the problem, the previous locations are now handled in a batch job that runs after the updater is finished. It's safe, clean and still reasonably fast.

Several lessons to be learned here: On scalability, and the difficulty of predicting same. On testing, and the necessity of collecting many datafiles from different types of customers. On overconfidence.

0 Comments:

Post a Comment

<< Home