Monday, December 25, 2017

Unnecessary Code

We were asked, "How can I tell if my code does extra unnecessary work?"
To answer this question well, I’d need to know what you mean by “unnecessary.” Not knowing your meaning, I’ll just mention one kind of code I would consider unnecessary: code that makes your program run slower than necessary but can be replaced with faster code.

To rid your program of such unnecessary code, start by timing the program’s operations. If it’s fast enough, then you’re done. You have no unnecessary code of this type.

If it’s not fast enough, then you’ll want to run a profiler that shows where the time is being spent. Then you search those areas (there can be only one that consumes more than half the time) and work it over, looking first at the design.

There’s one situation I’ve encountered where this approach can bring you trouble. Code that’s fast enough with some sets of data may be unreasonably slow with other sets. The most frequent case of this kind is when the algorithm’s time grows non-linearly with the size of the data. To prevent this kind of unnecessary code, you must do your performance testing with (possibly artificially) large data sets.

Paradoxically, though, some algorithms are faster with large data sets than small ones.

Here’s a striking example: My wife, Dani, wanted to generate tests in her large Anthropology class. She wanted to give all students the same test, but she wanted the questions for each student to be given in a random order, to prevent cheating by peeking. She gave 20 questions to a programmer who said he already had a program that would do that job. The program, however, seemed to fall into an unending loop. Closer examination eventually showed that it wasn't an infinite loop, but would have finally ended about the same time the Sun ran out of hydrogen to burn.

Here’s what happened: The program was originally built to select random test questions from a large (500+ questions) data base. The algorithm would construct a test candidate by choosing, say, twenty questions at random, then checking the twenty to see if there were any duplicates among those chosen. If there were duplicates, the program would discard that test candidate and construct another.

With a 500 question data base, there was very little chance that twenty questions chosen at random would contain a duplicate. It could happen, but throwing out a few test candidates didn’t materially affect performance. But, when the data base had only twenty questions, and all Dani wanted was to randomize the order of the questions, the situation was quite different.

Choosing twenty from twenty at random (with replacement) was VERY likely to produce duplicates, so virtually every candidate was discarded, but the program just ground away, trying to find that rare set of twenty without duplicates.

As an exercise, you might want to figure out the probability of a non-duplicate set of twenty. Indeed, that’s an outstanding way to eliminate unnecessary code: by analyzing your algorithm before coding it.

Over the years, I’ve seen many other things you might consider unnecessary, but which do no harm except to the reputation of the programmer. For example:
* Setting a value that’s already set.
* Sorting a table that’s already sorted.
* Testing a variable that can have only one value.

These redundancies are found by reading the program, and may be important for another reason besides performance. Such idiotic pieces of code may be indications that the code was written carelessly, or perhaps modified by someone without full understanding. In such cases, there’s quite likely to be an error nearby, so don’t just ignore them.


No comments: