I have been struggling lately with this idea that once code works in Prod, it is infallible. I have worked on a few defects recently where we identified bugs in code that is over 4 years old. The recommendation always included fixing the 4 year old defects, but management seems to dislike that. In their minds, if the code worked fine for 4 years, then we must have changed something that caused the issue. While true to a certain degree, the problem lies in the kind of defect that occurs. I have been seeing two types of "old" defects that manifested themselves recently.
The first type of defect is the good ol race condition. In my college multithreading class, I was taught that any possible path of execution is just that....possible. Therefore any possible path that is wrong must be eliminated. We had to prove that every possible execution path was correct. Just because you work to reduce the likelihood of a particular execution path, doesn't mean it will never occur. The reason for this is because of the decision making process that thread schedulers go through. Changes in your operation system, your hardware, your JVM version or even the code inside of the threads being scheduled could change the order in which lines of code get executed. This means you must eliminate race conditions because they could cause future problems! That is now how management thinks, though.
The second type of defect is one where the calling code changed in a way that triggers the defect. This is the most common scenario I have seen. Imagine a Set< String > in Java. A Set object contains a collection of String objects. The set is supposed to be unique. If you try to add the same string twice, then the second Set.add() is ignored. I saw an implementation of a Set-like container that claimed the same unique property. The problem was the add() method didn't honor that contract. You could add the same String as many times as you want. In this case, it wasn't a String, though. The developer required the api-user to enforce the unique property by checking the "key" property of all the objects in the set. It was a very large object, non-changing object (which is why it was being cached). An error was introduced in the call stack 10 frames up. This error caused the code 1 frame up to check check for a null key, but it still had the full correct object that it was adding in. This caused the code to call add() on the set 60k times. Obviously we ran out of memory.
In both of these scenarios, fix your code. I don't care how old it is. A defect is a defect.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.