JS Ext

Wednesday, June 12, 2013

On the Importance of (Proper) Logging

I was recently in a situation where I had to defend my work.  First, some background.  There are 3 JVMs.  Lets call them A, B and C.  I wrote/maintain C, but I wrote code that gets included in A.  C is a service them, that a bunch of JVMs contact, including A and B.  Since so many JVMs connect to my service, I ended up adding some decent logging to the system.  The main thing that gets logged is the start/stop times for every service call.  The log line for the stop time also has the amount of time it took to make the call.  On top of that, I wrote code that monitors all the web container threads.  If a servlet takes longer than 3.5 seconds, the stack trace of that servlet gets dumped to the log.  This code was written as a library, so JVM A also has the servlet stack trace functionality in it.  JVM B is currently getting some large code changes, A is getting minor changes and C is not getting any changes.

The situation started when someone was running a test case on the QA instance of A.  They went through a flow, then all of a sudden, the page took about 2 minutes to load.  After some investigation, the team making the changes was able to reliably reproduce the problem.  That is the good news.  The bad news is JVM B cannot be run on someone's desktop.

This is when the head developer of A and B started to tell people the C was causing the performance issue.  He didn't talk to me; no, he told other people.  He wanted to know what would have changed in C.  This is where logs come in handy.  I logged into the QA server and looked at the logs.  In the previous 7 days, only one request took over 1000ms.  This means it never took long enough to even get a thread dump!  In my mind, it was proven.  C did not cause a problem.  There wasn't even enough calls to add up to 2 minutes of processing time (QA instance of C, not Prod)!

After a few days, I started hearing more over the cube wall.  Once again, the developer for A and B was saying C was causing the issue.  I started taking a look at A's logs.  Luckily, A has the servlet filter that I wrote that dumps the stack trace of a servlet at the 3.5s mark.  At that point in the call, A was calling B!  More proof that C wasn't causing the issue.

This kept going back and forth until I was called in by the junior (junior based on the management pecking order, not actual seniority) members of the A/B dev team to figure out why B had a performance issue.  The problem persisted because the head A/B developer considered it a C problem so he delegated to someone else to "solve" the problem.  The first thing I asked was did anyone put in the performance monitoring code I wrote into B (the thread dump servlet filter).  They said no.  I asked them which person had B running on their desktop so that I could run JProfiler or JProbe against the JVM.  Apparently, B won't run on a desktop.  You have to checking in code, build it, and test it on the DEV server.

At this point, I decided to do the kill -3 trick.  On unix systems, performing a kill -3 on a JVM will cause the JVM to print a stack trace for every thread to standard out.  I wrote a quick while loop in korn shell to perform a kill -3 on the DEV JVM once every second.  I had the developer run the test.  After about 10 seconds, I killed the while loop and started looking at the logs.  The worker thread was at the same point in every stack trace.

The method was a simple (but poorly written) pad method.  It was padding a string to be 50 characters wide.  It did not pre-allocate a buffer.  It just kept prepending the pad string over and over again until the string had a length of 50.  As we went up the stack, things appeared just as bad.  The calling method was a toString() call.  This call iterated over a map, appending the pad() of the map entry value to a StringBuffer that was not pre-allocated.  This toString() method was called because the caller was calling log.debug( "The value is " + value );  This log line was inside of a tight loop.  The method that had the loop was being called about 300 times for the users that had a performance problem.  It was called only once for the people without the problem.  I assumed someone added the logging line and that caused the performance problem.

What is interesting is that it took a few tries to explain to the developers why the toString() was being called.  The way they understood it (at a high level) was that since Log4j wasn't configured to output debug messages, the line shouldn't have done anything.  I had to explain to them that the input object has to be created (in this case, toString() needed to be called due to string concatenation) before the method is even invoked.  Inside of the method is the logic that determines if the message should be printed.  Nobody fessed up to adding the log line.  We have revision control to figure out that kind of thing, but I try to concentrate on the Why did it happen and the How do we fix it rather than the Who caused it.

This story really tells us two things.  Logging is very helpful.  When done right, it proved that JVM C wasn't causing a performance issue.  When done wrong, it caused a performance issue in JVM B.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.