We get the fix pack, and we did a quick regression test. Everything seemed like it still worked. I told my project manager that it would take me some time to integrate the server flush fix. You see, the method we call to flush to the server already worked in one way....in the foreground. Any professional library developer would know that you need to maintain backwards compatibility. Therefore, I assumed that they would provide a new method for the background version. I was wrong. They changed the existing method. Our issue (the freeze after taking a screenshot) was gone. I checked everything in, let Jenkins build the APK and told our tester that it was ready to test.
We are near the end of our release cycle so it took a few days for the tester to actually test the screenshot functionality. I was in for a nasty surprise though. The tester reported that the screenshot functionality wasn't working! I fired up Logcat to figure out what was going wrong. The "fix" pack did a lot more than just fix our one issue. While it also fixed a bunch of other issues that other clients complained about, it also reduced the logging to Logcat. After 2 days of looking into it, I finally disassmbled their Jar file. I used the debugger to step line by line into their code. That is when I discovered the first problem.
The very first thing the library does is send the killswitch request to see if the library should be enabled. That is done in the foreground, delaying the start of the application. One of this company's clients complained, so they put the killswitch request in the background. The problem is the library initialization takes two steps: start() and enable(). start() fires off the killswitch request and enable() uses the killswitch response to determine if the library should start up. Since start() now runs in the backround (and returns immediately), enable() has a high likelihood of failing. If the server is running fast that day, then enable() will work. That is why my quick regression worked when I was first given the fix pack. While looking at the source code for enable(), I noticed that it checks an internal boolean to see if enable() already ran successfully. This means I could call enable() right before I take the screenshot. It is a hack, but at least I'm not calling undocumented API calls, like I did in my previous hack.
The next issue was that not all the screenshots were being sent to the server. This one drove me nuts. It took me a while before I decided to look at the source code for takeScreenShot(). I was appalled. What used to be a foreground action was now in the background! You see, another client complained about something being slow. This time is was the screenshot code. The library developers did what they always do: they shove it in the background. The problem was I call the flush code right after I take a screenshot. Yay race condition! There was no guarantee that the screenshot was done being taken before we invoke the manual flush. Once again, this was something that would work part of the time, which is why it passed our quick regression.
It is obvious that the developers that own this library don't know anything about multithreading. You can't just throw everything in the background as soon as someone complains about code running in the foreground. It requires thought and planning. There are ramifications.
The next thing that irritates me is the fact that some features are barely supported. If a company provides half ass support for a feature, then say it up front. Don't let us get invested in a feature that is just going to cause us more problems. In this case, it was the manual flush. My company wants to guarantee that the server gets the screenshot. They don't want to gamble with the possibility that the user kept our app open for another 60 seconds. If your app/library implements a feature in a shitty way, then you really don't support that feature. Let us know that up front and move on.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.