Jeff's Technology Blog: March 2014

Tuesday, March 25, 2014

The Joys of Bulk Organize Imports

I discovered something accidentally while making changes to a file in Eclipse. I habitually hit the Ctrl+Shift+O shortcut or organize the imports in the top of a Java file. I didn't realize I had a folder in the Project Explorer window selected instead of the editor window. When I performed the shortcut, every single Java file had their imports organized! I quickly sent out an email to the developers instituting a new policy. I am going to organize all the imports for all of our Java files once a week.

Wednesday, March 19, 2014

APK Export Fail

My company was sending an update to the Amazon app store when we got a rejection notice. The notice claimed that whenever you started our app, it would immediately crash. It didn't give a stacktrace; just the list of devices they tested it on. This was weird to us since the code was extensively tested. I grabbed the APK, side-loaded it onto my phone and it crashed! Unfortunately I couldn't get a stacktrace since this was a release build.

It took us a while to debug the problem. At first, I decompiled the APK but the decompiler didn't work. Most of our java files were missing. I just thought it was a problem with the decompiler. Eventually we decided to change the AndroidManifest.xml file in the jar to make it debuggable. The first surprise was that the AndroidManifest.xml was binary, not xml. The easy solution was to change the source AndroidManifest.xml, re-export, then take the binary AndroidManifest.xml from the newly exported APK and drop it into the broken APK. We resigned, but Android refused to install the APK. This was a learning process for us. We learned that you can't just re-sign an APK. You have to delete everything in the META-INF/* folder. That "un-signs" the APK file. We signed it and were able to install the APK. We crashed and got a stacktrace: NoClassDefFoundError. It turns out that "problem" with the decompiler was actually a problem with the export process. The export silently failed.

For those of you who are not familiar with the Amazon app store submission process, you have to upload an unsigned APK. Amazon adds a jar file and modifies every single Activity adding hooks into their DRM system. They do this regardless of whether you use their DRM or now. After they add their DRM, you download the new unsigned APK, sign it with your key, then upload the new signed APK back to the Amazon store. This weird process exists because Amazon wants to inject their DRM and they can't do that to a signed APK; since that would defeat the purpose of signing.

Lets examine that process for a second. Once you export an unsigned APK, you cannot test it. Unsigned APKs can't be installed onto phones. Once you sign it, in theory, you can install it, but the Amazon DRM kicks in, preventing us from testing it. This means we sent an APK to Amazon where we tested the code, but not the export process.

After some Googling, I discovered that the export process sometimes fails like this. It is usually associated with a corrupt bin/ folder in your Eclipse workspace. Cleaning your bin/ folder and rebuilding fixes the issue. This was the first attempt to package the APK from Windows. As it turns out, the bin/ folder corruption thing happens more frequently in Windows.

To prevent future problems, I recommended that we export a signed APK as the first step. This lets us test the app on a device. We can unsign the APK before uploading it the first time to the Amazon store. This gives us an opportunity to test the export process before uploading.

While being a horrible experience, we learned a lot. Although I signed jar files before, I never actually knew what actually changed in the jar file. Now we know there are two CERT files in the META-INF folder as well as a bunch of SHA-1 hashes in the MANIFEST.MF. Deleting those files (well the certs and the hashes from the manifest) will effectively unsign the jar. The coworker I was working with didn't realize that APK files were just zip files, so that was a good lesson for her. We figured out how to enable debugging on a released APK file. It was a really good learning experience.

Monday, March 17, 2014

Amazon Update Caused Our App to Crash (But It Was Our Fault)

My company recently decided that wanted to support Kindles in the Amazon app store. The business didn't give us a lot of time. They set a pretty aggressive deadline. We proposed to wait for our next release; uploading to iTunes, Google Play and Amazon all at the same time. They rejected the idea, instead telling us to modify our previous release (that was already in the iTunes and Google Play store) and upload it to Amazon a month before our next release was supposed to launch. We did a decent amount of testing on various 2nd and 3rd Gen Kindles and found a few issues. Most of them were less about Kindle and more about the fact that we never really developed for Android tablets. We fixed the low hanging fruit. Then we saw a rather large commit but another team that concerned us.

Although my team owns our company app, there are two teams that developer code for it. This separation is similar to the JeffBank example I have given in previous posts. One group owns one part of the business (lets say checking) while the other group owns another part of the business (lets say mortgages). The developers for the other business unit changed all of their WebViews to Amazon WebViews. This concerned us for multiple reasons. First, we wanted to maintain one codebase. We don't want to have an Amazon branch and a Google Play branch. Also, we historically had lots of issues with WebViews. They require a lot of testing. Changing all of your WebViews this late in the game was dangerous to say the least. The developer assured us that the Amazon WebView is an abstraction layer; it delegates back to the Android WebView if the real Amazon WebView implementation doesn't exist. They didn't have the same fear around WebViews that we did.

We launch on the Amazon store and all is fine.

Fast forward two weeks later. One of the friends (we used to work on the same team) called me up. He was a pilot tester for Kindle. He was a developer but he wasn't involved with the Kindle rollout. When I had him test, he found lots of minor issues and brought it to our attention. He called because the app stopped working on this Kindle. I found it weird that it was working then all of a sudden it stopped. He was very busy so we set up some time for a week and a half later to diagnose the problem. That is when my team started getting phone calls.

We were getting negative feedback in the Amazon store. Finally, Amazon sent us an email claiming our app was crashing. They provided a stacktrace. The stacktrace pointed to the WebView code that the other team wrote. My team inspected the code, comparing it to the documentation. According to the documentation, the code should NEVER work. We should always be crashing. We brought this up to the other team and they confirmed that the code was wrong, that it was fixed already in the next release (due to unrelated refactoring) and meh.

We retested on all of our Kindles and we couldn't reproduce the problem, however. Management started dropping the hammer. Unhappy emails were being sent. I called up my buddy and he offered to sacrifice a lunch so we could work things out. My coworker and I went to my buddy's building and installed the debuggable version of the app that was sent to the Amazon store. His crash was the same as the one reported by Amazon. We "fixed" the WebView code and it worked. We tested a build of our next release and it worked. We verified the problem, but we didn't know why it happened all of a sudden.

Since our next release would fix it anyways and we were two weeks away from launching that release, our business decided that we shouldn't upload a fixed version to the Amazon store. Doing so would have bumped all of our version numbers, which would mess with our version checks. That would have required making changes to our next release which was in super-lockdown.

My buddy let us borrow his Kindle. We looked at our logs and identified the last time he successfully logged in. We Googled his Amazon OS version number and discovered that the version was released a few days after my buddy's last successful logon. It turns out he got a software update and that broke our app. This is why the failures seemed to snowball. As people upgraded to the latest Amazon OS, our app started to crash for them.

Now, the root cause was technically our fault. We initialized the Amazon WebView library. The problem was still frustrating for multiple reasons. First, Amazon never gave us a heads up about an upgrade. They didn't give us an opportunity to test. Google didn't even turn up an official announcement about the upgrade. We found out about it because of some shady website that offered the Amazon OS upgrade as a shady sideloading download. The second thing that was irritating was the fact that we were using the Amazon WebView. I Googled around, and I couldn't find anyone who was using this library. I have no idea why the other team decided to use it. There is no information on how to support it or the quirks related to cookies. Finally, this was the other teams' code. We spent a lot of time diagnosing the problem. It is a shame when other teams don't take ownership.

Thursday, March 13, 2014

Decompling Failure

My company decided to roll out a product that allows us to take screenshots of key pages on our mobile app. This in theory gives us a better tool to help us diagnose problems that our clients have. If a client calls up after having an issue, we can see what they saw. It gives us a better understanding of what happened without having to go through server logs. Developers tend to be find looking at logs, but our Tier-1 support, the people who take the phone calls, might not want to look at logs. We were provided with a jar file, some example code and some documentation. This seemed like an easy project. Boy we were wrong.

The first thing most people think would be a problem is bandwidth, but that wasn't an issue. Keep in mine this is only for a small set of key screens, so it shouldn't take a lot of bandwidth. The first issue was actually related to scrollable pages. Now, you'd think a library to that takes screenshots for you would......work. You would be wrong. The problem lies in the fact that the tool only takes a screenshot of the visible part of the page. We put in a support ticket and got a response: we were doing it wrong! You see, we should have read the documentation. What did we do wrong? We have no clue. The response just said read the documentation! We responded back with a politically friendly "WTF" and then they dropped this helpful nugget. If you want to take a screenshot of the entire page, you have to attach a scroll listener and wait for the user to scroll there. Now you are taking lots of screenshots with no guarantee that you will see the entire page. We quickly rejected that. Thefore, they responded with "read the documentation". WTF!

One of our developers researched taking our own screenshots of the scrollable view and she figured it out. We suggested that they add an API call to allow us to send our own screenshot to their server. We liked this idea since this is actually how the iOS library that the company supports works. They rejected the request citing the fact that the Android library should already do what we need it to do. After 7 back-and-forths, they wrote a response saying that it is the responsibility of our developer to take the screenshot at the write place. They just provide the screenshot tool. Our product owner (the person in our company that talks to the vendor on a regular basis) forwards that email to a lot of managers, saying it is my fault (they didn't drop my name of CC me in the email, but it was obvious they meant me). That is when I broke out the decompiler. I decompiled the jar file into java source code. I then followed the call stack we were using to take the screenshot. That is when I find the issue. The API has us pass in the View that we want a screenshot of. The screenshot code actually takes a screenshot of View.getRootView(), though. The support team for this vendor implied I was the idiot, so I had to decompile their code to prove they had no clue what they were talking about.

Our next major issue was with the "kill switch". The kill switch allows us (in theory) to remotely disable the screenshot library in the event of an unforeseen issue. If you ever had an issue with an app in the store, then you realize the importance of disabling functionality without resubmitting to the store. The documentation states that you create a url that returns a single character, a 1 or a 0. If the webpage returns a 1, then the library is enabled. If the webpage returns a 0, then the library is disabled. We enable the kill switch configuration and the library stops working. It doesn't matter if its a 1 or a 0; it always disabled itself on both Android and iOS. Then we get a call from our product owner saying we were doing it wrong. They claimed the documentation said nothing about a 1 or a 0. They claimed we needed to send either an HTTP status code of 200 or 500 to enable or disable. We started pouring over the documentation. What we determined was the Android documentation said to use 1 or 0. The iOS documentation on the other hand tells us to use the HTTP status code in one place and the 1 or 0 in another place. The iOS documentation literally contradicted itself.

We ended up ripping the entire product out of the current release, completely wasting my time. We decided to revisit it for our next release.

JS Ext