A lot of people with the Nexus 5 have been complaining about Lollipop. Some have complained about battery life. One large group of users complained about a memory leak, which Google has checked in a fix for. I have not experienced any of these issues, though. I am currently running 5.0.1 on my Nexus 5. Although I do reboot it from time to time (mostly to play around with Ubuntu Touch), my current uptime is just under 10 days. I multitask heavily on the phone. I have what seems to be an infinite scroll in Task Manager, mostly due to Chrome tabs showing up in there. When my car turns on, RocketPlayer automatically starts playing over Bluetooth. Pandora is always running since I sometimes pause RocketPlayer to listen to Pandora in the car. I have a service that runs to run Text-to-speech to read my incoming SMSes to me while I am driving. I have Google Now configured to always listen to that I can respond to the SMS with Speech-to-text. I have a service that automatically turns on Bluetooth tethering so that my Nexus 7 will automatically get internet when I am away from home. I just opened up Settings and the System was only using 471MB of RAM.
I do consider myself a bit of a power user, so I would expect battery issues and memory leaks to happen to me first, but they haven't. I bring all of this up because of a few comments I read about the memory leak. Some of pointed out that they are not experiencing the memory leak. They are not seeing apps force closing due to running out of RAM. Whenever I see one of those comments, I always see a response saying they aren't using the phone enough to see the issue. Somehow, the unaffected people are not power users and never actually needed the Nexus 5 to begin with. This irritates me to no end.
As a software engineer, I can appreciate that some users have having a problem while others do not. I have seen major bugs only affect small portions of users. Just because most people don't have an issue doesn't mean the bug doesn't exist. Just because some people don't see the bug doesn't mean those people aren't techy enough to see the bug. Attitudes like this make it harder to get effective feedback and make it harder to prioritize the work. It makes it harder to find patterns with the users that are experiencing the issue. These patterns could be helpful to the engineers that are trying to find the bug.
That being said, there are a few reasons why I might not be seeing the issues:
1) I don't play games on my Nexus 5. That is what my Nexus 7 is for.
2) I use (and LOVE) Multirom. In fact, Kitkat is still my "primary" rom, but Lollipop is the default rom.
3) I have USB debugging and developer options on because I do write Android apps
What this post comes down to is your mileage may vary. This is true for all devices/things. Don't let a few loud users convince you that Lollipop doesn't work well on the Nexus 5, just because it didn't work well for them.
JS Ext
Wednesday, December 31, 2014
Monday, December 29, 2014
Sony's 300 billion dollar "mistake"
As of 12/28, Sony's The Interview has been streamed 2 million times. That is unfortunate, since Sony might have stolen music for the movie. Various sources are reporting that a 30s clip of Yoon Mi-rae's "Pay Day" was put into the movie without a copyright license. It appears that Sony was in negotiations for the license, but those negotiations fell through. Sony tried to get the license, failed, then put the music into them movie anyways. To me that fits the very definition of "willful infringement". Because the infringement was willful, the statutory damages are $150,000 per infringement. That is $150K x 2M, or $300B. That is a lot.
Should Yoon Mi-rae have a $300B payday? In my opinion, no. The problem is Sony, via the RIAA and the MPAA, has consistently argued yes. Did Sony steal the music? Again, no, but Sony has argued in the past that copyright infringement is stealing. Does it make sense that a movie that grossed $15M should cost a company $300B in fines? No. That doesn't make sense. Once again, though, it makes sense to Sony. If a single person illegally downloads one music track instead of paying the $2 for the track, Sony will demand $150K. It is a statutory damage for "willful infringement". The point of increasing the damages to something so astronomical is to deter infringement. Well, statutory damages hasn't deterred Sony from stealing music.
The bigger problem is this isn't going to amount to a whole lot. Large corporations don't have to follow the same set of rules that people do. Sony will see no problem requesting $150K when it is their intellectual property being "stolen" but will cry foul when they steal someone else's intellectual property. And they will win.
Should Yoon Mi-rae have a $300B payday? In my opinion, no. The problem is Sony, via the RIAA and the MPAA, has consistently argued yes. Did Sony steal the music? Again, no, but Sony has argued in the past that copyright infringement is stealing. Does it make sense that a movie that grossed $15M should cost a company $300B in fines? No. That doesn't make sense. Once again, though, it makes sense to Sony. If a single person illegally downloads one music track instead of paying the $2 for the track, Sony will demand $150K. It is a statutory damage for "willful infringement". The point of increasing the damages to something so astronomical is to deter infringement. Well, statutory damages hasn't deterred Sony from stealing music.
The bigger problem is this isn't going to amount to a whole lot. Large corporations don't have to follow the same set of rules that people do. Sony will see no problem requesting $150K when it is their intellectual property being "stolen" but will cry foul when they steal someone else's intellectual property. And they will win.
Tuesday, September 30, 2014
Amazon: An example on how NOT to run an app store
After a recent release to the app stores, my company discovered a crash in the public area of our Android app. The crash was caused by a race condition in making HTTP requests. It was crashing enough that we decided to deploy a fix to the Google Play and Amazon App stores. Google Play has some issues (Google Play gives a cryptic error when you deploy two versions in "rapid" succession), but we launched in the store. Amazon ended up rejecting us.
When you elevate to a store, you have to provide a demo account so that Apple, Google and Amazon to test out the functionality. During the main release, Apple, Google and Amazon (in that order) were able to login. Then, during the fix release, Google was able to log in just fine. Somehow, in the hours between Google successfully logging in, and Amazon trying to log in, the demo account got locked out. This shouldn't be that big of a deal. We called up our support group to unlock the account. We contacted Amazon to try again. That is when we found out the bad news. They can't just "try again". We have to resubmit our app. That frustrated us for two reasons. First, in large companies, there tends to be some bureaucracy when it comes to changes to a production environment. This is usually done to allow different groups a chance to reschedule or join a change. Another reason resubmitting frustrated us is the Amazon upload process is really painful. For those of you not familiar with Amazon's DRM process, here it is:
1) Upload your APK
2) Amazon adds their DRM jar into your APK, then modifies every Activity to hook into their DRM
3) You download the modified APK
4) You sign the APK with your digital signature
5) You upload the modified APK
In a small shop, this is not a big deal. In a large enterprise, all production changes tend to be executed by a different team. That team tends to have no clue what "signing" is. You end up with multiple people on a conference call since different people need to do different parts of the change.
A few days later (due to a weekend and holiday) we resubmit. After the overnight (Amazon test engineers seem to work in a timezone on the opposite end of the world), we get rejected again! This time:
Bug Description: "The App" was found to be incompatible because of issues with the app’s interaction with Kindle Fire's hibernation feature. "The App" force closes, crashes, or loses user state after the device hibernates and then resumes.
Steps to Reproduce:
1. Install and launch the app.
2. Login using the credentials given.
3. Hibernate the device.
4. Unlock the device, app restarts and reverts back to the login page.
What Amazon calls a bug is actually a security feature that my company's (paranoid) security team said we must have. We detect if you background then foreground the app. If you are on a secure page (a page that requires you to be logged in), then we log you out and send you to the home page to log in again. While our clients find this extremely annoying, we do provide the ability to disable the security feature. You have to opt into it, though.
This is not a new feature. Amazon has accepted our app before with this feature. It seems to me that Amazon's test team is inconsistent with their testing procedures. At this point, I'm wondering why we are even in the Amazon store.
When you elevate to a store, you have to provide a demo account so that Apple, Google and Amazon to test out the functionality. During the main release, Apple, Google and Amazon (in that order) were able to login. Then, during the fix release, Google was able to log in just fine. Somehow, in the hours between Google successfully logging in, and Amazon trying to log in, the demo account got locked out. This shouldn't be that big of a deal. We called up our support group to unlock the account. We contacted Amazon to try again. That is when we found out the bad news. They can't just "try again". We have to resubmit our app. That frustrated us for two reasons. First, in large companies, there tends to be some bureaucracy when it comes to changes to a production environment. This is usually done to allow different groups a chance to reschedule or join a change. Another reason resubmitting frustrated us is the Amazon upload process is really painful. For those of you not familiar with Amazon's DRM process, here it is:
1) Upload your APK
2) Amazon adds their DRM jar into your APK, then modifies every Activity to hook into their DRM
3) You download the modified APK
4) You sign the APK with your digital signature
5) You upload the modified APK
In a small shop, this is not a big deal. In a large enterprise, all production changes tend to be executed by a different team. That team tends to have no clue what "signing" is. You end up with multiple people on a conference call since different people need to do different parts of the change.
A few days later (due to a weekend and holiday) we resubmit. After the overnight (Amazon test engineers seem to work in a timezone on the opposite end of the world), we get rejected again! This time:
Bug Description: "The App" was found to be incompatible because of issues with the app’s interaction with Kindle Fire's hibernation feature. "The App" force closes, crashes, or loses user state after the device hibernates and then resumes.
Steps to Reproduce:
1. Install and launch the app.
2. Login using the credentials given.
3. Hibernate the device.
4. Unlock the device, app restarts and reverts back to the login page.
What Amazon calls a bug is actually a security feature that my company's (paranoid) security team said we must have. We detect if you background then foreground the app. If you are on a secure page (a page that requires you to be logged in), then we log you out and send you to the home page to log in again. While our clients find this extremely annoying, we do provide the ability to disable the security feature. You have to opt into it, though.
This is not a new feature. Amazon has accepted our app before with this feature. It seems to me that Amazon's test team is inconsistent with their testing procedures. At this point, I'm wondering why we are even in the Amazon store.
Wednesday, August 20, 2014
My issue list for Ubuntu Touch 14.10-devel r179 on Nexus 5
This is my issue list from using Ubuntu Touch as my primary mobile operating system for a few days. While this seems like a large list, don't let it scare you away. This list shouldn't tell you that Ubuntu Touch will be a failure. There are a few things to keep in mind. First, Ubuntu Touch isn't even in Beta yet. Second, most of these issues are polish-type problems. It doesn't stop the user from doing something, and it doesn't happen very frequently, but it is noticeable. Polishing is something you tend to do when you get the major issues out of the way. Finally, I am using an unsupported device. The only phone that is officially supported at this point is the Nexus 4. These are all issues I experienced on a Nexus 5.
On to the list:
When I first installed Ubuntu Touch, after the wizard, the system wasn't usable until I rebooted. The landing team knows about this issue.
Sometimes, when I boot, the lock screen never comes up. After a few minutes, I tried hitting the power button thinking the system locked up on boot, and the lock screen came up. It turns out the system booted, but the screen had gone blank due to an idle timeout before the lock screen was available. This only happened twice to me.
When hitting the power button to turn off the screen, the backlight stays on. This drains the battery very fast. I have noticed that it will eventually turn off. This is the one of the two problems that I consider a true blocker for me, since the phone dies way too fast. Other users have noticed this on the Nexus 5 and Nexus 7, but I haven't seen a forum post about it happening on the Nexus 4.
In Settings->Storage, in the list of apps, each app is listed twice.
When upgrading apps, the "Transfers" area in the notification bar shows an entry for each app, but the entry only shows a red X to cancel the transfer. It doesn't tell you anything about it, like the app or the download status.
Sometimes, the action bar of the app slides up by 50% of the height of the action bar. This means the top 1/2 of the action bar is cut off, hidden under the notification bar. Sometimes the app fixes itself, but I mostly open up task manager, swipe to kill, then relaunch. I have seen a bug report of this in Launchpad already.
Sometimes, clicking on an app in the home screen that is already open causes weird things to happen. It does work from time to time. I have seen it just hang there for a few seconds, though. When it hangs there, I usually cannot switch to that app anymore. I have to kill it and restart it. I have seen it kill the app and restart it as well. It will sometimes just do nothing. I get frustrating thinking the app won't launch until I realize that it is already open and I just need to switch to it.
In System Settings, the "Updates available" section doesn't go away after you perform all the updates until you relaunch the System Settings app.
Clicking on System Settings->About this phone->Developer mode crashes the System Settings app. I have no idea what is supposed to be in there, but I was curious.
Open up Messenger, Click in the "Write a message..." text box to bring up the keyboard, hit the power button turn off the screen, hit the power button to bring up the lock screen, swipe to unlock. The bottom row of the app (that has the text box and send button) are still in the higher position (as if the software keyboard was still there), but the software keyboard isn't actually showing. Clicking on the text box does not bring up the keyboard because the text box already has focus. You can get back to normal just by scrolling in the app.
Bluetooth does not work. This is a blocker for me, since I use the Bluetooth on my phone to play music while I drive. This is a known issue. I don't know which devices have Bluetooth working.
Sometimes, launching an app will leave you at the home screen, frozen and unusable, for a long time before actually starting the app
The Facebook app is mostly just a webview. My news feed doesn't seem to auto-update. I have to open the menu and click on News Feed, but that doesn't always refresh my news feed. This isn't technically an Ubuntu Touch issue.
When rotating the phone, sometimes the phone detects the first rotation really fast, but takes some jiggling to detect the rotation back.
I sometimes have problems connecting to my work WiFi network. That network is a little....special. Other devices have issues, but Ubuntu on my Nexus 5 has more issues than Android on my Nexus 5. The main problem is the WiFi network requires you to enter in your ActiveDirectory credentials on a page before you can do anything. On Android, the notification bar will usually pop up with something that you can click that takes you to the login page. Ubuntu doesn't do anything to let you know that you need to go to the webpage. As a backup, the WiFi router is supposed to send an HTTP redirect to the login page if you try and access any webpage. This is the part that is the real issue. Ubuntu doesn't seem to want to handle or understand the redirect. If I can't authenticate, then I can't access the internet. I have to drop back to 4G and try again later. I found the easiest way around the issue was to authenticate to the WiFi in Android, then boot into Ubuntu.
On to the list:
When I first installed Ubuntu Touch, after the wizard, the system wasn't usable until I rebooted. The landing team knows about this issue.
Sometimes, when I boot, the lock screen never comes up. After a few minutes, I tried hitting the power button thinking the system locked up on boot, and the lock screen came up. It turns out the system booted, but the screen had gone blank due to an idle timeout before the lock screen was available. This only happened twice to me.
When hitting the power button to turn off the screen, the backlight stays on. This drains the battery very fast. I have noticed that it will eventually turn off. This is the one of the two problems that I consider a true blocker for me, since the phone dies way too fast. Other users have noticed this on the Nexus 5 and Nexus 7, but I haven't seen a forum post about it happening on the Nexus 4.
In Settings->Storage, in the list of apps, each app is listed twice.
When upgrading apps, the "Transfers" area in the notification bar shows an entry for each app, but the entry only shows a red X to cancel the transfer. It doesn't tell you anything about it, like the app or the download status.
Sometimes, the action bar of the app slides up by 50% of the height of the action bar. This means the top 1/2 of the action bar is cut off, hidden under the notification bar. Sometimes the app fixes itself, but I mostly open up task manager, swipe to kill, then relaunch. I have seen a bug report of this in Launchpad already.
Sometimes, clicking on an app in the home screen that is already open causes weird things to happen. It does work from time to time. I have seen it just hang there for a few seconds, though. When it hangs there, I usually cannot switch to that app anymore. I have to kill it and restart it. I have seen it kill the app and restart it as well. It will sometimes just do nothing. I get frustrating thinking the app won't launch until I realize that it is already open and I just need to switch to it.
In System Settings, the "Updates available" section doesn't go away after you perform all the updates until you relaunch the System Settings app.
Clicking on System Settings->About this phone->Developer mode crashes the System Settings app. I have no idea what is supposed to be in there, but I was curious.
Open up Messenger, Click in the "Write a message..." text box to bring up the keyboard, hit the power button turn off the screen, hit the power button to bring up the lock screen, swipe to unlock. The bottom row of the app (that has the text box and send button) are still in the higher position (as if the software keyboard was still there), but the software keyboard isn't actually showing. Clicking on the text box does not bring up the keyboard because the text box already has focus. You can get back to normal just by scrolling in the app.
Bluetooth does not work. This is a blocker for me, since I use the Bluetooth on my phone to play music while I drive. This is a known issue. I don't know which devices have Bluetooth working.
Sometimes, launching an app will leave you at the home screen, frozen and unusable, for a long time before actually starting the app
The Facebook app is mostly just a webview. My news feed doesn't seem to auto-update. I have to open the menu and click on News Feed, but that doesn't always refresh my news feed. This isn't technically an Ubuntu Touch issue.
When rotating the phone, sometimes the phone detects the first rotation really fast, but takes some jiggling to detect the rotation back.
I sometimes have problems connecting to my work WiFi network. That network is a little....special. Other devices have issues, but Ubuntu on my Nexus 5 has more issues than Android on my Nexus 5. The main problem is the WiFi network requires you to enter in your ActiveDirectory credentials on a page before you can do anything. On Android, the notification bar will usually pop up with something that you can click that takes you to the login page. Ubuntu doesn't do anything to let you know that you need to go to the webpage. As a backup, the WiFi router is supposed to send an HTTP redirect to the login page if you try and access any webpage. This is the part that is the real issue. Ubuntu doesn't seem to want to handle or understand the redirect. If I can't authenticate, then I can't access the internet. I have to drop back to 4G and try again later. I found the easiest way around the issue was to authenticate to the WiFi in Android, then boot into Ubuntu.
Tuesday, August 19, 2014
The Importance of Backups
I have a decent amount of data. All my data is stored in a ZFS RAID-Z array. RAID is not a replacement for backup, however. I had a data corruption issue that luckily didn't have an effect on me, since I had an appropriate backup.
First, some background. One of the hard disks in my ZFS array is in an external chassis. I didn't realize this chassis was plugged into the wrong outlet on my UPS. It was plugged into the "Surge Only" outlet. I had a power flicker and my ZFS went down to a degraded state. Normally, I would have gone down there to take care of the problem immediately, but children change your response time.
When it rains, it pours, however. When I restarted the enclosure (after plugging it into the right port), ZFS ran a scan. The scan discovered 5 files that had bitrot. Luckily, I had a backup of all 5 files, and I was able to restore them.
Here are some tips on effective backups:
1) Categorize your data by importance. My really importance files (tax/legal documents and family pictures/video) are stored in the cloud (my personal Owncloud instance). I keep the important files offsite in the event that my house is completely destroyed. For less important files, I store on an external USB3 hard disk. This disk is not normally connected to my server. I have to plug it in to perform a backup or restore. I don't want a lightning bolt to destroy my data and my backup, but I am less concerned with losing this data if my entire house is destroyed.
2) I highly recommend ZFS. Although it is not impervious to bitrot, but is far better than other solutions. It also identifies which files have been impacted. "zpool status -v" gives you a list of files that have issues. It is much easier to restore 5 files and to restore everything.
3) Schedule a ZFS scrub. A scrub will go through your data and verify that all the data matches the CRCs. While ZFS can let you know something bad happened, it is good to pre-emptively verify your data. I perform a scrub once a week. I know the data corruption issue I had occurred within a one week window.
First, some background. One of the hard disks in my ZFS array is in an external chassis. I didn't realize this chassis was plugged into the wrong outlet on my UPS. It was plugged into the "Surge Only" outlet. I had a power flicker and my ZFS went down to a degraded state. Normally, I would have gone down there to take care of the problem immediately, but children change your response time.
When it rains, it pours, however. When I restarted the enclosure (after plugging it into the right port), ZFS ran a scan. The scan discovered 5 files that had bitrot. Luckily, I had a backup of all 5 files, and I was able to restore them.
Here are some tips on effective backups:
1) Categorize your data by importance. My really importance files (tax/legal documents and family pictures/video) are stored in the cloud (my personal Owncloud instance). I keep the important files offsite in the event that my house is completely destroyed. For less important files, I store on an external USB3 hard disk. This disk is not normally connected to my server. I have to plug it in to perform a backup or restore. I don't want a lightning bolt to destroy my data and my backup, but I am less concerned with losing this data if my entire house is destroyed.
2) I highly recommend ZFS. Although it is not impervious to bitrot, but is far better than other solutions. It also identifies which files have been impacted. "zpool status -v" gives you a list of files that have issues. It is much easier to restore 5 files and to restore everything.
3) Schedule a ZFS scrub. A scrub will go through your data and verify that all the data matches the CRCs. While ZFS can let you know something bad happened, it is good to pre-emptively verify your data. I perform a scrub once a week. I know the data corruption issue I had occurred within a one week window.
Friday, August 15, 2014
I installed Ubuntu Touch on my Nexus 5
I have been waiting very anxiously for the ability to install Ubuntu Touch on my phone. I have been every excited about this phone operating system. I was looking at some forums and I noticed that someone said they had Ubuntu Touch running reasonably well on their Nexus 5. I previously was only looking for support on a Galaxy S4 (a backup phone) since there are a few apps that are a must for me. I did a quick search to see how far Ubuntu Touch has come on the Nexus 5 and noticed this forum post talking about Multirom support for Ubuntu Touch. Multirom is a tool that allows you to dual-boot various different Android roms. The tool has been greatly expanded to support dual-booting alternative operating systems, like Ubuntu Touch.
I rooted my phone (I didn't have a need previously to this) then followed the instructions. It took about an hour to install Ubuntu Touch. I installed 14.10 devel, build r179.
I was surprised how usable the system was. There are a few major issues and a lot of minor issues. I wouldn't consider it usable for most people. It is almost usable for me, though. I imported my Gmail contacts and calendar. I text regularly with my wife. I use the Gmail and Facebook webview apps. The Pandora app works. The unofficial Dropbox app doesn't work very well. I can list my content, but downloading is a little weird. I tried the camera and gallery apps. I have launched the terminal a few times. There wasn't a need to launch the terminal, I just launched it to see if it would work.
Wifi and 4G work, but Bluetooth does not. The built in browser works ok. I tried the Youtube app, but that is just a webview. I installed a Google Maps app but I haven't actually tried turn-by-turn directions. It did use the GPS to correctly identify my location. The file manager app is ok. I haven't tried a music player yet, but I'm assuming it will work fine since Pandora worked. The weather app works ok. The basic weather information is native, but if you click on anything to get more detail, you get the www.weather.com mobile website.
I haven't figured out how to copy files between the Android and Ubuntu sections of the SD Card. My Qi charger still charges the phone. The swipe gestures from the top, right and left edge work perfectly fine. The swipe gesture from the left edge doesn't always work when I have my phone cover on. If I take the cover off, then it works fine. I am compiling an issues list that I will post in a few days.
Overall, I am very impressed with how far the Ubuntu Touch development team has come. I am excited at the progress and the prospect of using it as my primary phone operating system.
I rooted my phone (I didn't have a need previously to this) then followed the instructions. It took about an hour to install Ubuntu Touch. I installed 14.10 devel, build r179.
I was surprised how usable the system was. There are a few major issues and a lot of minor issues. I wouldn't consider it usable for most people. It is almost usable for me, though. I imported my Gmail contacts and calendar. I text regularly with my wife. I use the Gmail and Facebook webview apps. The Pandora app works. The unofficial Dropbox app doesn't work very well. I can list my content, but downloading is a little weird. I tried the camera and gallery apps. I have launched the terminal a few times. There wasn't a need to launch the terminal, I just launched it to see if it would work.
Wifi and 4G work, but Bluetooth does not. The built in browser works ok. I tried the Youtube app, but that is just a webview. I installed a Google Maps app but I haven't actually tried turn-by-turn directions. It did use the GPS to correctly identify my location. The file manager app is ok. I haven't tried a music player yet, but I'm assuming it will work fine since Pandora worked. The weather app works ok. The basic weather information is native, but if you click on anything to get more detail, you get the www.weather.com mobile website.
I haven't figured out how to copy files between the Android and Ubuntu sections of the SD Card. My Qi charger still charges the phone. The swipe gestures from the top, right and left edge work perfectly fine. The swipe gesture from the left edge doesn't always work when I have my phone cover on. If I take the cover off, then it works fine. I am compiling an issues list that I will post in a few days.
Overall, I am very impressed with how far the Ubuntu Touch development team has come. I am excited at the progress and the prospect of using it as my primary phone operating system.
Monday, August 11, 2014
My First QML App - KidsDraw
I wrote my first QML application. It is a simple app that allows you to draw with a touchscreen. There are multiple colors to select from. The draw surface supports multi-touch so you can use two fingers to draw two lines at once. It was designed for my son to use. He doesn't like it when I use my laptop without him. He really likes my laptop since he can touch the screen and things happen. The only problem so far is he uses more than two fingers. In Ubuntu, the system recognizes 3-finger and 4-finger gestures, so my son can easily get out of the program by mashing his hand on the screen....which he does regularly.
The app is very bare bones. It does not contain any automated test cases. It assumes more of a laptop or tablet display, since it puts a row of color buttons that probably won't fit on a phone.
You can take a look at the source code here.
The app is very bare bones. It does not contain any automated test cases. It assumes more of a laptop or tablet display, since it puts a row of color buttons that probably won't fit on a phone.
You can take a look at the source code here.
Thursday, August 7, 2014
Blindly Following Documentation
I have been working on a project that has multiple development teams working together. One of the teams has a habbit of blindly following documentation without understanding what is actually going on. This caused us problems before. This project was to build a proof of concent Google Glass app for my company. On this project, we needed to store some data that would be globally available to every activity. I had learned my lesson already and chose to store the data insize of a custom Application class. Every activity that needed the data was able to access it without the fear of the data being nulled out during a GC cycle. The other team tore me apart.
They said I should use a static class with a public static variable that would store the data. They citced the official Google documentation for the Application class. In the documentation, you will see a paragraph that says to do the exact opposite of what I implemented.
I tried to explain the GC issue with static classes, but they didn't really care too much. They complained about needing a reference to the Context to be able to access that information. I explained that everything should already have access to the Context. That is when things got ugly. I was accused a promoting the worst possible style of Android development. Apparently, I was advocating a Worst Practice. Passing the Context around is evil. Android developers should do as much as possible to prevent having access to the Context (some blog post was quoted where people were keeping static references to Activities, keeping the entire view in memory). In their mind, when documentation says "In most situation(s)", that means NEVER.
Through more discussions, it turns out the other team didn't really understand the GC issue. After fully explaining it, the lead on that team suggested that we should store the data in SharedPreferences instead. Now, every time an activity starts up, we read from the internal SD card on the UI thread. That is way better! On top of that, SharedPreferences requires access to a Context, so we are still passing the (evil) Context in.
They said I should use a static class with a public static variable that would store the data. They citced the official Google documentation for the Application class. In the documentation, you will see a paragraph that says to do the exact opposite of what I implemented.
There is normally no need to subclass Application. In most situation, static singletons can provide the same functionality in a more modular way. If your singleton needs a global context (for example to register broadcast receivers), the function to retrieve it can be given aContext
which internally usesContext.getApplicationContext()
when first constructing the singleton.
I tried to explain the GC issue with static classes, but they didn't really care too much. They complained about needing a reference to the Context to be able to access that information. I explained that everything should already have access to the Context. That is when things got ugly. I was accused a promoting the worst possible style of Android development. Apparently, I was advocating a Worst Practice. Passing the Context around is evil. Android developers should do as much as possible to prevent having access to the Context (some blog post was quoted where people were keeping static references to Activities, keeping the entire view in memory). In their mind, when documentation says "In most situation(s)", that means NEVER.
Through more discussions, it turns out the other team didn't really understand the GC issue. After fully explaining it, the lead on that team suggested that we should store the data in SharedPreferences instead. Now, every time an activity starts up, we read from the internal SD card on the UI thread. That is way better! On top of that, SharedPreferences requires access to a Context, so we are still passing the (evil) Context in.
Monday, August 4, 2014
Pros and Cons of Removing Boilerplate and Using Frameworks
My team has started to look into the use of Java Annotations to help get rid of some of the boilerplate that can be involved with Android development. We created annotations for injecting view references in Activities and Fragments. We created annotations for handling the registration of broadcast listeners. We created annotations for getting and maintainer references to certain classes, like to the host Activity from the Fragment. The pilot code that uses the annotations looks a lot cleaner (less casting, fewer lines of code). It looks easier (for me) to maintain. It had fewer bugs. Are those good reasons for removing boilerplate, though?
Saying the code has fewer bugs seems like a slam dunk. Lets analyse the bugs that occur, though. All the bugs that are prevented by using the annotations are related to the lifecycles of Activities and Fragments. Our developers were grabbing onto references to other objects incorrectly. Sometimes, they grab them in the wrong method (onCreateView vs onActivityCreated). Sometimes they don't handle saving the state to the bundle correctly. These are the type of mistakes that junior developers make more often than senior developers. Using annotations to hide these details prevents junior developers from learning the details of Android lifecycles. It holds them back, making it harder for them to become senior level. It also makes it harder to on-board mid-level developers. Mid-level developers will be used to seeing the boilerplate. It will take time for them to get used to the annotations.
One of the other developers on my team pointed out that debugging issues could be harder. The annotations make assumptions about the code it is in. If you inject a view with the id of R.id.button1, the assumption is a layout was already inflated that has a view with id R.id.button1. What happens if that assumption is wrong? Currently, you get a NullPointerException the first time you try use the reference to the view. A mid-level developer might spend hours diving into the framework, accusing it if being crap. They will curse the seniors to writing it, saying what they are doing is fine. The problem won't go away until a senior developer points out that they inflated their layout incorrectly, or they were in landscape mode and forgot to add R.id.button1 to the landscape layout. They might make this mistake because they aren't getting exposed to the type of mistakes that forces them to learn more about how Android works.
The annotations remind me a lot of the other frameworks that I have either used or written. They accomplish two things very well. First, junior developers can write code with fewer bugs. The frameworks do this by hiding the advanced details of what they are doing. This reduces bugs at the cost of never allowing them to advance as developers. The second thing these frameworks do is allow senior developers to write code faster. They can write code faster without sacrificing code quality.
Saying the code has fewer bugs seems like a slam dunk. Lets analyse the bugs that occur, though. All the bugs that are prevented by using the annotations are related to the lifecycles of Activities and Fragments. Our developers were grabbing onto references to other objects incorrectly. Sometimes, they grab them in the wrong method (onCreateView vs onActivityCreated). Sometimes they don't handle saving the state to the bundle correctly. These are the type of mistakes that junior developers make more often than senior developers. Using annotations to hide these details prevents junior developers from learning the details of Android lifecycles. It holds them back, making it harder for them to become senior level. It also makes it harder to on-board mid-level developers. Mid-level developers will be used to seeing the boilerplate. It will take time for them to get used to the annotations.
One of the other developers on my team pointed out that debugging issues could be harder. The annotations make assumptions about the code it is in. If you inject a view with the id of R.id.button1, the assumption is a layout was already inflated that has a view with id R.id.button1. What happens if that assumption is wrong? Currently, you get a NullPointerException the first time you try use the reference to the view. A mid-level developer might spend hours diving into the framework, accusing it if being crap. They will curse the seniors to writing it, saying what they are doing is fine. The problem won't go away until a senior developer points out that they inflated their layout incorrectly, or they were in landscape mode and forgot to add R.id.button1 to the landscape layout. They might make this mistake because they aren't getting exposed to the type of mistakes that forces them to learn more about how Android works.
The annotations remind me a lot of the other frameworks that I have either used or written. They accomplish two things very well. First, junior developers can write code with fewer bugs. The frameworks do this by hiding the advanced details of what they are doing. This reduces bugs at the cost of never allowing them to advance as developers. The second thing these frameworks do is allow senior developers to write code faster. They can write code faster without sacrificing code quality.
Thursday, July 31, 2014
Did I Just Find a Bug in the Google IO 2014 app?
The Google IO app tends to serve as an example of Android best practices for the technology that was just released. The source code for the 2014 IO conference was just released so I decided to take a look. Within two minutes, I noticed something weird. I see something that appears to be a bug! This can't be.
The problem is on line 99 of the PartnersFragment. You can see the source code here: PartnersFragment.java. On that line, you will notice a call to getActivity(). The problem with that line is you shouldn't be calling getActivity() from onCreate(). That lifecycle method is too early! There is no guarantee that the activity will be available. Even if the activity is available, there is no guarantee it is a valid activity. In the worst case, this will result in passing in a null to the ImageLoader constructer, which might throw a NullPointerException.
You might be wondering, if this is a true bug, why hasn't it been reported already. The bug does not occur all the time. In theory, it should only appear sporadically. Most of the time, the activity gets created first, then the fragment gets created. This means during the happy path, the activity is always available to the fragment. If the activity and fragment need to be recreated from their Bundles, then their is no guarantee that the activity will be available to the fragment until the Fragment.onActivityCreated() method gets called. I have seen this experimentally and in my company's Android app. Here is a Stackoverflow question about the same topic. It even includes the diagram showing you the guaranteed ordering of activity and fragment callbacks.
For all I know, Google did something else that mitigates the bug so that it never actually happens. It could be safe to pass in a null into the ImageLoader constructor. I haven't reviewed much other code yet. That doesn't mean it isn't a bug, though. If something does mitigate the crash, that is logic that could break, which would cause this bug to appear. Also, if this is supposed to serve as example code, then other developers will use Fragment.getActivity() inside of Fragment.onCreate(), not realizing that they could be crashing.
The problem is on line 99 of the PartnersFragment. You can see the source code here: PartnersFragment.java. On that line, you will notice a call to getActivity(). The problem with that line is you shouldn't be calling getActivity() from onCreate(). That lifecycle method is too early! There is no guarantee that the activity will be available. Even if the activity is available, there is no guarantee it is a valid activity. In the worst case, this will result in passing in a null to the ImageLoader constructer, which might throw a NullPointerException.
You might be wondering, if this is a true bug, why hasn't it been reported already. The bug does not occur all the time. In theory, it should only appear sporadically. Most of the time, the activity gets created first, then the fragment gets created. This means during the happy path, the activity is always available to the fragment. If the activity and fragment need to be recreated from their Bundles, then their is no guarantee that the activity will be available to the fragment until the Fragment.onActivityCreated() method gets called. I have seen this experimentally and in my company's Android app. Here is a Stackoverflow question about the same topic. It even includes the diagram showing you the guaranteed ordering of activity and fragment callbacks.
For all I know, Google did something else that mitigates the bug so that it never actually happens. It could be safe to pass in a null into the ImageLoader constructor. I haven't reviewed much other code yet. That doesn't mean it isn't a bug, though. If something does mitigate the crash, that is logic that could break, which would cause this bug to appear. Also, if this is supposed to serve as example code, then other developers will use Fragment.getActivity() inside of Fragment.onCreate(), not realizing that they could be crashing.
Wednesday, May 28, 2014
Why Don't People Use Java Generics
My company tends to be behind in technology. Java 5 was released in 2004, but my company didn't start using it until 2008. That means we have been using Java 5 for six years now. You would think the developers would be using generics. You would be surprised.
I feel like some developers don't want to learn new technology. They learn one thing and they want to stick with it. That is the only reason why after six years, people are still writing Java code without using generics. It is either that or they LOVE casting. We don't disable Java warnings, so Eclipse will tell them right away that they are doing something wrong. We use various static code analysis tools that give them reports about problems. You can lead a developer to water, but you can't make them use Java generics.
I find this frustrating because I hated Java prior to Java 5. I couldn't stand the endless casting that needed to happen. Code that is written like that feels dirty. I feel like you need to constantly handle ClassCastExceptions. The weird thing is I feel like Java code with generics is faster to write. This is a scenario where spending a little time learning how to become a better developer is a good return on investment.
I feel like some developers don't want to learn new technology. They learn one thing and they want to stick with it. That is the only reason why after six years, people are still writing Java code without using generics. It is either that or they LOVE casting. We don't disable Java warnings, so Eclipse will tell them right away that they are doing something wrong. We use various static code analysis tools that give them reports about problems. You can lead a developer to water, but you can't make them use Java generics.
I find this frustrating because I hated Java prior to Java 5. I couldn't stand the endless casting that needed to happen. Code that is written like that feels dirty. I feel like you need to constantly handle ClassCastExceptions. The weird thing is I feel like Java code with generics is faster to write. This is a scenario where spending a little time learning how to become a better developer is a good return on investment.
Monday, May 26, 2014
Rant About Online Forums: Help, Don't Criticize
Online forums are very helpful. They can also be very frustrating. Technology is filled with people that have strong opinions about technology and enjoy trolling. I find it frustrating when I look for help on a topic, but all I get is "my way is better, so I'm not going to help you."
I recently wrote about problems with ZFS when upgrading Ubuntu to 14.04. Technically, the problem was around the 3.12 kernel upgrade. When searching for answers, I came across lots of people that were anti-ZFS. If someone is anti-ZFS, then why would they answer a ZFS related question? Because they are trolls.
The most frequent answer I saw to this question was usually something about how you shouldn't use ZFS. They would start a discussion about the cons of ZFS. These types of answers don't help. They don't provide anything constructive. They just frustrate. If I need help with a technology, I need people who are familiar with that technology. I don't need to justify WHY I used a technology before I get that help.
A similar thing happened when I was researching DTLS. DTLS is an encryption standard for UDP packets. If you need to send data to someone over UDP, then DTLS is the way that you would encrypt it....in theory. The problem is DTLS isn't very well supported. I wanted to write an Android app that uses DTLS, but the OpenSSL version that comes with Android doesn't have DTLS enabled. Some people were helpful and pointed out that I could compile OpenSSL myself. I was trying to avoid that, but at least they were trying to be helpful.
Others were not so helpful. They would argue that DTLS is useless because UDP is useless. They argued that I should just use TCP. They argue that there is no reason to use UDP, even though they know nothing about what I am trying to do. Once again, you end up having to defend rather than getting help. In this particular case, I was trying to avoid the TCP over TCP meltdown problem.
I remember running into this many times in the past. Something that people need to keep in mind is that the harder the problem, the more help someone is going to need. You shouldn't claim to be the expert then argue about the why.
I recently wrote about problems with ZFS when upgrading Ubuntu to 14.04. Technically, the problem was around the 3.12 kernel upgrade. When searching for answers, I came across lots of people that were anti-ZFS. If someone is anti-ZFS, then why would they answer a ZFS related question? Because they are trolls.
The most frequent answer I saw to this question was usually something about how you shouldn't use ZFS. They would start a discussion about the cons of ZFS. These types of answers don't help. They don't provide anything constructive. They just frustrate. If I need help with a technology, I need people who are familiar with that technology. I don't need to justify WHY I used a technology before I get that help.
A similar thing happened when I was researching DTLS. DTLS is an encryption standard for UDP packets. If you need to send data to someone over UDP, then DTLS is the way that you would encrypt it....in theory. The problem is DTLS isn't very well supported. I wanted to write an Android app that uses DTLS, but the OpenSSL version that comes with Android doesn't have DTLS enabled. Some people were helpful and pointed out that I could compile OpenSSL myself. I was trying to avoid that, but at least they were trying to be helpful.
Others were not so helpful. They would argue that DTLS is useless because UDP is useless. They argued that I should just use TCP. They argue that there is no reason to use UDP, even though they know nothing about what I am trying to do. Once again, you end up having to defend rather than getting help. In this particular case, I was trying to avoid the TCP over TCP meltdown problem.
I remember running into this many times in the past. Something that people need to keep in mind is that the harder the problem, the more help someone is going to need. You shouldn't claim to be the expert then argue about the why.
Wednesday, May 14, 2014
Another hard disk dead
It's that time of year again. The seasons are changing. I have to mow my lawn again. The temperature is getting warmer. Hard disks are dying! I had expected that moving my server into my basement would reduce the deaths of the hard disks. This death might be a little unique, though. The hard disk is brand new. I bought it 3 months ago. It is a Hitachi 4TB hard disk. I never had a Hitachi before, but it was 4TB and it was on sale for the same price as Seagate and Western Digital 3TB disks. I decided to take a chance. At 3 months, I consider this hard disk a dud.
My server has 7 hard disks attached to it, although one of them is an SSD (the operating system). Five are inside of the case while 2 are in an external eSATA enclosure. The Hitachi is in the external enclosure. I know the enclosure is fine, since there is another hard disk in it that is working fine. This hard disk was my first chance to really expand my capacity. Most of my hard disks are between 1.5TB and 3TB. I previously combined two 1.5TB drives into a single 3TB device and two 2TB into a 4TB device in my RAIDZ. The 4TB drive replaced the two 1.5TB drives, matching the capacity of the two 2TB drives. With the 4TB dead, I'm running in DEGRADED state until a replacement can get shipped to me. I don't need the capacity yet. Right now, the biggest thing driving my data growth is doing backups of my Windows PC, but I barely use it anymore, so I was able to reduce the backup frequency.
My server has 7 hard disks attached to it, although one of them is an SSD (the operating system). Five are inside of the case while 2 are in an external eSATA enclosure. The Hitachi is in the external enclosure. I know the enclosure is fine, since there is another hard disk in it that is working fine. This hard disk was my first chance to really expand my capacity. Most of my hard disks are between 1.5TB and 3TB. I previously combined two 1.5TB drives into a single 3TB device and two 2TB into a 4TB device in my RAIDZ. The 4TB drive replaced the two 1.5TB drives, matching the capacity of the two 2TB drives. With the 4TB dead, I'm running in DEGRADED state until a replacement can get shipped to me. I don't need the capacity yet. Right now, the biggest thing driving my data growth is doing backups of my Windows PC, but I barely use it anymore, so I was able to reduce the backup frequency.
Monday, May 12, 2014
iPhone 5 and Galaxy S5 - There is something about 5
I have seen lots of reports about the design chief for the Galaxy S5 stepping down. He isn't stepping down because the S5 was a bad phone. He is stepping down because the S5 was only an incremental upgrade over the S4. The S5 had some great upgrades, but none were earth-shattering. My buddy upgraded from an S4 to the S5, mostly because of the fingerprint scanner. He was greatly disappointed, though. From the outside, they are so similar he accidentally brought his S4 home from work, thinking it was his S5. My wife is going to upgrade from an S2 to the S5. She considered the S4 which would have been free, but we decided that the IP67 rating of the S5 and the Qi charging (semi-false advertising, but that is a rant for another day) was worth paying a little bit for the S5.
This whole situation reminded me of the iPhone 5 rollout last year. The S5 is an incremental upgrade similar to how the iPhone 5 was an incremental upgrade. I wrote about the iPhone 5 upgrade last year. I will dive deeper into the S5 next time, but for now, I want to compare the "upgrades".
Charging
The iPhone 5 released a new charging connector that was not compatible with older iPhones. While this new connector was better, it also meant you had to buy all new connectors. All the old wires were now useless.
The S5 continues to use the standard Micro-USB. All your old cables still work. There is talk about the S5 supporting Qi wireless charging (another standard), but it turns out you have to buy an after-market replacement cover from Samsung. Unofficial qi receivers are available, but they remove the IP67 water resistance. If you buy the cover (which we are going to do) then you have a more convenient way to charge your phone that supports a standard.
64bit (iPhone 5S)
The iPhone 5S is 64bit. The S5 is still 32bit. Right now it isn't that important, but it will become a big deal in the future. We don't know when in the future it will become important. While 64bit does increase your performance, so does doubling your cores (S5 has 4 cores).
Size
The iPhone 5 was about 20% lighter and thinner. That is nice. It isn't what most people want, though. Many people like having the bigger screens. Rumor has it Apple has finally got the memo and the iPhone 6 might have a larger screen. The S5 on the other hand....didn't change. It looks identical to the S4. At least the iPhone 5 changed something. My buddy wouldn't have made the same mistake with an iPhone 5.
Post-Launch Marketing
As a general rule of thumb, ignore all pre-launch marketing. That is why we are jumping right to post-launch. After the iPhone 5 launched, Apple continued to market it as if it was a huge improvement over the iPhone 4S. They wouldn't acknowledge that it was an incremental upgrade. Apple portrayed the iPhone 5 as the next big thing (hehe). Samsung on the other hand realized that the S5 was an incremental upgrade after launch. That is why the person in charge stepped down.
This is where I think both companies got it wrong. There shouldn't be anything wrong with an incremental upgrade. You shouldn't market it as something great, or fire someone for mediocrity. Most people get upgrades every two years. They tend to skip a model. When asking people why they were upgrading to the iPhone 5, the most common response I got was it was time to upgrade...since they were on either the iPhone 3 or the iPhone4. Apple does have a lot of fan boys that will pay for the yearly upgrade, however. Samsung has to compete with Apple, so it sticks with the yearly rollout as well. These companies should acknowledge this. Besides, what else can you really put into the phones? Apple has some catching up to do (screen size, wireless charging, IP67), but not by much.
The next big thing should come out next year.
This whole situation reminded me of the iPhone 5 rollout last year. The S5 is an incremental upgrade similar to how the iPhone 5 was an incremental upgrade. I wrote about the iPhone 5 upgrade last year. I will dive deeper into the S5 next time, but for now, I want to compare the "upgrades".
Charging
The iPhone 5 released a new charging connector that was not compatible with older iPhones. While this new connector was better, it also meant you had to buy all new connectors. All the old wires were now useless.
The S5 continues to use the standard Micro-USB. All your old cables still work. There is talk about the S5 supporting Qi wireless charging (another standard), but it turns out you have to buy an after-market replacement cover from Samsung. Unofficial qi receivers are available, but they remove the IP67 water resistance. If you buy the cover (which we are going to do) then you have a more convenient way to charge your phone that supports a standard.
64bit (iPhone 5S)
The iPhone 5S is 64bit. The S5 is still 32bit. Right now it isn't that important, but it will become a big deal in the future. We don't know when in the future it will become important. While 64bit does increase your performance, so does doubling your cores (S5 has 4 cores).
Size
The iPhone 5 was about 20% lighter and thinner. That is nice. It isn't what most people want, though. Many people like having the bigger screens. Rumor has it Apple has finally got the memo and the iPhone 6 might have a larger screen. The S5 on the other hand....didn't change. It looks identical to the S4. At least the iPhone 5 changed something. My buddy wouldn't have made the same mistake with an iPhone 5.
Post-Launch Marketing
As a general rule of thumb, ignore all pre-launch marketing. That is why we are jumping right to post-launch. After the iPhone 5 launched, Apple continued to market it as if it was a huge improvement over the iPhone 4S. They wouldn't acknowledge that it was an incremental upgrade. Apple portrayed the iPhone 5 as the next big thing (hehe). Samsung on the other hand realized that the S5 was an incremental upgrade after launch. That is why the person in charge stepped down.
This is where I think both companies got it wrong. There shouldn't be anything wrong with an incremental upgrade. You shouldn't market it as something great, or fire someone for mediocrity. Most people get upgrades every two years. They tend to skip a model. When asking people why they were upgrading to the iPhone 5, the most common response I got was it was time to upgrade...since they were on either the iPhone 3 or the iPhone4. Apple does have a lot of fan boys that will pay for the yearly upgrade, however. Samsung has to compete with Apple, so it sticks with the yearly rollout as well. These companies should acknowledge this. Besides, what else can you really put into the phones? Apple has some catching up to do (screen size, wireless charging, IP67), but not by much.
The next big thing should come out next year.
Wednesday, May 7, 2014
Make a Backup Before Upgrading Ubuntu
I am a little obsessive about backing up my data. I use ZFS with RAID-Z to prevent data loss due to silent data corruption. Since ZFS can be problematic due to its licensing, I also have a USB hard disk that contains a copy of all my data. In the event that my house gets destroyed, all my data is backed up to an EC2 micro instance. My EC2 instance runs Owncloud with the data stored on a ZFS volume. Since the EC2 instance is outside of my home, I try to keep it up to date. During the last update, serious problems occurred.
After upgrading my EC2 instance to Ubuntu 14.04, my ZFS volume disappeared. Since there was a kernel upgrade, I assumed the zfs module just needed to be recompiled. Finding the instructions on how to do this was pretty painful, but that is a rant for another day. The recompile of the zfs module failed, however. The zfs module won't compile against the 3.12 version of the Linux kernel. I would have to patch my kernel to get it to work. I decided to try and roll back the kernel, but that just made things worse.
This is where I wished I took a backup. Not a backup of my data. A backup of the Ubuntu OS. Now that 14.04 is out, I am having a hard time finding Ubuntu AMIs for older versions that work. I fired up another EC2 instance using Gentoo. Gentoo is the distro I use at home, and I know ZFS works well with it. I used this instance to import the ZFS pool and copy the data to an ext4 volume. I will create a new Ubuntu 14.04 instance that will use the ext4 volume for Owncloud.
This experience has showed that I did a few things right. First, my data was on a different EBS volume. I did backup the Mysql database, but not with enough frequency. The software I installed was pretty minimal and minimally customized. This makes it a lot easier to create a new instance and get running.
I definitely have a few things that I know I need to do in the future. First, I need to take snapshots of the OS. Second, I need to keep the configuration backed up as well. That way, if I need to create a new instance, I can more easily recover. Finally, I should back up the Mysql data to the data volume.
After upgrading my EC2 instance to Ubuntu 14.04, my ZFS volume disappeared. Since there was a kernel upgrade, I assumed the zfs module just needed to be recompiled. Finding the instructions on how to do this was pretty painful, but that is a rant for another day. The recompile of the zfs module failed, however. The zfs module won't compile against the 3.12 version of the Linux kernel. I would have to patch my kernel to get it to work. I decided to try and roll back the kernel, but that just made things worse.
This is where I wished I took a backup. Not a backup of my data. A backup of the Ubuntu OS. Now that 14.04 is out, I am having a hard time finding Ubuntu AMIs for older versions that work. I fired up another EC2 instance using Gentoo. Gentoo is the distro I use at home, and I know ZFS works well with it. I used this instance to import the ZFS pool and copy the data to an ext4 volume. I will create a new Ubuntu 14.04 instance that will use the ext4 volume for Owncloud.
This experience has showed that I did a few things right. First, my data was on a different EBS volume. I did backup the Mysql database, but not with enough frequency. The software I installed was pretty minimal and minimally customized. This makes it a lot easier to create a new instance and get running.
I definitely have a few things that I know I need to do in the future. First, I need to take snapshots of the OS. Second, I need to keep the configuration backed up as well. That way, if I need to create a new instance, I can more easily recover. Finally, I should back up the Mysql data to the data volume.
Monday, May 5, 2014
Static methods for all....and (another) reason to not use them
When I switched to my current team, I was surprised to find an endless supply of static methods and variables. It seemed like the team really liked them. I can understand why. Most of the team were younger and weren't really used to design patterns and object orientation. For them, the only reason to extend and object is to make it convenient to call common methods. Being new to the team, I didn't try to force a paradigm shift. I brought it up a handful of times, but everyone seemed to like their static methods.
Recently, however, I was tasked to fixing a recurring issue. Inside of our Android app, some of our static variables were being reset to null. It seemed like it happened when the app crashed, so we put in code to reinitialize the static variables in the Activity that gets called when the app crashes. We were still getting sporadic null values.
Then a major problem occurred. We started using a 3rd party Android library that was returning null values. I disassembled the code and the developer had used static variables to store some values. Sound familiar? At first it happened when the app crashed, but then we noticed it happening when we ran out of memory. The screenshot library has a memory leak....ugh! We plugged the memory leak and the nulls in the 3rd party library went away. The incident got me thinking.
I started doing some research on the topic. I found lots of places talking about Android memory management. There seemed to be a debate of where to put long-lived objects. Some (including Google) recommended using a static variable to create a singleton. Others suggested keeping a reference to it from the Application class. I then found more details on what we think is going on.
First, a few things that you might not realize. Java classes are loaded into memory. If nobody is using a Java class, then that class can be unloaded from memory. If someone decides to use that class again, then the class can be reloaded into memory. This type of thing might happen if you run out of memory.
Apparently, one of the differences between a JVM and Dalvik is the definition of a "class being used", or when is it OK to unload a class. It looks like the JVM looks for references to the class while Dalvik looks for references to an instance of the class. It turns out that it is a very important distinction. In the above examples, the entire class was static. All the methods and variables were static. There was no instance of the class. There wasn't a need for one since everything was static. This works in the JVM world, but it doesn't work in Dalvik. When you are low on heap space, Dalvik will unload the static class, since by it's definition, nobody is using the class. Then, someone calls a static method which reloads the class, reinitializing the static variables, usually to null.
I made the decision that we shouldn't use completely static classes any more. While it is generally bad practice to use static only classes, I had a concrete technical reason to stop using them. I rewrote the code to use instances of the classes that are referenced from the Application class. Anyone that wants to call them needs a Context. Luckily almost all of our code already had a reference to a Context. We haven't had an issue since I did the refactoring.
Recently, however, I was tasked to fixing a recurring issue. Inside of our Android app, some of our static variables were being reset to null. It seemed like it happened when the app crashed, so we put in code to reinitialize the static variables in the Activity that gets called when the app crashes. We were still getting sporadic null values.
Then a major problem occurred. We started using a 3rd party Android library that was returning null values. I disassembled the code and the developer had used static variables to store some values. Sound familiar? At first it happened when the app crashed, but then we noticed it happening when we ran out of memory. The screenshot library has a memory leak....ugh! We plugged the memory leak and the nulls in the 3rd party library went away. The incident got me thinking.
I started doing some research on the topic. I found lots of places talking about Android memory management. There seemed to be a debate of where to put long-lived objects. Some (including Google) recommended using a static variable to create a singleton. Others suggested keeping a reference to it from the Application class. I then found more details on what we think is going on.
First, a few things that you might not realize. Java classes are loaded into memory. If nobody is using a Java class, then that class can be unloaded from memory. If someone decides to use that class again, then the class can be reloaded into memory. This type of thing might happen if you run out of memory.
Apparently, one of the differences between a JVM and Dalvik is the definition of a "class being used", or when is it OK to unload a class. It looks like the JVM looks for references to the class while Dalvik looks for references to an instance of the class. It turns out that it is a very important distinction. In the above examples, the entire class was static. All the methods and variables were static. There was no instance of the class. There wasn't a need for one since everything was static. This works in the JVM world, but it doesn't work in Dalvik. When you are low on heap space, Dalvik will unload the static class, since by it's definition, nobody is using the class. Then, someone calls a static method which reloads the class, reinitializing the static variables, usually to null.
I made the decision that we shouldn't use completely static classes any more. While it is generally bad practice to use static only classes, I had a concrete technical reason to stop using them. I rewrote the code to use instances of the classes that are referenced from the Application class. Anyone that wants to call them needs a Context. Luckily almost all of our code already had a reference to a Context. We haven't had an issue since I did the refactoring.
Sunday, April 20, 2014
XP is close to dead (but why is it still alive?)
Now that Windows XP is officially out of support, various websites are reporting the current Windows XP share. I have decided share my stats. Currently, 5.7% of the Windows visitors to my blog run Windows XP. That is not a tiny amount. That is more than my Windows Vista readers (3.16%). This begs the question, why is it so high?
1) Timing
Windows XP was released in 2001. This was around the time that non-technical people were getting computers. Services like @Home were providing broadband internet for the masses. This means the operating system that those masses of people were running was Windows XP. Just about every computer you purchased at the time was running Windows XP. For many of the current internet users, Windows XP was their first operating system.
2) Backwards compatibility
This is something that affects business users more than consumers. In the workplace, your computers often have speciality programs that MUST work. This means upgrading has a cost that is much larger than the licensing cost for the new version of Windows. There is massive testing effort. Every native program must be tested by QA engineers before a roll out. To make things worse, many programs need to be heavily modified to work in newer versions of Windows. This is not always cheap, or even possible.
3) People dislike change
This is probably the biggest reason. Because XP was the first version that most people used, people know how to use it. They know the ins and outs of the OS. They know how to update drivers, open task manager, browse the network and change the desktop wallpaper. Users have spent users getting used to these administrative interfaces. Microsoft has a habit of completely rewriting all of those interfaces for every major Windows upgrade. Users don't want to relearn how to do all of those tasks.
1) Timing
Windows XP was released in 2001. This was around the time that non-technical people were getting computers. Services like @Home were providing broadband internet for the masses. This means the operating system that those masses of people were running was Windows XP. Just about every computer you purchased at the time was running Windows XP. For many of the current internet users, Windows XP was their first operating system.
2) Backwards compatibility
This is something that affects business users more than consumers. In the workplace, your computers often have speciality programs that MUST work. This means upgrading has a cost that is much larger than the licensing cost for the new version of Windows. There is massive testing effort. Every native program must be tested by QA engineers before a roll out. To make things worse, many programs need to be heavily modified to work in newer versions of Windows. This is not always cheap, or even possible.
3) People dislike change
This is probably the biggest reason. Because XP was the first version that most people used, people know how to use it. They know the ins and outs of the OS. They know how to update drivers, open task manager, browse the network and change the desktop wallpaper. Users have spent users getting used to these administrative interfaces. Microsoft has a habit of completely rewriting all of those interfaces for every major Windows upgrade. Users don't want to relearn how to do all of those tasks.
Tuesday, April 15, 2014
The Screenshot Saga: Episode 2 - Background All the Things!
We get the fix pack, and we did a quick regression test. Everything seemed like it still worked. I told my project manager that it would take me some time to integrate the server flush fix. You see, the method we call to flush to the server already worked in one way....in the foreground. Any professional library developer would know that you need to maintain backwards compatibility. Therefore, I assumed that they would provide a new method for the background version. I was wrong. They changed the existing method. Our issue (the freeze after taking a screenshot) was gone. I checked everything in, let Jenkins build the APK and told our tester that it was ready to test.
We are near the end of our release cycle so it took a few days for the tester to actually test the screenshot functionality. I was in for a nasty surprise though. The tester reported that the screenshot functionality wasn't working! I fired up Logcat to figure out what was going wrong. The "fix" pack did a lot more than just fix our one issue. While it also fixed a bunch of other issues that other clients complained about, it also reduced the logging to Logcat. After 2 days of looking into it, I finally disassmbled their Jar file. I used the debugger to step line by line into their code. That is when I discovered the first problem.
The very first thing the library does is send the killswitch request to see if the library should be enabled. That is done in the foreground, delaying the start of the application. One of this company's clients complained, so they put the killswitch request in the background. The problem is the library initialization takes two steps: start() and enable(). start() fires off the killswitch request and enable() uses the killswitch response to determine if the library should start up. Since start() now runs in the backround (and returns immediately), enable() has a high likelihood of failing. If the server is running fast that day, then enable() will work. That is why my quick regression worked when I was first given the fix pack. While looking at the source code for enable(), I noticed that it checks an internal boolean to see if enable() already ran successfully. This means I could call enable() right before I take the screenshot. It is a hack, but at least I'm not calling undocumented API calls, like I did in my previous hack.
The next issue was that not all the screenshots were being sent to the server. This one drove me nuts. It took me a while before I decided to look at the source code for takeScreenShot(). I was appalled. What used to be a foreground action was now in the background! You see, another client complained about something being slow. This time is was the screenshot code. The library developers did what they always do: they shove it in the background. The problem was I call the flush code right after I take a screenshot. Yay race condition! There was no guarantee that the screenshot was done being taken before we invoke the manual flush. Once again, this was something that would work part of the time, which is why it passed our quick regression.
It is obvious that the developers that own this library don't know anything about multithreading. You can't just throw everything in the background as soon as someone complains about code running in the foreground. It requires thought and planning. There are ramifications.
The next thing that irritates me is the fact that some features are barely supported. If a company provides half ass support for a feature, then say it up front. Don't let us get invested in a feature that is just going to cause us more problems. In this case, it was the manual flush. My company wants to guarantee that the server gets the screenshot. They don't want to gamble with the possibility that the user kept our app open for another 60 seconds. If your app/library implements a feature in a shitty way, then you really don't support that feature. Let us know that up front and move on.
We are near the end of our release cycle so it took a few days for the tester to actually test the screenshot functionality. I was in for a nasty surprise though. The tester reported that the screenshot functionality wasn't working! I fired up Logcat to figure out what was going wrong. The "fix" pack did a lot more than just fix our one issue. While it also fixed a bunch of other issues that other clients complained about, it also reduced the logging to Logcat. After 2 days of looking into it, I finally disassmbled their Jar file. I used the debugger to step line by line into their code. That is when I discovered the first problem.
The very first thing the library does is send the killswitch request to see if the library should be enabled. That is done in the foreground, delaying the start of the application. One of this company's clients complained, so they put the killswitch request in the background. The problem is the library initialization takes two steps: start() and enable(). start() fires off the killswitch request and enable() uses the killswitch response to determine if the library should start up. Since start() now runs in the backround (and returns immediately), enable() has a high likelihood of failing. If the server is running fast that day, then enable() will work. That is why my quick regression worked when I was first given the fix pack. While looking at the source code for enable(), I noticed that it checks an internal boolean to see if enable() already ran successfully. This means I could call enable() right before I take the screenshot. It is a hack, but at least I'm not calling undocumented API calls, like I did in my previous hack.
The next issue was that not all the screenshots were being sent to the server. This one drove me nuts. It took me a while before I decided to look at the source code for takeScreenShot(). I was appalled. What used to be a foreground action was now in the background! You see, another client complained about something being slow. This time is was the screenshot code. The library developers did what they always do: they shove it in the background. The problem was I call the flush code right after I take a screenshot. Yay race condition! There was no guarantee that the screenshot was done being taken before we invoke the manual flush. Once again, this was something that would work part of the time, which is why it passed our quick regression.
It is obvious that the developers that own this library don't know anything about multithreading. You can't just throw everything in the background as soon as someone complains about code running in the foreground. It requires thought and planning. There are ramifications.
The next thing that irritates me is the fact that some features are barely supported. If a company provides half ass support for a feature, then say it up front. Don't let us get invested in a feature that is just going to cause us more problems. In this case, it was the manual flush. My company wants to guarantee that the server gets the screenshot. They don't want to gamble with the possibility that the user kept our app open for another 60 seconds. If your app/library implements a feature in a shitty way, then you really don't support that feature. Let us know that up front and move on.
Sunday, April 13, 2014
The Screenshot Saga: Episode 1 - Customer Support Attacks
I recently had the (dis)pleasure of adding a 3rd party library to my company's Android app. While this library had a few features, we only needed two features: the ability to take screenshots (of our app) and the ability to turn that functionality in the event of a catastrophic failure. The idea was to take pictures of certain transactions in the event that one of our customers wanted to dispute a transaction (I said $100, not $1000). The developers had exactly no say in the picking of the library. We were not even allowed to implement it ourselves. We had to use this library.
The company that wrote this library owes me a nice bottle of single malt scotch. Customer service was a nightmare. For starters, the killswitch didn't work. It eventually came out that this was my company's fault, but it took some....extreme....measures to figure this out. The library fires off an HTTP GET and looks for a response. The developer who wrote the JSP used the JSP Designer view, not the text editor view. Therefore, the JSP didn't return the needed 0 or 1. It returned a <html><body>1</body></html>. A little bit of logging would have helped on that one.
The next issue was the fact that we couldn't take a screenshot of the entire scrollable area of the page. It didn't matter what View you passed in. The library would get the root View of the page and take a screenshot of that. The company kept giving us some blanket statement about how only our "developers know when the entire view is visible". I was not happy about their support staff's veiled insult towards me. I eventually disassembled their Jar file. I tracked down the one line where they inadvertently took the screenshot of the root view, not the passed in view. I got so frustrated with their support staff, I actually pulled up their source code on a WebEx!! They still didn't believe me. That issue was fixed in a new release of the library.
The next issue was sending the screenshot to the server. Taking the screenshot doesn't automatically send the screenshot to the server. You have to either wait for the flush interval or use a manual flush. My company opted to use the manual flush while most other clients used the interval. It turns out that so few people actually use the manual flush, that nobody noticed that it runs in the "foreground". There was a noticeable freeze in the app. I disassembled their code and turns out that call new FlushTask().execute().get(); For those of you not familiar with Android development, what that means is the flush is run in the background, but we wait for the flush to finish....essentially negating the fact that the flush is running in the background. My first instinct was to run it in my own AsyncTask, but for some reason, the FlushTask actually used a thread local variable, causing it not to work when invoked from a background thread. Why would they do this!!! Luckily FlushTask() was public so I could invoke it directly. Technical support was no help. First, they told me to just use my own AsyncTask. Can you believe we pay them money for this? Finally they tell us that another company complained about the same thing and that their next fix pack was going to fix the issue.
Our saga continues. We got the next fix pick and it was a mess. Tune in next time!
The company that wrote this library owes me a nice bottle of single malt scotch. Customer service was a nightmare. For starters, the killswitch didn't work. It eventually came out that this was my company's fault, but it took some....extreme....measures to figure this out. The library fires off an HTTP GET and looks for a response. The developer who wrote the JSP used the JSP Designer view, not the text editor view. Therefore, the JSP didn't return the needed 0 or 1. It returned a <html><body>1</body></html>. A little bit of logging would have helped on that one.
The next issue was the fact that we couldn't take a screenshot of the entire scrollable area of the page. It didn't matter what View you passed in. The library would get the root View of the page and take a screenshot of that. The company kept giving us some blanket statement about how only our "developers know when the entire view is visible". I was not happy about their support staff's veiled insult towards me. I eventually disassembled their Jar file. I tracked down the one line where they inadvertently took the screenshot of the root view, not the passed in view. I got so frustrated with their support staff, I actually pulled up their source code on a WebEx!! They still didn't believe me. That issue was fixed in a new release of the library.
The next issue was sending the screenshot to the server. Taking the screenshot doesn't automatically send the screenshot to the server. You have to either wait for the flush interval or use a manual flush. My company opted to use the manual flush while most other clients used the interval. It turns out that so few people actually use the manual flush, that nobody noticed that it runs in the "foreground". There was a noticeable freeze in the app. I disassembled their code and turns out that call new FlushTask().execute().get(); For those of you not familiar with Android development, what that means is the flush is run in the background, but we wait for the flush to finish....essentially negating the fact that the flush is running in the background. My first instinct was to run it in my own AsyncTask, but for some reason, the FlushTask actually used a thread local variable, causing it not to work when invoked from a background thread. Why would they do this!!! Luckily FlushTask() was public so I could invoke it directly. Technical support was no help. First, they told me to just use my own AsyncTask. Can you believe we pay them money for this? Finally they tell us that another company complained about the same thing and that their next fix pack was going to fix the issue.
Our saga continues. We got the next fix pick and it was a mess. Tune in next time!
Tuesday, March 25, 2014
The Joys of Bulk Organize Imports
I discovered something accidentally while making changes to a file in Eclipse. I habitually hit the Ctrl+Shift+O shortcut or organize the imports in the top of a Java file. I didn't realize I had a folder in the Project Explorer window selected instead of the editor window. When I performed the shortcut, every single Java file had their imports organized! I quickly sent out an email to the developers instituting a new policy. I am going to organize all the imports for all of our Java files once a week.
Wednesday, March 19, 2014
APK Export Fail
My company was sending an update to the Amazon app store when we got a rejection notice. The notice claimed that whenever you started our app, it would immediately crash. It didn't give a stacktrace; just the list of devices they tested it on. This was weird to us since the code was extensively tested. I grabbed the APK, side-loaded it onto my phone and it crashed! Unfortunately I couldn't get a stacktrace since this was a release build.
It took us a while to debug the problem. At first, I decompiled the APK but the decompiler didn't work. Most of our java files were missing. I just thought it was a problem with the decompiler. Eventually we decided to change the AndroidManifest.xml file in the jar to make it debuggable. The first surprise was that the AndroidManifest.xml was binary, not xml. The easy solution was to change the source AndroidManifest.xml, re-export, then take the binary AndroidManifest.xml from the newly exported APK and drop it into the broken APK. We resigned, but Android refused to install the APK. This was a learning process for us. We learned that you can't just re-sign an APK. You have to delete everything in the META-INF/* folder. That "un-signs" the APK file. We signed it and were able to install the APK. We crashed and got a stacktrace: NoClassDefFoundError. It turns out that "problem" with the decompiler was actually a problem with the export process. The export silently failed.
For those of you who are not familiar with the Amazon app store submission process, you have to upload an unsigned APK. Amazon adds a jar file and modifies every single Activity adding hooks into their DRM system. They do this regardless of whether you use their DRM or now. After they add their DRM, you download the new unsigned APK, sign it with your key, then upload the new signed APK back to the Amazon store. This weird process exists because Amazon wants to inject their DRM and they can't do that to a signed APK; since that would defeat the purpose of signing.
Lets examine that process for a second. Once you export an unsigned APK, you cannot test it. Unsigned APKs can't be installed onto phones. Once you sign it, in theory, you can install it, but the Amazon DRM kicks in, preventing us from testing it. This means we sent an APK to Amazon where we tested the code, but not the export process.
After some Googling, I discovered that the export process sometimes fails like this. It is usually associated with a corrupt bin/ folder in your Eclipse workspace. Cleaning your bin/ folder and rebuilding fixes the issue. This was the first attempt to package the APK from Windows. As it turns out, the bin/ folder corruption thing happens more frequently in Windows.
To prevent future problems, I recommended that we export a signed APK as the first step. This lets us test the app on a device. We can unsign the APK before uploading it the first time to the Amazon store. This gives us an opportunity to test the export process before uploading.
While being a horrible experience, we learned a lot. Although I signed jar files before, I never actually knew what actually changed in the jar file. Now we know there are two CERT files in the META-INF folder as well as a bunch of SHA-1 hashes in the MANIFEST.MF. Deleting those files (well the certs and the hashes from the manifest) will effectively unsign the jar. The coworker I was working with didn't realize that APK files were just zip files, so that was a good lesson for her. We figured out how to enable debugging on a released APK file. It was a really good learning experience.
It took us a while to debug the problem. At first, I decompiled the APK but the decompiler didn't work. Most of our java files were missing. I just thought it was a problem with the decompiler. Eventually we decided to change the AndroidManifest.xml file in the jar to make it debuggable. The first surprise was that the AndroidManifest.xml was binary, not xml. The easy solution was to change the source AndroidManifest.xml, re-export, then take the binary AndroidManifest.xml from the newly exported APK and drop it into the broken APK. We resigned, but Android refused to install the APK. This was a learning process for us. We learned that you can't just re-sign an APK. You have to delete everything in the META-INF/* folder. That "un-signs" the APK file. We signed it and were able to install the APK. We crashed and got a stacktrace: NoClassDefFoundError. It turns out that "problem" with the decompiler was actually a problem with the export process. The export silently failed.
For those of you who are not familiar with the Amazon app store submission process, you have to upload an unsigned APK. Amazon adds a jar file and modifies every single Activity adding hooks into their DRM system. They do this regardless of whether you use their DRM or now. After they add their DRM, you download the new unsigned APK, sign it with your key, then upload the new signed APK back to the Amazon store. This weird process exists because Amazon wants to inject their DRM and they can't do that to a signed APK; since that would defeat the purpose of signing.
Lets examine that process for a second. Once you export an unsigned APK, you cannot test it. Unsigned APKs can't be installed onto phones. Once you sign it, in theory, you can install it, but the Amazon DRM kicks in, preventing us from testing it. This means we sent an APK to Amazon where we tested the code, but not the export process.
After some Googling, I discovered that the export process sometimes fails like this. It is usually associated with a corrupt bin/ folder in your Eclipse workspace. Cleaning your bin/ folder and rebuilding fixes the issue. This was the first attempt to package the APK from Windows. As it turns out, the bin/ folder corruption thing happens more frequently in Windows.
To prevent future problems, I recommended that we export a signed APK as the first step. This lets us test the app on a device. We can unsign the APK before uploading it the first time to the Amazon store. This gives us an opportunity to test the export process before uploading.
While being a horrible experience, we learned a lot. Although I signed jar files before, I never actually knew what actually changed in the jar file. Now we know there are two CERT files in the META-INF folder as well as a bunch of SHA-1 hashes in the MANIFEST.MF. Deleting those files (well the certs and the hashes from the manifest) will effectively unsign the jar. The coworker I was working with didn't realize that APK files were just zip files, so that was a good lesson for her. We figured out how to enable debugging on a released APK file. It was a really good learning experience.
Monday, March 17, 2014
Amazon Update Caused Our App to Crash (But It Was Our Fault)
My company recently decided that wanted to support Kindles in the Amazon app store. The business didn't give us a lot of time. They set a pretty aggressive deadline. We proposed to wait for our next release; uploading to iTunes, Google Play and Amazon all at the same time. They rejected the idea, instead telling us to modify our previous release (that was already in the iTunes and Google Play store) and upload it to Amazon a month before our next release was supposed to launch. We did a decent amount of testing on various 2nd and 3rd Gen Kindles and found a few issues. Most of them were less about Kindle and more about the fact that we never really developed for Android tablets. We fixed the low hanging fruit. Then we saw a rather large commit but another team that concerned us.
Although my team owns our company app, there are two teams that developer code for it. This separation is similar to the JeffBank example I have given in previous posts. One group owns one part of the business (lets say checking) while the other group owns another part of the business (lets say mortgages). The developers for the other business unit changed all of their WebViews to Amazon WebViews. This concerned us for multiple reasons. First, we wanted to maintain one codebase. We don't want to have an Amazon branch and a Google Play branch. Also, we historically had lots of issues with WebViews. They require a lot of testing. Changing all of your WebViews this late in the game was dangerous to say the least. The developer assured us that the Amazon WebView is an abstraction layer; it delegates back to the Android WebView if the real Amazon WebView implementation doesn't exist. They didn't have the same fear around WebViews that we did.
We launch on the Amazon store and all is fine.
Fast forward two weeks later. One of the friends (we used to work on the same team) called me up. He was a pilot tester for Kindle. He was a developer but he wasn't involved with the Kindle rollout. When I had him test, he found lots of minor issues and brought it to our attention. He called because the app stopped working on this Kindle. I found it weird that it was working then all of a sudden it stopped. He was very busy so we set up some time for a week and a half later to diagnose the problem. That is when my team started getting phone calls.
We were getting negative feedback in the Amazon store. Finally, Amazon sent us an email claiming our app was crashing. They provided a stacktrace. The stacktrace pointed to the WebView code that the other team wrote. My team inspected the code, comparing it to the documentation. According to the documentation, the code should NEVER work. We should always be crashing. We brought this up to the other team and they confirmed that the code was wrong, that it was fixed already in the next release (due to unrelated refactoring) and meh.
We retested on all of our Kindles and we couldn't reproduce the problem, however. Management started dropping the hammer. Unhappy emails were being sent. I called up my buddy and he offered to sacrifice a lunch so we could work things out. My coworker and I went to my buddy's building and installed the debuggable version of the app that was sent to the Amazon store. His crash was the same as the one reported by Amazon. We "fixed" the WebView code and it worked. We tested a build of our next release and it worked. We verified the problem, but we didn't know why it happened all of a sudden.
Since our next release would fix it anyways and we were two weeks away from launching that release, our business decided that we shouldn't upload a fixed version to the Amazon store. Doing so would have bumped all of our version numbers, which would mess with our version checks. That would have required making changes to our next release which was in super-lockdown.
My buddy let us borrow his Kindle. We looked at our logs and identified the last time he successfully logged in. We Googled his Amazon OS version number and discovered that the version was released a few days after my buddy's last successful logon. It turns out he got a software update and that broke our app. This is why the failures seemed to snowball. As people upgraded to the latest Amazon OS, our app started to crash for them.
Now, the root cause was technically our fault. We initialized the Amazon WebView library. The problem was still frustrating for multiple reasons. First, Amazon never gave us a heads up about an upgrade. They didn't give us an opportunity to test. Google didn't even turn up an official announcement about the upgrade. We found out about it because of some shady website that offered the Amazon OS upgrade as a shady sideloading download. The second thing that was irritating was the fact that we were using the Amazon WebView. I Googled around, and I couldn't find anyone who was using this library. I have no idea why the other team decided to use it. There is no information on how to support it or the quirks related to cookies. Finally, this was the other teams' code. We spent a lot of time diagnosing the problem. It is a shame when other teams don't take ownership.
Although my team owns our company app, there are two teams that developer code for it. This separation is similar to the JeffBank example I have given in previous posts. One group owns one part of the business (lets say checking) while the other group owns another part of the business (lets say mortgages). The developers for the other business unit changed all of their WebViews to Amazon WebViews. This concerned us for multiple reasons. First, we wanted to maintain one codebase. We don't want to have an Amazon branch and a Google Play branch. Also, we historically had lots of issues with WebViews. They require a lot of testing. Changing all of your WebViews this late in the game was dangerous to say the least. The developer assured us that the Amazon WebView is an abstraction layer; it delegates back to the Android WebView if the real Amazon WebView implementation doesn't exist. They didn't have the same fear around WebViews that we did.
We launch on the Amazon store and all is fine.
Fast forward two weeks later. One of the friends (we used to work on the same team) called me up. He was a pilot tester for Kindle. He was a developer but he wasn't involved with the Kindle rollout. When I had him test, he found lots of minor issues and brought it to our attention. He called because the app stopped working on this Kindle. I found it weird that it was working then all of a sudden it stopped. He was very busy so we set up some time for a week and a half later to diagnose the problem. That is when my team started getting phone calls.
We were getting negative feedback in the Amazon store. Finally, Amazon sent us an email claiming our app was crashing. They provided a stacktrace. The stacktrace pointed to the WebView code that the other team wrote. My team inspected the code, comparing it to the documentation. According to the documentation, the code should NEVER work. We should always be crashing. We brought this up to the other team and they confirmed that the code was wrong, that it was fixed already in the next release (due to unrelated refactoring) and meh.
We retested on all of our Kindles and we couldn't reproduce the problem, however. Management started dropping the hammer. Unhappy emails were being sent. I called up my buddy and he offered to sacrifice a lunch so we could work things out. My coworker and I went to my buddy's building and installed the debuggable version of the app that was sent to the Amazon store. His crash was the same as the one reported by Amazon. We "fixed" the WebView code and it worked. We tested a build of our next release and it worked. We verified the problem, but we didn't know why it happened all of a sudden.
Since our next release would fix it anyways and we were two weeks away from launching that release, our business decided that we shouldn't upload a fixed version to the Amazon store. Doing so would have bumped all of our version numbers, which would mess with our version checks. That would have required making changes to our next release which was in super-lockdown.
My buddy let us borrow his Kindle. We looked at our logs and identified the last time he successfully logged in. We Googled his Amazon OS version number and discovered that the version was released a few days after my buddy's last successful logon. It turns out he got a software update and that broke our app. This is why the failures seemed to snowball. As people upgraded to the latest Amazon OS, our app started to crash for them.
Now, the root cause was technically our fault. We initialized the Amazon WebView library. The problem was still frustrating for multiple reasons. First, Amazon never gave us a heads up about an upgrade. They didn't give us an opportunity to test. Google didn't even turn up an official announcement about the upgrade. We found out about it because of some shady website that offered the Amazon OS upgrade as a shady sideloading download. The second thing that was irritating was the fact that we were using the Amazon WebView. I Googled around, and I couldn't find anyone who was using this library. I have no idea why the other team decided to use it. There is no information on how to support it or the quirks related to cookies. Finally, this was the other teams' code. We spent a lot of time diagnosing the problem. It is a shame when other teams don't take ownership.
Thursday, March 13, 2014
Decompling Failure
My company decided to roll out a product that allows us to take screenshots of key pages on our mobile app. This in theory gives us a better tool to help us diagnose problems that our clients have. If a client calls up after having an issue, we can see what they saw. It gives us a better understanding of what happened without having to go through server logs. Developers tend to be find looking at logs, but our Tier-1 support, the people who take the phone calls, might not want to look at logs. We were provided with a jar file, some example code and some documentation. This seemed like an easy project. Boy we were wrong.
The first thing most people think would be a problem is bandwidth, but that wasn't an issue. Keep in mine this is only for a small set of key screens, so it shouldn't take a lot of bandwidth. The first issue was actually related to scrollable pages. Now, you'd think a library to that takes screenshots for you would......work. You would be wrong. The problem lies in the fact that the tool only takes a screenshot of the visible part of the page. We put in a support ticket and got a response: we were doing it wrong! You see, we should have read the documentation. What did we do wrong? We have no clue. The response just said read the documentation! We responded back with a politically friendly "WTF" and then they dropped this helpful nugget. If you want to take a screenshot of the entire page, you have to attach a scroll listener and wait for the user to scroll there. Now you are taking lots of screenshots with no guarantee that you will see the entire page. We quickly rejected that. Thefore, they responded with "read the documentation". WTF!
One of our developers researched taking our own screenshots of the scrollable view and she figured it out. We suggested that they add an API call to allow us to send our own screenshot to their server. We liked this idea since this is actually how the iOS library that the company supports works. They rejected the request citing the fact that the Android library should already do what we need it to do. After 7 back-and-forths, they wrote a response saying that it is the responsibility of our developer to take the screenshot at the write place. They just provide the screenshot tool. Our product owner (the person in our company that talks to the vendor on a regular basis) forwards that email to a lot of managers, saying it is my fault (they didn't drop my name of CC me in the email, but it was obvious they meant me). That is when I broke out the decompiler. I decompiled the jar file into java source code. I then followed the call stack we were using to take the screenshot. That is when I find the issue. The API has us pass in the View that we want a screenshot of. The screenshot code actually takes a screenshot of View.getRootView(), though. The support team for this vendor implied I was the idiot, so I had to decompile their code to prove they had no clue what they were talking about.
Our next major issue was with the "kill switch". The kill switch allows us (in theory) to remotely disable the screenshot library in the event of an unforeseen issue. If you ever had an issue with an app in the store, then you realize the importance of disabling functionality without resubmitting to the store. The documentation states that you create a url that returns a single character, a 1 or a 0. If the webpage returns a 1, then the library is enabled. If the webpage returns a 0, then the library is disabled. We enable the kill switch configuration and the library stops working. It doesn't matter if its a 1 or a 0; it always disabled itself on both Android and iOS. Then we get a call from our product owner saying we were doing it wrong. They claimed the documentation said nothing about a 1 or a 0. They claimed we needed to send either an HTTP status code of 200 or 500 to enable or disable. We started pouring over the documentation. What we determined was the Android documentation said to use 1 or 0. The iOS documentation on the other hand tells us to use the HTTP status code in one place and the 1 or 0 in another place. The iOS documentation literally contradicted itself.
We ended up ripping the entire product out of the current release, completely wasting my time. We decided to revisit it for our next release.
The first thing most people think would be a problem is bandwidth, but that wasn't an issue. Keep in mine this is only for a small set of key screens, so it shouldn't take a lot of bandwidth. The first issue was actually related to scrollable pages. Now, you'd think a library to that takes screenshots for you would......work. You would be wrong. The problem lies in the fact that the tool only takes a screenshot of the visible part of the page. We put in a support ticket and got a response: we were doing it wrong! You see, we should have read the documentation. What did we do wrong? We have no clue. The response just said read the documentation! We responded back with a politically friendly "WTF" and then they dropped this helpful nugget. If you want to take a screenshot of the entire page, you have to attach a scroll listener and wait for the user to scroll there. Now you are taking lots of screenshots with no guarantee that you will see the entire page. We quickly rejected that. Thefore, they responded with "read the documentation". WTF!
One of our developers researched taking our own screenshots of the scrollable view and she figured it out. We suggested that they add an API call to allow us to send our own screenshot to their server. We liked this idea since this is actually how the iOS library that the company supports works. They rejected the request citing the fact that the Android library should already do what we need it to do. After 7 back-and-forths, they wrote a response saying that it is the responsibility of our developer to take the screenshot at the write place. They just provide the screenshot tool. Our product owner (the person in our company that talks to the vendor on a regular basis) forwards that email to a lot of managers, saying it is my fault (they didn't drop my name of CC me in the email, but it was obvious they meant me). That is when I broke out the decompiler. I decompiled the jar file into java source code. I then followed the call stack we were using to take the screenshot. That is when I find the issue. The API has us pass in the View that we want a screenshot of. The screenshot code actually takes a screenshot of View.getRootView(), though. The support team for this vendor implied I was the idiot, so I had to decompile their code to prove they had no clue what they were talking about.
Our next major issue was with the "kill switch". The kill switch allows us (in theory) to remotely disable the screenshot library in the event of an unforeseen issue. If you ever had an issue with an app in the store, then you realize the importance of disabling functionality without resubmitting to the store. The documentation states that you create a url that returns a single character, a 1 or a 0. If the webpage returns a 1, then the library is enabled. If the webpage returns a 0, then the library is disabled. We enable the kill switch configuration and the library stops working. It doesn't matter if its a 1 or a 0; it always disabled itself on both Android and iOS. Then we get a call from our product owner saying we were doing it wrong. They claimed the documentation said nothing about a 1 or a 0. They claimed we needed to send either an HTTP status code of 200 or 500 to enable or disable. We started pouring over the documentation. What we determined was the Android documentation said to use 1 or 0. The iOS documentation on the other hand tells us to use the HTTP status code in one place and the 1 or 0 in another place. The iOS documentation literally contradicted itself.
We ended up ripping the entire product out of the current release, completely wasting my time. We decided to revisit it for our next release.
Friday, February 21, 2014
How bad is the HP Slate 7? Real bad.
Seven months ago, I wrote a post about buying an HP Slate 7 and returning it right away. I didn't really use the tablet heavily before returning it because it wouldn't charge on my standard USB charger. That was enough to declare the tablet "crap". It seems that other people have reached the same conclusion as me. A little more background first.
My company logs the device model for everyone that logs on. This gives us the ability to generate statistics on what devices are being used. We wanted to gather the data based on our clients, not based on the number of registered phones or whatever metric some other company uses to determine market share. When we gather the device models, we know it pertains to our user base. This data comes in handy for multiple reasons but the one reason that came up today was test hardware purchasing. We buy phones and tablets to test our software on and we want to buy the hardware that is most common among our user base.
When generating this report, I noticed something weird. There was a device model called "cm_tenderloin". I was something I had seen before because one of our clients with an issue had this device. "cm_tenderloin" is the Cyanogenmod device name for the HP Touchpad. The HP Touchpad that doesn't run Android....it runs WebOS. On its own, this is not weird. People install after market firmware on lots of devices. It wouldn't be weird to see "some" hits with this device name. The weird thing was the "cm_tenderloin" device name showed almost 5 times (about 4.8x for the month and a half window) more logons than the HP Slate 7!
This tells a rather interesting story. It seems that people would rather buy an old HP Touchpad and install Cyanogenmod on it than buy an HP Slate 7. It seems the Cyanogenmod install process must be easier than the HP Slate 7 install process (which was difficult for me, given the charger issues). This is even more incredible because Amazon (as of 2/21/2014) lists the HP Slate 7 at $200 vs $270 for the HP Touchpad (16GB models for both). I know a big difference is size. The HP Slate 7 is a 7in screen while the HP Touchpad is a 10in screen. Our clients still feel the Touchpad is the better buy.
My company logs the device model for everyone that logs on. This gives us the ability to generate statistics on what devices are being used. We wanted to gather the data based on our clients, not based on the number of registered phones or whatever metric some other company uses to determine market share. When we gather the device models, we know it pertains to our user base. This data comes in handy for multiple reasons but the one reason that came up today was test hardware purchasing. We buy phones and tablets to test our software on and we want to buy the hardware that is most common among our user base.
When generating this report, I noticed something weird. There was a device model called "cm_tenderloin". I was something I had seen before because one of our clients with an issue had this device. "cm_tenderloin" is the Cyanogenmod device name for the HP Touchpad. The HP Touchpad that doesn't run Android....it runs WebOS. On its own, this is not weird. People install after market firmware on lots of devices. It wouldn't be weird to see "some" hits with this device name. The weird thing was the "cm_tenderloin" device name showed almost 5 times (about 4.8x for the month and a half window) more logons than the HP Slate 7!
This tells a rather interesting story. It seems that people would rather buy an old HP Touchpad and install Cyanogenmod on it than buy an HP Slate 7. It seems the Cyanogenmod install process must be easier than the HP Slate 7 install process (which was difficult for me, given the charger issues). This is even more incredible because Amazon (as of 2/21/2014) lists the HP Slate 7 at $200 vs $270 for the HP Touchpad (16GB models for both). I know a big difference is size. The HP Slate 7 is a 7in screen while the HP Touchpad is a 10in screen. Our clients still feel the Touchpad is the better buy.
Monday, February 10, 2014
Stealing Syntax Trees
I had to write a parser for a language. I decided to break out my Programming Language Concepts text book from my 3rd year of college. In that class we learned not only how to compare and contrast programming languages, but we learned how to write parsers for those languages. Since this is a direct application of the course material, I brought my text book to work. I gave myself a refreshing on the theory behind parsers.
With this refreshed knowledge, I started with writing down the Backus-Naur Form of the language. I created a Lexical Analyzer for the language based on a Finite State Machine that returns Token classes that I created. I created classes for an Abstract Syntax Tree that represent the expressions in the BNF. I created a parser that converts the token stream from the Lexical Analyzer into the Abstract Syntax Tree.
So far, everything I did to write this parser was quite literally text book. I created a text book parser. Imagine my surprise when someone accused me of stealing the code!
For starters, the developer couldn't believe I would be able to write a full featured parser for the language so fast. It was just not imaginable. Second, he didn't know what an Abstract Syntax Tree was. You see, I used a naming convention for my tree classes. I prefixed all of them with AST for Abstract Syntax Tree. Since this developer had never heard of an AST before, he did a quick google search for ASTExpression (one of the classes that I wrote). He saw some open source projects that used the same class name. He immediately assumed that I must have stolen the open source code. Nevermind the fact that the contents of the classes were different. It would be impossible for two developers to use the same class name! Therefore, I am a thief!
For me, my code was an example of software engineering done right. I used documented practices that multiple software engineers use. These standard practices mean people who are familiar with the standard practices can support my code. Unfortunately, the legacy of this code is an accusation theft.
With this refreshed knowledge, I started with writing down the Backus-Naur Form of the language. I created a Lexical Analyzer for the language based on a Finite State Machine that returns Token classes that I created. I created classes for an Abstract Syntax Tree that represent the expressions in the BNF. I created a parser that converts the token stream from the Lexical Analyzer into the Abstract Syntax Tree.
So far, everything I did to write this parser was quite literally text book. I created a text book parser. Imagine my surprise when someone accused me of stealing the code!
For starters, the developer couldn't believe I would be able to write a full featured parser for the language so fast. It was just not imaginable. Second, he didn't know what an Abstract Syntax Tree was. You see, I used a naming convention for my tree classes. I prefixed all of them with AST for Abstract Syntax Tree. Since this developer had never heard of an AST before, he did a quick google search for ASTExpression (one of the classes that I wrote). He saw some open source projects that used the same class name. He immediately assumed that I must have stolen the open source code. Nevermind the fact that the contents of the classes were different. It would be impossible for two developers to use the same class name! Therefore, I am a thief!
For me, my code was an example of software engineering done right. I used documented practices that multiple software engineers use. These standard practices mean people who are familiar with the standard practices can support my code. Unfortunately, the legacy of this code is an accusation theft.
Friday, February 7, 2014
Time to replace the hard disks
One of my hard disks is starting to have issues. I know this because I use ZFS. I run a weekly scrub and email myself the results. Every once in a while I see read failures coming from the USB hard disk. Recently, I started seeing checksum errors, however. If I were running a standard RAID 5, I would be suffering from "bit rot".
Jim Salter over at Ars Technica has a great article about bit rot and ZFS. At a high level, bit rot is when you attempt to write a 1 to a hard disk but the hard disk actually stores a 0. It could be a single occurrence, or it could be a sign of a dying hard disk. In my case, it might be a bad cable or enclosure. I have needed a new hard disk enclosure for a while, so I decided to buy a new enclosure and a new hard disk. I am going with a 4tb hard disk since I need more disk space.
What is important is I didn't lose any data for two reasons: ZFS (RAID-Z) checksums that data to identify where the bit-rot took place and ZFS tells me about a faulty hardware before it becomes a catastrophe.
Jim Salter over at Ars Technica has a great article about bit rot and ZFS. At a high level, bit rot is when you attempt to write a 1 to a hard disk but the hard disk actually stores a 0. It could be a single occurrence, or it could be a sign of a dying hard disk. In my case, it might be a bad cable or enclosure. I have needed a new hard disk enclosure for a while, so I decided to buy a new enclosure and a new hard disk. I am going with a 4tb hard disk since I need more disk space.
What is important is I didn't lose any data for two reasons: ZFS (RAID-Z) checksums that data to identify where the bit-rot took place and ZFS tells me about a faulty hardware before it becomes a catastrophe.
Monday, February 3, 2014
Examples of Progress
Mainframes started using 64bit processors in 1961. SGI Irix graphic workstations started using 64bit processies in 1991. The Nintendo 64 launched in 1996 with a 64bit processor. AMD releases a 64bit processor for servers and desktops. When Apple launched the iPhone 5S with a 64bit processor in 2013, that is progress, not innovation. There was no leap of the imagination that phones would eventually get 64bit processors. The question was when it would happen. With most technologies, you can't release it too early or too late. You have to time the release correctly. Eventually, all phones will run 64bit processors. Likewise, having the first 64bit watch or microwave is not innovation.
Once mini-computer technology took off with the launch of various ARM devices, it is not innovation to stick an ARM processor in anything that you can conceive of. Now that you have tiny and cheap SoC systems, people want "smart" everything. Someone is going to make a smart watch. Someone else is going to make a smart picture frame. Someone else is going to make a smart camera. Someone is going to make smart thermostats and smart smoke detectors. SoCs will be everywhere. It is not innovation every time someone puts an SoC into something that didn't have an SoC before. That is progress. The innovation is the design of the first SoC that started to be used with smartphones like BlackBerry (did you think I was going to say the iPhone?)
A pattern should emerge in these examples. Having an idea, like SoC, is an innovation. Different arrangements and designs of the SoCs are innovation. Slapping an SoC in every inanimate object you can find is not an innovation; it is progress. Designing a unique user interface for the SoC is innovation. Innovation is the seed of progress. Once someone comes out with a great innovation, that spurs off progress to make that initial innovation better. Both the innovation and the progress are important. Someone needs to make progress. It just doesn't take as much effort and research to make progress. Progress should not be patentable and you shouldn't be sued for making progress. That is one of the major problems with the US Patent system.
Once mini-computer technology took off with the launch of various ARM devices, it is not innovation to stick an ARM processor in anything that you can conceive of. Now that you have tiny and cheap SoC systems, people want "smart" everything. Someone is going to make a smart watch. Someone else is going to make a smart picture frame. Someone else is going to make a smart camera. Someone is going to make smart thermostats and smart smoke detectors. SoCs will be everywhere. It is not innovation every time someone puts an SoC into something that didn't have an SoC before. That is progress. The innovation is the design of the first SoC that started to be used with smartphones like BlackBerry (did you think I was going to say the iPhone?)
A pattern should emerge in these examples. Having an idea, like SoC, is an innovation. Different arrangements and designs of the SoCs are innovation. Slapping an SoC in every inanimate object you can find is not an innovation; it is progress. Designing a unique user interface for the SoC is innovation. Innovation is the seed of progress. Once someone comes out with a great innovation, that spurs off progress to make that initial innovation better. Both the innovation and the progress are important. Someone needs to make progress. It just doesn't take as much effort and research to make progress. Progress should not be patentable and you shouldn't be sued for making progress. That is one of the major problems with the US Patent system.
Thursday, January 30, 2014
Competing with Innovation
Microsoft hasn't really innovated anything in years. Apple spent a good chunk of the previous decade innovating itself to the top, but their innovating days are behind them. Google doesn't innovate as much as they used to, but they still innovate. There seems to be a high correlation between how much a company innovates in a market, and their market position. Google's Android is at the top of the mobile operating system landscape. Microsoft is near the bottom.
When competitors emerge, companies often have two ways of competing. The first is the try and hold back your competitor. This usually involves patent wars and the Eastern District of Texas. The other is to actually compete by being better than your competition. Microsoft and Apple have chosen the former. The goal of holding back your competitors is to prevent them from reaching your level of technology so that you don't actually have to do anything. It is much easier to do nothing than it is to actually compete. If people are already buying your technology, then there is no point in investing in creating new technology. You can just sit around collecting money. What you care about is your income, not the people providing your income.
The thing about competition is that is creates better products for everything. Apple creates a multitouch smartphone (that is not the first....it just looks appealing to the masses). Google creates a rival operating system but tacks on more features. Apple should add those new features to its phone, but instead decides to lawyer up and file patent suit after patent suit. This does not help Apple's customers. This doesn't help Google's customers. This doesn't help any other company that wants to create a 3rd major mobile operating system. This is the very definition of anti-competitive.
Lets imagine a world where Apple decided to actually compete instead of whining like a little baby. When Google innovates new features, Apple can copy those features. Apple customers would be even happier! Heaven forbid Apple actually innovate something new (and not claim logic next steps as innovation).
When competitors emerge, companies often have two ways of competing. The first is the try and hold back your competitor. This usually involves patent wars and the Eastern District of Texas. The other is to actually compete by being better than your competition. Microsoft and Apple have chosen the former. The goal of holding back your competitors is to prevent them from reaching your level of technology so that you don't actually have to do anything. It is much easier to do nothing than it is to actually compete. If people are already buying your technology, then there is no point in investing in creating new technology. You can just sit around collecting money. What you care about is your income, not the people providing your income.
The thing about competition is that is creates better products for everything. Apple creates a multitouch smartphone (that is not the first....it just looks appealing to the masses). Google creates a rival operating system but tacks on more features. Apple should add those new features to its phone, but instead decides to lawyer up and file patent suit after patent suit. This does not help Apple's customers. This doesn't help Google's customers. This doesn't help any other company that wants to create a 3rd major mobile operating system. This is the very definition of anti-competitive.
Lets imagine a world where Apple decided to actually compete instead of whining like a little baby. When Google innovates new features, Apple can copy those features. Apple customers would be even happier! Heaven forbid Apple actually innovate something new (and not claim logic next steps as innovation).
Tuesday, January 28, 2014
Requirements Annotations
While developing some HTML pages for one of my projects, I started to think about requirements. I was doing TDD with the HTML engine (thanks to my XSLT-based solution) and got to thinking about regression testing and requirements. I have never liked requirements docs, but I also see the need for them. Low level requirements docs tend to be tedious. They provide lots of detail, but in a way that makes them unusable. Developers tend to gloss over them. While low level requirements are not written by technical people, they tend to dictate parts of the implementation that can be problematic. Agile tries to tackle some of these issues to make it better for the initial development, but requirements tends to fall by the wayside...mostly because requirements docs are horrible.
I was also thinking about all the unit tests that I wrote. For new developers on a project, they may not understand why a particular unit test exists. You can try to name the unit test better and comment your unit tests, but the big picture is often lost. You can add a comment that lists the requirement name that is getting implemented, but the requirements are often written in MS Word using a number-based outline that causes requirements numbers to change when you add a new requirement. I started to think of a better way of handling this.
First off, imaging having a requirement xml file that is coupled with the item being developed and tested. For example, you have a webpage that is being generated. In my system, that starts with MyPage.xsl. You have beans and logic classes that are coupled with it: MyPageBean.java and MyPageBeanFactory.java. You have a unit test for the page that mocks out the midtier and generates your HTML: TestMyPage.java. Now, add another file to the mix: MyPage.rqdx (requirements document xml). Inside of MyPage.rqdx, you specify all the requirements for MyPage.html. This should include the requirements for what should show up in the generated MyPage.html file, and the client side interations on that page. Each requirement in that xml file gets a unique name. You can even put metadata about the requirement, like when it was requested, when it was implemented, who requested it, maybe even cost-to-implement information. A requirement can link to another requirement. Sometimes requirement B is only needed because of requirement A. This happens when B is the low level requirement (MyPage should have this text as a header) for a high level requirement A (You should have a page called MyPage).
Now, imagine having Java annotations that you can use inside of MyPageBeanFactory.java and TestMyPage.java. You have an annotation @ReqImpl(reqs="req1,req2") that you use to mark methods/blocks that implement a particular requirement. You have an annotation @ReqTests(reqs="req1,req1") that you use to mark unit tests that test a particular requirement. Code, tests and requirements are now linked together.
When you put this all together, a few things now become possible. First, it is much easier for new developers to find out why code was implemented in the first place. They can follow the annotations back to the requirements xml. Requirements are now part of your revision control system. You can see your requirements change over time, just like your code has. You can now translate a failed unit test to the requirement that is impacted. Tools can be written that calculate "requirement coverage" as opposed to code coverage. This is the percentage of requirements that have been tested. A big advantage is code retirement. Since you can search for all code that is related to a specific requirement, you can now search for all code that can be removed due to the removal of a specific requirement. With requirements in revision control, you can create a report of new requirements added per release of your software, just by diffing all the xml files for the current release with the previous release.
Another change that could be interesting for Agile teams is adding un-implemented requirements. When a new requirement is identified, you can add that requirement to the xml file right away. It will sit there as an unimplemented requirement. You can write a tool that calculates the number of unimplemented requirements, just like you had with untested requirements. As developers implement requirements, you create a "burn down" chart of implemented requirements. At the end of every sprint, you have the number of requirements added, removed, implemented and tested.
Overall, this idea could help make a piece of software more maintainable. Tooling could allow the management of requirements to be much more useful. Developers would be able to read/implement requirements easier, since they only see the requirements that are relevent. Project managers can see the progress of implementing requirements. Release managers could easily identify what new features are available in a release. Tying code to requirements makes the maintenance of that code much better.
As a whole, the core of this solution doesn't seem like it would be too hard to implement. The tooling would take the most time, but I think different organizations would want different tooling. I really wish I had more time to run with this.
I was also thinking about all the unit tests that I wrote. For new developers on a project, they may not understand why a particular unit test exists. You can try to name the unit test better and comment your unit tests, but the big picture is often lost. You can add a comment that lists the requirement name that is getting implemented, but the requirements are often written in MS Word using a number-based outline that causes requirements numbers to change when you add a new requirement. I started to think of a better way of handling this.
First off, imaging having a requirement xml file that is coupled with the item being developed and tested. For example, you have a webpage that is being generated. In my system, that starts with MyPage.xsl. You have beans and logic classes that are coupled with it: MyPageBean.java and MyPageBeanFactory.java. You have a unit test for the page that mocks out the midtier and generates your HTML: TestMyPage.java. Now, add another file to the mix: MyPage.rqdx (requirements document xml). Inside of MyPage.rqdx, you specify all the requirements for MyPage.html. This should include the requirements for what should show up in the generated MyPage.html file, and the client side interations on that page. Each requirement in that xml file gets a unique name. You can even put metadata about the requirement, like when it was requested, when it was implemented, who requested it, maybe even cost-to-implement information. A requirement can link to another requirement. Sometimes requirement B is only needed because of requirement A. This happens when B is the low level requirement (MyPage should have this text as a header) for a high level requirement A (You should have a page called MyPage).
Now, imagine having Java annotations that you can use inside of MyPageBeanFactory.java and TestMyPage.java. You have an annotation @ReqImpl(reqs="req1,req2") that you use to mark methods/blocks that implement a particular requirement. You have an annotation @ReqTests(reqs="req1,req1") that you use to mark unit tests that test a particular requirement. Code, tests and requirements are now linked together.
When you put this all together, a few things now become possible. First, it is much easier for new developers to find out why code was implemented in the first place. They can follow the annotations back to the requirements xml. Requirements are now part of your revision control system. You can see your requirements change over time, just like your code has. You can now translate a failed unit test to the requirement that is impacted. Tools can be written that calculate "requirement coverage" as opposed to code coverage. This is the percentage of requirements that have been tested. A big advantage is code retirement. Since you can search for all code that is related to a specific requirement, you can now search for all code that can be removed due to the removal of a specific requirement. With requirements in revision control, you can create a report of new requirements added per release of your software, just by diffing all the xml files for the current release with the previous release.
Another change that could be interesting for Agile teams is adding un-implemented requirements. When a new requirement is identified, you can add that requirement to the xml file right away. It will sit there as an unimplemented requirement. You can write a tool that calculates the number of unimplemented requirements, just like you had with untested requirements. As developers implement requirements, you create a "burn down" chart of implemented requirements. At the end of every sprint, you have the number of requirements added, removed, implemented and tested.
Overall, this idea could help make a piece of software more maintainable. Tooling could allow the management of requirements to be much more useful. Developers would be able to read/implement requirements easier, since they only see the requirements that are relevent. Project managers can see the progress of implementing requirements. Release managers could easily identify what new features are available in a release. Tying code to requirements makes the maintenance of that code much better.
As a whole, the core of this solution doesn't seem like it would be too hard to implement. The tooling would take the most time, but I think different organizations would want different tooling. I really wish I had more time to run with this.
Thursday, January 9, 2014
Android RatingBar not scalable
I am writing an Android app that is going to use the RatingsBar. The problem is the RatingsBar is not scalable. What I mean by this is the images that are used to render the starts are of a fixed size. To make the widget smaller, you use smaller images. To make the widget larger, you use larger images. I haven't found a good way to use one image and have Android scale that image to the size of the widget.
In my app, I have a ListView that contains the RatingBar in each list item. With limited real estate, I wanted the RatingsBar to look smaller. I also had a blue background, so I made the stars yellow. While I felt it was important, it wasn't the most important piece of information in the list item. When you click on the list item, another activity comes up. Since that activity is dedicated to the item, I have a lot more screen real estate. In that screen, I used the bundled Android images. My wife complained. The RatingsBar in the details screen didn't look as good as the one in the list view. She wanted them to look the same....just scaled differently. I fired up Gimp and created another set of star images for that size. Luckily I only have two sizes to display. It would be nice if I could just have one set of images, though.
In my app, I have a ListView that contains the RatingBar in each list item. With limited real estate, I wanted the RatingsBar to look smaller. I also had a blue background, so I made the stars yellow. While I felt it was important, it wasn't the most important piece of information in the list item. When you click on the list item, another activity comes up. Since that activity is dedicated to the item, I have a lot more screen real estate. In that screen, I used the bundled Android images. My wife complained. The RatingsBar in the details screen didn't look as good as the one in the list view. She wanted them to look the same....just scaled differently. I fired up Gimp and created another set of star images for that size. Luckily I only have two sizes to display. It would be nice if I could just have one set of images, though.
Monday, January 6, 2014
Late Requirements
In software engineering, there always seems to be a trend about late requirements. My wife is doing consulting work for a company that keeps adding requirements, but wont push back the deadline. During my most recent company project, I was getting new requirements after the code lockdown date. Then, management started complaining about the large number of defects created after code lockdown. In the Agile world, late requirements aren't supposed to be a big deal. When a requirement is added, another is supposed to be delayed, though. I don't see this happening often.
The problem isn't in the development. It isn't an "IT" problem. It is a client problem. Clients are the ones demanding these requirements. They are the clients and they are the ones signing the pay checks, but clients need to understand that time is finite. Software engineers can't stop time to work on requirements. As a profession, we need to put our feet down and put the ownership of the problem back on the clients. If they want to add late requirements, then they need to agree to extend deadlines.
The problem isn't in the development. It isn't an "IT" problem. It is a client problem. Clients are the ones demanding these requirements. They are the clients and they are the ones signing the pay checks, but clients need to understand that time is finite. Software engineers can't stop time to work on requirements. As a profession, we need to put our feet down and put the ownership of the problem back on the clients. If they want to add late requirements, then they need to agree to extend deadlines.
Friday, January 3, 2014
ZFS Performance Issues on Low Memory Computer (Amazon EC2)
I recently set up a Owncloud on an Amazon EC2 instance. I chose to put the data onto a separate EBS volume. Although EBS supports taking snapshots of a volume, you really shouldn't snapshot a live file system that isn't aware that snapshots can be taken of it. For that reason, I decided to use a file system with snapshot capabilities built into it. I am a huge fan of ZFS, so I got ZFS installed onto my EC2 instance and was good to go.
After about a month of usage, I noticed a performance problem. I SSHed to the VM and noticed that a process called spl_kmem_cache was taking all of my CPU. After some googling, I discovered that the process was related to ZFS'es RAM cache of the disk. ZFS L1 cache is stored in RAM and uses an ARC variant to swap out pages.
The problem is the ZFS L1 cache does not work well in a low memory environment. ZFS was designed for servers, and servers usually have lots of RAM. EC2 micro instances barely have more than 580MB of RAM, though. After consuming most of the RAM, the ZFS L1 cache started to swap constantly, causing the spl_kmem_cache process to use up all my CPU. No RAM and no CPU makes Homer something something. Go crazy? Don't mind if I do!
I read about various L1 cache tweaks that you can set in the /proc file system. None of those helped the situation. I almost gave up hope until I decided to look up disabling the L1 cache. By running the command zfs set primarycache=metadata <poolname>, I disabled L1 caching for data (metadata is still cached). After making the change, my VM came back to life.
After about a month of usage, I noticed a performance problem. I SSHed to the VM and noticed that a process called spl_kmem_cache was taking all of my CPU. After some googling, I discovered that the process was related to ZFS'es RAM cache of the disk. ZFS L1 cache is stored in RAM and uses an ARC variant to swap out pages.
The problem is the ZFS L1 cache does not work well in a low memory environment. ZFS was designed for servers, and servers usually have lots of RAM. EC2 micro instances barely have more than 580MB of RAM, though. After consuming most of the RAM, the ZFS L1 cache started to swap constantly, causing the spl_kmem_cache process to use up all my CPU. No RAM and no CPU makes Homer something something. Go crazy? Don't mind if I do!
I read about various L1 cache tweaks that you can set in the /proc file system. None of those helped the situation. I almost gave up hope until I decided to look up disabling the L1 cache. By running the command zfs set primarycache=metadata <poolname>, I disabled L1 caching for data (metadata is still cached). After making the change, my VM came back to life.
Subscribe to:
Posts (Atom)