In the last couple of months, we looked at various memory errors that lead to application crashes and/or decreased performance, and discussed a number of tools that help detect these errors. For enterprise applications, 100 per cent availability remains the goal, and any application crash that causes downtime or decreased performance needs to be prevented at all costs.
Let us assume that we have attained the ultimate wisdom regarding programming, and that we have developed a software application that can be certified as having zero bugs. Now, while selling this software to company X, the firm wants an assurance that deploying our certifiable bug-free software in place of its current ancient, bug-ridden software application will not cause any issues. Now, would you make that guarantee to Company X?
Well, the answer is that you cannot make that guarantee. All that you know is that your application is bug-free. But remember the important fact that your application is getting deployed in an existing software environment, where it has to interact with multiple other applications. It needs to be configured correctly, and adequate resources should be made available for its correct operation.
All these factors are not in your control as a software developer. Therefore, even if we can certify that your software is 100 per cent bug-free, it cannot be assumed that deploying it in a enterprise environment will be trouble-free.
In fact, a software engineering study found that the major root cause of service disruptions that occur when software application upgrades are performed in enterprise software environments was the bugs in the upgrade process itself. These include broken dependencies between different pieces of software components, failures due to configuration file errors, etc. Therefore, software upgrade failures are a major contributor to application crashes, and need to be studied and understood well in advance in order to avoid them.
In this month’s column, we will focus our attention on software upgrade failures, discussing the types of failures that occur during the upgrade process, and what kind of upgrade process can help reduce such occurrences. We will then briefly cover online software upgrades that ensure zero downtime for the end customer. In fact, it is not just user applications that can be upgraded online — even operating-system kernels can be patched and fixed for critical defects or security issues, without having to reboot the system in order to apply the upgrade.
Let’s look at Ksplice, a mechanism for online software updates for the Linux kernel.
Software upgrade failures
In the summer of 1996, America Online (AOL) was one of the major Internet service providers that advertised its high reliability, with claims to being immune to network outages. However, at 4 a.m. EDT (Eastern Daylight Time) on 7 Aug 1996, AOL systems went down for 19 hours. It resulted in loss of revenue to 16 million customers, and made a huge dent in AOL’s much-publicised claims of offering the highest reliability. When the dust finally settled, it was found that the outage was not due to a hardware or software fault. It was a software upgrade failure that caused the front-page headlines, “America Offline”.
Even after rolling back the offending software upgrade, AOL could not restore its services, because the routing tables got corrupted during the upgrade process, and could not be restored immediately, resulting in a much longer outage. This is a typical example of a software upgrade scenario where the software maintainers (AOL, in this case), had not made adequate provision for an immediate rollback in case any failures occurred during the upgrade process itself.
Another well-publicised software upgrade failure, with much greater damage to business, is the case of AT&T Wireless, well documented in the article titled “AT&T Wireless Self-destructs”. This chronicles the case of a Customer Relationship Management (CRM) software upgrade that caused the company’s CRM system to fail for more than two months, resulting in the loss of thousands of customers.
The root cause for the outage was the breakage of dependencies between legacy backend components, due to the upgraded software. AT&T could not roll back to the previous version, since not enough of the old version was preserved. All these issues led to the company losing $100 million in revenue, and contributed to its ultimately getting sold off to Cingular for less than half its worth.
As we see from the above examples, in terms of their cost to the company, upgrade failures are extremely expensive and disruptive. The main question for us to ponder over is: What are the common causes of software upgrade failures, and how can they be avoided?
Causes of software upgrade failures
Let’s classify the various software upgrade faults into the following four categories:
- Simple configuration errors: For instance, if a startup script has a typo, which prevents the upgrade from starting a service.
- Semantic configuration errors: If, for example, the maintainer performing the upgrade fails to create the required SSL certificate needed for the upgrade process to complete successfully, leading to failure.
- Broken environmental dependencies: Imagine a scenario in which an upgraded software requires a new library version, while a co-existing software component requires an older version. The upgrade process ends up removing the older library version and replacing it with the newer library version, causing the coexisting software component to fail because of the upgrade.
- Data-access errors due to software upgrades, which make the persistent data partially or fully unavailable to the application/end-user. For instance, if the upgraded application assumes a database schema that is different from (does not match) the existing database schema, this can lead to loss of data availability for the end-user.
Each of the above classes of errors can be detected, and prevented from happening, using automated tools along with greater vigilance by the software maintainer/administrator performing the software upgrade. For instance, applications should be written robustly, so that configuration errors arising from typographical mistakes or simple syntax errors can be caught and corrected. Semantic configuration errors are more difficult to catch, but should be detected during a pre-installation check, ahead of the actual software being installed, so that the upgrade does not fail.
Environmental dependencies are one of the most difficult upgrade errors to catch. Detecting dependencies relies upon the inputs from the software maintainers, and metadata on the system, such as the software catalogue inventory. Many operating systems provide packaging managers and software installation tools, which do a pre-installation check for the various dependencies, and allow the upgrade to go ahead only if all the pre-requisites are satisfied.
Data-access errors, such as those arising from mismatches in database schema, can be caught by doing pre-installation testing with real-life workloads. A detailed discussion of the various software upgrade faults and their classification can be found in the paper “Why do upgrades fail, and what can we do about it?” available here.
Note that there are different types of software upgrades that can be used on a large system that employs multiple host nodes interacting with each other. One is known as the rolling upgrade, where each node is upgraded independently, one at a time, and then added back to the system.
Another process is the Big Flip upgrade, where half the nodes in the system are taken out and upgraded to the new software, while the other 50 per cent of the nodes in the system continue to service incoming requests. The upgraded nodes are added back to the system, while taking out the remaining 50 per cent of the nodes for upgrade. While rolling upgrade results in a small perturbation to the system, it requires that the system allows for coexistence of old and new software.
On the other hand, Big Flip can result in a large dip in service response times (since 50 per cent of the nodes are taken out for the upgrade), but does not require that the system should support co-existence of old and new software. For instance, an incompatible database schema change cannot be applied as a rolling upgrade, whereas it can be applied as a Big Flip upgrade.
Offline v/s online software upgrades
An important point to consider when performing a software upgrade is whether the application or system needs to be brought down in order to apply the upgrade. Based on whether downtime is required, we can classify the upgrade as online (does not require the system/application to be brought down during the upgrade) or offline (requires downtime of the system/application being upgraded).
Online upgrades allow interruption-free service even during the upgrade, though possibly with reduced performance/functionality. Since online upgrades support zero downtime for the end-user, they have been a major topic of research over the past few years. While there is considerable complexity in patching a user application dynamically without any interruption in service, consider the complexity involved in patching an entire operating system dynamically without requiring a reboot.
However, recent advances in the research on dynamic software updates have resulted in the development of tools that allow just this. In next month’s column, we will look at Ksplice, which supports dynamically patching Linux kernels. Interested readers can look up the Ksplice paper available here [PDF].
This month’s takeaway question
It is centred on dynamic software updates, and is a question to ponder over, rather than a programming problem. Can you think of the scenarios in which a software application cannot be patched dynamically? Can you write a tool which can verify whether a dynamically patched application is indeed performing correctly?
My ‘must-read book/article’ for this month
This month’s recommendation comes from one of our readers, Ms Raji Sundararajan. Raji is a final-year computer science student. She recommends an article that recently appeared at the Scala Days 2011 conference, titled “Loop Recognition in C++/Java/Go/Scala” [PDF].
This interesting paper compares the performance of a graph algorithm typically used in compiler optimisations to benchmark the four languages. It shows that while C++ does beat the other implementations in terms of performance, it needed considerable tuning, and is dismal in terms of out-of-the-box performance. Thanks, Raji, for your suggestion.
If you have a favourite programming book/article that you think is a must-read for every programmer, please do send me a note with the book’s name, and a short write-up on why you think it is useful, so I can mention it in the column. This would help many readers who want to improve their coding skills.
If you have any favourite programming puzzles that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming, and here’s wishing you the very best!