Even after a lot of time and energy spent on testing and validating releases, some fail (or partially fail) when rolled out to PROD. In this post, we’ll look at the common causes of PROD rollbacks and how manage the risk of software releases.

We have all seen PROD releases fail. Whether it is as a consumer or as a software professional. Failed releases are an unfortunate part of the software industry. There is no magic bullet (or recipe or process) to eliminate PROD releases being rolled back, but there are prudent steps that organizations can take.

Why Releases Fail

First, we need to understand why releases fail in the first place. In my opinion, there are 3 main categories for PROD releases failing:

  • Features not fully tested: This is the most obvious reason. There were some user scenarios which were missed or overlooked in the release validation process which missed bugs.
  • Size: In every company that I have worked in, PROD is always the largest environment. This is true in terms of computer resources (VMs, databases, networking, etc.) as well as the amount of data.
  • Configuration: Since PROD serves the full customer base, it has the largest configuration of computer resources and the largest configuration of customer parameters.

Depending on the type of release being delivered, these 3 factors come into varying degrees of importance.

An incremental feature improvement should be covered the validation process and should not affect Size and Configuration in any meaningful way. Typically, incremental feature improvements should not fail.

By comparison, infrastructure or major feature changes are greatly affected by the above 3 factors. Take the example of a database (DB) upgrade. Since the data in PROD is typically much larger than lower environments, it is very hard to anticipate where the DB upgrade process could have issues. Data conversion during the upgrade may slow to a crawl or some customer data may be unique and therefore untested.

It is major feature changes or infrastructure changes which tend to fail the most often.

Preventing Failures

It is impossible to fully prevent failures, but there are some proactive steps that an organization can take. Note, none of these steps are easy. These steps should be viewed as an investment in producing the best customer experience possible.

  • Test, test, test! Investments in full end-to-end feature validation is key. From a practical perspective, to make sure all tests are run against all potential releases, this testing will need to be automated.
  • Monitor and audit customer use cases: When features are designed, the organization has a pretty good view on how customers will initially use the feature. However, over time, customers come up with unique configurations to solve their specific needs. Unless the organization monitors the customer use cases, some critical testing may be overlooked.
  • Make the data in test environment comparable in size and complexity to PROD: The goal here is to try and determine if there are unforeseen performance issues with the new release. Again, this is especially important if the release contains infrastructure or a major feature change.
  • Canary releases: This is a fairly common technique and goes by several names. The idea is to release the product to a small set of customers, monitor for failures and then make any adjustments before rolling out to all customers.

Improve the Release Process

There are several steps organizations can take to optimize the customer experience.

  • Release schedule: Have published schedules documenting when PROD releases occur. This lets customers know a change is happening. Also, it allows the organization to plan its internal resources and communication more effectively.
  • Have a rollback strategy: This feels counter-productive, but it is important to have an exit strategy if the release does not go as planned. Again, the focus here is minimize customer downtime in the event of failure.
  • Release Metrics: Which releases were successful and which failed? What were their scope? Reviewing and analyzing these trends provides an objective view of strengths and weaknesses.

Wash, Rinse, Repeat

Releases are part of any software lifecycle. It is important to view the Release Process as a separate step with separate metrics, e.g., time to deploy, scope of change, failure rate, etc.

Organizations are generally review failed releases, but they tend to not analyze which releases are successful. You need both sets of data to understand the organizations strengths and weaknesses!

By following these metrics over time, the organization will learn how to best prepare for future releases.


Leave a comment