Workday’s Journey to Zero Downtime: Progress Report

As our customers know, we update Workday every week. Our single codeline approach to development enables us to push out a number of changes per week to continually enhance and improve Workday. At the same time, we’re continually investing in how we can reduce the amount of planned downtime associated with these updates.

In addition to weekly updates, we group the most significant changes into twice-yearly larger updates so our customers can plan for these new capabilities. But for us, bi-annual feature release updates are mechanically identical to the weekly updates—we just happen to change more feature toggles from “preview mode” to “production mode.”

One of the dividends of moving to single codeline development is that we’ve already been able to significantly shorten the downtime associated with our update process. For example, while previously it took us three 36-hour waves to deliver a major update, we delivered our latest bi-annual update last March —Workday 24—in just one four-hour period.

By the end of 2015, we plan to halve downtime for weekly updates from the current four hours to a maximum of two hours. Over time we will eliminate planned downtime for service updates. Here’s how we plan to do it.

Reducing and Eliminating Service Downtime

There are a few elements to working toward this goal. The first is service decomposition—we’ve developed more logical separations among the individual sub-services that come together to make up Workday. We can now start and stop these sub-services independently of one another, and in parallel when we are re-initiating service. This will significantly reduce the overall startup latency.

In addition, these independent services will be working in a more sophisticated shared state model, overseen by a “tenant coordinator.” Sub-services and customers’ tenants understand their dependencies and relationships in this new model, and collaborate (via Apache ZooKeeper and the tenant coordinator) to optimize their startup sequence.

In our current model, after we’ve updated Workday, no customer is granted access to its tenant until all services are ready for all tenants. This introduces unnecessary latency across all tenants.

In the new model that we will progressively move to by the end of 2015, we’ll be able to release each customer’s tenant as soon as its dependent services are active and ready. In the short term, larger tenants will take a little longer than smaller tenants to be ready, but we are rapidly moving to a model where startup time is independent of customer size.

As a result of all these efforts, we plan to roll out zero downtime (ZDT) updates progressively starting in 2016.

Our more refined state model has ancillary benefits, too. We have recently introduced a more sophisticated degradation model in the Workday service. If an individual sub-service becomes distressed or breaks, its state changes to “degraded.” At that point we stop sending any work to that service and route around it. So in the event of a problem with a single sub-service, we can localize the impact of that problem to a small subset of a customer’s users and operations, and then rapidly and automatically restore full service.

As an example, this means that if a customer has a report that behaves unusually, it will not affect any other reports or analytics other users may be running at the same time. More graceful degradation is an important overall element of service resiliency.

This move towards finer-grained services, with a more sophisticated state model and complete deployment automation, is all part of our transition to a fully virtualized elastic operating model.

In addition, we are transitioning our data center and operating infrastructure to our OpenStack-based availability zone model over the next two years, and customers should see progressive improvements to performance, availability, and service resiliency. We want to not just continue meeting our availability and performance SLAs, but to consistently beat and raise them over time.

Finally, most of the sub-services in our technical architecture can already be updated in-place without any downtime. Customers may have seen this as we respond to certain classes of issues we encounter and fix mid-week without any downtime. We are progressively enabling all of our constituent services to operate this way, including our core in-memory transaction server.

As a result of all these efforts, we plan to roll out zero downtime (ZDT) updates progressively starting in 2016.

You can check out this video for an informal look at a ZDT update of one of our development systems. If you’re interested in learning more about zero downtime, please attend my session at Workday Rising  in Las Vegas, Workday Rising Europe in Dublin, or catch the recordings after the events.

As we gear up for the next 10 years of growth at Workday, we are investing heavily in novel technical approaches to improve the reliability, availability, and scalability of our service. These challenges are interesting at our level of scale and growth. Our engineering teams in the U.S. (Pleasanton, San Francisco, Portland, Boston, and Boulder) and Europe (Dublin and Munich) are all expanding. If you are an expert and/or have a point of view in these areas, we’d love to talk to you. Find out more at Workday Next, where you can also see the technical events we are hosting or attending.