3 tips for making highly available systems in Amazon's cloud

01.11.2012

Amazon Web Services makes a big deal out of its availability zones, recommending that customers use multiple AZs to build clouds that are tolerant of failures like .

But experts say using a second availability zone when you're starting up an AWS instance isn't a panacea to prevent your app from going down during an outage, nor is it as easy to do as simply flipping a switch. And it will cost you more -- sometimes more than double the cost of deploying to a single AZ.

AMAZON TAKES THE BLAME:

MORE CLOUD:

There are a variety of tools for customers to create highly available, fault tolerant systems within Amazon's cloud. One option is to use AWS's own services, specifically its Elastic Load Balancers (ELBs) that allow workloads to be moved from one availability zone into another. However, AWS acknowledged in a about the October outage that even its ELBs were impacted.

There are a variety of vendors, some call them tools, that will manage this process of creating highly available cloud systems using AWS. RightScale and enStratus are two of the most popular. RightScale offers customers prepackaged solutions that spread workloads across AZs, or even various providers. But as Gartner IaaS analyst Kyle Hilgendorf says, "It's a cost vs. risk play. It's not easy, and it's not cheap." Highly available cloud systems simply cost more and add complexity.

Fault-tolerant have to be built to be horizontally scalable. In an ideal world they would be stateless, meaning they wouldn't be constantly changing with new data being inputted and saved into them. In most cases, it requires a copy of your system to be made somewhere else to ensure fault tolerance during an outage. Even when all of that is taken into account, there can still be problems.

RightScale CTO Thorsten von Eicken says during , internal operations within RightScale had trouble scaling across availability zones in Amazon's cloud. AWS admitted it was "throttling" customers, meaning it limited how much data they could transfer from one AZ to another, something it has vowed it will not be as aggressive doing in the future. The point is that even if a system is architected to be fault tolerant, unexpected problems can still arise.

There are multiple ways to architect fault tolerant systems though, von Eicken says. Customers can create two active-active services, or create one active and a "clone" standby, for example. Each has its own advantages and cost considerations, though.

Basic fault tolerance: In a basic fault tolerant architecture, there is a production architecture and a standby "clone architecture." If there is a fail in the master AZ, then the system can be manually switched to use the cloned version, a process that not only usually requires a manual switch-over, but the databases are usually replicated in Amazon's Simple Storage Service (S3) about every 10 minutes, so when a switch-over does occur, you could lose about the last 10 minutes worth of data, RightScale says.

Advanced fault tolerant system: A more advanced system creates two active systems running simultaneously. In this active-active setup, any instance, or even an entire AZ can fail and the system will automatically be able to complete all its functions from another AZ that is pre-architected and ready to run on. RightScale says this architecture will cost more than double the cost of a single AZ setup, because all of the services form the single AZ not only have to be replicated, but there are data transfer costs that come with ensuring both systems are kept up-to-date in real time.

There are other options, too.

Sean Hull is an independent scalability and performance consultant with iHeavy in New York, and shortly after the AWS outage authored a titled "AirBNB didn't have to fail," referring to the travel site that was one of dozens across the Internet that went down when AWS's cloud hiccupped. In the post, Hull argues there are tools Web developers can use to be tolerant against outages.

A website can be programmed to turn off certain features but keep the main part of the site up and running if parts of a system go down. In this case, someone browsing to the site would still be able to use basic functions of the site, but may not be able to make a purchase on the site. If a website is to be hosted at multiple locations, a browse-only mode could be active so that even if AWS does go out, a bare-bones version of the site is still accessible to users.

Other third-party vendors offer services within AWS's ecosystem for customers to create highly available systems. Amazon Web Services launched a that have been optimized to run on AWS.

Customers can chose a variety of load balancers from one of these partners, such as Riverbed's Stingray division. Apurva Dave, VP of product marketing for the company, says there is a benefit to taking a "best of breed" approach of using third-party apps instead of simply relying on AWS for services such as load balancing.

"Have you ever been at an airport when a flight is canceled?" he says, as an analogy. "Everyone's in an immediate rush to get to the customer service desk and they wait in line. Then there are other folks, who just call their travel agent directly and the problem is taken care of. We're that travel agent that directs your traffic where it needs to go." Riverbed Stingray immediately and automatically redirects traffic over a dedicated network, avoiding the bottlenecks created in the AWS environment. Like the RightScale option, there are various levels of service that customers can choose from, ranging in price depending on how fault tolerant the system is.

As Hilgendorf says, and Dave agrees: It's a cost-benefit analysis for the business. "There are some apps where it's OK if it does down for an hour a month," Dave says. "But there are tons of apps that can't afford that."

Network World staff writer Brandon Butler covers cloud computing and social collaboration. He can be reached at BButler@nww.com and found on Twitter at @BButlerNWW.