3 tips for making highly available systems in Amazon's cloud

01.11.2012

There are a variety of vendors, some call them tools, that will manage this process of creating highly available cloud systems using AWS. RightScale and enStratus are two of the most popular. RightScale offers customers prepackaged solutions that spread workloads across AZs, or even various providers. But as Gartner IaaS analyst Kyle Hilgendorf says, "It's a cost vs. risk play. It's not easy, and it's not cheap." Highly available cloud systems simply cost more and add complexity.

Fault-tolerant have to be built to be horizontally scalable. In an ideal world they would be stateless, meaning they wouldn't be constantly changing with new data being inputted and saved into them. In most cases, it requires a copy of your system to be made somewhere else to ensure fault tolerance during an outage. Even when all of that is taken into account, there can still be problems.

RightScale CTO Thorsten von Eicken says during , internal operations within RightScale had trouble scaling across availability zones in Amazon's cloud. AWS admitted it was "throttling" customers, meaning it limited how much data they could transfer from one AZ to another, something it has vowed it will not be as aggressive doing in the future. The point is that even if a system is architected to be fault tolerant, unexpected problems can still arise.

There are multiple ways to architect fault tolerant systems though, von Eicken says. Customers can create two active-active services, or create one active and a "clone" standby, for example. Each has its own advantages and cost considerations, though.

Basic fault tolerance: In a basic fault tolerant architecture, there is a production architecture and a standby "clone architecture." If there is a fail in the master AZ, then the system can be manually switched to use the cloned version, a process that not only usually requires a manual switch-over, but the databases are usually replicated in Amazon's Simple Storage Service (S3) about every 10 minutes, so when a switch-over does occur, you could lose about the last 10 minutes worth of data, RightScale says.