Netflix uncages Chaos Monkey disaster testing system

30.07.2012

Netflix says it has run Chaos Monkey internally to create 65,000 failed instances across its system. "Failures happen and they inevitably happen when least desired or expected," the blog reads, continuing later: "Even if you are confident that your architecture can tolerate an instance failure, are you sure it will still be able to next week? How about next month? Software is complex and dynamic and that 'simple fix' you put in place last week could have undesired consequences."

Jeremiah Peschka, managing director at IT consultancy Brent Ozar PLF, says clients he advises too often overlook installing disaster recovery plans, much less testing them. "This seems like a really sane and safe way to see if you're protected that doesn't cost a ton," he says. Netflix says in the blog the service could likely be run using Amazon SimpleDB, a relational database, and is small enough that it could be run within AWS's , which covers up to 25 SimpleDB machine hours and 1GB of storage.

For users that may be nervous about intentionally failing their systems, Peschka recommends running Chaos Monkey in a scaled testing area that mimics a production environment. Other ways to test DR systems, he says, are traditionally from manually built processes of shutting down servers, but Netflix releasing Chaos Monkey is the first DR testing system he's seen released in an fashion.

"One of the biggest stumbling blocks of the cloud is that it can go down," he says. "People need to test this stuff and if you're really concerned about having 24/7 uptime, there's probably even more you can do." For example, highly reliant systems typically use elastic load balancers, which are available from Amazon Web Services, to reconfigure virtual machines away from failed instances. Additional testing to fail an entire availability zone and perhaps even an entire region are being tested by Netflix, the company has hinted at in what it calls "Chaos Gorilla." it hopes to open source other tools it uses as well in the future.

Network World staff writer Brandon Butler covers cloud computing and social collaboration. He can be reached at BButler@nww.com and found on Twitter at @BButlerNWW.