Cloud Computing Done the Netflix Way

05.04.2012

Also, applications and automated monitoring constantly check the performance and latency of services. In the case of applications, they are written to call the services asynchronously, so that if one fails, the application does not hang, but moves on with a small piece missing or with slightly stale cached data. The monitoring mechanism constantly watches service performance and, if it observes intolerable variances it will initiate a set of specific automated steps. If the service performance problem persists, the system will raise alerts to ensure that human attention is directed to the problem.

This can be taken even further. Since the underlying infrastructure can be untrustworthy, Netflix spreads its processing across many different Amazon data centers and regions. This makes it more complex and more challenging to operate, but it safeguards Netflix from even large infrastructure outages. (Netflix was notably unaffected by last April's AWS outage, when many "Web 2.0" companies found themselves offline as a result of their decision not to absorb the additional cost and complexity of distributing their applications more widely across Amazon's infrastructure.)

Then, if your application is composed of many services that are failure-prone, and your application architecture is written to be failure-proof for services, it makes sense to deliberately shut down portions of your production environment to see if the application is truly robust. Netflix famously does this with what it calls its "chaos monkey," in which different service environments are randomly taken offline to confirm that the Netflix environment can continue operating in the face of resource failure. One thing that came out of the presentations is that Netflix has many monkeys, not just one. They do different things, but they all focus on validating the robustness of the environment when confronted with resource failure.

Of course, if the concepts of release to production -- and release itself -- are called into question, so too is the role of operations. Netflix does not have a separate operations group for its cloud infrastructure -- every developer is responsible for putting his or her code into production and is called when something breaks. Cockcroft has caused a bit of a ruckus in the cloud community by calling this "NoOps," in contrast to "DevOps," which many operations-focused folks feel is the future of large-scale cloud computing applications.

To my mind, the notion of fine-grained service in continuous deployment puts to rest the concept of a separate operations group responsible for putting applications into production and keeping them running. I believe Cockcroft is somewhat overstating the situation, as there are people tracking the service monitoring and ensuring that any performance and latency issues get addressed. The larger point is that the new model of applications requires a radical rethinking of application architectures, differing ways of moving fine-grained services through their individual lifecycle, differing ways of monitoring an "application," and differing ways of ensuring robustness. As I said last week, cloud computing requires rebuilding enterprise IT for a completely new operating model.