VMware causes second outage while recovering from first

02.05.2011

VMware official that the April 25 power outage is "something that can and will happen from time to time," and that VMware has to ensure that its software, monitoring systems and operational practices are robust enough to prevent power outages from taking customer systems offline.

With that in mind, VMware began developing "a full operational playbook for early detection, prevention and restoration" the very next day.

"At 8am [April 26] this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon," Tankel wrote. "This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed. Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."

The second-day outage was the more serious of the two.

"This was our first total outage, which is an event where we need to put up a maintenance page," Tankel continued. "During this outage, all applications and system components continued to run. However, with the front-end network down, we were the only ones that knew that the system was up. By 11:30 a.m. PDT the front end network was fully operational."