Skype post-mortem explains service outage

29.12.2010
Skype's CIO Lars Rabbe on Wednesday offered a frank assessment of the recent 24-hour lapse in its Internet telephony service, in a that also laid out what the company is now doing to make its network more robust.

Rabbe's post also served as a corporate mea culpa, saying "we know that we fell short in both fulfilling your expectations and communicating with you during this incident."

The failure of Skype's service for many of its users started at about 4 p.m. GMT on Dec. 22 and lasted through much of the 23rd, Rabbe said. On that Wednesday, a cluster of servers became overloaded, and some Skype clients received delayed responses from them. In one particular version of the Skype for Windows client, the delayed responses from the servers caused a processing misfire that led the client software to crash.

The affected version of the Skype for Windows client was 5.0.0152 -- a version that Rabbe said about half of Skype's users were running. Crashes caused about 40 percent of those clients to fail. And among the clients that failed were between a quarter and 30 percent of systems that provided important directory services in Skype's peer-to-peer network.

While Skype worked quickly to bring these so-called supernodes back online, even when restarted those systems remained unavailable to the network for a time. And meanwhile, the pressure on remaining supernodes pushed other systems over the top and caused even more of them to shut down. "This further increased the load on remaining supernodes and caused a positive feedback loop, which led to the near complete failures that occurred a few hours after the triggering event," Rabbe explained.

To fix the problem, Skype engineers introduced hundreds of instances of the Skype software into the peer-to-peer network to serve as dedicated supernodes, the CIO said. To do that, they drew on resources that are normally used in Group Video calling, thus taking that service offline temporarily. It was restored in time for Christmas, Rabbe wrote.