Facebook heat maps pinpoint trouble spots

19.09.2012
Faced with the challenge of overseeing the health of large caching systems, a Facebook engineer developed heat-map software to quickly pinpoint problems in the social network's data centers.

The visualization monitoring tool, called Claspin, uses the heat map format to portray the working status of Facebook's servers.

"As Facebook grew both in size and complexity, it became more and more difficult to figure out which piece was broken when something went wrong," wrote Sean Lynch, an engineer with Facebook's cache performance team, .

The idea of using heat maps in overseeing data center operations is an emerging one. At least one Oracle engineer of using heat maps to quickly convey potential problems in the data center.

Whenever the popular social networking service experiences technical difficulties, the cache performance group must make sure that the caching mechanisms are not the problem, or part of the problem. A heat map could be an efficient way of representing operational status of a large number of components. Each component is represented as a cell on a large matrix, and the color of the cell represents the health of the component. A green cell may represent a node that is operating within acceptable bounds, while a red cell may represent one not operating correctly.

Facebook uses two major cache systems. One Memcache, and the other relies on a caching graph database called .