Facebook heat maps pinpoint trouble spots

19.09.2012

Both of these systems produce copious performance metrics -- on various latency, request rate, and error rate statistics. According to Lynch, the caching team was already using a generic heat map to monitor performance. The software, however, could not easily fit the visual data into a single screen. The colors the heat map software used to represent different values offered little intuitive indication of whether a server was performing adequately. And the software didn't interpret the source data in a way that could immediately indicate whether an individual host was running within acceptable bounds.

Lynch designed Claspin, named after a protein that monitors for DNA damage in cells, so that each cluster of servers would get its own heat map, ordered by the rack number within a data center. So problems at the rack level or at the cluster level would become readily apparent by simply viewing the heat map.

"On a 30-inch screen we could easily fit 10,000 hosts at the same time, with 30 or more stats contributing to their color, updated in real time--usually in a matter of seconds or minutes," Lynch said. The code to parse and compile the operational metrics was written in JavaScript, and the heat maps were rendered using the SVG format.

With this heat map, a black box indicates an individual host is down. A green block means the host is performing adequately and a red box means that some aspect of the operation, such as a large number of timeouts, is beyond acceptable levels. In addition to providing a visual glimpse into operations, Claspin allows the users to drill down to specific metrics, by running the mouse pointer over a specific host.