Google gets Slack with software updates

16.10.2006

Maintaining a large number of Linux servers to power its search and Web application services is at the heart of Google's business and, until now, has remained a closely guarded secret.

Speaking at the Australian Unix Users Group (AUUG) 2006 conference in Melbourne last week, corporate systems administrator Michael Still lifted the lid on some of the tools Google uses internally to manage clusters of servers.

Rather than relying on standard Linux operating system packages, Google developed its own software, dubbed "Slack", and released it as an open source project a year ago but Still said this is the first time the search giant has talked about it publicly.

"Slack is a source deployment system and it's the way we install applications on servers," Still said, adding Slack is based around a centralized configuration repository which is then deployed onto selected machines in a "pull" method. Each of the "worker" machines asks for its new configuration regularly or when a manual command is run.

"An application install is called a Slack role, so if you have an LDAP slave, you have an LDAP slave role," Still said. "You can have more than one role per machine although if the roles are going to tread on each other then your installs will have to handle how to deal with that."

With Slack, Google system administrators build changes or patches against the source control system for configuration. These changes are checked into the central repository, and then to the "Slackmaster", which Still says is "nothing special", just an rsync server.

Slack also support sub-roles for specific parts of an application, and both pre- and post-install scripts.

Still said there are alternatives to Slack, the most obvious being operating system packages, but one advantage of Google's system is there is "no intermediate binary compact form" of the Slack role.

"So it's reasonably easy to go poke around with just the bit you need without going and rebuilding an entire RPM," he said.

While there is no concept of rolling back a Slack role, if something is broken "you fix it and redeploy it everywhere".

"If you really regret that a machine is not an LDAP slave for instance, you have a repeatable operating system install [so] rebuild it for whatever it was meant to be," Still said. "We can get a new server up in probably half an hour."

There is also no logging of what Slack roles were deployed when but Still said that will be fixed soon.

On the topic of standard operating environments, Still said he is amazed at the number of people that don't have SOEs for servers.

"We do, obviously, and it saves our bacon all the time," he said. "If random people can wake up at 3am and know what to expect on a machine then that can stop really embarrassing stuff from happening. So if you don't have that you should get one."

In addition to having a repeatable operating system install, Still recommends having a repeatable application install "so at 3am when you lose one of your Web front ends you are able to bring up another one quickly".

"You need to monitor your machines so you know what's failed, preferably before your users know so you can start looking at it," he said. "We do a lot of that with custom code but there are lots of open source monitoring systems out there that are actually quite good. You also want to keep failure metrics so you know if your mail server is the least reliable portion of your network - it's useful data."

The humble Still believes none of what Google does is gospel and is sure there are other "equally valid" ways of doing keeping systems in check.

"If you've only got six machines then the answer might be to spend $100,000 per machine but if you are going to build apps based on whitebox hardware then you have to assume that hardware is going to fail reasonably regularly," he said.

The exact number of servers used by Google is kept under wraps but speculation puts the total number in the tens to hundreds of thousands.

"Generally you architect things, even smaller internal corporate apps, so that when things fail the app stays up," Still said.