MIT Media Lab's baby-talk project spawns massive array

22.05.2006

Imagine a storage array with capacity equal to a stack of iPods three times the height of the Empire State Building and that can be managed with common Ethernet networking tools.

Developing such an array is the goal of a group of MIT scientists and four storage vendors working on the Human Speechome Project, an MIT Media Lab venture looking to find out how babies develop the ability to talk.

The project began three months ago when MIT associate professor Deb Roy began using 14 microphones and 11 cameras mounted on ceilings throughout his house to record his baby boy's everyday life. The setup gives researchers a bird's-eye view of every room in the house.

The effort requires a massive storage-area network to archive and search what is expected to be 1.4 petabytes, or 1,400TB, of data compiled over the span of the three-year project. It is being built from commodity hardware and uses a 10 Gigabit Ethernet IP network for data transfer between the back-end SAN and hundreds of servers.

"I think here what we're seeing is the future of storage. This is a great marriage between industry and the academic world," said Frank Moss, director of the MIT Media Lab and former CEO of Tivoli Systems Inc., at a press conference last week at MIT.

When completed, the Human Speechome Project's computing infrastructure is expected to include more than 300 Hammer Z-Rack storage enclosures from Bell Microproducts Inc., about 3,000 Serial ATA hard disk drives from Seagate Technology LLC, and more than 100 10 Gigabit Ethernet switches and 400 blade processors from Marvell Technology Group Ltd.

The project team selected the high-throughput switches to handle the storage I/O anticipated by researchers, who believe that 350GB of video will be processed during every 12-hour analytical run.

Cost-conscious approach

To achieve the desired performance requirements, 150-drive stripes will be created using the native virtualization capabilities of Bell's Z-SAN. Protection against data loss will be delivered through RAID 10 mirrors of the raw video data, transform data and metadata files.

"Our approach allows us to eliminate a lot of cost by using high-volume, commonly available systems," said Jeff Greenberg, senior director of product marketing at Irvine, Calif.-based Zetera Corp., the firm that's designing the SAN.

The project has been amassing several terabytes of audio and video data per week of early-childhood learning and socialization data to model human language acquisition, Roy said.

"If you take all parallel tracks of data over three years, you'll have 400,000 hours of video and audio data," he said.

Roy said engineers at the university have also built an application that can home in on video and audio streams that involve his 9-month-old baby's development while avoiding video playback of empty rooms or footage of mundane tasks such as making coffee.