Conquering Big Data with stream computing

10.04.2012
There is big data, and then there is mind-bogglingly enormous data; the latter is the scale at which Mahmoud Mahmoud has been focusing his research on for the last three years. And he says his work will be a "paradigm shift" in the way businesses use big data in the future.

The AUT University computer scientist has been teaching on and off for the better part of a decade, and is currently working on finishing his doctorate. He originally came to New Zealand in 1994, from Kuwait where he was raised and educated.

Mahmoud started his career as a graphic designer, but followed a childhood passion for computers to his current position.

"I have always been a computer geek, even as a little child I remember while the other kids were doing their reports on ancient Egypt using colouring pencils and paper, I did mine using a word processor on my computer," recalls Mahmoud.

Since 2009, Mahmoud and his team at AUT's Institute of Radio Astronomy and Space Research (IRASR) have been working on ways to glean useful information from the enormous quantity of data that is produced by mega-science projects like the Square Kilometre Array (SKA).

IRASR was a part of the joint bid with Australia's Commonwealth Scientific and Industrial Research Organisation to build the SKA radio telescope project in the Australia-New Zealand region.

Mahmoud's research led to a paper published in late 2011 on the use of stream-computing to analyse the data as it is produced, instead of storing it to be mined later. He explains that stream-computing is much like putting your finger in the air to gauge which way the wind is blowing, it is quick and relatively effective.

"With stream-computing, rather than storing the data we store the queries we want to apply to it. We probe the data with questions using the queries, and are given real-time answers as it comes by," says Mahmoud.

"The idea is you don't need to wait until there is downtime to process the information. You can immediately elicit out of the stream the relevant information without needing to store it.

"When you consider that 99 percent of the data collected is likely to be nothing but noise, this saves a lot of time, and money wasted on storage."

Mahmoud says stream-computing could be valuable for businesses looking to leverage the large amounts of data created from various sources online, to make better business decisions.

"Businesses are in the age of information overload. You have stock prices, market data, Twitter, Facebook, SMS, blogs -- all the information just coming out of your ears and being wasted," says Mahmoud.

"Each one of those points of information can lead to better forecasting and decision making when harnessed correctly, but it's also important it is collected in a reasonable amount of time to give businesses agility and a competitive edge."

An example of its use is in the financial sector, where banks and other financial institutions are constantly monitoring market data for the latest trends. Mahmoud says stream-data would enable those businesses to make better decisions on-the-fly without needing to wait several hours for the information to be compiled and analysed.

He says projects like SKA will help speed up research and development in stream-computing by bringing in business interest, but only if the infrastructure is interoperable with what is currently used in enterprise.

Mahmoud's research uses IBM's InfoSphere Stream technology as its parallelisation middleware to manage CPU usage and to hold queries. He says other stream-computing infrastructure, like that used by CERN for its Large Hadron Collider research, uses highly customised components which would be difficult to replicate for business use. "At CERN they use a protocol called White Rabbit to query their data. This is a very comprehensive system, but it's not interoperable with other protocols," says Mahmoud. "They manufacture everything right down to layer one, it needs special hardware and routers which couldn't be used by most modern businesses."

He says further research could help realise the "Holy Grail" of cloud computing, which is contextual search and answers.

"At the end of the day people should be able to use language specific to their domain or expertise, whether you are in the medical field or financial field or whatever, to ask questions to this stream, and it will give you a contextually aware answer.

"This is the Holy Grail for cloud computing, and while we are not there yet our research is heading in that direction."http://cio.co.nz/cio.nsf/spot/conquering-big-data-by-stream-computing