Big data the NASA way

30.10.2012

Mattmann explains that Nutch could originally only scale up to 100 million web pages, whereas the big search engines such as Yahoo and Google were in the four billion page range. So Cutting set about creating a new system, which he called Hadoop, allegedly after his childhood stuffed toy.

Inspired by Cutting's work and sense of humour, Mattmann has started his own project called Tika - a text analysis tool that detects and extracts metadata and structured text content from various documents using existing parser libraries. It is named after a soft toy belonging to the daughter of his partner in the project.

"Most of the open source work I do is through Apache, a lot of it has to do with the Apache licence being a very permissive licence," Mattman says. "It allows people downstream that leverage Apache based software to use that upstream open source component in arbitrary ways. It makes it so the software I build -- when we distribute it to customers, or others we collaborate with, we don't have to give them any surprises."

Mattmann says NASA has been an active user of open source software for around 15 years but only recently has it become active on the production side. For the past two years NASA has held open source summits, outlining its contribution to open source.

NASA categorises its data in different levels, and in the next generation earth science system satellite area where Mattmann works it is publically distributed via DAACs (Distributed Active Archive Centres). He says the programs and tools used to process data vary depending on the preferences of the scientists involved in the project. "A lot times the software itself is coupled to the instrument."