EMC is using a mixture of traditional databases and newfangled NoSQL data stores to analyze public perception of the company and its products, explained Subramanian Kartik, distinguished EMC engineer, during one talk.
The process, called sentiment analysis, involves scanning hundreds of technology blogs, finding mentions of EMC and its products, and assessing if the references are positive or negative, using words in the text.
To execute the analysis, EMC gathers the full text of all the blog and Web pages mentioning EMC, and compiles them into a version of MapReduce running on its Greenplum data analysis platform. It then uses Hadoop to weed out the Web markup code and non-essential words, which slims the data set considerably. It then passes the word lists into SQL-based databases, where a more thorough quantitative analysis is done.
The NoSQL technologies are useful in summarizing a huge data set, while SQL can then be used for a more detailed analysis, Kartik said, adding that this hybrid approach can be applied to many other areas of analysis as well.
"There is all sorts of information out there, and at some point you will have to go through tokenizing, parsing and natural language processing. The way to get to any meaningful quantitative measures of this data is to put it in an environment you know can manipulate it well, in a SQL environment," Kartik said.