Didier Bourigault, R&D Director at Synomia and researcher at CNRS,
explains why he thinks unstructured data analysis is such an important
issue for Big Data analytics
All the possible values of structured data are set and known in advance. For instance, in a database containing the results of an opinion poll, the age and the gender of respondents are structured data since age groups and gender are set beforehand.
Responses to open-ended questions are unstructured data because these responses are potentially all different and impossible to categorize. In an email client database, the author and the date are structured data, whereas the body of message is unstructured data.
In general, unstructured data is text data.
The automatic processing of unstructured data has been an R&D and industrial problem sincge the early days of computing. It's historical application is automatic translation.
Research areas that are focused on the problem of automatic processing of unstructured data are Natural Language Processing, Artificial Intelligence, and Text Mining.
Currently, the most popular designation in the industry that deals with unstructured data is “Text Analytics” (see Who's Who in Text Analytics published by Gartner in September 2012).
Currently, the main features of language processing in industrial tools are:
Two factors explain why unstructured data is not easily exploitable:
The quality of linguistic analyses performed by the tools available on the market is poor. They are far from able to characterize the content of the texts under analysis in a meaningful, precise, and comprehensive way. They only reveal a small part of the contents of the dataset.
Consequently, unstructured data is largely untapped, except in large companies that have the means to train users in industrial Text Analytics platforms.