Google+

THE RISE OF BIG DATA SHOWS THE VALUE OF UNSTRUCTURED DATA

UNSTRUCTURED DATA

Didier Bourigault works to find new solutions to address Big Data search and development



Didier Bourigault, R&D Director at Synomia and researcher at CNRS,

explains why he thinks unstructured data analysis is such an important

issue for Big Data analytics 

 

 

 

Structured data vs unstructured data

All the possible values of structured data are set and known in advance. For instance, in a database containing the results of an opinion poll, the age and the gender of respondents are structured data since age groups and gender are set beforehand.

Responses to open-ended questions are unstructured data because these responses are potentially all different and impossible to categorize. In an email client database, the author and the date are structured data, whereas the body of message is unstructured data.

 

In general, unstructured data is text data.

unstructured data is text data.

 

 

The issue of Big Data reopens the issue of unstructured data processing


The automatic processing of unstructured data has been an R&D and industrial problem sincge the early days of computing. It's historical application is automatic translation.

Research areas that are focused on the problem of automatic processing of unstructured data are Natural Language Processing, Artificial Intelligence, and Text Mining.

Currently, the most popular designation in the industry that deals with unstructured data is “Text Analytics” (see Who's Who in Text Analytics published by Gartner in September 2012).

Unstructured data is not exploited because the Text Analytics technologies offered by companies are not sufficiently advanced


Currently, the main features of language processing in industrial tools are: 

  • named entity recognition
  • extraction of terms (noun phrases)
  • sentiment analysis
  • extraction of relations (subject-verb-object triples)

Two factors explain why unstructured data is not easily exploitable:

  • The quality of linguistic analyses performed by the tools available on the market is poor. They are far from able to characterize the content of the texts under analysis in a meaningful, precise, and comprehensive way. They only reveal a small part of the contents of the dataset.

  •  To make up for these shortcomings the promoters of these tools provide platforms with features that allow the user to customize a corpus of lexicons, specific extraction rules, the profession and the objectives of the analysis. But not only is the cost very high in terms of skills (it takes skills similar to a Natural language processing (NLP) developer, skills which are not accessible to consultants but it also requires a lot of time since the user must discover the relevant linguistic specifities in their corpus. This option to customize thoose tools is presented by the industry as an advantage, but it is actually a hindrance for the exploitation of raw data.

 

Consequently, unstructured data is largely untapped, except in large companies that have the means to train users in industrial Text Analytics platforms.

 

 

 

RECEIVE REGULAR INSIGHTS ON BIG DATA EXPLOITATION AT THE SERVICE OF YOUR COMPANY

CONTACT US

Synomia
63 bis rue de Sèvres
92100 Boulogne-Billancourt

Phone: +33 (0)1 46 10 06 40

Email :

Scroll to Top