Google+

BIG DATA REVEALS UNSTRUCTURED DATA'S VALUE

Unstructured Data

associe-didier


Didier Bourigault, Synomia’s R&D Director and researcher for CNRS, shares with us his vision of the capital importance of unstructured data analytics within Big Data. 

 


Structured Data vs Unstructured Data

Sutructured data's possible value is determined and understood upfront. For example, within a database consisting of results from an opinion survey, an individual's age or socio-professional category is considered structured data, as age or socio-professional category options are determined beforehand. All responses to open-ended questions, on the other hand, are considered unstructured data, since these responses are potentially all different and impossible to categorize beforehand. Within a database consisting of customer emails, the sender name or the date is structured data, while the body of the message is unstructured data. In general, unstructured data is textual data

Open-ended responses to open-ended questions are unstructured data, as these responses are potentially all different and impossible to categorize. In a mail client database, the author or date is structured data, the body of the message is unstructured data.

 In general, unstructured data is textual data.

 

données stucturées et non structurées

 

 

The challenge of Big Data brings up that of unstructured data processing


Automatic processing of unstructured data has been an industrial and R&D challenge since IT's very beginning. Historic application is automatic translation.

The fields of research interested in the challenges associated with automatic processing of unstructured data are Natural Language Processing, Computational Linguistics, Artificial intelligence, and Text Mining.

The current in-fashion title for the processing of unstructured data within the industrial sector is known as “Text Analytics” (cf. Who's Who in Text Analytics, Gartner 2017).

Unstructured data has been explored very little as the technology within Text Analytics companies is not as evolved as it needs to be to process such data.


The primary tasks involved in linguistic processing proposed by industrial tools are currently:  

  • named-entity recognition
  • terminology extraction (noun phrases)
  • sentiment analysis
  • relationship extraction (triplets subject-verb-object)

Two elements explain why unstructured data is not easily exploitable:

  • Linguistic analysis done by the market’s text analytics tools is of poor quality. It is far from being able to significantly, precisely, and completely characterize the content of analyzed text. It reveals only a limited portion of data content: text mining has not yet become a part of widespread science.
  •   To fill such gaps, promoters of text mining tools propose platforms with features which allow the user to customize lexicons and extraction regulations for each specific text corpus, experience, and objective of the analysis. However, the cost is very high, requiring not only a great amount of skill (such as those well-versed in NPL, inaccessible within consulting firms) but also time, as the user must personally uncover the relevant linguistic particularities of the original text corpus. This customization is presented as an advantage, but in fact presents an obstacle in raw data exploitation.

Consequently, unstructured data is almost never exploited. Only large corporations with the means to train users on Text Analytics industrial platforms can take advantage of this data.

CONTACT US

Synomia
63 bis rue de Sèvres
92100 Boulogne-Billancourt

Tel : +33 (0)1 46 10 06 40

Email :

Scroll to Top