Unstructured data becomes a first-class citizen

Unstructured data has previously been an accessible resource, hardly an indispensable resource for modern converged data platforms. The evolution of technologies has changed the paradigm in the processing and use of data, demonstrating that the ability to transform unstructured data into concrete value is the key to enabling innovation and optimizing decision-making and operational processes.

Industries:

Finance & Insurance - Retail & FMCG - Transportation - Energy & Utility - Life Science - Industrial

Solutions:

Background

Over the decades that have marked the era of Information Technology, there has been a succession of technologies and approaches aimed at addressing diverse and innovative needs. In this repetitive yet constant cycle, what has truly generated value for organizations has been the ability to collect, manage, and ultimately use the data at their disposal intelligently.

However, the types and formats of data to be processed can vary widely.

With Data Warehouses, for example, the goal is to provide a centralized repository that enables the extraction of useful information for decision-making processes. Regardless of the technologies and the different methods of implementing and modeling such tools, the common point is the use of data in its structured form.

Structured Data

Data with a fixed schema, usually referred to as a “data model,” which must be designed in advance of interacting with the data itself, specifying the attributes and the data types to which these attributes refer. Thanks to the presence of a well-defined structure, this type of data is easily queryable using structured query languages, with SQL (Structured Query Language) being the primary example.

With the growth of computational capabilities on one hand and new possibilities in the production and consumption of information on the other, this initial type of data has given way to the era defined as “Big Data.” Data platforms, particularly those built around the concept of a Data Lake, have retained the goal of providing a centralized point for data analysis. However, the technologies used have changed, with the underlying idea being the need to distribute both data storage and the processing required to extract value, for reasons of scalability and reliability. In this scenario, a key aspect is the need to quickly collect any type of data produced, including in its semi-structured and unstructured forms.

Semi-Structured Data

Data that have a partial structure but do not follow a rigid schema as in the case of structured data. They are primarily organized around semantic entities and use flexible formats that can be easily extended or modified. Typical examples include XML and JSON formats. An advantage of using semi-structured data is their greater readability by humans as well as computer systems. However, this comes at the cost of query and analysis methods that may perform worse compared to structured data.

Unstructured Data

Data that do not fall into the other two categories, meaning they do not adhere to a specific schema or for which defining such a schema is particularly challenging. This category includes information contained within textual documents, social media posts, or multimedia files such as images, audio, or video. These types of data are the closest to human natural language but are therefore more difficult for automated systems to interpret and process.

Regardless of the initial format, until a few years ago, it was necessary to convert data into a structured format to process and extract information from it.

Unstructured data, due to its greater distance from automated processing methods, has always been less valued and primarily managed manually. Tools were either inadequate or custom-built (machine learning models, computer vision, etc.) for specific datasets.

But is this statement still true in the era of Generative AI?

Challenges

There are several challenges in extracting value from unstructured data:

Data quality and governance

The presence of heterogeneous formats (text, images, video, audio, …) in unstructured data requires advanced processing and analysis techniques. The lack of standards in the structure makes this type of data more prone to noise, the presence of errors or incomplete information, making the identification of unique quality metrics complex, which are more easily definable for structured data. Their use also requires specific approaches to governance, which take into account the difficulty of mapping and cataloging the content, while ensuring compliance with current regulations, especially with regard to the management of personal data that is often hidden or implicit within the content itself .

Processing complexity

The process of transforming unstructured data into a usable structured form requires sophisticated tools and significant resources. For example, OCR solutions are used that combine positional logic with computer vision models, but which often fail in contexts where the format and type of document have high variability.

Contextual interpretation

It is difficult to guarantee that the information extracted from the unstructured data is interpreted correctly, that is, that it reflects the specific semantic interpretations that could be defined in a particular context within an organization. Often the information produced at the corporate level reflects a language that is specific to a particular business area, where knowledge of the context is implicit in the people who work in the organisation. If for structured data it is simpler to define semantic reconciliation processes, for example through the use of conceptual models, this mapping is certainly more complex if we consider the volume of information content present in unstructured data.

Solution

The new possibilities made available by artificial intelligence can significantly help in solving the problems of processing and managing unstructured data.

When we talk about artificial intelligence, however, we risk remaining at too high a level of generalization. In fact, within what is commonly defined as artificial intelligence there is a considerable amount of discipline and tools. If you want to provide an initial organization you could use the following scheme:

Machine Learning

Set of practices and algorithms with well-defined analysis and prediction objectives, aimed at “learning” from a set of data.

Deep Learning

Machine learning algorithms that have a complex layered structure based on neural networks with a large number of parameters. They require advanced computing capabilities and specific larger data sets.

Generative AI

Deep Learning techniques which, through the understanding of patterns present within the data seen during the training phase, are able to generate new information content, therefore potentially a new real asset (text, image, video) for the user of such tools.

Artificial General Intelligence

Theoretical concept that refers to the ability of an algorithm to “understand” and act autonomously, resilient to the context.

In the transition from “traditional” Machine Learning to the use of new Generative AI technologies, of which Large Language Models constitute the main component, we have been able to witness a clear paradigm shift:

In the first case, one deals with models that are very different from one another, each optimized to perform a specific task. The primary activity of data scientists involves selecting the most appropriate model for the use case in question and building the highest quality dataset dedicated to that specific use case. In this context, structured data provided by data platforms is mainly used. Whether in CSV format, stored within a layer of the Data Lake, created from Excel files, or extracted from relational databases, the creation of training sets necessary to train and subsequently use Machine Learning models requires structured data.

In the second case, however, by using an LLM (Large Language Model) as a single model, it is possible to perform a wide variety of tasks. Simplifying, the primary purpose of such models is to make the most accurate prediction of the “next word” based on an input text. From this simple task, and through targeted training to produce coherent sentences aligned with human expectations, applications capable of perfectly simulating the understanding of natural language have been developed. At the core of these models remains the use of neural networks, which therefore require a training dataset.

So, is structured data also used in this case?

Actually, no, and this is precisely one of the main differences and paradigm shifts mentioned above. To build the datasets necessary for these models, one can start with an entire textual document, such as a book, a web page, or an entire code repository, and simply “split” the textual content into sentences of increasing size, where the correct “next word” is exactly the subsequent word already contained in the text. This is the foundation of the training process for Large Language Models, effectively transforming supervised training into unsupervised training.

At this point, a natural question arises: where can a large volume of this type of data be obtained? Obviously, from the internet, using all the available text, while aiming to ensure the highest possible quality.

With these new technologies, processing data from various types of documents—especially in unstructured formats—has become increasingly significant. If this is true for the training needs of these models, it is equally true for the different usage methods and types of processing enabled by these tools.

ChatGPT was the first to demonstrate the power of Large Language Models (LLMs), thanks to their ability to understand and respond coherently to textual inputs. Initially used for recreational and exploratory purposes, with functionalities like translation, summarization, formatting, and natural language interpretation, it quickly led to the implementation of advanced features, such as support for code generation and interpretation (e.g., Copilot), tool calling, and the definition of agents, leveraging both text generation and integrated application components.

Thanks to these new capabilities and uses of the models, artificial intelligence is increasingly being adopted in enterprise contexts—not only for analytical and predictive purposes related to Machine Learning but also, and especially, to bring efficiency to various business processes through Generative AI.

Below, two of the most commonly employed use cases are described:

Named Entity Recognition

Tecnica di estrazione di dato strutturato a partire da testi non strutturati. L’utilizzo di un LLM in questo ambito consente di sfruttare le capacità di comprensione del testo e la conoscenza intrinseca del modello linguistico per recuperare informazioni a partire da qualsiasi tipologia di testo. Un tipico use case è quello della digitalizzazione di documenti. Consideriamo il caso in cui un’organizzazione riceve una serie di documenti in formato non digitalizzato, e ha necessità di estrarre alcune informazioni da questi documenti, siano essi fatture, documenti di trasporto di merce, bollette, … La combinazione di strumenti quali OCR e LLM permette di svolgere in automatico questo compito con ottimi livelli di affidabilità, e soprattutto di elaborare i documenti estraendo dato strutturato a prescindere dalla lingua o da formati e template documentali vari.

Retrieval Augmented Generation

A framework for utilizing language models that enables the models to perform searches on proprietary knowledge bases managed by organizations, thereby extending the intrinsic knowledge of the models with specific domain expertise. In this approach, when a user makes a request to the model, the model can retrieve the most relevant and contextually appropriate information from the knowledge base (retrieve), incorporate this information into the language model’s context (augment), and leverage it to generate a coherent response containing detailed information (generation).

In both of these examples, we can use unstructured data as input.

The first example enables the automatic collection of information from a series of documents that organizations have gathered over time, but for which metadata tagging and conversion into structured data had to be done manually, or at best, with heuristics strongly based on the structure of the documents themselves.

The second example, on the other hand, forms the basis of modern chatbots. It leverages the ability of LLMs to generate natural language text and maintain a high-quality conversation with users, allowing these chatbots to interact with the document base they need to answer questions accurately. Of course, this methodology can also be used with structured data, for example, by having the model generate a coherent SQL query based on the user’s request, thus retrieving data in a structured format. The real value, however, comes from combining both functionalities.

In general, we can assert that unstructured data is becoming increasingly important: on one hand, new artificial intelligence tools make it easier to use and convert it into a structured form, which then fits into more traditional usage processes; on the other hand, this type of data is essential for both training these new tools and in use cases that bring real value to organizations.

As with all disciplines within Data Management, these new uses of artificial intelligence must be controlled and should fall within a defined Data Governance process at the organizational level.

Benefits

Data quality and governance

The quality and governance of unstructured data can be supported by the use of frameworks such as Named Entity Recognition, which allows you to identify and structure relevant information automatically.

Simplified processing

Transforming unstructured data into structured data can be addressed through a combination of advanced OCR and language models such as LLMs. The adoption of pipelines that integrate OCR and Named Entity Recognition allows you to manage documents with high formats and variability.

Context control

To address the difficulties related to contextual interpretation, it is essential to exploit the Retrieval Augmented Generation framework. This allows models to integrate information extracted from proprietary and domain-specific knowledge sources, increasing the system's ability to correctly understand and contextualize data.