Data product for unstructured data

Extracting value from unstructured data requires more than accurate and exclusive solutions: to make it a strategic asset and ensure sustainable and scalable management it is necessary to design its integration with the ecosystem and life cycle management.

Industries:

Finance & Insurance - Retail & FMCG - Transportation - Energy & Utility - Life Science - Industrial

Solutions:

Data Products

Technologies:

Blindata - Open Data Mesh

Background

Unstructured data is a valuable resource that has long remained underexplored. In the past, the challenges of automating processes without compromising on quality and cost limited its potential. Today, thanks to artificial intelligence (AI), these barriers have been overcome. However, AI alone is not sufficient; it requires the support of effective knowledge management to provide appropriate context, preventing generalist models from veering into inaccuracies or trivialities, which can lead to superficial and ineffective solutions. Without this support, even the best technology risks becoming a blunt instrument.

How many companies are truly managing to fully exploit this opportunity?

The flexibility and immediacy of recent tools have led to an increasing number of Proofs of Concept (PoCs), which seem to pave the way for innovation. However, once the proposed approach is approved and initial results are obtained, it is important not to confuse the agility in creating a prototype with the integration of the final solution into the existing ecosystem, while respecting the organization’s overall requirements and strategy. Transitioning from a promising idea to a scalable and well-integrated solution requires a broader vision and attention to context that goes beyond the initial phase.

Even the most promising solutions, if not managed correctly, can lead to an uncontrolled proliferation of tactical approaches, escaping control through phenomena of shadow IT. This not only undermines long-term maintainability, evolution, and security but also fuels technical debt with isolated solutions that are difficult to manage and destined to collapse like houses of cards.

It is evident that the current context is rich in opportunities but also fraught with pitfalls. Exploiting the potential of unstructured data requires a mature vision capable of finding the right balance: innovating without complicating, preserving flexibility and sustainability.

Punti Challenges

If the realm of unstructured data is now more accessible, challenges are far from being overcome. Here are the key elements to address for implementing reliable and sustainable solutions:

Effective integration with the existing ecosystem
Minimizing direct dependence on external providers
Modularity and maintainability of the solution
Compliance with security and privacy requirements
Integration with test and release management processes
Observability
Quality of inputs and outputs
Agile configuration management
Enforcement of governance policies
Interoperability with other assets
Ability to evolve agilely in response to dynamic requirements and functionalities
Transparency through clear definition of responsibilities, purpose, functionalities, and service consumption modalities

These aspects can be addressed with tailored solutions; however, where possible, it is essential to adopt a strategic approach that leverages tools and practices already established in the world of structured data, enabling integrated management of structured and unstructured data and converging towards a unified vision.

Solution

The data product-based approach has enabled the management of structured data as a corporate asset, made available for multiple use cases, promoting greater business agility and data democratization.

Applying this approach to unstructured data also provides an effective solution by organizing assets within an architecture characterized by strong decoupling, adaptability, and effective division of responsibilities.

Data products that interact with unstructured data add to all the classic characteristics of data products the ability to engage with this new data format.

In particular, the basic anatomy of this type of data product includes at least:

• an input pot fo consuming unstructure data
• an output port for exposing structured data after processing

This is the case, for example, of a data product capable of processing billing documents in PDF format, extracting header information from an invoice such as VAT number, amount, and data.

The data product approach enables the utilization of multiple heterogeneous input and output ports, enhancing flexibility and the ability to handle unstructured data. This methodology allows for:
• Consumption of structured data through various input port, facilitating the retrieval of information such as user profiles.
• Production of unstructured data via different output port, enabling the distribution of original files, their segments, or unstructured derivatives to consumers.

Naturally, both input and output ports involve interaction with other data products: the paradigm achieves its ultimate goal precisely through the reuse of assets. The expression of the data product does not vary based on the type of data it operates with and allows the convergence of practices and tools for centralized and unified management of assets.

However, differences do exist; in fact, interaction with unstructured data requires specific tools and precautions. The lifecycle of data products is also governed and standardized through platform services. In addition to the classic services for managing the lifecycle of data products, to operate with unstructured data, specific services are necessary:

• Exclusive services for unstructured data, such as those for extracting text from images, segmenting parts, or calculating embeddings.
• Services for interaction with semantic elements, such as those for retrieving the definition of a concept or extracting portions of an ontology.

In both cases, and in continuity with the originally planned data product-based approach, the shared platform services conceal and take on the complexity of integration with the necessary functionalities by introducing a decoupling from the underlying infrastructure.

The services are invoked during the definition phase of the data product using a declarative approach, similar to practices in the structured data world, such as specifying the use of a database table or an ETL job. In this case, the platform will provide a series of interfaces to enable data products and various use cases to utilize additional tools, such as services specific to unstructured data or services for interacting with knowledge, with a decoupling that hides management and integration complexities and centralizes control.

Within the data contract, references to the concepts of the ontology are specified, where the expression of useful semantic elements resides, characterizing the business context, and establishing the point of contact between data products and knowledge. This connection allows data products to autonomously yet regulated and centrally monitored access semantics. It enables the enrichment of unstructured input data with relevant contextual information useful for producing outputs.

Through platform services, the data product can interact with specific points of the ontology: for example, to dynamically retrieve the concept of VAT number to then use it to search for instances in unstructured documents. Alternatively, it is possible to extract portions of the ontology to enrich the context considered by the processing, for example, to dynamically obtain the lexical structure of a document to be processed for fine-grained processing.

The data products resulting from this approach can operate behind the scenes, processing data in batch or real-time modes, or foresee direct interaction with users and applications, for example, by providing an API input port to receive unstructured data. The value of this approach lies precisely in its flexibility and ability to respond to heterogeneous needs with an adaptable, modular model centered on collaboration and the reuse of resources and assets.

Benefits

Harmony

Convergence with the approach used for structured data

Organization

Separation of responsibilities

Autonomy

Rationalization and standardization of shared functionalities through platform services

Democatization

Transparency and clarity of data

Sustainability

Maintenance of essential and non-accidental complexity

Synergy

Interoperability and reuse of components

Use Case

Data Strategy

Data product for unstructured data

Background

Punti Challenges

Solution

Benefits

Use Case

Data Contract

Contattaci!

Entra a far parte del team Quantyca, facciamo squadra!