Data Contract

A fundamental ingredient for governing the evolution and diffusion of data that impacts tools, practices, and operational models.

Industries:

Finance & Insurance - Retail & FMCG - Energy & Utility - Life Science - Transportation - Industrial

Solutions:

Data Strategy - Data Products - Data Governance - Data Platform

Technologies:

Blindata - Open Data Mesh - Confluent - Snowflake

Contesto

Data Contracts are agreements between data producers and data consumers aimed at regulating both the data and their exchange. They provide transparency on data dependencies and usage within an information architecture and help ensure consistency, clarity, reliability, and quality.

A data contract allows answering fundamental questions about the data, such as:

• How often are these KPIs updated?
• Are these data reliable?
• Where and how can I access this information?
• What is the expected usage for this data?
• Who has permission to view this value?
• What does this attribute mean?

punti critici dei Data Contract

Every data contract must meet the following characteristics:

• Addressable: It needs to be easily identifiable for usability
• Expressive: Self-descriptive, defined clearly and unequivocally
• Scoped: With a clear scope that suits its purpose, to prevent misuse, ambiguity, or excessive cognitive load on those responsible
• Stable: Unchanged for reasonable time intervals; the measures for managing any changes are also part of the data contract
• Reliable: It must be respected by all involved parties
• Computable: Ideally, it should be processable through automation services. However, it typically evolves gradually from contracts intended as human-readable documentation to data contracts as code designed for use

Data contracts are in the spotlight as they sit at the intersection of 3 key trends in data management: Data Centricity, Data as a Product, and Data Fabric:

• According to the data-centric manifesto, applications should not be at the center of architecture; data should be. While applications generate and consume data, they shouldn’t treat it as a byproduct but should be responsible for it, and contracts are a good way to formalize this responsibility.

• Focusing on the mesh principle of Data as a Product, it’s crucial for developed assets to have clear interfaces to their consumers formally specifying scope, provided services, and how to access them: this is exactly what data contracts are designed for.

• The Data Fabric paradigm emphasizes automation and reproducibility in data management and heavily relies on metadata collection and activation: data contracts serve as a primary and proactive source of metadata.

In databases and data warehouses, schema-based data management has been the norm for decades: data undergo rigorous analysis, are divided by domain, modeled, and documented. However, their behavior is typically governed by local constraints that do not consider their dissemination and evolution. This approach often leads to a lack of flexibility and adaptability. Long chains of dependencies are often created without guarantees, such as rules for introducing changes or characteristics of information exposed by an artifact. This approach often results in unexpected surprises when a schema modification can collapse the entire data pipeline.

Data contracts are often misunderstood and reduced to simple schemas, the physical representation of data. However, schemas serve as the basis for data contracts; they are their primary aspect but do not capture their entire essence. The fundamental elements of a data contract include:

• API: to describe how to access and consume exposed data. A robust API provides details on service location, supported communication protocols, authentication methods, available endpoints, and data exchange schemas.

• Constraints: to define engagement rules, describing how data is exchanged and how it should be used. They establish the perimeter within which the data contract operates.

• Semantics: to clarify the meaning of exchanged data. This aspect is crucial; however, its formalization is still an ongoing challenge.

Challenges

The adoption of data contracts brings along a series of both technological and organizational challenges.

However, managing technological aspects is just the tip of the iceberg, and the main complexities are related to the operational model and mindset required for its implementation. For effective integration of data contracts, the evolution of roles, interactions, and processes towards a culture focused on innovation, collaboration, and continuous learning is essential.

punti critici dei Data Contract

Technological Challenges

Finding effective representations and tools capable of processing and managing them is a key element to ensure operational efficiency and implement automation and tracking. On the market, there is a proliferation of standards and tools to bridge this gap, but there are no mature solutions, and none stands out above the others. However, it’s only a matter of time before competition and community enthusiasm enable us to overcome these challenges.

Organizational Challenges

The adoption of data contracts requires shifting ownership towards the data, which must proactively define and adhere to the exposed data contracts for their consumers. Data producers encompass both primary sources of information and downstream applications in the data pipeline that generate derived data: in both cases, the owners of these applications need to be accountable for the data they expose and circulate. This transformation of the operating model necessitates strong internal support and an ability to culturally engage and stimulate stakeholders. Another fundamental challenge is negotiating data contracts, which in turn encompasses several hurdles:

Diverse perspectives: Communication is fundamental.
Ambiguity: Clarity is important.
Risks/Benefits: Assessing the benefits of data access versus risks.
Cognitive load: Simplifying complex concepts.
Alignment: Ensuring everyone is on the same page.

Solutions

The importance of a commond standard

One of the key aspects to address technological challenges is the establishment of standards upon which data contracts can be based. Here are the main reasons:

• Simplifies Understanding: Reduces cognitive load and allows the team to focus on priorities
• Enables Enforcement of Best Practices: Allows sharing of experience, avoiding the need to reinvent the wheel each time
• Facilitates Automation: Organizes permissible content and its format to support automated processing services
• Enhances Collaboration: Provides a common foundation, promoting teamwork
• Increases Interoperability and Scalability: Makes assets more compatible and processes replicable

Standardization can be achieved through the definition of a specification, whose key characteristics should include:

• Technologically Agnostic: Compatible with any technological stack
• Incrementally Adoptable: To facilitate diffusion without insurmountable initial barriers
• Declarative and Modular: Makes usage agile
• Extensible: For example, allowing custom properties and integration with external standards

In Quantyca, we have launched the Open Data Mesh Initiative and open-sourced our internal specification for describing data products, the Data Product Descriptor Specification (DPDS).

Within the DPDS, the broader concept of a service agreement is utilized: data contracts are a specific type of service agreement. Indeed, on one side, there can be services that do not expose data, and on the other side, there can be agreements between parties that are not formalized with a contract. The DPDS formally describes the components of a data product’s interface, including:

Intent: the purpose of the interface
Expectations: how it should be used by consumers
Contract: regulating the behavior of both the data product itself and its consumers

Data Product Entity

Data Platform Support

Managing the lifecycle of data contracts requires implementing two processes to ensure their validation upon creation and continuous monitoring. Through validation, the data contract undergoes an entrance selection process: if it does not meet the requirements, it is rejected and reported to the responsible parties. This measure ensures that the platform only admits valid data contracts.

Data contracts, like living organisms, evolve over time as their lifecycle is intertwined with that of data and applications. Therefore, it is essential to monitor the validity of contracts even after their initial registration, signaling any divergences and applying corrective actions where necessary.

To govern the processes of validation and continuous monitoring across different technologies and use cases, three architectural components are required:

componenti architetturali data contract

Contract Enforcing Point (CEP)

The Director Behind the Scenes Orchestrating the Entire Data Contract Lifecycle, Acting as the Single Interface for Applications and Users.

Contract Decision Point (CDP)

Locally Responsible for Integration with Specific Technologies.

Contract Information Point (CIP)

Metadata repository storing the data contracts, making them accessible to the entire platform.

Managing data contracts effectively and efficiently requires an approach capable of overcoming the barriers of various technologies. Indeed, data contracts can encompass a broad and heterogeneous perimeter, including Spark jobs, database tables, object store buckets, and Kafka topics, to name a few.

If managing a data contract locally for a single product is already a challenge, adopting heterogeneous solutions is unthinkable as it would lead to significant management effort and the failure of such initiatives. The architecture based on Contract Enforcing Point, Contract Decision Point, and Contract Information Point enables the implementation of centralized and holistic management of data contracts capable of agile adaptation to diverse technologies and requirements.

The Evolution of the Operating Model

For the adoption of data contracts, it is essential to address data management initiatives by carefully examining the organizational structure, the operating model, and the practices adopted by workgroups. This analysis allows for the evaluation of a sustainable transition plan capable of incrementally evolving processes, roles, and interactions in line with the implementation of new paradigms. At Quantyca, we assist our clients in defining and implementing the data strategy, focusing on organizational aspects and tailor-designing the solution to align not only with needs and objectives but also, and above all, with the real status and characteristics of the involved parties.

Benefits

Enhanced and More Efficient Data Governance

Proactive Metadata Generation

Data Transparency and Clarity

Achieved through documenting assets and their semantic value

Interoperability

Facilitates and safeguards data exchange and interconnection between applications

Data Quality

Ensures data integrity and appropriate handling through the inclusion of constraints and validation rules

Collaboration

Platform-provided validation and monitoring enable teams to collaborate with greater confidence

Resources

Video

Free

11/07/2024

Quantyca Podcast: CoE Organizational & Change Governance