THE PROBLEM

What is the persistent problem in the data ecosystem?

The current data stack is littered with potholes to such an extent that it is considered ‘normal.’ Even while organizations are owning extremely rich data, the data does not reach the destination at the right time or even in the right state due to the poor state of the data pipelines 🚧. Upstream change events are common in data ecosystems, the problem arrives when all the downstream pipelines break due to simple yet untraceable changes.
🏗️ The common drill after a pipeline breaks falls on a few heroic data engineers who take the heat of debugging, locating the exact upstream change event, and fixing the resulting problem. And, of course, to derail the whole process, there’s the unavoidable blame game around why the right information was not distributed to downstream teams before the pipeline broke.

On average, easily more than a few weeks are spent on tackling one simple upstream change event. Moreover, the revenue stream is directly affected if the issue impacts popular data pipelines.

The lack of information distribution is not even the fault of producers, who are most often unaware of how and where their data is being channelled to. Moreover, once the pipelines break, poor data engineers are bombarded with requests from both producers and consumers to fix their data on priority.

What do Data Personas want?

As a data producer, I want to produce data strategically to complement business requirements. But, I don’t want to be constrained during data generation since it obstructs the speed and richness of data.

As a business user or data consumer, I want quick access to reliable data that adheres to my business/use-case requirements without week/month-long iterations with data teams. I want to use high-quality data at the right time to achieve maximum impact on fast-paced business decisions.

As a data engineer, I want to cut down the flood of requests I receive from both producers and consumers and channel incoming data systematically through a reliable yet adaptable pipeline that enables dynamic self-service for both producers and consumers of data.

The outcome of such disparate communication between data personas and users results in:

⛑️ Poor data quality

🌪️ Unmanageable data ecosystem

⚔️ A gap between business logic and physical data
🩹 Fragile data pipelines

🥪 Sandwiched data engineers

☠️ Death of Data Modeling

💪🏼 Lack of data ownership or accountability

Without any direct consensus between data producers and data consumers, the data lack schema fit, governance, quality, and thereafter, reliability.

THE SOLUTION

What is the least non-disruptive solution?

To enable high-quality and trustworthy data that truly impact business, the data ecosystem needs to imbibe the concept of API handshakes. We call these handshakes Data Contracts. Data Contracts give back the ownership of data semantics to businesses, enabling data that is rich with business context, democratized, declarative, and highly operable.

APIs enforce rules so that two software applications can talk to each other. API handshakes are so standard today that anybody with any digital device, including a phone or a laptop, has already made over a hundred API requests while reading this article.
The hope for data contracts is nothing less. Open Data Contracts (ODC) aim to form a similar level of reliable communication between data producers and consumers by establishing standard data practices.
And the wonderful thing is this can be established by ODCs becoming a simple bridge between already existing data resources. Thus, from a resource perspective, ODCs do not need organizations to replace their entire data infra that has locked large investments over the years. ODCs also do not need data to be constrained.
Instead, they provide a set of concrete expectations that producers and consumers could implement to enable backward compatibility.

WHAT

What are Open Data Contracts?

Open Data Contracts or ODCs are expectations on data. These expectations can be in the form of business meaning, data quality, or data security. Think of them as guidelines to which the data may or may not adhere. In other words, an ODC is an agreement between data producers and data consumers that documents and declaratively ensures the fulfilment of data expectations.
“Open” in open data contracts stands for open standards wherein the community can pitch in to improve or customize the implementation to their specific needs. Open standards are crucial, particularly for new innovations.
An ODC does not just describe data or an entity, but it also adds meaning to it by defining semantics or logical business expectations from the data. It is enforceable at any chosen point in the data ecosystem, allowing automation, standardization, control, and reliability to the whole data infrastructure. Through an ODC, data stewards can decide which parties or systems to expose the data to and in what format.
From a resource perspective, the contract layer helps data engineers to delegate the burden of countless high-priority requests from both producers and consumers by decoupling databases or services from the analytical or consumption layer. ODcs do not need organizations to replace their entire data infra that has locked large investments over the years.
The outcome of implementing ODCs is contextually rich, high-quality, well-governed data that is trustworthy and available for use at the right business opportunity. Consumers can now simply focus on describing the data they need instead of reconstructing it through complex and buggy JOINs and Aggregations over weeks.

The Four Cornerstones of ODC: Expectations fulfilled by contracts

Specs Description
Schema • Column name
• Data type
Semantics

• Entity and column description
• Tags. Using tags is how we will bridge taxonomy and ODC. This will enable users to manage governance, checks, and automation efficiently.
• Integrity Checks. Validation rules which guarantee the integrity of an attribute in the entity.

Security • Access policies
• Data policies
Business assertions • Quality checks
• The cadence of checks
Open Data Contracts are required to be well-structured and templated. It defines and logically represents only one single entity to help producers and consumers understand the schema and semantics around it. Taxonomy or logical classification of entities is a layer on top of ODCs and represents a hierarchical structure for entities.
While the purpose of ODCs is not to define the ontology/taxonomy/glossary, it establishes a connection to the taxonomical layer through semantics. Due to the hierarchical structure of taxonomy, entities can inherit semantics from parent entities or other objects (every element in the Taxonomy layer need not be an entity but can relate to ODCs through semantic definitions or tags).

WHY

Why Implement Open Data Contracts?

Prevention is better than cure, and Open Data Contracts are how you prevent mishaps in the Universe of Data.
Problem with the current infra and how it is igniting instead of solving the issue of untrustworthy and dormant data.

Manageable Data Ecosystem

Manage and modify data structures, relationships, and business logic with a backspace button and a couple of YAML lines.

Concrete and Adaptable Data Pipelines

Every change to the data triggered by business teams is governed and selectively enforced before the pipeline gets a chance to break.

Bridge between Business Logic and Physical data

Enforce a right-to-left data modelling to declare requirements even before data teams encounter or process the data.

Optimizing Data Modeling

Rid yourself of subpar data models created by data engineers who only have a partial view of the business landscape.

Happy Data Engineers

Outsource the mundane activity of maintaining data definitions and behaviour and ensures that the expectations are propagated across all teams.

High Data Quality

Make it impossible for any user/machine to feed data that does not follow quality standards.

HOW

How to Implement Open Data Contracts?

Without the implementation of Open Data Contracts, models and the apps in the activation layer cannot channel data that has quality, semantics, and governance guarantees.
ODCs enable right-to-left data modelling where the business user defines the data requirements based on the use case instead of consuming whatever IT feeds them. This holds even when the data demanded by the data model and the contract do not exist or have not yet arrived in the org’s data store.
Business users can ensure that the data channelled through the data model is of high quality and well-governed through contractual handshakes. The sole task of the data engineer is to map the available data to the contextually rich data model. This does not require them to struggle with or scrape domain knowledge from vague sources.
This also implies that the use cases on top of the data models are powered by the right data that specifically aligns with the high-value requirements.

ODC Landscape

Objects in the Landscape Description
Dataview Set of data points. Dataview acts as a bridge between physical data and contracts. It can reference a table stored in any lakehouse, a topic from Kafka, a snowflake view, a complex query extracting information from multiple sources, etc.
Contract It contains strongly typed data expectations like column type, business meaning, quality, and security. Think of them as guidelines to which a Dataview may or may not adhere.
Entity Logical representation of a business object, mapped by a Dataview adhering to a specific contract with a certain degree of confidence. It also contains information on what can be measured and categorized into dimensions.

 

Other Entity Specs Description
Dimensions Categorical or time-based data helps in adding context to the measures. Users can slice and dice measures by dimensions to get deeper insights.
Measures Aggregated values across dimensions. Dimensions here can be internal as well as external dimensions belonging to other entities. The formula or description for measures should be expected from the persona building the Dataview.
Relations List of 1:1, 1:Many, or Many:1 relations with other entities to enable calculations and eventual analysis of Measures and metrics.

Federated responsibilities

Dataview, Contract, and Entity, each fall under the ownership of different personas. However, no persona is restricted from setting up either of the components. The boundary is loosely defined based on the typical role of three different personas:

Persona Responsibility
Data Engineer • Data engineers who work with actual physical data can independently produce Dataviews without any obstruction from Contracts.
Data engineers can refer to Contracts as a guideline to produce Dataviews that adhere to the requirements from the business end.
• Data engineers can run CRUD operations on Dataviews to progressively increase percentage adherence to a Contract or a bunch of Contracts.
Domain User • Domain users are responsible for defining Contracts since they own domain knowledge.
• The domain structure is backed by a data mesh architecture that supports domain ownership by giving control of niche data to domain teams.
• The allocation of the contract’s responsibility to domain teams removes the bottlenecks on central data teams, which are now solely responsible for the central control plane that takes care of governance, metadata, and orchestration.
Business User • Business users are responsible for defining Entities with three attributes: Measures, Dimensions, and Relations.
• The responsibility for these three features falls on Business teams since they directly work with data applications on the customer side and are aware of the connections, equations, and columns that have to be operationalized for specific business outcomes.
• The business user can create queries matched with a set of contracts and dataviews. Contracts with the highest match to the query are surfaced, and project the Dataviews with the highest match with the respective contracts.

Contracts & The Three-Layered Data Operating System Landscape

The Data Operating System (DOS) is segregated into three logical layers: Data Layer, Semantic Layer, & Activation Layer.
The role of Contracts integrates with each layer to facilitate a governed and high-quality exchange of data across these layers.
DOS Layers Contract Implementation
Data Layer You establish a connection with your heterogenous, multi-cloud data environments to make data addressable consistently. The layer provides users access to data via a standard SQL interface, automatically catalogues the metadata, and implements data privacy using the ABAC engine.
Knowledge Layer With dynamic metadata management capabilities, DOS identifies associations between data assets and surfaces up the lineage. Auto-classification finds the hidden business meaning in the datasets and propagates such information, including sensitive identifiers, to downstream consumers. All of the information is stored in a graph structure to utilize the network effect, and with open APIs, users can augment the data programmatically.
Semantic Layer A semantic layer capable of accessing and modelling data across disparate sources. It allows you to view and express various business concepts through different lenses. Users can also define key business KPIs, and the system automatically surfaces anomalies around them. With contracts, business users can express their data expectations and enforce them while modelling business entities.