Datakin is announcing OpenLineage, its initiative to define industry standards for data lineage, at the Open Core Summit today. OpenLineage’s end-to-end management intends to make data operations more efficient and trustworthy for enterprises.
Datakin is developing OpenLineage in collaboration with contributors to other open source projects, including Admunsen, DataHub, Pandas, and Spark.
Data lineage is the flow of data over time across an ecosystem; as the foundation for data operations, it provides visibility into data’s origin, inventory, and availability. And data technologies are growing in both number and complexity. Its use cases have evolved in recent years to be increasingly analytical and operational, as opposed to only analytical, making data more central to enterprises’ products. For example, targeted advertisements rely on data models for personalized recommendations.
But the existing tools to handle growing data technologies can be inefficient and untrustworthy, limiting data availability and quality. Data-driven decisions become difficult to make following disruptions like models collecting the wrong data or dashboards breaking. When data processing is more observable, problems can more quickly be identified and fixed.
OpenLineage wants to meet this need by building an end-to-end data management layer. This approach would include data catalogs, comprehensive operational tools, access control for data privacy, and governance and compliance solutions to more easily analyze and collect lineage metadata.
In an interview with VentureBeat, Datakin chief technology officer Julien Le Dem described OpenLineage’s mission to get industry players on the same page. “One goal is, OK, let’s share this effort on those integrations and reuse that independently of the use case, whether it’s governance or privacy or operations. We all need the same data,” he said.
When new processing systems such as Apache Spark are released, they may break and then rely on integration to extract their metadata. Le Dem’s second goal is to fix this issue by making projects first depend on a data lineage standard. “Let’s flip the dependency … that [standard] becomes core to Spark and core to the data warehouse to actually stay in sync and whenever they change something to keep respecting the standard,” Le Dem said. “And now that we inverted this dependency … we are making it a lot more robust.”
OpenLineage could reduce fragmentation and duplication of data lineage efforts across enterprises. Its flexible, industry-wide standards could help enterprises guarantee their metadata’s consistency and compatibility.
The initiative aims to help capture information for the data pipeline when it is running. In the interview with VentureBeat, Datakin CEO Laurent Paris compared it to how pictures on smartphones include as much information as possible, such as GPS coordinates, and that metadata enables observability and allows other tools like computers to show pictures on a map.
According to Paris, “You can have an ecosystem of tools that use that information to process your picture, but it’s only possible because everybody agreed on one single spot on how to express the metadata about the picture, and that’s what we’re trying to do.”
OpenLineage follows observability efforts like Open Telemetry and Open Tracing. It applies a similar community-oriented concept to the data processing ecosystem. “It’s really getting all together, planting the seed, starting defining the models … and this is the type of project where the more people contribute to it, the more everybody gets out of it,” Le Dem said.