In a rapidly evolving world of big data, data discovery, governance and data lineage is an essential aspect of data management. As organizations modernize their workloads into multi-cloud and hybrid environments, data starts to get distributed across cloud data lakes and SaaS applications. With that, organizations are trying to answer key questions:
- How do I find the right dataset?
- How do I ensure the data is of high quality?
- How do I move faster to deliver insights for analytics and ML workloads?
- How do I comply with regulations and deliver trusted data?
Achieving data discovery, lineage and reliability – at enterprise scale – is an opportunity for organizations. To help enterprises build a strong foundation for data management, we’ve partnered with Informatica to provide an end-to-end lineage solution. This joint solution provides complete visibility and traceability into data pipelines on Delta Lake, the open-source storage layer for reliable data lakes at scale.
Building End-to-End Pipelines with Data Discovery, Governance and Lineage in the Cloud on Delta Lake
Take a moment to think about all the applications we use everyday – email, web, mobile, social media, SaaS applications, BI, reporting dashboards and many others. Data Engineers spend vast amounts of time in these applications finding datasets and tracing data transformations, which delays analytics and machine learning projects.
The joint solution by Databricks and Informatica solves this problem by enabling data engineers and data scientists to find, validate and trace datasets as they move through data pipelines. Informatica’s EDC connects seamlessly with Delta Lake to scan and index metadata so teams can discover and profile data and find detailed lineage of that data as it moves through pipelines. This allows data engineers and data scientists to easily track data movement, including column/metric-level lineage to identify related tables, views and domains.
Along with EDC, Databricks integrates with Informatica Data Engineering Integration (DEI). DEI uses dynamic mappings and data transformations to ingest data from multiple source systems and applications into Delta tables with complete data lineage tracking. Once the data is in Delta tables, EDC performs scanning, profiling and discovery to help data engineers find the right data sets.
Viewing Lineage of Delta Datasets Engineered with Informatica Data Engineering Integration in EDC
Data Governance at Enterprise Scale
With new and upcoming regulations such as the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), data governance becomes integral to any data management initiative. For example, GDPR mandates ‘Right to Access’ that allows a customer to view their personal information across the entire enterprise. Similarly, the Right to Erasure requires that all their personal data be deleted without delay. Since Delta is a transactional engine, specific data (i.e. rows in tables) can be easily deleted using DELETE commands in the ‘Right to be forgotten’ compliance scenarios, without the burden of coding elaborate pipelines.
Overall, with data dispersed across disparate sources, it can be challenging to keep track of what data resides where and which on-premise and cloud workflows touch that data. A cloud-based data platform and discovery program for the entire organization can future proof its data governance discipline.
Getting started with the Databricks-Informatica End-to-end Data Lineage solution
Building intelligent data pipelines to bring data from different silos, tracing its origin and creating a complete view of data movement in the cloud is critical to enterprise organizations. The Databricks and Informatica partnership enables modern data teams to leverage data assets to scale and document datasets and data pipelines for analytics and ML. It is a powerful integration for data engineers and data scientists looking to automate their governance processes while achieving speed and agility of data management for the future.
Check out this webinar for an in-depth demo of the Databricks and Informatica joint solution for data lineage.