A data lake is a centralized data repository that is capable of storing both traditional structured data, as well as unstructured, raw data in its native format (like videos, images, binary files, and more.)
Data lakes are often used to consolidate all of an organization’s data in a single, central location, where new data can be saved “as is” until it is ready for processing. The goal of this architecture is to create a “single source of truth” that eliminates data silos, streamlines data security, and democratizes access to the data for all users.
Data lakes enable use cases that include data analysis and visualization, business intelligence and reporting, data science, machine learning, deep learning, data mining, and more.
Whereas data warehouses are designed for the analysis of highly structured data, modern data lakes have the unique ability to store virtually unlimited amounts of unstructured data, which is the raw material needed for next generation machine learning and data science. Experts estimate that 80% or more of the data that organizations collect is unstructured¹, so companies that can capitalize upon it using data lakes are at a monumental advantage.
Data lakes can also serve as a single source of truth that can centralize and consolidate an organization’s data assets, breaking data silos along the way. When properly architected, the benefits of building a data architecture with a data lake as its foundation is that it enables the ability to:
- Power data science and machine learning. Data lakes allow you to automatically transform raw data into structured data that is ready for analytics, data science, and machine learning with low latency. Raw data can be retained indefinitely at low cost for future use in machine learning.
- Centralize, consolidate, and catalogue your data. A centralized data lake eliminates problems with data silos (like data duplication, multiple security policies, and difficulty with collaboration), offering downstream users a single place to look for all sources of data.
- Quickly and seamlessly integrate diverse data sources and formats. Any and all data types can be collected and retained indefinitely in a data lake, including batch and streaming data, video, image, and binary files, and more.
- Democratize your data by offering users self-service tools. Data lakes are incredibly flexible, enabling users with completely different skills, tools, and languages to perform different analytics tasks all at once. Machine learning engineers can build models with Python and Databricks/Jupyter notebooks, while data analysts are querying the data with Presto, and BI analysts are building beautiful dashboards with Tableau.
|Data Lake||Data Warehouse|
|Types of data||All types: Structured data, semi-structured data, unstructured (raw) data||Structured data only|
|Scalability||Scales to hold any amount of data, regardless of type||Scale limited by vendor cost and inability to store unstructured data|
|Intended use cases||Data analysts, data scientists, business users||Data analysts|
|Advantages||Cost, flexibility, allows storage of the raw data needed for machine learning||Fast query performance, even with many concurrent users|
|Disadvantages||Sorting through large amounts of raw data can be difficult without tools to organize and catalog the data||Expensive, proprietary software, cannot hold unstructured (raw) data needed for machine learning|
|Data Lake||Relational Database|
|Types of data||All types: Structured data, semi-structured data, unstructured (raw) data||Structured data only|
|Scalability||Scales to hold any amount of data, regardless of type||Not scalable|
|Intended users||Data analysts, data scientists, business users||Data analysts, business users|
|Advantages||Cost, flexibility, allows storage of the raw data needed for machine learning, scalability||Fast query performance, even with many concurrent users|
|Disadvantages||Sorting through large amounts of raw data can be difficult without tools to organize and catalog the data||Cannot scale to big data, cannot hold unstructured (raw) data needed for machine learning|
Traditionally, companies have turned to data warehouses as the primary way to manage big data. Data warehouses bring an organization’s collection of relational databases under a single umbrella that allows the data to be queried and viewed as a whole. These data warehouses were typically run on expensive, on-premise hardware from vendors like Teradata and Vertica. The primary advantages of this technology included:
- Integration of many data sources
- Data optimized for read access
- Ability to run quick ad hoc analytical queries
Data warehouses served their purpose well, but over time, the downsides to this technology became apparent.
- Inability to store unstructured, raw data
- Expensive, proprietary hardware and software
- Difficulty scaling due to the tight coupling of storage and compute power
To address concerns about cost and vendor lock-in, Hadoop MapReduce emerged as an open source technology that was a precursor to the data lakes of today. Early data lakes built on MapReduce and the Hadoop File System (HDFS) had some limited success, but many failed due to poor cloud integration, difficulty scaling, and other factors.
The modern era
The rise of cloud computing and the era of big data has changed the landscape of data analytics as we know it. Today, data warehouses still play an important role in many organizations for analyzing highly structured data, but data lakes are becoming an increasingly popular option because of their ability to store raw data and serve many more analytics use cases. As machine learning becomes an increasingly important part of data analytics, we believe this trend will not only continue, but accelerate, and that companies that do not make investments to capitalize on this trend will lose out in the long run.
A data swamp is a term that was coined to describe a data lake that has failed because of poor metadata management, whereby the data sources and tables themselves are not properly annotated and categorized so that users can understand what’s in them. When there’s no way for downstream users to find and make use of the data that they need, good data languishes in the data lake, unanalyzed and undiscovered. This is to be avoided at all costs.
To stop your data lake from becoming a data swamp, catalog the data in your data lake and use Delta Lake.
[for each of these items, include the title and a ‘Learn More’ button [can also add a small tease of the text] that links to a ‘Data lake best practices’ page.]
Catalog the data in your data lake
In order to implement a successful data lake strategy, it’s important for users to properly catalog new data as it enters your data lake, and continually curate it to ensure that it remains updated. The data catalog is an organized, comprehensive store of table metadata, including table and column descriptions, schema, data lineage information, and more. It is the primary way that downstream consumers (for example, BI & data analysts) can discover what data is available, what it means, and how to make use of it. It should be available to users on a central platform or in a shared repository.
There are a number of software offerings that can make data cataloguing easier. Apache Atlas is available as open source software, whereas other proprietary options in the space include AWS Glue, Azure Data Catalog, Alation, Collibra, and Informatica.
Use the data lake as a landing zone for raw data
As you add new data into your data lake, It’s important not to perform any data transformations on your raw data (with one exception for personally identifiable information – see below). Data should be saved in its native format, so that no information is inadvertently lost by aggregating or otherwise modifying it. Even cleansing the data of null values, for example, can be detrimental to good data scientists, who can seemingly squeeze additional analytical value out of not just data, but even the lack of it.
Mask data containing private information before it enters the data lake
Data engineers need to strip out PII (personally identifiable information) from any data sources that contain it, replacing it with a unique ID, before those sources can be saved to the data lake. This process maintains the link between a person and their data for analytics purposes, but ensures user privacy, and compliance with data regulations like the GDPR and CCPA. Since one of the major aims of the data lake is to persist raw data assets indefinitely, this step enables the retention of data that would otherwise need to be thrown out. Further discussion can be found here. Read more about how to protect PII when using data lakes to analyze big data.
Secure your data lake with role- and view-based access controls
Traditional role-based access controls (like IAM roles on AWS and Role Based Access Controls on Azure) provide a good starting point for managing data lake security, but they’re not fine-grained enough for many applications. In comparison, view-based access controls allow precise slicing of permission boundaries down to the individual column, row, or notebook cell level, using SQL views. SQL is the easiest way to implement such a model, given its ubiquity and easy ability to filter based upon conditions and predicates.
View-based access controls are available on modern unified data platforms, and can integrate with cloud native role-based controls via credential passthrough, eliminating the need to hand over sensitive cloud provider credentials. Once set up, administrators can begin by mapping users to role-based permissions, then layer in finely tuned view-based permissions to expand or contract the permission set based upon each user’s specific circumstances. You should review access control permissions periodically to ensure they do not become stale.
Creating a data lake using elastic cloud infrastructure
The cloud is the best place to build a data lake because of its cost effectiveness, elasticity, and scalability. At its core, a data lake is nothing more than an object storage repository that can scale out to any size, so setting up a data lake is easier than you think. If you decide to use Microsoft Azure, you can set up a data lake by creating an Azure Data Lake Storage Gen2 repository, and if you decide to use AWS, you can set up an S3 bucket to serve as your data lake.
Connecting existing data sources to your data lake
Use a firehose approach to enable quick, continuous ingestion of all types of data into your data lake – batch or streaming, relational or non-relational, structured or unstructured – in its raw, native format. Simply point streaming sources like Apache Kafka, Azure Event Hubs or AWS Kinesis at your cloud provider’s data storage endpoint, and let the data flow in. For batch sources, like RDS databases or Redshift, use pre-built connectors to enable batch ingestion of existing data assets.
Use off-the-shelf connectors or plugins to transfer data from your existing sources to your data lake.
AWS and Azure offer an incredible array of data transfer options for every use case. These options span all the way from SSH and simple drag-and-drop tools, all the way to specially designed 18-wheeler semi-trailers to physically transport your data from your on-prem servers. Get more information about uploading your data to Azure Data Lake Storage here, and to AWS S3 here.
Building data pipelines that combine batch and streaming data
Batch data is data that is accumulated and stored until it is processed at a designated point-in-time (like quarterly sales data). Streaming data is real-time event data (like a push notification) that is processed immediately. With the ever-increasing volume of streaming data like clickstreams and Internet of Things (IoT) data, data lakes need to be able to quickly and easily combine batch data with streaming data to enable analytics that are updated in near real-time.
Traditionally, data engineers built lambda architectures to combine batch and streaming data for analytics, but lambda architectures are prone to breakage, and require maintenance of two code bases.
Instead of using a lambda architecture, use a Delta architecture built on open source Delta Lake to combine batch and streaming data for analytics. Delta Lake tables can serve as a source and sink for both types of data, and support multiple concurrent readers and writers, so the same table can also serve correct views on the data to users at all times, even while ingesting new data. Get started with a webinar about Delta architecture found here.
A data lake provides the raw data foundation for a next-generation analytics architecture, but you still need an engine to power the processing of all that data, at scale. Apache Spark™ is the de facto open source big data processing standard, enabling rapid distributed processing of data sets of any size. Learn more about running Spark on Databricks here.
Learn more about building a data lake using Amazon Web Services and S3 here.
Azure Data Lake Gen2
Simplify and strengthen your data architecture by using Delta Lake to ensure data validity and consistent views at petabyte scale.
Until recently, ACID transactions have not been possible on data lakes. However, they are now available with the introduction of open source Delta Lake, bringing the reliability and consistency of data warehouses to data lakes.
ACID properties (atomicity, consistency, isolation, and durability) are properties of database transactions that are typically found in traditional relational database management systems systems (RDBMSes). They’re desirable for databases, data warehouses, and data lakes alike because they ensure data reliability, integrity, and trustworthiness by preventing some of the aforementioned sources of data contamination.
Delta Lake builds upon the speed and reliability of open source Parquet (already a highly performant file format), adding transactional guarantees, scalable metadata handling, and batch and streaming unification to it. It’s also 100% compatible with the Apache Spark API, so it works seamlessly with the Spark unified analytics engine. Learn more about Delta Lake with Michael Armbrust’s webinar entitled Delta Lake: Open Source Reliability for Data Lakes, or take a look at a quickstart guide to open source Delta Lake here.
Companies are already processing over 2 exabytes (that’s 2 billion gigabytes) of data per month with Delta Lake. Here are some of the users of Delta Lake.
[insert company logos from companies like Comcast, Viacom, Booz Allen, McAfee, etc. from delta.io]
Read about how companies are using Delta Lake to make better analytics decisions.
[Insert blog posts from companies RE: Delta Lake]