As data technologies continue to advance, modern companies are ingesting, storing, and processing more data than ever before in order to make the most informed business decisions. While relational databases may have been enough for the data demands 25 years ago, the continual increase of data operations has led to the emergence of new data technologies to support the era of big data. These days, there are a host of cloud products for data teams to choose from, many of which describe themselves as data warehouses, data lakes, or data lakehouses. With such similar terms, it can be difficult to understand what vendors mean by these terms. In this post, we’ll break down what these terms mean, then discuss how real-time data streaming plays a role in the big data landscape.

What is a Data Warehouse?

A data warehouse is a storage and processing hub, primarily intended for generating reports and performing historical analysis. Data stored in data warehouses are structured and well-defined, allowing the warehouse to perform fast and performant analysis on its datasets. Data from relational databases, streaming storage systems, backend systems, and other sources are loaded into the data warehouse through ETL (extract, transform, load) processes, where data is cleaned and otherwise transformed to match the data integrity requirements expected by the data warehouse. Most data warehouses allow users to access data through SQL clients, business intelligence (BI) tools, or other analytical tools.

Data warehouses are a great choice for organizations that primarily need to do historical data analytics and reporting on structured data. However, the ETL process adds complexity to the ingestion of data into the data warehouse and the requirements for structured data can make the system limiting for some use cases. Popular data warehouse vendors include Snowflake, Amazon Redshift, Google BigQuery, and Oracle Autonomous Data Warehouse.

What is a Data Lake?

A data lake is a massive storage system designed to store both structured and unstructured data at any scale. Similar to data warehouses, data lakes can ingest data from many different sources. However, data lakes are designed to be flexible so that users are able to store their raw data as-is, without needing to clean, reformat, or restructure the data first. By utilizing cheap object data storage and accommodating a wide range of data formats, data lakes make it easy for developers to simply store their data. This ultimately results in organizations accumulating large repositories of data that can be used to power use cases such as machine learning analytics, aggregations on large datasets, and exploring patterns in data from different data sources. One of the challenges of working with data lakes, however, is that downstream tasks need to make sense of differently formatted data to perform analysis on them. Further, if poorly maintained, data quality can very easily become an issue in data lakes. Tools like Apache Hadoop and Apache Spark are popular for doing analysis with a data lake, as these tools allow developers to write custom logic to make sense of different kinds of data, but they require more expertise to work with which limits the set of people who can feasibly work with the data lake.

Data lakes are a good choice for organizations that have a lot of data they need to store, accommodating both structured and unstructured data, but analyzing and maintaining the data lake can be a challenge. Data lakes are commonly built on cheap cloud storage solutions such as AWS S3, Azure Data Lake Storage, and Google Cloud Storage.

What is a Data Lakehouse?

Data lakehouses merge the features of data warehouses and data lakes into a single system, hence the name. As data warehouses began adding more features found in data lakes, and as data lakes began adding more features found in data warehouses, the distinction between the two concepts became somewhat blurred. Before data lakehouses, organizations would typically need both a data lake for storage and a data warehouse for processing, but this setup could end up causing data teams a lot of overhead, as data from one location would often need to be processed or duplicated to the other location for data engineers to perform complete analyses. By merging the two concepts into a single system, data lakehouses aim to remove these silos and get the benefits of both worlds. Similar to data lakes, storing data in a data lakehouse is still cheap, scalable, and flexible, but metadata layers are also provided to enforce things like schemas and data validation where necessary. This allows the data lakehouse to still be performant for querying and analytics, like data warehouses are.

Since data is typically loaded into a data lakehouse in its raw format, it’s common for a medallion architecture to be used. The medallion architecture describes a series of queries or processing steps to transform raw data (bronze), to filtered/cleaned data (silver), to business ready aggregated results (gold), where the gold set of data can be easily queried for BI purposes.

While the actual distinctions of what makes a system a data lakehouse instead of a data lake or data warehouse are somewhat nuanced, popular cloud vendors that have data lakehouse capabilities include Databricks Lakehouse Platform, Snowflake, Amazon Redshift Spectrum, and Google Cloud BigLake. While data lakehouses can handle a wide range of use cases, they can be complex to manage and still require skilled data experts to extract its full benefits.

Impacts of Real-time Streaming Data

As big data technologies continue to evolve, there has been an increasing demand for real-time data products. Users are becoming more accustomed to getting results instantly, and in order to support these use cases, companies have been adopting streaming technologies such as Apache Kafka and Apache Flink.

The Challenges of Streaming Data in the Current Ecosystem

Apache Kafka is a real-time event log that uses a publisher/consumer model. Micro-services, clients, and other systems with real-time data will produce events to Kafka topics, then the data events in these topics are consumed by other real-time services that act on these events. Data in Kafka and other streaming storage systems typically set some expiration period for their data events, so in order to keep their real-time data long-term, organizations typically load this data into a data lake, data warehouse, or data lakehouse for analysis later on. However, streaming data coming from IoT sensors, financial services, and web interactions can sum up to a large volume of data, and doing computation raw form of this data can be too slow or too computationally expensive to be viable. In order to address this, data engineers will typically do downsampling or other transformations to prepare the raw data for end users. In the case of data lakehouses, a medallion architecture, as mentioned earlier, is recommended to prepare the data for general consumption. For data lakes, a compute engine such as a data warehouse, or some Spark/Hadoop infrastructure, is needed to transform the data into more consumable results.

A setup that requires constant recomputation comes with an inherent tradeoff. Real-time data is constantly arriving into the data lake or data lakehouse, so users will need to choose between recomputing results often, which can be computationally expensive, or recompute less frequently, resulting in stale datasets. Another issue with the setup mentioned earlier is that computed results need to be stored as well. In the medallion architecture for example, where raw data needs to go through multiple steps of processing before being ready for warehouse-like querying, this could involve storing the same data multiple times. This results in higher storage costs and higher latencies, as each processing step needs to be scheduled for recomputation.

Using Stream Processing to Prepare Streaming Data

This is where a stream processing solution, such as Apache Flink, can become beneficial. Stream processing jobs are long-lived and can produce analytical results incrementally, as new data events arrive. Contrast this to the medallion architecture where new result datasets need to be completely recomputed. By adding stream processing to the data stack, streaming data can be filtered, transformed, and aggregated before ever arriving to the data lake, data warehouse, and data lakehouse layer. This results in lower computational costs and lower end-to-end latencies.

One of the main burdens of Apache Flink and other stream processing frameworks is their complexity. Understanding how to develop, manage, scale, and provide fault tolerance for stream processing applications requires skilled personnel and time. With DeltaStream, we take all of that complexity away so that users can focus on their processing logic. DeltaStream is a fully managed serverless stream processing solution that is powered by Apache Flink. If you’re interested in how DeltaStream can help you manage your streaming data, schedule a demo with us or reach out to us on one of our socials.