14 Nov 2023

Min Read

Key Components of a Modern Data Stack: An Overview

Data is one of the most valuable assets of any company in the digital age. Drawing insights from data can help companies better understand their market and make well informed business decisions. However, to unlock the full potential of data, you need a data stack that can collect, store, process, analyze, and act on data efficiently and effectively.

When considering a data stack, it’s important to understand what your needs are and how your data requirements may change in the future, then make decisions that are best suited for your particular business. For almost every business, making the most out of data collection and data insights are essential, and you don’t want to find yourself stuck with a legacy data system. Data systems can quickly go out of date and updating to new emerging technologies can be painful. Using modern cloud-based data solutions is one way you can help keep your data stack flexible, scalable, and ultimately save time and money down the road.

In this blog post, we’ll cover the benefits of building a modernized data stack, the main components of a data stack, and how DeltaStream fits into a modern data stack.

What makes a Modern Data Stack and Why You Need One

The modern data stack is built on cloud-based services and low-code or no-code tools. In contrast to legacy systems, modern data stacks are typically built in a distributed manner and aim to improve on the flexibility, scalability, latency, and security of older systems.

How we build a data stack to manage and process data has developed and morphed greatly in the last few years alone. Changing standards, new regulations, and the latest tech have all played a role in how we handle our data. Some of the core properties of a modern data stack include being:

  • Cloud-based: A cloud-based product is something that users can pay for and use over the internet, whereas before users would have to buy and manage their own hardware. The main benefit for users is scalability of their applications and reduced operational maintenance. Modern data solutions offer elastic scalability, meaning users can easily increase or decrease the resources running for their applications by simply adjusting some configuration. More recently, modern data solutions have been offering a serverless option, where users don’t need to worry about resources at all, and automatic scaling is handled behind the scenes by the solution itself. Compare this with legacy systems, where scaling up requires planning, hardware setup, and maintenance of the software and new hardware. Elastic scaling enables users to be flexible with the workloads they plan to run, as resources are always available to use. To ensure availability of their products, modern data solutions typically guarantee SLAs, so users can expect these products to work without worrying about system maintenance or outages. Without the burdens of resource management and system maintenance, users can completely focus on writing their applications, which will improve developer velocity and allow businesses to innovate faster.
  • Performant: In order for modern data solutions to be worthwhile, they need to be performant. Low latency, failure recovery, and security are all requirements for modern data solutions. While legacy systems may have been sufficient to meet the standards of the past, not all of them have evolved to meet the requirements of data workloads today. Many of the current modern data products utilize the latest open-source technologies in their domain. For example, Databricks utilizes and contributes back to the Apache Spark project, Confluent does the same for Apache Kafka, and MongoDB does the same for the MongoDB open source project. For modern data solutions that aren’t powered by open source projects, they typically feature advanced state-of-the-art software and provide benchmarks on how their performance compares to the status quo. Modern data solutions make the latest technologies accessible, enabling businesses to build powerful tools and applications that get the most out of their data.
  • Easy to use: The rapid development and advancement of technologies to solve emerging use cases has led to the most powerful tech to also become the most complex and specialized. While these technologies have made it possible to solve many new use cases, they’re oftentimes only accessible to a handful of experts. Because of this, the modern trend is towards building democratized solutions that are low-code and easy to use. That’s why modern solutions abstract away the complexities of their underlying tech and expose easy to use interfaces to users. Consider AWS S3 as an example, users can use the AWS console to store or retrieve files without writing any code. However, behind the scenes, there is a highly complex system to provide strong consistency guarantees, high availability, scale indefinitely, and provide a low-latency experience for users. Businesses using modern data solutions no longer need to hire or train experts to manage these technologies, which ultimately saves time and money.

Components of a Data Stack

At a high level, a modern data stack consists of four components: collection, storage, processing and analysis/visualization. While most modern data products loosely fit into a single component, it’s also not uncommon for solutions to span multiple components. For example, data warehouses such as Databricks and Snowflake act as both Data Storage (with data governance) and Data Processing.

  • Data Collection: These are services or systems that ingest data from your data sources into your data platform. This includes data coming from APIs, events from user facing applications, and data from databases. Data can come structured or unstructured and in various data formats.
  • Data Storage: These are systems that hold data for extended periods of time, either to be kept indefinitely or for processing later on. This includes databases like MongoDB, or streaming storage platforms like RedPanda and Kinesis. Data warehouses and data lakes such as Snowflake and AWS S3 can also be considered data storage.
  • Data Processing: These are the systems that perform transformations on your data, such as enrichment, filtering, aggregations, and joins. Many databases and data warehouses have processing capabilities, which is why you may see some products in multiple categories below. For example Snowflake is a solution where users can store their data and run SQL queries to transform the data in their tables. For stream processing use cases, users may look at products like DeltaStream that use frameworks such as Apache Flink to handle their processing.
  • Data Analysis and Visualization: These are the tools that enable you to explore, visualize, and share data insights. These include BI platforms, analytics software, and chart building software. Some popular tools include Tableau and Microsoft Power BI among others. Data analysis and visualization tools can help users draw insights from their data, discover patterns, and communicate findings.

Overview of the current landscape of modern data technologies

How DeltaStream fits into your Modern Data Stack

DeltaStream is a serverless data platform that unifies, governs, and processes real-time data. DeltaStream acts as both an organizational and compute layer on top of users data resources, specifically streaming resources. In the modern data stack, DeltaStream fits in the Data Storage and Data Processing layers. In particular, DeltaStream makes it easier to manage and process data for streaming and stream processing applications.

In the Data Storage layer, DeltaStream is a data platform that helps users properly manage their data through unification and governance. By providing proper data management tools, DeltaStream helps make the entire data stack more scalable and easier to use, allowing developers to iterate faster. Specifically, DeltaStream improves on the flat namespace model that most streaming storage systems, such as Kafka, currently have. DeltaStream instead uses a hierarchical namespace model such that data resources exist within a “Database” and “Schema”, and data from different storage systems can exist within the same database and schema in DeltaStream. This way the logical representation of your streaming data is decoupled from the physical storage systems. Then, DeltaStream provides Role Based Access Control (RBAC) on top of this relational organization of data so that roles can be created for certain permissions and users can inherit the roles that they’ve been granted. While DeltaStream isn’t actually storing any raw data itself, providing data unification and data governance to otherwise siloed data makes managing and securing all of a user’s streaming data easier. The graphic below is a good representation of DeltaStream’s federated data governance model.

DeltaStream federated data governance

In the Data Processing layer, DeltaStream provides an easy to use SQL interface to define long lived stream processing jobs. DeltaStream is a serverless platform, so fault tolerance, deployment, resource scaling, and operational maintenance of these queries are all taken care of by the DeltaStream platform. Under the hood, DeltaStream leverages Apache Flink, which is a powerful open source stream processing framework that can handle large volumes of data and supports complex data transformations. For DeltaStream users, they get all the benefits of Flink’s powerful stream processing capabilities without having to understand Flink or write any code. With simple SQL queries, users can deploy long-lived stream processing jobs in minutes without having to worry about scaling, operational maintenance, or the complexities of stream processing. This fits well with the scalable, performant, and easy to use principles of modern data stacks.

Conclusion

In this blog post we covered the core components of a data stack and discussed the benefits in investing in a modern data stack. For streaming storage and stream processing, we discussed how DeltaStream fits nicely in the modern data stack for unifying, governing, and processing data. DeltaStream’s integrations with non-streaming databases and data warehouses such as PostgreSQL, Databricks, and Snowflake make it a great option to run alongside these already popular modern data products. If you have any questions about modern data stacks or are interested in learning more about DeltaStream, reach out to us or schedule a demo.

01 Nov 2023

Min Read

A Guide to Repartitioning Kafka Topics Using PARTITION BY in DeltaStream

In this blog post we are going to be highlighting why and how to use the PARTITION BY clause in queries in DeltaStream. While we are going to be focusing on repartitioning Kafka data in this post, any data storage layer that uses a key to partition their data can benefit from PARTITION BY.

We will first cover how Kafka data partitioning works and explain why a Kafka user may need to repartition their data. Then, we will show off how simple it is to repartition Kafka data in DeltaStream using a PARTITION BY query.

How Kafka partitions data within a topic

Kafka is a distributed and highly scalable event logging platform. A topic in Kafka is a category of data representing a single log of records. Kafka is able to achieve its scalability by allowing each topic to have 1 or more partitions. When a particular record is produced to a Kafka topic, Kafka determines which partition that record belongs to and the record is persisted to the broker(s) assigned to that partition. With multiple partitions, writes to a Kafka topic can be handled by multiple brokers, given that the records being produced will be assigned to different partitions.

In Kafka, each record has a key payload and a value payload. In order to determine which partition a record should be produced to, Kafka uses the record’s key. Thus, all records with the same key will end up in the same topic partition. Records without a key will be produced to a random partition.

Let’s see how this works with an example. Consider you have a Kafka topic called ‘pageviews’ which is filled with records with the following schema:

  1. {
  2. ts: long, // timestamp of pageviews event
  3. uid: string, // user id
  4. pid: string // page id
  5. }

The topic has the following records (omitting ts for simplicity):

If we partition by the uid field by setting it as the key, then the topic with 3 partitions will look like the following:

If we partition by the pid field by setting it as the key, then the topic with 3 partitions will look like the following:

Why repartition your Kafka topic

The relationship between partitions and consumers in Kafka for a particular application is such that there can be at most 1 consumer per partition, but a consumer can read from multiple partitions. What this means is if our Kafka topic has 3 partitions and our consumer group has 4 consumers, then one of the consumers will sit idle. In the inverse case, if our Kafka topic has 3 partitions and our consumer group has 2 consumers, then one of the consumers will read from 2 partitions while the other reads from only 1 partition.

In most cases, users will set up their applications that consume from a Kafka topic to have a number of consumers that is a divisor of the number of partitions so that one consumer won’t be overloaded relative to other consumers. However, data can still be distributed unevenly to different partitions if there is a hotkey or poor partitioning strategy, and repartitioning may be necessary in these cases.

To showcase how data skew can be problematic, let’s look again at our pageviews example. Imagine that half of the records have a pid value of A and we partition by the pid field. In a 3 partition topic, ~50% of the records will be sent to one partition while the other two partitions get ~25% of the records. While data skew might hurt performance and reliability for the Kafka topic itself, it can also make it difficult for downstream applications that consume from this topic. With data skew, one or more consumers will be overloaded with a disproportionate amount of data to process. This can have a direct impact on how well downstream applications perform and result in problems such as many very out of order records, exploding application state sizes, and high latencies (see what Apache Flink has implemented to address some of the problems caused by data skew in sources). By repartitioning your Kafka topic and picking a field with more balanced values as the key to partition your data, data skew can be reduced if not eliminated.

Another reason you may want to repartition your Kafka data is to align your data according to its context. In the pageviews example, if we choose the partition key to be the uid field, then all data for a particular user id will be sent to the same partition and thus the same Kafka broker. Similarly, if we choose the partition key to be the pid field, then all data for a particular page id will be sent to the same partition and Kafka broker. If our use case is to perform analysis based on users, then it makes more sense to partition our data using uid rather than pid, and downstream applications will actually process data more efficiently.

Consider we are counting the number of pages a user visits in a certain time window and are partitioning by pid. If the application that aggregates the data has 3 parallel threads to perform the aggregation, each of these threads will be required to read records from all partitions, as the data belonging to a particular uid can exist in many different partitions. If our topic was partitioned by uid instead, then each thread can process data from their own distinct sets of partitions as all data for a particular uid would be available in a single partition. Stream processing systems like Flink and Kafka Streams require some kind of repartition step in their job to handle cases where operator tasks need to process data based on a key and the source Kafka topic is not partitioned by that key. In the case of Flink, the source operators need to map data to the correct aggregation operators over the network. The disk I/O and network involved for stream processing jobs to repartition and shuffle data can become very expensive at scale. By properly partitioning your source data to fit the context, you can avoid this overhead for downstream operations.

PARTITION BY in DeltaStream

Now the question is, how do I repartition or rekey my Kafka topic? In DeltaStream, it’s made simple by PARTITION BY. Given a Kafka topic, you can define a Stream on this topic and write a single PARTITION BY query that rekeys the data and produces the results to a new topic. Let’s see how to repartition a keyless Kafka topic ‘pageviews’.

First, define the ‘pageviews’ Stream on the ‘pageviews’ topic by writing a CREATE STREAM query:

  1. CREATE STREAM pageviews (
  2. ts BIGINT, uid VARCHAR, pid VARCHAR
  3. ) WITH (
  4. 'topic' = 'pageviews', 'value.format' = 'JSON'
  5. );

Next, create a long-running CREATE STREAM AS SELECT (CSAS) query to rekey the ‘pageviews’ Stream using uid as the partition key and output the results to a different Stream:

  1. CREATE STREAM pageviews_keyed AS
  2. SELECT
  3. *
  4. FROM pageviews
  5. PARTITION BY uid;

The output Stream, ‘pageviews_keyed’, will be backed by a new topic with the same name. If we PRINT the input ‘pageviews’ topic and the output ‘pageviews_keyed’ topic, we can see the input has no key assigned and the output has the uid value defined as the key.

  1. db.public/cc_kafka# PRINT TOPIC pageviews_ctan;
  2. | {"ts":46253951,"uid":"User_7","pid":"Page_90"}
  3. | {"ts":46253961,"uid":"User_5","pid":"Page_18"}
  4. | {"ts":46253971,"uid":"User_9","pid":"Page_64"}

  1. db.public/cc_kafka# PRINT TOPIC pageviews_keyed;
  2. {"uid":"User_6"} | {"ts":46282721,"uid":"User_6","pid":"Page_79"}
  3. {"uid":"User_7"} | {"ts":46282731,"uid":"User_7","pid":"Page_56"}
  4. {"uid":"User_1"} | {"ts":46282741,"uid":"User_1","pid":"Page_70"}

As you can see, with a single query, you can repartition your Kafka data using DeltaStream in a matter of minutes. This is one of the many ways we remove barriers to make building streaming applications easy.
If you want to learn more about DeltaStream or try it for yourself, you can request a demo or join our free trial.

24 Oct 2023

Min Read

Seamless Data Flow: How Integrations Enhance Stream Processing

Data processing systems can be broadly classified into 2 categories. batch processing & stream processing. Enterprises often use both streaming and batch processing systems because they serve different purposes and have distinct advantages, and using them together can provide a more comprehensive and flexible approach to data processing and analysis. Stream processing platforms help organizations process data as close to real-time as possible which is important for handling use-cases that are latency sensitive. Some examples include monitoring IoT sensors, fraud detection and threat detection. Batch processing systems are useful for a different set of situations – for example, when you want to analyze past data to find patterns or handle a lot of data from different sources.

Integrating between systems

The need for stream and batch processing is exactly why you are seeing companies implementing “lambda architecture”, which brings the best of both worlds together. In this architecture you generally see batch processing & stream processing systems working together to serve multiple use-case. So for these enterprises it’s important to be able to: 

  1. Move data between these systems seamlessly – ETL
  2. Make data products available to users in their preferred format/platform
  3. Continue using existing systems while leveraging new technologies to improve overall efficiency

Having the ability to process and prepare data as soon as it arrives for downstream consumption is an extremely critical function in the data landscape. For this you need to be able to  1. EXTRACT data from source systems to process and after processing, 2.  LOAD the data  into your platform of choice. This is precisely where stream processing platforms & integrations come into the picture. While integrations help you extract and load data, Stream processing helps you with the Transformation.

Every enterprise uses multiple disparate systems, each serving its own purpose, to manage their data. For these enterprises, it is important for data teams to produce data products in real-time and serve them across multiple data platforms that are in use. This will require a certain level of sophisticated processing, data governance and integration with commonly used data systems. 

Real-world integration uses

Let’s take a look at an example which can help us understand how companies manage their data across multiple platforms and how they process, access and transfer data across them.

Consider a scenario at a Bank. You have an incoming stream of transactions from Kafka. This stream of transactions is connected to DeltaStream where you can analyze transactions as they come in and flag them in real time if you notice fraudulent activity based on various predefined rules and alert your users as soon as possible. This is extremely time sensitive and a Stream Processing Platform is best suited for such use-cases.

Now, other teams within a Bank for eg : the marketing team, would want to understand trends based on customer activity and customize how they market their products to a given customer. For this, you need to look at transactions going back to a month or a week and process it all to generate enough context. Instead of going back to the ‘source’ system you can now have DeltaStream send all the processed transactions in the right format to your batch processing systems using our Integrations so that you can: 

  1. Have the data ready in the right format & schema for processing 
  2. Reduce the back and forth between multiple platforms as data transits the pipeline just once – reduction in data transfer costs 
  3. Eliminate the duplication of data sets across multiple platforms for processing
  4. Easily manage compliance – for eg :  by reducing the footprint of your PII data.

By integrating both the batch processing and stream processing system, the entire pipeline becomes more manageable and it reduces the complexity of managing data across different systems.

Integrating with DeltaStream

It’s evident that we need integration across different platforms to enable data teams to process and manage data. DeltaStream provides for all the ingredients required for such an operation. Our Stream processing platform is powered by Flink. DeltaStream’s RBAC enables you to govern and securely share data products. The integrations to Databricks and Snowflake allow for data products to be used in the data systems that your teams are using. With the launch of these integrations, you can do MORE with your data. To unlock the power of your streaming data reach out to us or request a free trial!

30 Aug 2023

Min Read

Choosing Serverless Over Managed Services: When and Why to Make the Switch

Consider storage service where you store and retrieve files. In on-prem environments, HDFS(Hadoop Distributed File System) has been one of the most common storage platforms to use. As for any service in an on-prem environment, as a user you are responsible for all aspects of the operations. This includes bringing up the HDFS cluster, scaling up and down the cluster based on the usage requirements, dealing with different types of failures including but not limited to server failures, network failures and many more. Of course, in an ideal situation as a user you would like to just use the storage service and focus on your business without dealing with the complexity of operating such infrastructure, including the one mentioned above. If you use a cloud environment instead of on-prem, you have the option of choosing to use storage services that are provided by a variety of vendors instead of running HDFS on cloud yourself. However, there are different ways to provide the same service on cloud that can significantly affect user experience and ease of use for such services. 

A Look at Managed Services

Let’s go back to our storage service and consider that we are now using a cloud environment and can take advantage of not running our required services ourselves and instead using the ones that vendors offer on this cloud environment. One common option to provide such services is to provide a managed version of the on-prem services. In this case, the service provider takes the same platform that is used in the on-prem environment and makes some improvements to run it on the cloud environment. While this takes away some of the burden of operations from the user, the user still is involved in many other aspects of the operations of such managed services. For the storage service we are considering here, a managed HDFS service would be an example of such an approach. When using a “fully managed” cloud HDFS, as a user you should still make decisions such as provisioning a HDFS cluster through the service. This means that you need to have a good understanding of the amount of storage resources you will be using and let the service provider know if you need more or less of such resources after provisioning the initial cluster. Requiring the user to provide such information in many cases results in confusion and in most cases the initial decision won’t be the accurate one and as the usage continues there will be a need for adjusting the provisioned resources. You cannot expect a user to accurately know how much storage they will need in the next six months or a year.

In addition, the notion of cluster brings many limitations. A cluster has a finite amount of resources available and as the usage continues, the cluster resources would be consumed and there will be a need for more resources. In the case of our “managed” HDFS service, the provisioned storage(disk space) is one of the limited resources and as more and more data is stored in the cluster, the available storage will shrink. At this point, the user has to decide between scaling up the existing cluster by adding more nodes and disk space or adding a new cluster to accommodate the growing need for the storage. To accommodate such issues, users may over-provision resources which in turn can result in unused resources and extra cost. Finally, once a cluster is provisioned, the user will start incurring the cost of the whole cluster regardless if half or all of the cluster resources are utilized. In short, the managed cloud service in most cases will put the burden of resource provisioning on the user, this in turn requires the user to have deep understanding of the required resources not for now, but for short term and long term future.

Unlocking Cloud Simplicity: The Serverless Advantage

Now let’s assume instead of taking the managed service path, a vendor takes a different route and builds a new cloud-only storage service from the ground up where all the complexity and operations are handled under the hood by the vendor and the users don’t have to think about the complexities such as resource provisioning as described above. In the case of storage service, object store services such as AWS S3 are great examples of such an approach, which is called Serverless. As an S3 user, you just need to set up buckets and folders and read and write your files. No need to manage anything or provision any cluster, no need to worry about having enough disk space or nodes. All operational aspects of the service including making sure the service is always available with required storage space is handled by S3. This is a huge win for the users since they can focus on building their applications instead of worrying about provisioning and sizing clusters correctly. With such simplicity in use, we can see why almost every cloud user uses object stores such as S3 for their storage needs unless there is an explicit requirement to use anything else. S3 is a great example for superiority of cloud-native serverless architecture compared to providing a “fully managed” version of the on-prem products. 

Another benefit of serverless platforms such as S3 compared to managed services is that S3 enables users to access, organize and secure their data in one place instead of dealing with multiple clusters. In S3 you can organize your data in buckets and folders, have a unified view of all of your data and control access to your data in one place. The same cannot be said for managed HDFS service if you have more than one cluster! In this case users have to keep track of which cluster has which data and how to control access to data across multiple clusters which is a much more complex and error prone process.

Choosing Serverless for Stream Processing

We can have the same argument in favor of serverless offering compared to managed services for many other platforms including stream processing and streaming databases. You can have “fully managed” services where the user has to provision clusters along with specifying the amount of resources this cluster will have before starting to write any query. Indeed, in managed services for stream processing you have much more complexities compared to the managed HDFS example we explained above. The cluster in the stream processing case will be shared among multiple queries which means imperfect isolation and the possibility of one bad query bringing down the whole cluster which disrupts the other queries even though they were healthy and running with no issues. To exacerbate the situation, as you add more streaming queries to the cluster eventually the cluster resources will all be used since the streaming queries are long running jobs and you will need to scale up your cluster or launch new cluster to accommodate newer queries. The first option results in having a larger cluster with more queries that share the same cluster resources and can interfere with each other’s resources. 

On the other hand, the second option will result in a growing number of clusters to keep track of and also keep track of which query is running on which cluster. So anytime you have to provision or declare a cluster is a “fully managed” stream processing or streaming database service, you are dealing with the managed service along with the above mentioned restrictions and many more. Even worse, once you provision a cluster, the billing for the cluster starts regardless of having no queries or several queries running in the cluster.

We built DeltaStream as a serverless platform because we believe that such a service should be as easy to use as S3. You can refer to DeltaStream as the S3 of stream processing. In DeltaStream there is no notion of cluster or provisioning. You just connect to your streaming stores such as Apache Kafka or AWS Kinesis and you are up and running ready to run queries. Only pay for queries that are running, and since there is no concept of cluster, you won’t be charged for idle or under utilized clusters! Focus on building your streaming applications and pipelines and leave the infrastructure to us. Launch as many queries as you want and there is no notion of running out of resources! Your queries run in isolation and we can scale them up and down independently without interfering with each other. 

DeltaStream is a unified platform that provides stream processing(streaming analytics) and streaming database in one platform. You can build streaming pipelines, event base applications and always uptodate materialized views within familiar SQL syntax. In addition, DeltaStream enables you to organize and secure your streaming data across multiple streaming storage systems. If you are using any flavor of Apache Kafka including Confluent Cloud, AWS MSK or RedPanda or AWS Kinesis you can now try DeltaStream for free by signing up for our free trial.

22 Jun 2023

Min Read

What is Streaming ETL and how does it differ from Batch ETL?

In today’s data-driven world, organizations are seeking effective and reliable ways to extract insights to make timely decisions based on the ever-increasing volume and velocity of data. ETL (Extract, Transform, Load) is a process where data is extracted from various sources, transformed to fit specific requirements, such as cleaning, formatting, and aggregating, and loaded into a target system or data warehouse. ETL ensures data consistency, quality, and usability, and enables organizations to effectively analyze their data. Traditional batch processing, while effective for certain use cases, falls short in meeting the demands of real-time and event-driven data processing and analysis. This is where streaming ETL emerges as a powerful solution.

Streaming ETL: An Overview

Unlike traditional batch processing, which operates on fixed intervals, streaming ETL operates on a continuous stream of data as records arrive, allowing for real-time analysis. A streaming ETL pipeline begins with the continuous data ingestion phase, in which records are collected from different sources varying from databases to event streaming platforms like Apache Kafka, and Amazon Kinesis. Once data is ingested, it goes through real-time transformation operations for cleaning, normalization, enrichment, etc. Stream processing frameworks such as Apache Flink, Kafka Stream, ksqlDB and Apache Spark provide tools and APIs to apply these transformations to prepare the data. The same frameworks allow processing data in real time and support operations and functionalities such as real-time aggregation, and complex machine learning operations. Finally, the results of the streaming ETL are delivered to downstream systems and applications for immediate consumption, or they are stored in data warehouses and data stores for future use.

Streaming ETL can be applied in various domains, including fraud detection and prevention, real-time analytics and personalization for targeted advertisements, and IoT data processing and monitoring to handle high velocity and volume of data generated by devices such as sensors and smart appliances.

Why should you use Streaming ETL?

Streaming ETL offers numerous advantages in real-time data processing. Here is a list of most important ones:

  • Streaming ETL provides real-time insights into the emerging trends, anomalies and critical events in the data as they happen. It operates with low latency, and ensures the processing results are up to date. This reduces the gap between the time data arrives and the time it is processed. This facilitates accurate and timely decision-making, and enables organizations to capitalize on time-sensitive opportunities or address emerging issues promptly.
  • Streaming ETL frameworks are designed to scale horizontally, which is crucial for handling increased data volumes and processing requirements in real-time applications. This elasticity allows for seamless scaling of resources based on demand, and enables the system to manage spikes in data volume and velocity without sacrificing the performance.

With all its advantages, Streaming ETL also presents some challenges:

  • Streaming ETL process typically introduces additional complexity to data processing. Real-time data ingestion, continuous transformations, and persisting results while maintaining performance and data consistency require careful design and implementation.
  • Streaming ETL pipelines run in a distributed streaming environment, which introduces new challenges to the data processing process. Unless an appropriate delivery guarantee such as exactly-once or at-least once is used, there is a risk of delay or data loss during ingestion, and delivery stages, due to parallel and asynchronous processing when applying transformations. Ordering events and maintaining data consistency are complex in such situations, and if not handled properly, they may impact the accuracy of certain computations that rely on the event order. Using fault-tolerant mechanisms, such as replication, checkpointing, and backup strategies are essential to prevent data loss and ensure reliability and correctness of results.

Using a modern stream processing platform, such as DeltaStream, can help address the above challenges and enable organizations to benefit from all the advantages of Streaming ETL.

Differences between Streaming ETL and Batch ETL

Data processing model: Batch ETL starts with collecting large volumes of data over a time period and processes these batches in fixed intervals. Therefore, it applies transformations on the entire dataset as a batch. Streaming ETL operates on data as it arrives in real-time and continuously processes and transforms data as individual records or small chunks.

Latency: Batch ETL introduces inherent latency since data is processed in intervals. This latency normally ranges from minutes to hours. Streaming ETL processes data in real-time and offers low latency. Results are available immediately and are updated continuously.

Data volume and velocity: Batch ETL is well-suited for processing large volumes of data, collected over time. Therefore it is effective when dealing with historical data. Streaming ETL, on the other hand, is designed for high-velocity data streams and is effective for use cases that require immediate processing.

Processing frameworks: Batch ETL typically utilizes frameworks like Apache Hadoop, Apache Spark and traditional ETL and data warehouses tools. These frameworks are optimized for processing large volumes of data in parallel, but not necessarily for real-time use cases. Streaming ETL leverages specialized stream processing frameworks such as Apache Flink, and Apache Kafka. These frameworks are optimized for processing continuous streams of data in real time. With recent changes, some of these frameworks, such as Apache Flink, are now capable of processing batch workloads too. As these efforts and improvements continue, the overlap between frameworks which process these workloads are expected to get bigger.

Fault tolerance: Batch ETL typically processes large volumes of data in fixed intervals. If a failure occurs, all the data within that batch may be affected which could lead into results written partially. This makes failure recovery challenging in Batch ETL as it involves cleaning partial results and reprocessing the entire batch. Removing partial results and state, and starting a new run is a complicated process and normally involves manual intervention. Reprocessing a batch is time-consuming and resource-intensive and can take long which may result in issues for processing the next batch as producing results for the current batch has fallen behind. Moreover, rerunning some tasks could have unexpected side effects which may impact the correctness of final results.  Such issues need to be handled properly during a job restart. 

Streaming ETL does not involve many jobs that run sequentially many times over time, but there is a single long-running job which maintains its state and does incremental computation as data arrives. Therefore, Streaming ETL is generally better equipped to handle failures and partial results. Given that results are generated incrementally in the Streaming ETL case, a failure does not force discarding already generated results and reprocessing sources from the beginning. Stream processing frameworks provide transaction-like processing, exactly-once semantics, and write-ahead logs, ensuring atomicity and data consistency. They have built-in mechanisms for fault recovery, handle out-of-order events, and ensure end-to-end reliability by leveraging distributed messaging systems.

Choosing Between Streaming ETL and Batch ETL

There are several factors to consider, when deciding between Streaming ETL and Batch ETL for a data processing use case. The most important factor is latency requirements. Consider the desired latency of insights and actions. If a real-time response is critical, then Streaming ETL is the correct choice. The other important factor is data volume and velocity along with the cost of processing. You should evaluate the volume of data and the rate it arrives at. Streaming ETL is capable of processing fast data immediately. However, due to its inherent complexity and higher resource demand, it is more difficult to maintain. 

Streaming ETL also introduces challenges related to maintaining the correct order of events, especially in distributed environments. Batch ETL processes data in a much more deterministic and mostly sequential manner, which ensures data consistency and ordered processing. A modern stream processing platform is a viable solution to handle these challenges and difficulties when picking Streaming ETL as a solution. Finally, you need to consider how often the data sources evolve over time in your use case as that can impact the structure of incoming records. Data processing pipelines need to handle schema evolution properly to prevent disruptions and errors. Managing schema changes, versioning, and implementing schema inference mechanisms become crucial to ensure correctness and reliability. Using a stream processing framework enables you to address these changes in a streaming ETL pipeline, which is intended to run continuously with no interruption.

Conclusion

Choosing between streaming ETL and batch ETL requires a thorough understanding of the specific requirements and trade-offs of each. While both approaches have their strengths and weaknesses, they are effective in different use cases and for different data processing needs. Streaming ETL offers real-time processing with low latency and high scalability to handle high-velocity data. On the other hand, batch ETL is well-suited for historical analysis, scheduled reporting, and scenarios where near-real-time results are not critical. In this blog post, we covered the specifics as well as the pros and cons of each approach, and explained the important factors to consider when deciding which one to choose for a given data processing use case.

DeltaStream provides a comprehensive stream processing platform to manage, secure and process all your event streams. You can easily use it to create your streaming ETL solutions. It is easy to operate, and scales automatically. You can get more in-depth information about DeltaStream features and use cases by checking our blogs series. If you are ready to try a modern stream processing solution, you can reach out to our team to schedule a demo and start using the system.

05 Jun 2023

Min Read

All About Streaming Data Mesh

The Current State of Streaming Data

The adoption of Streaming data technologies has grown exponentially over the past 5 years. Every industry understands the importance of accessing and understanding data in real-time, so they can make decisions about their products, services and customers. The sooner you can access the event – could be a customer purchasing an item or an IoT device relaying data – the faster you can process and react to it. Making all this happen in real-time is where technologies like Apache Kafka, AWS Kinesis, Google Pub-Sub, Apache Flink and many more come into play. Most of these are available as a managed service, making it easy for companies to adopt them. For example,Confluent and AWS offer Apache Kafka as a managed service.

With rapid adoption of these services, many companies now have access to real-time data.

By their very nature technologies like Apache Kafka interact with multiple types of data producers and consumers. While Datalakes and Data Warehouses meant storing large amounts of data in centralized repositories, data streaming technologies have always promoted decentralization of services and the data behind them. As data sets transit these systems, being able to manage, access and self-provision data is extremely important. This is where Data Mesh comes into the picture.

What is a Data Mesh?

Data Mesh, as first defined by Zhamak Dehghani,  is a modern data management architecture that facilitates and promotes a  decentralized way to manage, access and share data. The four core principles of a data mesh are : 

  • Domain ownership 
  • Data as a product
  • Self-serve data platform 
  • Federated computational governance

Pic from : datamesh-architecture.com

A Data Mesh gives you the agility to work with the data that flows in and out of complex systems across multiple organizations. It reduces dependencies and removes bottlenecks that arise while accessing data; thereby improving the organization’s ability to respond and make business decisions.

While there are tools and technologies that can help you build a Data Mesh over “Data at Rest”, it isn’t the same with “Data in Motion”. This is where DeltaStream comes into the picture, to enable organizations develop a Data Mesh over and across all their streaming data  – which could span multiple platforms and multiple public cloud providers (Apache Kafka, Confluent Cloud, Kinesis, Pub-Sub, RedPanda .. etc)

Streaming Data Mesh with DeltaStream

What we have built at DeltaStream is an extremely powerful framework which will provide a single platform to access your operational and analytical data, in real-time, across your entire organization in a decentralized manner. No more expensive pipelines or duplicating your data or building central teams to manage Datalakes / Data Warehouses.

Let’s revisit the core tenets of Data Mesh and how DeltaStream helps you achieve each one of those and more.

Domain Ownership

  • In DeltaStream, the data boundaries are clearly defined using our namespaces &  RBAC. Each stream is isolated with strict access control and any query against that stream inherits the same permissions and boundaries. This way,  your core data sets remain under your purview and you can control what others can access. The queries are isolated at the infrastructure level, which means queries are scaled independently, reducing cost and operational complexity. With all these controls in place, the core objective of decentralized ownership of data streams is achieved while keeping your data assets secure.

Data as a Product

  • Making data discoverable and usable while maintaining the quality is central to serving data as a product to your internal teams. Being able to do it as close to the source as possible is extremely beneficial and in-line with the data mesh philosophy of domain owners owning the quality of data. This is exactly what you can achieve with DeltaStream. Using our powerful SQL interface you can quickly alter and enrich your data and get it ready-to-use in real-time. As your data models change,  DeltaStreams Data Catalog along with Schema registry can track your data evolution and help you iterate on your data assets  as you grow and evolve.

Self-serve Data Platform

  • Once you achieve “Domain ownership” and “Data as a Product”, self-service follows. The main objective of self-service is to remove the need for a centralized team to coordinate data distribution across the entire company. Leveraging DeltaStream’s catalog you can make your high quality data streams discoverable and combining this with our RBAC you can securely share them, with both internal and external users.  This means, the data will continuously flow from its source systems to end-users without ever having the need to store, extract or load. This is very powerful in the way that you get to securely democratize “streaming data” across your company, while having the ability to share it with users that are outside of your Organization too.

Federated Computational Governance

  • With everything decentralized  – from data ownership to data delivery – governance becomes a very important requirement. This is to ensure that data originating in a particular domain is consumable in any part of the organization. Schemas and schema registry go a long way in ensuring that. Data, just like the organization it serves, changes and evolves over time. Integrating your data streams with schema registry is critical to maintaining a common language to communicate.  Also, interoperability is a key component of governance and the ability of Deltastream to operate across multiple streaming stores is a great capability.

Conclusion

Decentralization of data ownership and its delivery is critical for organizations to be nimble. As Data Mesh gains traction, there are tools and technologies readily available to implement it over your data at rest. But the same can’t be said when you are dealing with streaming data. This is what Deltastream provides. With tooling built around the core processing framework,  it enables companies to implement data mesh on top of streaming data.

As streaming data continues its rapid growth, it is extremely important for organizations to set-up the right foundations to maximize the value of their real-time data. While architectures like DataMesh provide the right framework, platforms like DeltaStream offer a lot to bring it all together. 

We offer a free trial of DeltaStream that you can sign up for start using DeltaStream today. If you have any questions, please feel free to reach out to us.

10 May 2023

Min Read

Why Apache Flink is the Industry Gold Standard

What is Apache Flink?

Apache Flink is a distributed data processing engine that is used for both batch processing and stream processing. Although Flink supports both batch and streaming processing modes, it is first and foremost a stream processing engine used for real-time computing. By processing data on the fly as a stream, Flink and other stream processing engines are able to achieve extremely low latencies. Flink in particular is a highly scalable system, capable of ingesting high throughput data to perform stateful stream processing jobs while also guaranteeing exactly-once processing and fault tolerance. Initially released in 2011, Flink has grown into one of the largest open source projects. Large and reputable companies such as Alibaba, Amazon, Capital One, Uber, Lyft, Yelp, and others have relied on Flink to provide real-time data processing to their organizations.

Flink’s place in the data industry

Before stream processing, companies relied on batch jobs to fulfill their big data needs, where jobs would only perform computations on a finite set of data. However, many use cases don’t naturally fit a batch model. Consider fraud detection for example, where a bank may want to act if anomalous credit card behavior is detected. In the batch world, the bank would need to collect credit card transactions over a period of time, then run a batch job over that data to determine if a transaction is potentially a fraudulent one. Since the batch job is run after the data collection, by the time the batch job finishes, it might not be until a few hours that a transaction is marked as fraudulent, depending on how frequently the job is being scheduled. In the streaming world, transactions can be processed in real time and the latency for action will be less. Being able to freeze credit card accounts in real-time can save time and money for the bank. You can read more about the distinction of streaming vs batch in our other blog post.

Without Apache Flink or other stream processing engines, you could still write code to read from a streaming message broker (e.g. Kafka, Kinesis, RabbitMQ) and do stream processing from scratch. However, with this approach you would miss out on many of the features that modern stream processing engines give you out of the box, such as fault tolerance, the ability to perform stateful queries, and API’s that vastly simplify processing logic. There are many stream processing frameworks in addition to Apache Flink such as Kafka Streams, ksqlDB, Apache Spark (Spark streaming), Apache Storm, Apache Samza, and dozens of others. Each have their own pros and cons and comparing them could easily become its own blog post, but Apache Flink has become an industry favorite because of these key features:

  • Flink’s design is truly based on processing streams instead of relying on micro-batching
  • Flink’s checkpointing system and exactly-once guarantees
  • Flink’s ability to handle large throughputs of data from many different types of data sources at a low latency
  • Flink’s SQL API that leverages Apache Calcite allows users to define their Flink jobs using SQL, making it easier to build applications

How Apache Flink works

Flink is designed to be a highly scalable distributed system, capable of being deployed with most common setups, that can perform stateful computations at in-memory speed. To get a high level understanding about Flink’s design, we can look further into the Flink ecosystem and execution model.

The Flink ecosystem can be broken down into 4 layers:

  • Storage: Flink doesn’t have its own storage layer and instead relies on a number of connectors to connect to data. These include but are not limited to HDFS, AWS S3, Apache Kafka, AWS Kinesis, and SQL databases. You can similarly use managed products such as Amazon MSK, Confluent Cloud, and Redpanda among others.
  • Deploy: Flink is a distributed system and integrates seamlessly with resource managers like YARN and Kubernetes, but it can also be deployed in standalone mode or in a single JVM.
  • Runtime: Flink’s runtime layer is where the data processing happens. Applications can be split into tasks that are parallelizable and thus highly scalable across many cluster nodes. With Flink’s incremental and asynchronous checkpointing algorithm, Flink is also able to provide fault tolerance for applications with large state and provide exactly-once processing guarantees while minimizing the effect on processing latency.
  • APIs and Libraries: The APIs and libraries layer are provided to make it easy to work with Flink’s runtime. The DataStream API can be used to declare stream processing jobs and offers operations such as windowing, updating state, and transformations. The Table and SQL APIs can be used for batch processing or stream processing. The SQL API allows users to write SQL queries to define their Flink jobs and the Table API allows users to call functions on their data based on Table schemas used to define them.

In the execution architecture, there is a client and the Flink cluster. Two kinds of nodes make up the Flink cluster – the Job manager and the Task manager. There is always a Job manager and any number of Task managers.

  • Client: Responsible for generating the Flink job graph and submitting it to the Job manager.
  • Job manager: Receives the job graph from the Client then schedules and oversees the execution of the job’s tasks on the Task managers.
  • Task manager: Responsible for executing all the tasks it was assigned from the Job manager and report updates on its status back to the Job manager.

In Figure 1, you’ll notice that storage and deployment are somewhat detached from the Flink runtime and APIs. The benefit of designing Flink this way is that Flink can be deployed in any number of ways and connect to any number of storage systems, making it a good fit for almost any tech stack. In Figure 2, you’ll notice that there can be multiple Task managers and they can be scaled horizontally, making it easy to add resources to run jobs at scale.

What makes Apache Flink the data industry gold standard

Apache Flink has become the most popular real-time computing engine and the industry gold standard for many reasons, but its ability to process large volumes of real-time data with low latency is its main draw. Other features include the following:

  • Stateful processing: Flink is a stateful stream processor, allowing for transformations such as aggregations and joins of continuous data streams.
  • Scalability: Flink can scale up to thousands of Task manager nodes to handle virtually any workload without sacrificing latency.
  • Fault tolerance: Flink’s state of the art checkpointing system and distributed runtime makes it both fault tolerant and highly available. Checkpoints are used to save the application state, so when your application inevitably encounters a failure, the application can be resumed from the latest checkpoint. For applications that use transactional sinks, exactly-once end to end processing can be achieved. Also, since Flink is compatible with most resource management systems like YARN and Kubernetes, when nodes crash they will automatically be restarted. Flink also has high availability setup options based on Apache Zookeeper to help prevent failure states as well.
  • Time mechanisms: Flink has a concept called “watermarks” for working with event time which represents the application’s progress with regard to time. Different watermark strategies can be applied to allow users to configure how they want to handle out of order or late data. Watermarks are important for the system to understand when it is safe to close windows and emit results for stateful operations.
  • Connector ecosystem: Since Flink uses external systems for storage, Flink has a rich ecosystem of connectors. Wherever your data is, Flink likely already has a library to connect to it. Popular existing connectors include connectors for Apache Kafka, AWS Kinesis, Apache Pulsar, JDBC connector for databases, Elasticsearch, and Apache Cassandra. If a connector you need doesn’t exist, Flink provides APIs to help you create a new connector.
  • Unified batch and streaming: Flink is both a batch and stream processor, making it an ideal candidate for applications that need both streaming and batch processing. With Flink, you can use the same application logic for both cases reducing the headache of managing two different technologies.
  • Large open source community: In 2022, Flink had over 1,600 contributions and the Github project currently has over 21 thousand stars. To put this in perspective, Apache Kafka currently has 24,800 stars and Apache Hadoop currently has 13,500 thousand stars. Having a large and active open source community is extremely valuable for fixing bugs and staying up to date with the latest tech requirements.

Conclusion

In this blog post, we’ve introduced Apache Flink at a high level and highlighted some of its features. While we would suggest those who are interested in adding stream processing to their organizations to take a survey of the different stream processing frameworks available, we’re also confident that Flink would be a top choice for any use case. At DeltaStream, although we are detached from using any particular compute engine, we have chosen to use Flink for the reasons explained above. DeltaStream is a unified, serverless stream processing platform that takes the pain out of managing your streaming stores and launching stream processing queries. If you’re interested in learning more about DeltaStream, you can check out our blog series or reach out to our team.

17 Apr 2023

Min Read

Batch vs. Stream Processing: How to Choose

Rapid growth in the volume, velocity and variety of data has made data processing a critical component of modern business operations. Batch processing and stream processing are two major and widely used methods to perform data processing. In this blog, we will explain batch processing and stream processing and will go over their differences. Moreover, we will explore the pros and cons of each, and discuss the importance of choosing the right approach for your use cases. If you are looking for Streaming ETL vs. Batch ETL, we have a blog on that too.

What is Batch Processing?

Batch processing is a data processing method on large volumes of fully stored data. Depending on the use case specifics, a batch processing pipeline typically consists of multiple steps including data ingestion, data preprocessing, data analysis and data storage. Data goes through various transformations for cleaning, normalization, and enrichment, and different tools and frameworks could be used for analysis and value extraction. The final processed data gets stored in a new location with a new format such as a database, a data warehouse, a file, or a report.

Batch processing is used in a wide range of applications where large volumes of data need to be processed efficiently in a cost-effective manner with high throughput. Examples are: ETL processes for data aggregation, log processing, and training predictive ML models.Pros and cons of batch processing

Batch processing has its advantages and disadvantages. Here are some of its pros and cons.
Pros:

  • Batch processing is efficient and cost-effective for applications with large volumes of already stored data.
  • The process is predictable, reliable and repeatable which makes it easier to schedule, maintain, and recover from failures.
  • Batch processing is scalable and can handle workloads with complex transformations on large volumes of data.

Cons:

  • Batch processing has a high latency, and processing time can be long depending on the volume of data and complexity of the workload.
  • Batch processing is generally not interactive, while the processing is running. Users need to wait until the whole process is complete before they can access the results. This means batch processing does not provide real time insights into data which can be a disadvantage in applications where real-time access to (partial) results is necessary.

What is Stream Processing?

Stream processing is a data processing method where processing is done in real time as data is being generated and ingested. It involves analyzing and processing continuous streams of data in order to extract insights and information from them. A stream processing pipeline typically consists of several phases including data ingestion, data processing, data storage, and reporting. While the steps in a stream processing pipeline may look similar to those in batch processing, these two methods are significantly different from each other.

Stream processing pipelines are designed to have low latency and process the data in a continuous mode. When it comes to the complexity of transformations, batch processing normally involves more complex and resource intensive transformations which run over large, discrete batches of data. In contrast, stream processing pipelines run simpler transformations on smaller chunks of data, as soon as the data arrives. Given that stream processing is optimized for continuous processing with low latency, it is suitable for interactive use cases where users need to receive immediate feedback. Examples include fraud detection and online advertising.

Pros and cons of stream processing

Here is a list of pros and cons for stream processing.
Pros:

  • Stream processing allows for real time processing of data with low latency, which means the results can be gained quickly to serve use cases and applications with real-time processing demands.
  • Stream processing pipelines are flexible and can be quickly adapted to changes in data or processing needs.

Cons:

  • Stream processing pipelines tend to be more expensive due to their requirements to handle long-running jobs and data in real time, which means they need more powerful hardware and faster processing capabilities. However, with serverless platforms such as DeltaStream, stream processing can be simplified significantly.
  • Stream processing pipelines require more maintenance and tuning due to the fact that any change or error in data, as it is being processed in real time, needs to be addressed immediately. They also need more frequent updates and modifications to adapt to changes in the workload. Serverless stream processing platforms like DeltaStream simplify this aspect of stream processing as well.

Choosing between Stream Processing and Batch Processing

There are several factors to consider when deciding whether to use stream processing or batch processing for your use case. First, consider the nature and specifics of the application and its data. If the application is highly time-sensitive and requires real-time analysis and immediate response (low latency), then stream processing is the option to choose. On the other hand, if the application can rely on offline and periodic processing of large amounts of data, then batch processing is more appropriate.

Second factor is the volume and velocity of the data. Stream processing is suitable for processing continuous streams of data arriving in high velocity. However, if the data volume is too high to be processed in real time and it needs extensive and complex processing, batch processing is most likely a better choice though it comes with the cost of sacrificing low latency. Scaling up stream processing to thousands of workers to handle petabytes of data is very expensive and complex. Finally, you should consider the cost and resource constraints of the application. Stream processing pipelines are typically more expensive to set up, maintain, and tune. Batch processing pipelines tend to be more cost-effective and scalable, as long as no major changes appear in the workload or nature of the data.

Conclusion

Batch processing and stream processing are two widely used data processing methods for data-intensive applications. Choosing the right approach among them for a use case depends on characteristics of the data being processed, complexity of the workload, frequency of data and workload changes, and requirements of the use case in terms of latency and cost. It is important to evaluate your data processing requirements carefully to determine which approach is the best fit. Moreover, keep an eye on emerging technologies and solutions in the data processing space, as they may provide new options and capabilities for your applications. While batch processing technologies have been around for several decades; Stream processing technologies have seen significant growth and innovation in recent years. Beside several open-source stream processing frameworks such as Apache Flink which gained widespread adoption, cloud-based stream processing services are emerging as well which aim at providing easy-to-use and scalable data processing capabilities.

DeltaStream is a unified serverless stream processing platform to manage, secure and process all your event streams. DeltaStream provides a comprehensive stream processing platform that is easy to use, easy to operate, and scales automatically. You can get more in-depth information about DeltaStream features and use cases by checking our blogs series. If you are ready to try a modern stream processing solution, you can reach out to our team to schedule a demo and start using the system.

22 Mar 2023

Min Read

When to use Streaming Analytics vs Streaming Databases

Event streaming infrastructure has become an important part of modern data stack in many organizations enabling them to rapidly ingest and process streaming data to gain real-time insight and take actions accordingly. Streaming storage systems such as Apache Kafka provide a storage layer to ingest and make data streams available in real-time. Streaming analytics(stream processing) systems and streaming databases are two platforms enabling users to gain real-time insight from their data streams. However, each of these systems is suitable for different use cases and one system alone may not be the right choice for all of your stream processing needs. In this blog, we will provide a brief introduction on streaming analytics(stream processing) and streaming database systems and discuss how they work and what use cases each of them is suitable for. We then show how DeltaStream can provide capabilities of both these systems on one unified platform along with more capabilities unique to DeltaStream.

What is Streaming Analytics? 

Streaming analytics, also known as stream processing, continuously processes and analyzes event streams as they arrive. Streaming analytics systems do not have their own storage but provide a compute layer on top of other storage systems. These systems usually read data streams from streaming storage systems such as Apache Kafka, process it and write the results back to the streaming storage system. Streaming analytics systems provide both stateless(e.g., filtering, projection, transformation) and stateful(e.g., aggregation, windowed aggregation, join) processing capabilities on data streams. Many of them also provide a SQL interface where users can express their processing logic in familiar SQL statements. Apache Flink, Kafka Stream/Confluent’s ksqlDB and Spark Streaming are some of the commonly used streaming analytics(stream processing) platforms. The following figure depicts a high level overview of a streaming analytics system and its relation with streaming stores.

Streaming analytics systems are used in a wide variety of use cases. Building real-time data pipelines is one of the main use cases for the streaming analytics systems where data streams are ingested from source stores, processed and written into destination stores. Streaming data integration, real-time ML pipelines and real-time data quality assurance are just a few solutions you can use real-time data pipelines powered by streaming analytics systems. Other use cases for streaming analytics systems include their use in data mesh deployment and building event-driven microservices.

An important characteristic of streaming analytics systems is the focus on processing. In these systems, queries also known as jobs are a fundamental part of the platform. They provide fine grained control over these queries where users can start, pause and terminate queries individually without having to make any changes in the sources and sinks of those queries. The focus on queries is one of the main differences between streaming analytics and streaming databases as we explore them in the next section.

What is a Streaming Database?

Streaming databases combine stream processing and the serving of results in one system. Unlike streaming analytics systems, that are compute-only platforms and don’t have their own storage, streaming databases include their own storage where the results of continuous computations are stored and served to user applications. Materialized views are one of the foundational concepts in streaming databases that represent the result of continuous queries that are updated incrementally as the input data arrives. These materialized views are then available to query through SQL. 

The following figure depicts a high level view of a streaming database architecture.

The role of queries is different in streaming databases. As mentioned above, continuous queries are a fundamental part of streaming analytics systems, however, streaming databases don’t provide for fine grain control over queries. In streaming databases, the life cycle of continuous queries are attached to the materialized views and they are started and terminated implicitly by creating and dropping materialized views. This significantly limits the capabilities of users to manage such queries which is an essential part of building and managing streaming data pipelines. Therefore, streaming analytics systems are much more suited to build real-time data pipelines than streaming databases.

While streaming databases may not be the right tool for building real-time streaming pipelines, the incrementally updated materialized views in streaming databases can be used to serve real-time data to power user-facing dashboards and applications. They can also be used to build real-time feature stores for machine learning use cases.

Can we have both? With DeltaStream, yes

While streaming analytics systems are ideal for building real-time data pipelines, streaming databases are great in building incrementally updated materialized views and serving them to applications. Customers are left having to choose between two systems or even worse buy multiple solutions. The question is, can you have one system to reduce complexity and provide a comprehensive and powerful platform to build real-time applications on streaming data? The answer, as you expected, is Yes! 

DeltaStream does this seamlessly as a serverless service. As the following diagram shows, DeltaStream provides capabilities of both streaming analytics and streaming databases along with additional features to provide a complete platform to manage, process, secure and share streaming data.

Using DeltaStream’s SQL interface you can build streaming applications and pipelines. You can also build continuously updated materialized views and serve them to applications and dashboard in real-time. DeltaStream also provides fine grained control on continuous queries where you can start, restart and terminate each query. In addition to compute capabilities on streaming data, DeltaStream also provides familiar hierarchical name-spacing for your streams. In DeltaStream, you declare relations such as streams and changelogs on your streaming data such as Apache Kafka topics and organize them in databases and schemas the same way you organize tables in traditional databases. You can then control access to your streaming data with the familiar Role-based Access Control (RBAC) mechanism that you would use in traditional relational databases. This significantly simplifies using DeltaStream since you don’t need to learn a new security model for your streaming data. Finally, DeltaStream enables users to securely share real-time data with internal and external users.

So if you are using streaming storage systems such as Apache Kafka(Confluent Cloud, AWS MSK or any other Apache Kafka) or AWS Kinesis, you should check out DeltaStream as the platform for processing, organizing and securing your streaming data. You can schedule a demo where you can see all these capabilities in the context of a real world streaming application.

alert-icon

Please enter a valid email address.

Request Submitted

Thank you for requesting a demo.
You will receive your login information to your email soon.