05 Dec 2023

Min Read

Exploring Pattern Recognition using MATCH_RECOGNIZE

Pattern recognition is a common use case in data processing. Detecting trend reversals, identifying anomalies, and finding sequences in data are all examples of pattern recognition problems. In SQL, Row Pattern Recognition (RPR) became part of the SQL standard in 2016 (ISO/IEC 9075:2016) and introduced the MATCH RECOGNIZE SQL syntax. Using this new syntax, users can write concise SQL queries to solve pattern recognition problems.

While pattern recognition is an important challenge to solve in the batch world, there are also many pattern recognition use cases in the real-time context. That’s why as a leading streaming platform, it is necessary for DeltaStream to support MATCH RECOGNIZE in its SQL syntax.

In our previous blog post Analyzing Real-Time NYC Bus Data with DeltaStream, we showed how we could write a SQL query in DeltaStream to solve the pattern recognition use case of detecting bus trips where the delays are significantly increasing. In this blog post we will do a deep dive into that query and explain the purpose and meaning behind each line.

DSQL MATCH_RECOGNIZE Query

Below is the same pattern recognition SQL query from our Analyzing Real-Time NYC Bus Data with DeltaStream post:

  1. CREATE STREAM trips_delay_increasing AS
  2. SELECT
  3. trip,
  4. vehicle,
  5. start_delay,
  6. end_delay,
  7. start_ts,
  8. end_ts,
  9. CAST(FROM_UNIXTIME((start_epoch_secs + end_epoch_secs) / 2) AS TIMESTAMP) AS avg_ts
  10. FROM trip_updates
  11. MATCH_RECOGNIZE(
  12. PARTITION BY trip
  13. ORDER BY "ts"
  14. MEASURES
  15. C.row_timestamp AS row_timestamp,
  16. C.row_key AS row_key,
  17. C.row_metadata AS row_metadata,
  18. C.vehicle AS vehicle,
  19. A.delay AS start_delay,
  20. C.delay AS end_delay,
  21. A.ts AS start_ts,
  22. C.ts AS end_ts,
  23. A.epoch_secs AS start_epoch_secs,
  24. C.epoch_secs AS end_epoch_secs
  25. ONE ROW PER MATCH
  26. AFTER MATCH SKIP TO LAST C
  27. PATTERN (A B C)
  28. DEFINE
  29. A AS delay > 0,
  30. B AS delay > A.delay + 30,
  31. C AS delay > B.delay + 30
  32. ) AS MR WITH ('timestamp'='ts')
  33. QUERY WITH ('state.ttl.millis'='3600000');

Breakdown of MATCH_RECOGNIZE Query

First of all, you can find all of our documentation on MATCH RECOGNIZE here. In this section, we’ll be breaking down the above query and discussing the thought process behind each line of the query.

In this query, the source Stream is bus trip updates with a field called delay which represents the number of seconds a bus is delayed on its current route. The goal of the query is to output bus trips where the delay is significantly increasing so that we can get a better understanding of when buses will actually arrive at their upcoming stops.

CSAS (lines 1-2)

This query is what we call a CREATE STREAM AS SELECT (CSAS) query, where we are creating a new Stream that will be the output of running the continuous SELECT statement. Running a continuous query will launch a long-lived stream processing job behind the scenes and write the results to the physical storage that is backing the new Stream. As explained in the original blog post Analyzing Real-Time NYC Bus Data with DeltaStream, a Kafka topic is the physical storage layer that is backing this Stream.

Projection (lines 2-9)

The first few lines of the SELECT query are the fields that are being projected for the resulting Stream trips_delay_increasing. These fields are made available by the MEASURE and the PARTITION BY clauses in the MATCH_RECOGNIZE statement. The following fields are being projected:

  • trip and vehicle represents the particular bus trip that is experiencing increasing delays
  • start_delay, start_ts, end_delay, and end_ts give insights into the pattern that was matched and how fast delays are increasing
  • avg_ts is the midpoint between the start_epoch_secs and end_epoch_secs, which also represents the midpoint of the matched pattern. To evaluate this field, we use built-in functions to convert from epoch seconds, an INTEGER, into a TIMESTAMP. In the original blog post, this field was used in a follow up query to find the positions of the buses during the time that the delays were increasing.

Source (lines 10-11, 32)

The FROM clause defines the source of data for the query. In this query, we are sourcing from the result of the MATCH_RECOGNIZE clause on the trip_updates Stream. This source also has a WITH clause to specify a timestamp field, ts, which is a field in the trip_updates Stream. By specifying the timestamp field, we are setting the event time of incoming records to be equal to the value of that field, so later on in the ORDER BY clause we can use the correct timestamp for ordering events in our patterns.

MATCH_RECOGNIZE – PARTITION BY (line 12)

The first line in the MATCH_RECOGNIZE query is the optional PARTITION BY clause. This subclause groups the underlying data based on the partitioning expression, and optimizes DeltaStream’s compute resources for parallel processing of the source data. In this query, we are partitioning the events by the trip field, so any detected patterns are unique to a particular bus trip. In other words, we want to know for each particular bus trip if there are increasing delays. The PARTITION BY clause is necessary for our use case because the delays for one bus trip are not relevant to other bus trips. Note that the fields in the PARTITION BY clause are available for projection, as shown in this query by selecting the trip field in the SELECT statement.

MATCH_RECOGNIZE – ORDER BY (line 13)

The ORDER BY clause is required for MATCH_RECOGNIZE. This clause defines the order in which the data should be sorted for each partition before they are evaluated for matched patterns. For our query, the ordering field is ts, so bus trip updates will be ordered in ascending order according to the value of the ts field. One requirement for the ORDER BY subclause is that the ordering field must be the same as the timestamp field for the source Relation, which is set on line 32 (also mentioned in the Source section above).

MATCH_RECOGNIZE – MEASURES (lines 14-24)

The MEASURES clause defines the output schema for the MATCH_RECOGNIZE statement and has a similar meaning as a SELECT clause. The fields specified in the MEASURES subclause are made available to the SELECT clause. The MEASURES subclause and the DEFINE subclause (lines 28-31) are closely related, as the MEASURES subclause is projecting fields from rows defined by the DEFINE subclause. For example, on line 19 we define start_delay as A.delay. In this case, the delay for the row matching A’s definition is being projected as start_delay, whereas on line 20 the delay for the row matching C’s definition is being projected as end_delay. There are 3 fields in the MEASURES sub-clause that aren’t being used in our query’s SELECT clause. These are the row metadata columns – row_timestamp, row_key, and row_metadata. Since the MATCH_RECOGNIZE operator alters the projection columns of its input Relation, a PATTERN variable must be chosen for these special fields as part of the MEASURES subclause, which we do on lines 15-17. See the MATCH_RECOGNIZE - PATTERN below for more information on the PATTERN variables.

MATCH_RECOGNIZE – Output Mode (line 25)

ONE ROW PER MATCH is the only supported output mode, which means for a given sequence of events that matches our pattern, we should output one result. So in our query, for a bus trip with significantly increasing delays, we should output one event for this trip.

MATCH_RECOGNIZE – AFTER MATCH strategy (line 26)

The after match strategy defines where to begin looking for the next pattern. In this query, we specify AFTER MATCH SKIP TO LAST C, which means after finding a pattern match, look for the next pattern match starting with the last event of the current match. Other match strategies could inform our query to start looking for the next pattern starting from a different event, such as the event after a pattern match. However in our case, we want to make sure we are capturing continuous patterns. Specifically for our query, the pattern that we are looking for is 3 trip updates with increases in delay in a row (see MATCH_RECOGNIZE – PATTERN section below). So, if there was a bus trip with 5 trip updates with strictly increasing delay, then there would be 2 results from our MATCH_RECOGNIZE statement with our after match strategy. The first result would be for updates 1-3, and the second result for updates 3-5. The first match’s C would also be the second match’s A in this case. See other after match strategies in our documentation.

MATCH_RECOGNIZE – PATTERN (line 27)

The PATTERN clause specifies the pattern of events that should be considered a match. The PATTERN subclause contains pattern variables, which can each be associated with a quantifier to specify how many rows of that variable to allow in the pattern (see the documentation for more details). In our query, we have a simple pattern of (A B C) without any quantifiers, meaning that in order for a series of events to be considered a match, there needs to be 3 consecutive events with the first one matching A’s definition, the second one matching B’s definition, and the third one matching C’s definition.

MATCH_RECOGNIZE – DEFINE (lines 28-31)

The DEFINE subclause defines each of the pattern variables from the PATTERN subclause. If a variable is not defined then it will evaluate to true for an event contributing to a pattern match. This clause is similar to the WHERE clause in SQL, in that it specifies conditions that an event must meet in order to be considered as one of the pattern variables. When defining these pattern variables, we can access fields from the original events of the source Relation, in our case the trip_updates Stream, and define expressions for evaluation. Aggregation and offset functions can also be applied here. In our query, we are defining our pattern variables based on their delay. For the first event in our pattern, A, we want to see bus trips that are already delayed by 30 seconds. B is defined as an event that is 30 seconds more delayed than A, and similarly C is defined as an event that is 30 seconds more delayed than B. Combined with our PATTERN of (A B C), our query is essentially finding patterns where the delay is increasing by 30 seconds with each trip update for 3 trip updates in a row.

QUERY WITH (line 33)

The last line of our query is the QUERY WITH clause. This optional clause is used to set query properties. For our query, we are setting the state.ttl.millis which is used to inform the query when it is safe to purge its state. Another way to limit the state size for MATCH_RECOGNIZE queries is the optional WITHIN subclause that specifies a duration for patterns. Without specifying some way for the query to purge state, the amount of information the query needs to keep in memory will grow indefinitely and the query will eventually fail.

Conclusion

Queries using the MATCH_RECOGNIZE syntax can seem very complex at first glance. This blog post aims to bring clarity to the different parts of a MATCH_RECOGNIZE query by using a specific, easy to follow example. Hopefully this post helps make pattern recognition queries easier to understand, as it is quite a powerful operator that can solve many use cases.

If you’re interested in trying out MATCH_RECOGNIZE queries in DeltaStream yourself, sign up for a free trial or schedule a demo with us.

29 Nov 2023

Min Read

Top 3 Challenges of Stream Processing and How to Address Them

Real-time shipment tracking, truck fleet management, real-time training of ML models, and fraud detection are all real use-cases that are powered by streaming technologies. As companies try to keep up with the trends of building real-time products, they require a data stack capable of handling and processing streaming data. Unlocking stream processing allows businesses to do advanced data analytics at low latencies, which ultimately allows for faster business insights and more informed decision making. However, many of the popular stream processing frameworks, like Apache Flink, are quite complex and can be challenging to operate. In this blog post we’ll cover 3 of the main challenges with operating a stream processing platform and how to address them.

Challenge 1: Resource Management

Stream processing jobs are long lived in order to continuously process data. Because these jobs are long lived, it’s important that your stream processing platform properly allocates resources for the stream processing jobs. On one hand, over-allocating resources will result in overspending, while on the other hand, under-allocating resources will cause jobs to fall behind or fail. The proper allocation for stream processing jobs varies case by case. For instance, jobs with high throughputs that need to hold a lot of state should receive more memory, while jobs with complex transformations should receive more CPU. Many workloads can also fluctuate and a dynamic allocation of resources in such cases is necessary to match the workload. Take a stream of data for page visits on a website for example – it’s likely that the website will be visited more during the day when people aren’t asleep. So, stream processing jobs that source from this data should scale up during the day, then scale down for the night.

Solution: Utilization Metrics

Exposing resource utilization metrics is an important first step to tackling the challenges of resource management. Having visibility into the resource utilization trends of your jobs can allow your stream processing platform to have rules for resource allocation. In the simplest case, if your job’s resource allocation is stable, you can allocate the amount of resources to match what’s shown in the metrics. For jobs with predictable fluctuations, such as a job sourcing from data that peaks during the day and dips during the night, you can set up a system that adjusts the resource allocation on a timely basis. In the most complex case, for jobs with unpredictable resource fluctuations, the best approach is to add an auto-scaling service that can automatically resize compute resources based on resource metrics. Building a platform that exposes the correct metrics, can safely resize your stream processing jobs, and includes an auto-scaling service are all necessary to generically support stream processing workloads, but can take a lot of engineering time and effort. If building these systems is too costly of an engineering investment, you can also consider implementing fully managed third party solutions that can help to partially or fully address these challenges.

Challenge 2: Data Heterogeneity

For production use cases, data can come from many different sources and in many different formats. Streaming data coming from sensors, cloud providers, databases, and backend services will differ from each other, which makes them difficult to compare. It is not easy to create a stream processing platform that can handle various data formats and quality levels. The engineering team that supports this platform will need to understand the nuances of different data sources and provide tools/features that can help to make variable data more uniform. However, this platform can create many possibilities as businesses can use data from sources that were isolated before.

Solution: Data Standardization

Standardizing the data across your organization and implementing quality control over the data are the best solutions for dealing with data heterogeneity. Providing data standards and encouraging data schematization at the data’s source is the best practice, but if that’s not possible, stream processing can also help to transform a set of data into a standardized format that can be easily processed and integrated with other data streams. Stream processing platforms that enable users to filter out bad or duplicate data, enrich data with missing fields, and mutate data to fit standardized data formats can mitigate many of the issues of variable data.

One tip when it comes to dealing with data with variable data quality is to provide different configurations for error handling. For many stream processing frameworks, if there is a record that a job doesn’t know how to deserialize or make sense of, the job will simply fail. For data sources that don’t have great data integrity, having options to skip over those records or produce them to a dead-letter queue for further analysis can be better options for your overall application.

Challenge 3: Streaming Knowledge Gaps

Most stream processing frameworks are highly complex pieces of software. Building up expertise to understand the challenges of streaming vs. the traditional challenges in the batch world takes time. For some organizations, having engineers ramp up on streaming technologies may not be a worthwhile or affordable investment. For organizations that do end up investing in a team of streaming experts, a knowledge gap between the streaming team and other teams often form. Even with a stream processing platform available to them, product engineers in many cases may not have much exposure to streaming concepts or how to best leverage the streaming tools available to them. In these cases, engineers working on product features or applications may require a lot of back and forth with the team of streaming engineers to realize their projects, or they may not even realize the benefits of adding stream processing to their projects in the first place. These situations can lead to loss of business potential and impact developer velocity.

Solution: Education and Democratization

Two ways to help address these challenges is by investing in developer education and by democratizing the streaming and stream processing platforms. Setting up regular knowledge sharing sessions and encouraging collaboration between teams can go a long way to reducing knowledge gaps between teams. From the platform perspective, democratizing streaming and stream processing by making these platforms easy to use will lower the barrier of entry to these technologies. Popular stream processing frameworks such as Flink and Spark Streaming have SQL APIs for defining data processing jobs. Exposing SQL to abstract away some of the complexities of the underlying system is one way to make the platform easier to use.

Conclusion

In this blog post we highlighted 3 of the main challenges we’ve seen organizations face when building and operating their own stream processing platforms. Overcoming each of these challenges requires engineering time and effort. While some organizations may be able to spend the up front time and money to build their own in-house data streaming platforms, others may not be able to afford to. This is where fully managed cloud services, such as DeltaStream, can help.

Our aim at DeltaStream is to provide an easy to use data streaming platform to unify, process, and govern your data. Here’s how DeltaStream addresses each of the challenges above:

  1. Resource Management: DeltaStream is a serverless platform, meaning resource scaling and operations are completely taken care of. No cluster sizing or resource provisioning.
  2. Data Heterogeneity: Out of the box, DeltaStream has support for all major data serialization formats – JSON, ProtoBuf, and Avro. DeltaStream also has native support for many data storage systems including Kafka, Kinesis, Delta Lake, Snowflake, and PostgreSQL. DeltaStream’s rich processing capabilities also allow users to filter, enrich, and transform data to mitigate data quality issues.
  3. Streaming Knowledge Gaps: DeltaStream exposes an easy to use SQL interface for interacting with all streaming resources and streaming queries. Tools for sampling data and testing queries are also provided to help users iterate faster.

If you want to learn more about how DeltaStream can enable stream processing in your organization, schedule a demo with us. If you want to try DeltaStream out yourself, sign up for a free trial.

14 Nov 2023

Min Read

Key Components of a Modern Data Stack: An Overview

Data is one of the most valuable assets of any company in the digital age. Drawing insights from data can help companies better understand their market and make well informed business decisions. However, to unlock the full potential of data, you need a data stack that can collect, store, process, analyze, and act on data efficiently and effectively. When considering a data stack, it’s important to understand what your needs are and how your data requirements may change in the future, then make decisions that are best suited for your particular business. For almost every business, making the most out of data collection and data insights are essential, and you don’t want to find yourself stuck with a legacy data system. Data systems can quickly go out of date and updating to new emerging technologies can be painful. Using modern cloud-based data solutions is one way you can help keep your data stack flexible, scalable, and ultimately save time and money down the road.

In this blog post, we’ll cover the benefits of building a modernized data stack, the main components of a data stack, and how DeltaStream fits into a modern data stack.

What makes a Modern Data Stack and Why You Need One

The modern data stack is built on cloud-based services and low-code or no-code tools. In contrast to legacy systems, modern data stacks are typically built in a distributed manner and aim to improve on the flexibility, scalability, latency, and security of older systems.

How we build a data stack to manage and process data has developed and morphed greatly in the last few years alone. Changing standards, new regulations, and the latest tech have all played a role in how we handle our data. Some of the core properties of a modern data stack include being:

  • Cloud-based: A cloud-based product is something that users can pay for and use over the internet, whereas before users would have to buy and manage their own hardware. The main benefit for users is scalability of their applications and reduced operational maintenance. Modern data solutions offer elastic scalability, meaning users can easily increase or decrease the resources running for their applications by simply adjusting some configuration. More recently, modern data solutions have been offering a serverless option, where users don’t need to worry about resources at all, and automatic scaling is handled behind the scenes by the solution itself. Compare this with legacy systems, where scaling up requires planning, hardware setup, and maintenance of the software and new hardware. Elastic scaling enables users to be flexible with the workloads they plan to run, as resources are always available to use. To ensure availability of their products, modern data solutions typically guarantee SLAs, so users can expect these products to work without worrying about system maintenance or outages. Without the burdens of resource management and system maintenance, users can completely focus on writing their applications, which will improve developer velocity and allow businesses to innovate faster.
  • Performant: In order for modern data solutions to be worthwhile, they need to be performant. Low latency, failure recovery, and security are all requirements for modern data solutions. While legacy systems may have been sufficient to meet the standards of the past, not all of them have evolved to meet the requirements of data workloads today. Many of the current modern data products utilize the latest open-source technologies in their domain. For example, Databricks utilizes and contributes back to the Apache Spark project, Confluent does the same for Apache Kafka, and MongoDB does the same for the MongoDB open source project. For modern data solutions that aren’t powered by open source projects, they typically feature advanced state-of-the-art software and provide benchmarks on how their performance compares to the status quo. Modern data solutions make the latest technologies accessible, enabling businesses to build powerful tools and applications that get the most out of their data.
  • Easy to use: The rapid development and advancement of technologies to solve emerging use cases has led to the most powerful tech to also become the most complex and specialized. While these technologies have made it possible to solve many new use cases, they’re oftentimes only accessible to a handful of experts. Because of this, the modern trend is towards building democratized solutions that are low-code and easy to use. That’s why modern solutions abstract away the complexities of their underlying tech and expose easy to use interfaces to users. Consider AWS S3 as an example, users can use the AWS console to store or retrieve files without writing any code. However, behind the scenes, there is a highly complex system to provide strong consistency guarantees, high availability, scale indefinitely, and provide a low-latency experience for users. Businesses using modern data solutions no longer need to hire or train experts to manage these technologies, which ultimately saves time and money.

Components of a Data Stack

At a high level, a modern data stack consists of four components: collection, storage, processing and analysis/visualization. While most modern data products loosely fit into a single component, it’s also not uncommon for solutions to span multiple components. For example, data warehouses such as Databricks and Snowflake act as both Data Storage (with data governance) and Data Processing.

  • Data Collection: These are services or systems that ingest data from your data sources into your data platform. This includes data coming from APIs, events from user facing applications, and data from databases. Data can come structured or unstructured and in various data formats.
  • Data Storage: These are systems that hold data for extended periods of time, either to be kept indefinitely or for processing later on. This includes databases like MongoDB, or streaming storage platforms like RedPanda and Kinesis. Data warehouses and data lakes such as Snowflake and AWS S3 can also be considered data storage.
  • Data Processing: These are the systems that perform transformations on your data, such as enrichment, filtering, aggregations, and joins. Many databases and data warehouses have processing capabilities, which is why you may see some products in multiple categories below. For example Snowflake is a solution where users can store their data and run SQL queries to transform the data in their tables. For stream processing use cases, users may look at products like DeltaStream that use frameworks such as Apache Flink to handle their processing.
  • Data Analysis and Visualization: These are the tools that enable you to explore, visualize, and share data insights. These include BI platforms, analytics software, and chart building software. Some popular tools include Tableau and Microsoft Power BI among others. Data analysis and visualization tools can help users draw insights from their data, discover patterns, and communicate findings.

Overview of the current landscape of modern data technologies

How DeltaStream fits into your Modern Data Stack

DeltaStream is a serverless data platform that unifies, governs, and processes real-time data. DeltaStream acts as both an organizational and compute layer on top of users data resources, specifically streaming resources. In the modern data stack, DeltaStream fits in the Data Storage and Data Processing layers. In particular, DeltaStream makes it easier to manage and process data for streaming and stream processing applications.

In the Data Storage layer, DeltaStream is a data platform that helps users properly manage their data through unification and governance. By providing proper data management tools, DeltaStream helps make the entire data stack more scalable and easier to use, allowing developers to iterate faster. Specifically, DeltaStream improves on the flat namespace model that most streaming storage systems, such as Kafka, currently have. DeltaStream instead uses a hierarchical namespace model such that data resources exist within a “Database” and “Schema”, and data from different storage systems can exist within the same database and schema in DeltaStream. This way the logical representation of your streaming data is decoupled from the physical storage systems. Then, DeltaStream provides Role Based Access Control (RBAC) on top of this relational organization of data so that roles can be created for certain permissions and users can inherit the roles that they’ve been granted. While DeltaStream isn’t actually storing any raw data itself, providing data unification and data governance to otherwise siloed data makes managing and securing all of a user’s streaming data easier. The graphic below is a good representation of DeltaStream’s federated data governance model.

DeltaStream federated data governance

In the Data Processing layer, DeltaStream provides an easy to use SQL interface to define long lived stream processing jobs. DeltaStream is a serverless platform, so fault tolerance, deployment, resource scaling, and operational maintenance of these queries are all taken care of by the DeltaStream platform. Under the hood, DeltaStream leverages Apache Flink, which is a powerful open source stream processing framework that can handle large volumes of data and supports complex data transformations. For DeltaStream users, they get all the benefits of Flink’s powerful stream processing capabilities without having to understand Flink or write any code. With simple SQL queries, users can deploy long-lived stream processing jobs in minutes without having to worry about scaling, operational maintenance, or the complexities of stream processing. This fits well with the scalable, performant, and easy to use principles of modern data stacks.

Conclusion

In this blog post we covered the core components of a data stack and discussed the benefits in investing in a modern data stack. For streaming storage and stream processing, we discussed how DeltaStream fits nicely in the modern data stack for unifying, governing, and processing data. DeltaStream’s integrations with non-streaming databases and data warehouses such as PostgreSQL, Databricks, and Snowflake make it a great option to run alongside these already popular modern data products. If you have any questions about modern data stacks or are interested in learning more about DeltaStream, reach out to us or schedule a demo.

01 Nov 2023

Min Read

A Guide to Repartitioning Kafka Topics Using PARTITION BY in DeltaStream

In this blog post we are going to be highlighting why and how to use the PARTITION BY clause in queries in DeltaStream. While we are going to be focusing on repartitioning Kafka data in this post, any data storage layer that uses a key to partition their data can benefit from PARTITION BY.

We will first cover how Kafka data partitioning works and explain why a Kafka user may need to repartition their data. Then, we will show off how simple it is to repartition Kafka data in DeltaStream using a PARTITION BY query.

How Kafka partitions data within a topic

Kafka is a distributed and highly scalable event logging platform. A topic in Kafka is a category of data representing a single log of records. Kafka is able to achieve its scalability by allowing each topic to have 1 or more partitions. When a particular record is produced to a Kafka topic, Kafka determines which partition that record belongs to and the record is persisted to the broker(s) assigned to that partition. With multiple partitions, writes to a Kafka topic can be handled by multiple brokers, given that the records being produced will be assigned to different partitions.

In Kafka, each record has a key payload and a value payload. In order to determine which partition a record should be produced to, Kafka uses the record’s key. Thus, all records with the same key will end up in the same topic partition. Records without a key will be produced to a random partition.

Let’s see how this works with an example. Consider you have a Kafka topic called ‘pageviews’ which is filled with records with the following schema:

  1. {
  2. ts: long, // timestamp of pageviews event
  3. uid: string, // user id
  4. pid: string // page id
  5. }

The topic has the following records (omitting ts for simplicity):

If we partition by the uid field by setting it as the key, then the topic with 3 partitions will look like the following:

If we partition by the pid field by setting it as the key, then the topic with 3 partitions will look like the following:

Why repartition your Kafka topic

The relationship between partitions and consumers in Kafka for a particular application is such that there can be at most 1 consumer per partition, but a consumer can read from multiple partitions. What this means is if our Kafka topic has 3 partitions and our consumer group has 4 consumers, then one of the consumers will sit idle. In the inverse case, if our Kafka topic has 3 partitions and our consumer group has 2 consumers, then one of the consumers will read from 2 partitions while the other reads from only 1 partition.

In most cases, users will set up their applications that consume from a Kafka topic to have a number of consumers that is a divisor of the number of partitions so that one consumer won’t be overloaded relative to other consumers. However, data can still be distributed unevenly to different partitions if there is a hotkey or poor partitioning strategy, and repartitioning may be necessary in these cases.

To showcase how data skew can be problematic, let’s look again at our pageviews example. Imagine that half of the records have a pid value of A and we partition by the pid field. In a 3 partition topic, ~50% of the records will be sent to one partition while the other two partitions get ~25% of the records. While data skew might hurt performance and reliability for the Kafka topic itself, it can also make it difficult for downstream applications that consume from this topic. With data skew, one or more consumers will be overloaded with a disproportionate amount of data to process. This can have a direct impact on how well downstream applications perform and result in problems such as many very out of order records, exploding application state sizes, and high latencies (see what Apache Flink has implemented to address some of the problems caused by data skew in sources). By repartitioning your Kafka topic and picking a field with more balanced values as the key to partition your data, data skew can be reduced if not eliminated.

Another reason you may want to repartition your Kafka data is to align your data according to its context. In the pageviews example, if we choose the partition key to be the uid field, then all data for a particular user id will be sent to the same partition and thus the same Kafka broker. Similarly, if we choose the partition key to be the pid field, then all data for a particular page id will be sent to the same partition and Kafka broker. If our use case is to perform analysis based on users, then it makes more sense to partition our data using uid rather than pid, and downstream applications will actually process data more efficiently.

Consider we are counting the number of pages a user visits in a certain time window and are partitioning by pid. If the application that aggregates the data has 3 parallel threads to perform the aggregation, each of these threads will be required to read records from all partitions, as the data belonging to a particular uid can exist in many different partitions. If our topic was partitioned by uid instead, then each thread can process data from their own distinct sets of partitions as all data for a particular uid would be available in a single partition. Stream processing systems like Flink and Kafka Streams require some kind of repartition step in their job to handle cases where operator tasks need to process data based on a key and the source Kafka topic is not partitioned by that key. In the case of Flink, the source operators need to map data to the correct aggregation operators over the network. The disk I/O and network involved for stream processing jobs to repartition and shuffle data can become very expensive at scale. By properly partitioning your source data to fit the context, you can avoid this overhead for downstream operations.

PARTITION BY in DeltaStream

Now the question is, how do I repartition or rekey my Kafka topic? In DeltaStream, it’s made simple by PARTITION BY. Given a Kafka topic, you can define a Stream on this topic and write a single PARTITION BY query that rekeys the data and produces the results to a new topic. Let’s see how to repartition a keyless Kafka topic ‘pageviews’.

First, define the ‘pageviews’ Stream on the ‘pageviews’ topic by writing a CREATE STREAM query:

  1. CREATE STREAM pageviews (
  2. ts BIGINT, uid VARCHAR, pid VARCHAR
  3. ) WITH (
  4. 'topic' = 'pageviews', 'value.format' = 'JSON'
  5. );

Next, create a long-running CREATE STREAM AS SELECT (CSAS) query to rekey the ‘pageviews’ Stream using uid as the partition key and output the results to a different Stream:

  1. CREATE STREAM pageviews_keyed AS
  2. SELECT
  3. *
  4. FROM pageviews
  5. PARTITION BY uid;

The output Stream, ‘pageviews_keyed’, will be backed by a new topic with the same name. If we PRINT the input ‘pageviews’ topic and the output ‘pageviews_keyed’ topic, we can see the input has no key assigned and the output has the uid value defined as the key.

  1. db.public/cc_kafka# PRINT TOPIC pageviews_ctan;
  2. | {"ts":46253951,"uid":"User_7","pid":"Page_90"}
  3. | {"ts":46253961,"uid":"User_5","pid":"Page_18"}
  4. | {"ts":46253971,"uid":"User_9","pid":"Page_64"}

  1. db.public/cc_kafka# PRINT TOPIC pageviews_keyed;
  2. {"uid":"User_6"} | {"ts":46282721,"uid":"User_6","pid":"Page_79"}
  3. {"uid":"User_7"} | {"ts":46282731,"uid":"User_7","pid":"Page_56"}
  4. {"uid":"User_1"} | {"ts":46282741,"uid":"User_1","pid":"Page_70"}

As you can see, with a single query, you can repartition your Kafka data using DeltaStream in a matter of minutes. This is one of the many ways we remove barriers to make building streaming applications easy.
If you want to learn more about DeltaStream or try it for yourself, you can request a demo or join our free trial.

24 Oct 2023

Min Read

Seamless Data Flow: How Integrations Enhance Stream Processing

Data processing systems can be broadly classified into 2 categories. batch processing & stream processing. Enterprises often use both streaming and batch processing systems because they serve different purposes and have distinct advantages, and using them together can provide a more comprehensive and flexible approach to data processing and analysis. Stream processing platforms help organizations process data as close to real-time as possible which is important for handling use-cases that are latency sensitive. Some examples include monitoring IoT sensors, fraud detection and threat detection. Batch processing systems are useful for a different set of situations – for example, when you want to analyze past data to find patterns or handle a lot of data from different sources.

Integrating between systems

The need for stream and batch processing is exactly why you are seeing companies implementing “lambda architecture”, which brings the best of both worlds together. In this architecture you generally see batch processing & stream processing systems working together to serve multiple use-case. So for these enterprises it’s important to be able to: 

  1. Move data between these systems seamlessly – ETL
  2. Make data products available to users in their preferred format/platform
  3. Continue using existing systems while leveraging new technologies to improve overall efficiency

Having the ability to process and prepare data as soon as it arrives for downstream consumption is an extremely critical function in the data landscape. For this you need to be able to  1. EXTRACT data from source systems to process and after processing, 2.  LOAD the data  into your platform of choice. This is precisely where stream processing platforms & integrations come into the picture. While integrations help you extract and load data, Stream processing helps you with the Transformation.

Every enterprise uses multiple disparate systems, each serving its own purpose, to manage their data. For these enterprises, it is important for data teams to produce data products in real-time and serve them across multiple data platforms that are in use. This will require a certain level of sophisticated processing, data governance and integration with commonly used data systems. 

Real-world integration uses

Let’s take a look at an example which can help us understand how companies manage their data across multiple platforms and how they process, access and transfer data across them.

Consider a scenario at a Bank. You have an incoming stream of transactions from Kafka. This stream of transactions is connected to DeltaStream where you can analyze transactions as they come in and flag them in real time if you notice fraudulent activity based on various predefined rules and alert your users as soon as possible. This is extremely time sensitive and a Stream Processing Platform is best suited for such use-cases.

Now, other teams within a Bank for eg : the marketing team, would want to understand trends based on customer activity and customize how they market their products to a given customer. For this, you need to look at transactions going back to a month or a week and process it all to generate enough context. Instead of going back to the ‘source’ system you can now have DeltaStream send all the processed transactions in the right format to your batch processing systems using our Integrations so that you can: 

  1. Have the data ready in the right format & schema for processing 
  2. Reduce the back and forth between multiple platforms as data transits the pipeline just once – reduction in data transfer costs 
  3. Eliminate the duplication of data sets across multiple platforms for processing
  4. Easily manage compliance – for eg :  by reducing the footprint of your PII data.

By integrating both the batch processing and stream processing system, the entire pipeline becomes more manageable and it reduces the complexity of managing data across different systems.

Integrating with DeltaStream

It’s evident that we need integration across different platforms to enable data teams to process and manage data. DeltaStream provides for all the ingredients required for such an operation. Our Stream processing platform is powered by Flink. DeltaStream’s RBAC enables you to govern and securely share data products. The integrations to Databricks and Snowflake allow for data products to be used in the data systems that your teams are using. With the launch of these integrations, you can do MORE with your data. To unlock the power of your streaming data reach out to us or request a free trial!

30 Aug 2023

Min Read

Choosing Serverless Over Managed Services: When and Why to Make the Switch

Consider storage service where you store and retrieve files. In on-prem environments, HDFS(Hadoop Distributed File System) has been one of the most common storage platforms to use. As for any service in an on-prem environment, as a user you are responsible for all aspects of the operations. This includes bringing up the HDFS cluster, scaling up and down the cluster based on the usage requirements, dealing with different types of failures including but not limited to server failures, network failures and many more. Of course, in an ideal situation as a user you would like to just use the storage service and focus on your business without dealing with the complexity of operating such infrastructure, including the one mentioned above. If you use a cloud environment instead of on-prem, you have the option of choosing to use storage services that are provided by a variety of vendors instead of running HDFS on cloud yourself. However, there are different ways to provide the same service on cloud that can significantly affect user experience and ease of use for such services. 

A Look at Managed Services

Let’s go back to our storage service and consider that we are now using a cloud environment and can take advantage of not running our required services ourselves and instead using the ones that vendors offer on this cloud environment. One common option to provide such services is to provide a managed version of the on-prem services. In this case, the service provider takes the same platform that is used in the on-prem environment and makes some improvements to run it on the cloud environment. While this takes away some of the burden of operations from the user, the user still is involved in many other aspects of the operations of such managed services. For the storage service we are considering here, a managed HDFS service would be an example of such an approach. When using a “fully managed” cloud HDFS, as a user you should still make decisions such as provisioning a HDFS cluster through the service. This means that you need to have a good understanding of the amount of storage resources you will be using and let the service provider know if you need more or less of such resources after provisioning the initial cluster. Requiring the user to provide such information in many cases results in confusion and in most cases the initial decision won’t be the accurate one and as the usage continues there will be a need for adjusting the provisioned resources. You cannot expect a user to accurately know how much storage they will need in the next six months or a year.

In addition, the notion of cluster brings many limitations. A cluster has a finite amount of resources available and as the usage continues, the cluster resources would be consumed and there will be a need for more resources. In the case of our “managed” HDFS service, the provisioned storage(disk space) is one of the limited resources and as more and more data is stored in the cluster, the available storage will shrink. At this point, the user has to decide between scaling up the existing cluster by adding more nodes and disk space or adding a new cluster to accommodate the growing need for the storage. To accommodate such issues, users may over-provision resources which in turn can result in unused resources and extra cost. Finally, once a cluster is provisioned, the user will start incurring the cost of the whole cluster regardless if half or all of the cluster resources are utilized. In short, the managed cloud service in most cases will put the burden of resource provisioning on the user, this in turn requires the user to have deep understanding of the required resources not for now, but for short term and long term future.

Unlocking Cloud Simplicity: The Serverless Advantage

Now let’s assume instead of taking the managed service path, a vendor takes a different route and builds a new cloud-only storage service from the ground up where all the complexity and operations are handled under the hood by the vendor and the users don’t have to think about the complexities such as resource provisioning as described above. In the case of storage service, object store services such as AWS S3 are great examples of such an approach, which is called Serverless. As an S3 user, you just need to set up buckets and folders and read and write your files. No need to manage anything or provision any cluster, no need to worry about having enough disk space or nodes. All operational aspects of the service including making sure the service is always available with required storage space is handled by S3. This is a huge win for the users since they can focus on building their applications instead of worrying about provisioning and sizing clusters correctly. With such simplicity in use, we can see why almost every cloud user uses object stores such as S3 for their storage needs unless there is an explicit requirement to use anything else. S3 is a great example for superiority of cloud-native serverless architecture compared to providing a “fully managed” version of the on-prem products. 

Another benefit of serverless platforms such as S3 compared to managed services is that S3 enables users to access, organize and secure their data in one place instead of dealing with multiple clusters. In S3 you can organize your data in buckets and folders, have a unified view of all of your data and control access to your data in one place. The same cannot be said for managed HDFS service if you have more than one cluster! In this case users have to keep track of which cluster has which data and how to control access to data across multiple clusters which is a much more complex and error prone process.

Choosing Serverless for Stream Processing

We can have the same argument in favor of serverless offering compared to managed services for many other platforms including stream processing and streaming databases. You can have “fully managed” services where the user has to provision clusters along with specifying the amount of resources this cluster will have before starting to write any query. Indeed, in managed services for stream processing you have much more complexities compared to the managed HDFS example we explained above. The cluster in the stream processing case will be shared among multiple queries which means imperfect isolation and the possibility of one bad query bringing down the whole cluster which disrupts the other queries even though they were healthy and running with no issues. To exacerbate the situation, as you add more streaming queries to the cluster eventually the cluster resources will all be used since the streaming queries are long running jobs and you will need to scale up your cluster or launch new cluster to accommodate newer queries. The first option results in having a larger cluster with more queries that share the same cluster resources and can interfere with each other’s resources. 

On the other hand, the second option will result in a growing number of clusters to keep track of and also keep track of which query is running on which cluster. So anytime you have to provision or declare a cluster is a “fully managed” stream processing or streaming database service, you are dealing with the managed service along with the above mentioned restrictions and many more. Even worse, once you provision a cluster, the billing for the cluster starts regardless of having no queries or several queries running in the cluster.

We built DeltaStream as a serverless platform because we believe that such a service should be as easy to use as S3. You can refer to DeltaStream as the S3 of stream processing. In DeltaStream there is no notion of cluster or provisioning. You just connect to your streaming stores such as Apache Kafka or AWS Kinesis and you are up and running ready to run queries. Only pay for queries that are running, and since there is no concept of cluster, you won’t be charged for idle or under utilized clusters! Focus on building your streaming applications and pipelines and leave the infrastructure to us. Launch as many queries as you want and there is no notion of running out of resources! Your queries run in isolation and we can scale them up and down independently without interfering with each other. 

DeltaStream is a unified platform that provides stream processing(streaming analytics) and streaming database in one platform. You can build streaming pipelines, event base applications and always uptodate materialized views within familiar SQL syntax. In addition, DeltaStream enables you to organize and secure your streaming data across multiple streaming storage systems. If you are using any flavor of Apache Kafka including Confluent Cloud, AWS MSK or RedPanda or AWS Kinesis you can now try DeltaStream for free by signing up for our free trial.

22 Jun 2023

Min Read

What is Streaming ETL and how does it differ from Batch ETL?

In today’s data-driven world, organizations are seeking effective and reliable ways to extract insights to make timely decisions based on the ever-increasing volume and velocity of data. ETL (Extract, Transform, Load) is a process where data is extracted from various sources, transformed to fit specific requirements, such as cleaning, formatting, and aggregating, and loaded into a target system or data warehouse. ETL ensures data consistency, quality, and usability, and enables organizations to effectively analyze their data. Traditional batch processing, while effective for certain use cases, falls short in meeting the demands of real-time and event-driven data processing and analysis. This is where streaming ETL emerges as a powerful solution.

Streaming ETL: An Overview

Unlike traditional batch processing, which operates on fixed intervals, streaming ETL operates on a continuous stream of data as records arrive, allowing for real-time analysis. A streaming ETL pipeline begins with the continuous data ingestion phase, in which records are collected from different sources varying from databases to event streaming platforms like Apache Kafka, and Amazon Kinesis. Once data is ingested, it goes through real-time transformation operations for cleaning, normalization, enrichment, etc. Stream processing frameworks such as Apache Flink, Kafka Stream, ksqlDB and Apache Spark provide tools and APIs to apply these transformations to prepare the data. The same frameworks allow processing data in real time and support operations and functionalities such as real-time aggregation, and complex machine learning operations. Finally, the results of the streaming ETL are delivered to downstream systems and applications for immediate consumption, or they are stored in data warehouses and data stores for future use.

Streaming ETL can be applied in various domains, including fraud detection and prevention, real-time analytics and personalization for targeted advertisements, and IoT data processing and monitoring to handle high velocity and volume of data generated by devices such as sensors and smart appliances.

Why should you use Streaming ETL?

Streaming ETL offers numerous advantages in real-time data processing. Here is a list of most important ones:

  • Streaming ETL provides real-time insights into the emerging trends, anomalies and critical events in the data as they happen. It operates with low latency, and ensures the processing results are up to date. This reduces the gap between the time data arrives and the time it is processed. This facilitates accurate and timely decision-making, and enables organizations to capitalize on time-sensitive opportunities or address emerging issues promptly.
  • Streaming ETL frameworks are designed to scale horizontally, which is crucial for handling increased data volumes and processing requirements in real-time applications. This elasticity allows for seamless scaling of resources based on demand, and enables the system to manage spikes in data volume and velocity without sacrificing the performance.

With all its advantages, Streaming ETL also presents some challenges:

  • Streaming ETL process typically introduces additional complexity to data processing. Real-time data ingestion, continuous transformations, and persisting results while maintaining performance and data consistency require careful design and implementation.
  • Streaming ETL pipelines run in a distributed streaming environment, which introduces new challenges to the data processing process. Unless an appropriate delivery guarantee such as exactly-once or at-least once is used, there is a risk of delay or data loss during ingestion, and delivery stages, due to parallel and asynchronous processing when applying transformations. Ordering events and maintaining data consistency are complex in such situations, and if not handled properly, they may impact the accuracy of certain computations that rely on the event order. Using fault-tolerant mechanisms, such as replication, checkpointing, and backup strategies are essential to prevent data loss and ensure reliability and correctness of results.

Using a modern stream processing platform, such as DeltaStream, can help address the above challenges and enable organizations to benefit from all the advantages of Streaming ETL.

Differences between Streaming ETL and Batch ETL

Data processing model: Batch ETL starts with collecting large volumes of data over a time period and processes these batches in fixed intervals. Therefore, it applies transformations on the entire dataset as a batch. Streaming ETL operates on data as it arrives in real-time and continuously processes and transforms data as individual records or small chunks.

Latency: Batch ETL introduces inherent latency since data is processed in intervals. This latency normally ranges from minutes to hours. Streaming ETL processes data in real-time and offers low latency. Results are available immediately and are updated continuously.

Data volume and velocity: Batch ETL is well-suited for processing large volumes of data, collected over time. Therefore it is effective when dealing with historical data. Streaming ETL, on the other hand, is designed for high-velocity data streams and is effective for use cases that require immediate processing.

Processing frameworks: Batch ETL typically utilizes frameworks like Apache Hadoop, Apache Spark and traditional ETL and data warehouses tools. These frameworks are optimized for processing large volumes of data in parallel, but not necessarily for real-time use cases. Streaming ETL leverages specialized stream processing frameworks such as Apache Flink, and Apache Kafka. These frameworks are optimized for processing continuous streams of data in real time. With recent changes, some of these frameworks, such as Apache Flink, are now capable of processing batch workloads too. As these efforts and improvements continue, the overlap between frameworks which process these workloads are expected to get bigger.

Fault tolerance: Batch ETL typically processes large volumes of data in fixed intervals. If a failure occurs, all the data within that batch may be affected which could lead into results written partially. This makes failure recovery challenging in Batch ETL as it involves cleaning partial results and reprocessing the entire batch. Removing partial results and state, and starting a new run is a complicated process and normally involves manual intervention. Reprocessing a batch is time-consuming and resource-intensive and can take long which may result in issues for processing the next batch as producing results for the current batch has fallen behind. Moreover, rerunning some tasks could have unexpected side effects which may impact the correctness of final results.  Such issues need to be handled properly during a job restart. 

Streaming ETL does not involve many jobs that run sequentially many times over time, but there is a single long-running job which maintains its state and does incremental computation as data arrives. Therefore, Streaming ETL is generally better equipped to handle failures and partial results. Given that results are generated incrementally in the Streaming ETL case, a failure does not force discarding already generated results and reprocessing sources from the beginning. Stream processing frameworks provide transaction-like processing, exactly-once semantics, and write-ahead logs, ensuring atomicity and data consistency. They have built-in mechanisms for fault recovery, handle out-of-order events, and ensure end-to-end reliability by leveraging distributed messaging systems.

Choosing Between Streaming ETL and Batch ETL

There are several factors to consider, when deciding between Streaming ETL and Batch ETL for a data processing use case. The most important factor is latency requirements. Consider the desired latency of insights and actions. If a real-time response is critical, then Streaming ETL is the correct choice. The other important factor is data volume and velocity along with the cost of processing. You should evaluate the volume of data and the rate it arrives at. Streaming ETL is capable of processing fast data immediately. However, due to its inherent complexity and higher resource demand, it is more difficult to maintain. 

Streaming ETL also introduces challenges related to maintaining the correct order of events, especially in distributed environments. Batch ETL processes data in a much more deterministic and mostly sequential manner, which ensures data consistency and ordered processing. A modern stream processing platform is a viable solution to handle these challenges and difficulties when picking Streaming ETL as a solution. Finally, you need to consider how often the data sources evolve over time in your use case as that can impact the structure of incoming records. Data processing pipelines need to handle schema evolution properly to prevent disruptions and errors. Managing schema changes, versioning, and implementing schema inference mechanisms become crucial to ensure correctness and reliability. Using a stream processing framework enables you to address these changes in a streaming ETL pipeline, which is intended to run continuously with no interruption.

Conclusion

Choosing between streaming ETL and batch ETL requires a thorough understanding of the specific requirements and trade-offs of each. While both approaches have their strengths and weaknesses, they are effective in different use cases and for different data processing needs. Streaming ETL offers real-time processing with low latency and high scalability to handle high-velocity data. On the other hand, batch ETL is well-suited for historical analysis, scheduled reporting, and scenarios where near-real-time results are not critical. In this blog post, we covered the specifics as well as the pros and cons of each approach, and explained the important factors to consider when deciding which one to choose for a given data processing use case.

DeltaStream provides a comprehensive stream processing platform to manage, secure and process all your event streams. You can easily use it to create your streaming ETL solutions. It is easy to operate, and scales automatically. You can get more in-depth information about DeltaStream features and use cases by checking our blogs series. If you are ready to try a modern stream processing solution, you can reach out to our team to schedule a demo and start using the system.

05 Jun 2023

Min Read

All About Streaming Data Mesh

The Current State of Streaming Data

The adoption of Streaming data technologies has grown exponentially over the past 5 years. Every industry understands the importance of accessing and understanding data in real-time, so they can make decisions about their products, services and customers. The sooner you can access the event – could be a customer purchasing an item or an IoT device relaying data – the faster you can process and react to it. Making all this happen in real-time is where technologies like Apache Kafka, AWS Kinesis, Google Pub-Sub, Apache Flink and many more come into play. Most of these are available as a managed service, making it easy for companies to adopt them. For example,Confluent and AWS offer Apache Kafka as a managed service.

With rapid adoption of these services, many companies now have access to real-time data.

By their very nature technologies like Apache Kafka interact with multiple types of data producers and consumers. While Datalakes and Data Warehouses meant storing large amounts of data in centralized repositories, data streaming technologies have always promoted decentralization of services and the data behind them. As data sets transit these systems, being able to manage, access and self-provision data is extremely important. This is where Data Mesh comes into the picture.

What is a Data Mesh?

Data Mesh, as first defined by Zhamak Dehghani,  is a modern data management architecture that facilitates and promotes a  decentralized way to manage, access and share data. The four core principles of a data mesh are : 

  • Domain ownership 
  • Data as a product
  • Self-serve data platform 
  • Federated computational governance

Pic from : datamesh-architecture.com

A Data Mesh gives you the agility to work with the data that flows in and out of complex systems across multiple organizations. It reduces dependencies and removes bottlenecks that arise while accessing data; thereby improving the organization’s ability to respond and make business decisions.

While there are tools and technologies that can help you build a Data Mesh over “Data at Rest”, it isn’t the same with “Data in Motion”. This is where DeltaStream comes into the picture, to enable organizations develop a Data Mesh over and across all their streaming data  – which could span multiple platforms and multiple public cloud providers (Apache Kafka, Confluent Cloud, Kinesis, Pub-Sub, RedPanda .. etc)

Streaming Data Mesh with DeltaStream

What we have built at DeltaStream is an extremely powerful framework which will provide a single platform to access your operational and analytical data, in real-time, across your entire organization in a decentralized manner. No more expensive pipelines or duplicating your data or building central teams to manage Datalakes / Data Warehouses.

Let’s revisit the core tenets of Data Mesh and how DeltaStream helps you achieve each one of those and more.

Domain Ownership

  • In DeltaStream, the data boundaries are clearly defined using our namespaces &  RBAC. Each stream is isolated with strict access control and any query against that stream inherits the same permissions and boundaries. This way,  your core data sets remain under your purview and you can control what others can access. The queries are isolated at the infrastructure level, which means queries are scaled independently, reducing cost and operational complexity. With all these controls in place, the core objective of decentralized ownership of data streams is achieved while keeping your data assets secure.

Data as a Product

  • Making data discoverable and usable while maintaining the quality is central to serving data as a product to your internal teams. Being able to do it as close to the source as possible is extremely beneficial and in-line with the data mesh philosophy of domain owners owning the quality of data. This is exactly what you can achieve with DeltaStream. Using our powerful SQL interface you can quickly alter and enrich your data and get it ready-to-use in real-time. As your data models change,  DeltaStreams Data Catalog along with Schema registry can track your data evolution and help you iterate on your data assets  as you grow and evolve.

Self-serve Data Platform

  • Once you achieve “Domain ownership” and “Data as a Product”, self-service follows. The main objective of self-service is to remove the need for a centralized team to coordinate data distribution across the entire company. Leveraging DeltaStream’s catalog you can make your high quality data streams discoverable and combining this with our RBAC you can securely share them, with both internal and external users.  This means, the data will continuously flow from its source systems to end-users without ever having the need to store, extract or load. This is very powerful in the way that you get to securely democratize “streaming data” across your company, while having the ability to share it with users that are outside of your Organization too.

Federated Computational Governance

  • With everything decentralized  – from data ownership to data delivery – governance becomes a very important requirement. This is to ensure that data originating in a particular domain is consumable in any part of the organization. Schemas and schema registry go a long way in ensuring that. Data, just like the organization it serves, changes and evolves over time. Integrating your data streams with schema registry is critical to maintaining a common language to communicate.  Also, interoperability is a key component of governance and the ability of Deltastream to operate across multiple streaming stores is a great capability.

Conclusion

Decentralization of data ownership and its delivery is critical for organizations to be nimble. As Data Mesh gains traction, there are tools and technologies readily available to implement it over your data at rest. But the same can’t be said when you are dealing with streaming data. This is what Deltastream provides. With tooling built around the core processing framework,  it enables companies to implement data mesh on top of streaming data.

As streaming data continues its rapid growth, it is extremely important for organizations to set-up the right foundations to maximize the value of their real-time data. While architectures like DataMesh provide the right framework, platforms like DeltaStream offer a lot to bring it all together. 

We offer a free trial of DeltaStream that you can sign up for start using DeltaStream today. If you have any questions, please feel free to reach out to us.

10 May 2023

Min Read

Why Apache Flink is the Industry Gold Standard

Apache Flink is a distributed data processing engine that is used for both batch processing and stream processing. Although Flink supports both batch and streaming processing modes, it is first and foremost a stream processing engine used for real-time computing. By processing data on the fly as a stream, Flink and other stream processing engines are able to achieve extremely low latencies. Flink in particular is a highly scalable system, capable of ingesting high throughput data to perform stateful stream processing jobs while also guaranteeing exactly-once processing and fault tolerance. Initially released in 2011, Flink has continuously been updated and grown into one of the largest open source projects. Large and reputable companies such as Alibaba, Amazon, Capital One, Uber, Lyft, Yelp, and others have relied on Flink to provide real-time data processing to their organizations.

Before stream processing, companies relied on batch jobs to fulfill their big data needs, where jobs would only perform computations on a finite set of data. However, many use cases don’t naturally fit a batch model. Consider fraud detection for example, where a bank may want to act if anomalous credit card behavior is detected. In the batch world, the bank would need to collect credit card transactions over a period of time, then run a batch job over that data to determine if a transaction is potentially a fraudulent one. Since the batch job is run after the data collection, by the time the batch job finishes, it might not be until a few hours that a transaction is marked as fraudulent, depending on how frequently the job is being scheduled. In the streaming world, transactions can be processed in real time and the latency for action will be less. Being able to freeze credit card accounts in real-time can save time and money for the bank. You can read more about the distinction of streaming vs batch in our other blog post.

Without Apache Flink or other stream processing engines, you could still write code to read from a streaming message broker (e.g. Kafka, Kinesis, RabbitMQ) and do stream processing from scratch. However, with this approach you would miss out on many of the features that modern stream processing engines give you out of the box, such as fault tolerance, the ability to perform stateful queries, and API’s that vastly simplify processing logic. There are many stream processing frameworks in addition to Apache Flink such as Kafka Streams, ksqlDB, Apache Spark (Spark streaming), Apache Storm, Apache Samza, and dozens of others. Each have their own pros and cons and comparing them could easily become its own blog post, but Apache Flink has become an industry favorite because of these key features:

  • Flink’s design is truly based on processing streams instead of relying on micro-batching
  • Flink’s checkpointing system and exactly-once guarantees
  • Flink’s ability to handle large throughputs of data from many different types of data sources at a low latency
  • Flink’s SQL API that leverages Apache Calcite allows users to define their Flink jobs using SQL, making it easier to build applications

Flink is designed to be a highly scalable distributed system, capable of being deployed with most common setups, that can perform stateful computations at in-memory speed. To get a high level understanding about Flink’s design, we can look further into the Flink ecosystem and execution model.

The Flink ecosystem can be broken down into 4 layers:

  • Storage: Flink doesn’t have its own storage layer and instead relies on a number of connectors to connect to data. These include but are not limited to HDFS, AWS S3, Apache Kafka, AWS Kinesis, and SQL databases. You can similarly use managed products such as Amazon MSK, Confluent Cloud, and Redpanda among others.
  • Deploy: Flink is a distributed system and integrates seamlessly with resource managers like YARN and Kubernetes, but it can also be deployed in standalone mode or in a single JVM.
  • Runtime: Flink’s runtime layer is where the data processing happens. Applications can be split into tasks that are parallelizable and thus highly scalable across many cluster nodes. With Flink’s incremental and asynchronous checkpointing algorithm, Flink is also able to provide fault tolerance for applications with large state and provide exactly-once processing guarantees while minimizing the effect on processing latency.
  • APIs and Libraries: The APIs and libraries layer are provided to make it easy to work with Flink’s runtime. The DataStream API can be used to declare stream processing jobs and offers operations such as windowing, updating state, and transformations. The Table and SQL APIs can be used for batch processing or stream processing. The SQL API allows users to write SQL queries to define their Flink jobs and the Table API allows users to call functions on their data based on Table schemas used to define them.

In the execution architecture, there is a client and the Flink cluster. Two kinds of nodes make up the Flink cluster – the Job manager and the Task manager. There is always a Job manager and any number of Task managers.

  • Client: Responsible for generating the Flink job graph and submitting it to the Job manager.
  • Job manager: Receives the job graph from the Client then schedules and oversees the execution of the job’s tasks on the Task managers.
  • Task manager: Responsible for executing all the tasks it was assigned from the Job manager and report updates on its status back to the Job manager.

In Figure 1, you’ll notice that storage and deployment are somewhat detached from the Flink runtime and APIs. The benefit of designing Flink this way is that Flink can be deployed in any number of ways and connect to any number of storage systems, making it a good fit for almost any tech stack. In Figure 2, you’ll notice that there can be multiple Task managers and they can be scaled horizontally, making it easy to add resources to run jobs at scale.

Apache Flink has become the most popular real-time computing engine and the industry gold standard for many reasons, but its ability to process large volumes of real-time data with low latency is its main draw. Other features include the following:

  • Stateful processing: Flink is a stateful stream processor, allowing for transformations such as aggregations and joins of continuous data streams.
  • Scalability: Flink can scale up to thousands of Task manager nodes to handle virtually any workload without sacrificing latency.
  • Fault tolerance: Flink’s state of the art checkpointing system and distributed runtime makes it both fault tolerant and highly available. Checkpoints are used to save the application state, so when your application inevitably encounters a failure, the application can be resumed from the latest checkpoint. For applications that use transactional sinks, exactly-once end to end processing can be achieved. Also, since Flink is compatible with most resource management systems like YARN and Kubernetes, when nodes crash they will automatically be restarted. Flink also has high availability setup options based on Apache Zookeeper to help prevent failure states as well.
  • Time mechanisms: Flink has a concept called “watermarks” for working with event time which represents the application’s progress with regard to time. Different watermark strategies can be applied to allow users to configure how they want to handle out of order or late data. Watermarks are important for the system to understand when it is safe to close windows and emit results for stateful operations.
  • Connector ecosystem: Since Flink uses external systems for storage, Flink has a rich ecosystem of connectors. Wherever your data is, Flink likely already has a library to connect to it. Popular existing connectors include connectors for Apache Kafka, AWS Kinesis, Apache Pulsar, JDBC connector for databases, Elasticsearch, and Apache Cassandra. If a connector you need doesn’t exist, Flink provides APIs to help you create a new connector.
  • Unified batch and streaming: Flink is both a batch and stream processor, making it an ideal candidate for applications that need both streaming and batch processing. With Flink, you can use the same application logic for both cases reducing the headache of managing two different technologies.
  • Large open source community: In 2022, Flink had over 1,600 contributions and the Github project currently has over 21 thousand stars. To put this in perspective, Apache Kafka currently has 24,800 stars and Apache Hadoop currently has 13,500 thousand stars. Having a large and active open source community is extremely valuable for fixing bugs and staying up to date with the latest tech requirements.

Conclusion

In this blog post, we’ve introduced Apache Flink at a high level and highlighted some of its features. While we would suggest those who are interested in adding stream processing to their organizations to take a survey of the different stream processing frameworks available, we’re also confident that Flink would be a top choice for any use case. At DeltaStream, although we are detached from using any particular compute engine, we have chosen to use Flink for the reasons explained above. DeltaStream is a unified, serverless stream processing platform that takes the pain out of managing your streaming stores and launching stream processing queries. If you’re interested in learning more get a free trial of DeltaStream.

17 Apr 2023

Min Read

Batch vs. Stream Processing: How to Choose

Rapid growth in the volume, velocity and variety of data has made data processing a critical component of modern business operations. Batch processing and stream processing are two major and widely used methods to perform data processing. In this blog, we will explain batch processing and stream processing and will go over their differences. Moreover, we will explore the pros and cons of each, and discuss the importance of choosing the right approach for your use cases. If you are looking for Streaming ETL vs. Batch ETL, we have a blog on that too.

What is Batch Processing?

Batch processing is a data processing method on large volumes of fully stored data. Depending on the use case specifics, a batch processing pipeline typically consists of multiple steps including data ingestion, data preprocessing, data analysis and data storage. Data goes through various transformations for cleaning, normalization, and enrichment, and different tools and frameworks could be used for analysis and value extraction. The final processed data gets stored in a new location with a new format such as a database, a data warehouse, a file, or a report.

Batch processing is used in a wide range of applications where large volumes of data need to be processed efficiently in a cost-effective manner with high throughput. Examples are: ETL processes for data aggregation, log processing, and training predictive ML models.Pros and cons of batch processing

Batch processing has its advantages and disadvantages. Here are some of its pros and cons.
Pros:

  • Batch processing is efficient and cost-effective for applications with large volumes of already stored data.
  • The process is predictable, reliable and repeatable which makes it easier to schedule, maintain, and recover from failures.
  • Batch processing is scalable and can handle workloads with complex transformations on large volumes of data.

Cons:

  • Batch processing has a high latency, and processing time can be long depending on the volume of data and complexity of the workload.
  • Batch processing is generally not interactive, while the processing is running. Users need to wait until the whole process is complete before they can access the results. This means batch processing does not provide real time insights into data which can be a disadvantage in applications where real-time access to (partial) results is necessary.

What is Stream Processing?

Stream processing is a data processing method where processing is done in real time as data is being generated and ingested. It involves analyzing and processing continuous streams of data in order to extract insights and information from them. A stream processing pipeline typically consists of several phases including data ingestion, data processing, data storage, and reporting. While the steps in a stream processing pipeline may look similar to those in batch processing, these two methods are significantly different from each other.

Stream processing pipelines are designed to have low latency and process the data in a continuous mode. When it comes to the complexity of transformations, batch processing normally involves more complex and resource intensive transformations which run over large, discrete batches of data. In contrast, stream processing pipelines run simpler transformations on smaller chunks of data, as soon as the data arrives. Given that stream processing is optimized for continuous processing with low latency, it is suitable for interactive use cases where users need to receive immediate feedback. Examples include fraud detection and online advertising.

Pros and cons of stream processing

Here is a list of pros and cons for stream processing.
Pros:

  • Stream processing allows for real time processing of data with low latency, which means the results can be gained quickly to serve use cases and applications with real-time processing demands.
  • Stream processing pipelines are flexible and can be quickly adapted to changes in data or processing needs.

Cons:

  • Stream processing pipelines tend to be more expensive due to their requirements to handle long-running jobs and data in real time, which means they need more powerful hardware and faster processing capabilities. However, with serverless platforms such as DeltaStream, stream processing can be simplified significantly.
  • Stream processing pipelines require more maintenance and tuning due to the fact that any change or error in data, as it is being processed in real time, needs to be addressed immediately. They also need more frequent updates and modifications to adapt to changes in the workload. Serverless stream processing platforms like DeltaStream simplify this aspect of stream processing as well.

Choosing between Stream Processing and Batch Processing

There are several factors to consider when deciding whether to use stream processing or batch processing for your use case. First, consider the nature and specifics of the application and its data. If the application is highly time-sensitive and requires real-time analysis and immediate response (low latency), then stream processing is the option to choose. On the other hand, if the application can rely on offline and periodic processing of large amounts of data, then batch processing is more appropriate.

Second factor is the volume and velocity of the data. Stream processing is suitable for processing continuous streams of data arriving in high velocity. However, if the data volume is too high to be processed in real time and it needs extensive and complex processing, batch processing is most likely a better choice though it comes with the cost of sacrificing low latency. Scaling up stream processing to thousands of workers to handle petabytes of data is very expensive and complex. Finally, you should consider the cost and resource constraints of the application. Stream processing pipelines are typically more expensive to set up, maintain, and tune. Batch processing pipelines tend to be more cost-effective and scalable, as long as no major changes appear in the workload or nature of the data.

Conclusion

Batch processing and stream processing are two widely used data processing methods for data-intensive applications. Choosing the right approach among them for a use case depends on characteristics of the data being processed, complexity of the workload, frequency of data and workload changes, and requirements of the use case in terms of latency and cost. It is important to evaluate your data processing requirements carefully to determine which approach is the best fit. Moreover, keep an eye on emerging technologies and solutions in the data processing space, as they may provide new options and capabilities for your applications. While batch processing technologies have been around for several decades; Stream processing technologies have seen significant growth and innovation in recent years. Beside several open-source stream processing frameworks such as Apache Flink which gained widespread adoption, cloud-based stream processing services are emerging as well which aim at providing easy-to-use and scalable data processing capabilities.

DeltaStream is a unified serverless stream processing platform to manage, secure and process all your event streams. DeltaStream provides a comprehensive stream processing platform that is easy to use, easy to operate, and scales automatically. You can get more in-depth information about DeltaStream features and use cases by checking our blogs series. If you are ready to try a modern stream processing solution, you can reach out to our team to schedule a demo and start using the system.

alert-icon

Please enter a valid email address.

Request Submitted

Thank you for requesting a demo.
You will receive your login information to your email soon.