18 Jan 2024

Min Read

Processing Streaming Data with WarpStream & DeltaStream

WarpStream is an Apache Kafka® compatible data streaming platform built directly on top of object storage. It’s delivered as a single, stateless Go binary eliminating the management of local disks, rebalancing of brokers and operating ZooKeeper to operate. WarpStream clusters can scale out, and in, instantaneously, without rebalancing.

DeltaStream is a serverless stream processing platform powered by Flink. With features including Role-Based Access Control (RBAC), Streaming Catalog and Materialized Views, DeltaStream offers all the capabilities needed to process, share and govern ALL your streaming data across your streaming storage systems.

In this blog post, we will walk you through a simple but powerful use case that demonstrates how to process a stream of data produced by a WarpStream cluster using DeltaStream. 

Before getting started, sign up for DeltaStream and WarpStream.  Also, make sure to install the WarpStream CLI by running the following command into your Terminal:

  1. curl https://console.warpstream.com/install.sh | sh

We’re also going to use a producer client that’s written in Go, so your environment needs to be set up to run Go. The easiest way to do this is using Homebrew:

  1. brew install go

WarpStream Setup

After signing up for WarpStream, we’re going to log into the WarpStream Console and create a cluster. Then, we’ll use the ClickHouse ClickPipes data loader to produce data to WarpStream.

Step 1: Create a WarpStream cluster

Log into the WarpStream Console, and click New Cluster.

In the dropdown on the cluster creation screen, select Serverless.

Then, navigate to the Connect tab, and click Create Credentials. 

Store these values, as well as the Bootstrap URL, as environment variables for easy reference:

  1. export BOOTSTRAP_URL=<YOUR_BOOTSTRAP_URL> \
  2. SASL_USERNAME=<YOUR_SASL_USERNAME> \
  3. SASL_PASSWORD=<YOUR_SASL_PASSWORD>;

Note: Do not copy the port (“:9092”) for your Bootstrap URL. Just copy the URL.

Also, download the .txt file containing your credentials (you’ll need these later). 

Step 2: Produce some Data

After your cluster is set up, use the ClickHouse ClickPipes data loader to produce some data to your WarpStream cluster.

Auto-topic creation is enabled by default in WarpStream, so you won’t need to create a topic manually.

The data loader is a producer client written in Go that reads data from a CSV file and writes it to a topic using the Kafka protocol. Simply download the contents of the repo locally, and download one of the compressed datasets linked in the README. Then, make sure that the file var in main.go contains the path to your compressed CSV. By default, we will be reading logs-534.csv.gz, but if you downloaded a different dataset, just replace the file name when you edit the path. There’s no need to uncompress the file.

Once your producer is running, navigate to the Topics screen in the WarpStream Console and click on View Details next to the newly-created dns_logs_topic_test topic to confirm that the producer is producing data:

Congrats! You’re producing a stream of 10,000 records per second to WarpStream. Now it’s time to configure DeltaStream to process this stream of data.

DeltaStream Setup

Once you sign up and login to DeltaStream, you’ll need to complete two steps to start processing data from WarpStream.

Step 1: Connect WarpStream to DeltaStream

From the Stores Tab click on New Store and provide the WarpStream Bootstrap URL (you can get this from the Connect tab in the WarpStream console. It ends in 9092).

Navigate to the Next step and provide the Username, Password and SASL Hash Function. Choose PLAIN in the SASL Hash Function drop down and for Username and Password use the credentials that you obtained from the WarpStream console when you created your cluster.

If you forgot to save your SASL password, that’s OK, just go back to the WarpStream Console and create another set of credentials.

Click Save, and now your WarpStream cluster is connected to DeltaStream.

Step 2: Create a Database

Create a Database from the Catalog tab. Give it name and Save.

Once this is done, you can start writing your SQL queries in the SQL Editor tab.

Process your Data Stream

Now, let’s do some stream processing. In this example, we are going to inspect a stream of DNS logs to capture errors based on the “status” codes. Here’s a sample message from the dns_log_topic_test  topic on WarpStream.

  1. {
  2. "key": {},
  3. "value": {
  4. "referer": "https://www.zanbil.ir/m/filter/b8%2Ct31",
  5. "remote_addr": "5.214.135.81",
  6. "remote_user": "-",
  7. "request_path": "/image/59053/productModel/200x200",
  8. "request_protocol": "HTTP/1.1",
  9. "request_type": "GET",
  10. "runtime": "9540",
  11. "size": "6788",
  12. "status": "200",
  13. "time_local": "2019-02-01 09:56:13",
  14. "user_agent": "Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-A800I Build/MMB29K) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/8.2 Chrome/63.0.3239.111 Mobile Safari/537.36"
  15. }
  16. }

To start processing the incoming messages, we are going to CREATE a STREAM called dns_logs using SQL:

  1. CREATE STREAM dns_logs
  2. (referer VARCHAR, request_type VARCHAR, remote_addr VARCHAR, status VARCHAR, user_agent VARCHAR)
  3. WITH ('topic'='dns_logs_topic_test', 'value.format'='json');

From here we can filter out all the error messages as they come using a simple SQL statement:

  1.  
  2. SELECT * from dns_logs WHERE status != "200";

You can take this further and store all the “filtered” messages in a new topic in WarpStream, so that you can use them for more downstream processing. This can be done using the following statement:

  1. CREATE STREAM dns_logs_error AS SELECT * from dns_logs WHERE status != "200";

This statement runs continuously filtering all the incoming messages as they arrive and creates a new dns_logs_error  topic on your WarpStream cluster to store the filtered messages.

You can also perform aggregations and Create Materialized Views. These materialized views are accessible over APIs, so you can build real-time dashboards and apps powered by these views pretty quickly.

  1.  
  2. CREATE MATERIALIZED VIEW log_request_count AS
  3. SELECT remote_addr, count(request_type) AS request_count
  4. FROM dns_logs WHERE remote_addr != '200' GROUP BY remote_addr;

This statement creates a Materialized View , which in DeltaStream is an always up-to-date snapshot of your data, of the error messages aggregated by the IP address (represented by remote_addr in the JSON). With this materialized view, you can: 

  • Geolocate where the errors are happening to help with diagnosis
  • Stop malicious attacks
  • Build dashboards to monitor errors and alert the right personnel

Conclusion

By now you should have a pretty solid understanding of how to create a streaming data pipeline between WarpStream and DeltaStream.

Next, learn how the WarpStream architecture is a better fit than Apache Kafka for streaming workloads running in the cloud, or learn more about how DeltaStream makes it easy to process streams of data.

With WarpStream and DeltaStream you have everything you need to build a real time streaming platform. Get started and take the first steps towards becoming an event-driven organization.

Thank you to the WarpStream team for co-authoring this blog.

16 Jan 2024

Min Read

Streaming Data Governance and DeltaStream

Ensuring your data is accessible, compliant, and secure across your systems, known as Data governance, is increasingly vital for organizations. For real-time stream processing, there is a growing need for Governance of data assets, but to date it’s proven difficult to achieve. In this post, we will explore what Data Governance is, why it’s important, and how Data Unification and Data Governance work hand in hand when it comes to stream processing.

What is Data Governance and why is it Important?

Data Governance refers to the rules, policies and systems designed to make data available, secure and compliant across the organization. This includes determining which users are responsible for particular data assets and how these data assets can be accessed. Businesses that have a strong Data Governance system are able to extract the most value out of their data because they are able to easily make data available while ensuring the quality, security, and privacy of their data. This ultimately leads to faster data-driven decision making, improved efficiency for internal operations, and enhanced data security.

The main elements of Data Governance include:

  • Data cataloging – the metadata that is used to organize the different data assets in an organization to improve data discovery and management
  • Access control and security of data – the mechanisms that ensure only authorized users have access to data assets
  • Secure data sharing – the ability to make data resources available to other users
  • Data quality – the measurement of how accurate, complete, timely, and consistent the data is

Data Governance is more Powerful with Unified Data

In the data streaming space, it’s not uncommon for companies to utilize multiple data streaming stores. Having multiple Kafka clusters or some combination of storage systems, such as Kafka with Kinesis, is a pattern we’ve seen many companies adopt across the industry. The reasons behind choosing multiple data stores varies, but can typically be attributed to one or more of the following: ensuring data isolation, tech debt as the result of data migrations, different teams preferring different technologies, and different technologies being better suited for different use cases.

However, Data Governance only goes as far as the data platform’s reach. For example, Kafka provides Access Control Lists (ACLs) as a mechanism for Data Governance. Using ACLs, users can define access policies for topics, but ACLs are only limited to the topics in a single Kafka cluster. As we mentioned above, most companies don’t operate using a single Kafka cluster for their entire data streaming storage layer, meaning they’ll need to govern each streaming data store individually. If there are two Kafka clusters, this would mean two separate ACLs that need to be maintained. Herein lies the problem: Data Governance at the individual storage system level has a large overhead, is cumbersome, and is error-prone.

What is really needed is a unified view over all of the data storage systems, so that Data Governance can be applied in one place. Providing capabilities like data namespacing and access control at a layer above the individual storage systems allows users to have more control over how they organize their data, as opposed to being forced to manage data resources based on where that data is physically stored. This way, if a team is responsible for topics in two different Kafka clusters, they could organize them into the same namespace in the unified view. By operating with this unified view, the data silos are eliminated, making it easier for organizations to categorize, build/share data products, and provide policies for their data. Also, having a single place to apply Data Governance relieves the burden on data users who would otherwise need to understand details of the individual storage systems, leading to improved productivity with fewer mistakes.

DeltaStream’s Approach – Unification and Governance of Streaming Data

DeltaStream is a stream processing platform that both Unifies and Governs your streaming data across all your streaming storage systems. Using DeltaStream, organizations are able to get real-time insights across all their data assets in one platform. DeltaStream can connect to both streaming and relational data stores, and provide a unified view of all your data in a Streaming Catalog. Role-Based Access Control (RBAC) is then applied to the Streaming Catalog to Govern user access to specific data assets.

Overview of Data Governance in DeltaStream’s Platform in the middle, highlighted in light-purple

Streaming Catalog

The Streaming Catalog organizes your data assets in the platform. DeltaStream can connect to any of your streaming or relational data stores, including your Kafka clusters, Kinesis, PostgreSQL databases, and others. This provides a global view of all your data sources. DeltaStream then allows users to categorize the data into hierarchical namespaces to isolate the data that users and teams need access to. DeltaStream allows for an unlimited number of namespaces. For an organization, this creates a central data platform for stream processing. 

In DeltaStream, Stores define the connectivity to data storage systems such as Kafka, RedPanda, Kinesis, and PostgreSQL. Users can then define Relations to represent data entities in the Store – in the case of Kafka a data entity is a topic, in the case of PostgreSQL a data entity is a table. Similarly to other data catalogs, when these Relations are created, they are added to a specific Database and Schema that the user can specify. The Streaming Catalog in an Organization can contain any number of Databases, and any number of Schemas within those Databases. The Database and Schema provide two namespacing levels for users to organize their Relations. Data from different storage systems can exist within the same Database and Schema in DeltaStream, so the logical organization of data in the Streaming Catalog is decoupled from the data’s physical storage systems. This way, teams can organize their data to align with the context in which that data is being used.

Role-Based Access Control (RBAC)

RBAC within DeltaStream is used to manage user access to the Streaming Catalog. RBAC also applies to securable objects within DeltaStream, including Databases, Schemas, Relations, Queries, UDFs, Stores, and others. Within the Streaming Catalog, Databases, Schemas, and Relations can be global, or can be for specific individuals, teams, or other units of organization. This enables the right level of privileged access to the underlying data. Using RBAC, users can easily define policies to secure and share their data resources. We chose RBAC as our access control strategy because it strikes a balance between usability and scalability. Roles are set up to clearly define permissions and ownership for data assets, and users are granted one or several roles. The roles granted to users makes it clear which permissions each user has and as new data assets get added to the Organization, these data assets can be granted to an existing role or new role.

RBAC addresses a major pain point accessing streaming storage systems, such as Apache Kafka and Apache Pulsar. As mentioned earlier, these systems use access control lists (ACLs) and although these lists are pretty straight-forward to use at first, they quickly become unmanageable as more users and data objects get added to the platform due to access control needing to be specified for each user and topic. RBAC directly addresses these shortcomings of ACLs by assigning user roles that authorizes access to specific namespaces. As new data objects are added to the namespace the user role will automatically be granted access to that data object.

Streaming Catalogs combined with RBAC enables a single powerful platform for all your teams; it enables a central data platform for stream processing and improves the sharing of data products, data visibility, data security, data compliance and stream quality.

Wrapping up

In the first half of this post, we discussed what Data Governance is, why it is important, and how Data Governance over unified data solves many of the pain points of governing streaming data today. In the second half of this post, we discussed DeltaStream’s approach towards Data Unification and Data Governance of streaming data, specifically highlighting the Streaming Catalog and RBAC features which power these concepts.

In the following weeks, expect more content from us showing off these concepts in action. Meanwhile, if you want to learn more about DeltaStream, reach out to us to schedule a demo or start your free trial.

09 Jan 2024

Min Read

Bundle Queries with Applications by DeltaStream

Stream processing has turned into an essential part of modern data management solutions. It provides real-time insights which enable organizations to make informed decisions in a timely manner. Stream processing workloads are often complex to write and expensive to run. This is due to the high volumes of data that are constantly flowing into these workloads and the need for the results to be produced with minimal delay.

In the data streaming world, it’s common to think about your stream processing workload as pipelines. For instance, you may have a stream processing job ingest from one data stream, process the data, then write the results to another data stream. Then, another query will ingest the results from the first query, do further processing, and write another set of results to a third stream. Depending on the use case, this pattern of reading, processing, and writing continues until eventually you end up with the desired set of data. However, these intermediate streams may not be needed for anything other than being ingested by the next query in the pipeline. Reading, storing, and writing these intermediate results costs money in the form of network I/O and storage. For SQL-based stream processing platforms, one solution is to write nested queries or queries containing common table expressions (CTEs), but for multi-step pipelines, queries written in this way can become overly complex and hard to reason through. Furthermore, it may not even be possible to use nested queries or CTEs to represent some use cases, in which case materializing the results and running multiple stream processing jobs is necessary.

To make stream processing simpler, more efficient, and more cost effective, we wanted to have the capability of combining multiple statements together. We did this by creating a new feature called Applications, which has the following benefits:

  • Simplify the workload: Users can simplify a complex computation logic by dividing it into several statements without additional costs. This helps break down a complicated workload into multiple steps to improve readability of the computation logic and reusability of results. Smaller, distinct statements will also facilitate debugging by isolating the processing steps.
  • Reduce the load on streaming Stores: An Application in DeltaStream optimizes I/O operations on streaming Stores in two ways. First, the Application will only read from a unique source Relation’s topic once. This reduces the read operations overhead when multiple queries in the Application consume records from the same Relation. Second, users can eliminate the reads/writes from intermediate queries by specifying “Virtual Relations” in the Application. “Virtual Streams” and “Virtual Changelogs” are similar to regular Streams and Changelogs in DeltaStream, but they are not backed by any physical streaming Store. Instead, Virtual Relations are for intermediate results and other statements in the Application are free to read/write to them.
  • Reduce overall execution cost and latency: All statements in an Application run within a single runtime job. This not only reduces the overall execution cost by minimizing the total number of jobs needed for a workload, but also enhances resource utilization. Packing several statements together facilitates efficient resource sharing and lowers scheduling overhead for the shared resources. Additionally, the optimized I/O operations on streaming Stores (as previously mentioned) along with less network traffic from/to those Stores contribute to the overall cost and latency reduction.

Simplifying Data Workloads with Applications

Let’s go over an example to show how Applications can help users write a workload in a simpler and more efficient manner.

Assume we are managing an online retail store and we are interested in extracting insights on how users visit different pages on our website. There are two Kafka Stores/clusters, one in the US East region and one in the US West to store page views in each region. Registered users’ information is also stored separately in the US West Store. We have the following Relations defined on topics from these Stores:

  • “pageviews_east” and “pageviews_west” are two Streams defined on the topics in the US East and US West stores, respectively
  • “users_log” is a Changelog defined on the users’ information topic in the US west Store, using the “userid” column as the primary key

You can find more details about Stores, Streams and Changelogs in DeltaStream and how to create them here.

Our online advertisement team is curious to find out which product pages are popular as users are browsing the website. A page is popular if it is visited by at least 3 different female users from California in a short duration. Using the three Relations we defined above, we’ll introduce a solution to find popular pages without using an Application,then compare that with an approach that uses an Application.

“No Application” Solution

One way to find popular pages is by writing 4 separate queries: 

  • Query 1 & Query 2: Combine pageview records from the “pageviews_west” and “pageviews_east” Streams into a single relation. The resulting Stream is called “combined_pageviews”.
  • Query 3: Join “combined_pageviews” records with records from “users_log” to enrich each pageviews record with its user’s latest information. The resulting Stream is called “enriched_pageviews”.
  • Query 4: Group records in “enriched_pageviews” by their “pageid” column, aggregate their views, and find those pages that meet our popular page criteria.    

Figure 1 shows how the data flows between the different Relations (shown as rounded boxes) and the queries (shown as colored boxes). Each query results in a separate runtime job and requires its own dedicated resources to run. The dashed arrows between the Relations and the Stores indicate read and write operations against the Kafka topics backing each of the Relations. Moreover, given that each Relation is backed by a topic, all records are written into a persistent Store, including records in the “combined_pageviews” and “enriched_pageviews” Streams.

Solution with Application

Ideally, we are interested in reducing the cost of running our workload without modifying its computation logic. In the “No Application” solution above, while the records for “combined_pageviews” and “enriched_pageviews” are persisted, we don’t really need them outside the Application context. They are intermediate results, only computed to prepare the data for finding popular pages. “Virtual Relations” can help us skip materializing intermediate query results.

We can write an Application with Virtual Relations to find popular pages, as shown below. This Application creates a single long-running job and helps reduce costs in two ways:

  1. By using Virtual Relations, we can avoid the extra cost of writing intermediate results into a Store and reading them again. This will reduce the network traffic and the read/write load on our streaming Stores.
  2. By packing several queries into a single runtime job, we can use our available resources more efficiently. This will reduce the number of streaming jobs we end up creating in order to run our computation logic.

Here is the Application code for our example:

  1. BEGIN APPLICATION popular_pages_app
  2.  
  3. -- statement 1
  4. CREATE VIRTUAL STREAM virtual.public.combined_pageviews AS
  5. SELECT *
  6. FROM pageviews_east;
  7.  
  8. -- statement 2
  9. INSERT INTO virtual.public.combined_pageviews
  10. SELECT *
  11. FROM pageviews_west;
  12.  
  13. -- statement 3
  14. CREATE VIRTUAL STREAM virtual.public.enriched_pageviews AS
  15. SELECT v.userid, v.pageid,
  16. u.gender, u.contactinfo->`state` AS user_state
  17. FROM virtual.public.combined_pageviews v JOIN users_log u
  18. ON u.userid = v.userid;
  19.  
  20. -- statement 4
  21. CREATE CHANGELOG popular_pages WITH ('store'='us_west_kafka', 'topic.partitions'=1, 'topic.replicas'=3) AS
  22. SELECT pageid, count(DISTINCT userid) AS cnt
  23. FROM virtual.public.enriched_pageviews
  24. WHERE (UNIX_TIMESTAMP() - (rowtime()/1000) < 30) AND
  25. gender = 'FEMALE' AND user_state = 'CA'
  26. GROUP BY pageid
  27. HAVING count(DISTINCT userid) > 2;
  28.  
  29. END APPLICATION;

There are 4 statements in this Application. They look similar to the 4 queries used in the “no Application” solution above. However, the “combined_pageviews” and “enriched_pageviews” Streams are defined as Virtual Relations in the Application.

Figure 2 illustrates how the data flows between the different statements and Relations in the Application. Compared to Figure 1, note that the “combined_pageviews” and “enriched_pageviews” Virtual Streams (shown in white boxes) do not have dashed arrows leading to a Kafka topic in the storage layer. This is because Virtual Relations are not backed by physical storage, and thus reads and writes to the Store for these Virtual Relations are eliminated, reducing I/O and storage costs. In addition, the 4 queries in Figure 1 generate 4 separate streaming jobs, whereas all processing happens within a single runtime job in the solution using an Application.

Platform for Efficient Stream Processing

In this blog, we compared two solutions for a stream processing use case, one with an Application and one without. We showed how Applications and Virtual Relations can be used to run workloads more efficiently, resulting in reduced costs. The SQL syntax for Application helps users simplify their complex computation logic by breaking it into several statements, and allows reusing intermediate results with no additional cost. Stay tuned for more content on Application in the future, where we’ll dive more deeply into the individual benefits of running Applications and include more detailed examples.

DeltaStream is easy to use, easy to operate, and scales automatically. If you are ready to try a modern stream processing solution, you can reach out to our team to schedule a demo or start your free trial.

19 Dec 2023

Min Read

Securing Real-Time Streaming Data: DeltaStream’s Approach

The introduction of The General Data Protection Regulation (GDPR) in 2018 and other laws following such as California Privacy Rights Act (CRPA) have made it a requirement that data security and data privacy are handled properly for businesses and consumers. While data security in the industry has improved since the passing of these laws, “data leakage” events are still not uncommon. The fallout from such events can have devastating impacts for both the company that had the data leakage and for any customers or partners associated with that company.

Since the beginning of DeltaStream, data security has been one of its foundational pillars. In order to build a data platform that users can trust in production, we knew that we had to design our system with security in the forefront of our minds.

In this blog post, we discuss how DeltaStream keeps user data safe and how DeltaStream provides tools to help users safely share their data.

How DeltaStream Keeps your Data Safe: Zero Trust

DeltaStream is highly secure and private connections between DeltaStream and other networks can be set up. However, if a security breach was to occur, the exposure of that breach should be minimal. In other words, a zero trust strategy should be taken where security verifications exist at multiple levels and security risks are actively avoided. Here are three ways DeltaStream’s design aims to keep your data safe.

Overview of DeltaStream’s architecture

Queries for Data Processing are Run in Isolated Environments

In DeltaStream, users write SQL statements to define long running queries to process their streaming data. Behind the scenes, these long running queries are powered by Apache Flink. On its own, the Flink framework does not have a security model, so it’s DeltaStream’s responsibility to ensure the Flink is run in a secure manner.

One way this is done is by securing the network for the query’s runtime environment. Outside network calls into the runtime environment are not allowed and only necessary network calls from the runtime environment to the outside are allowed, such as connecting to a specific Kafka broker if the query requires reading or writing to Kafka.

Another way we keep a query’s runtime secure is by dedicating separate environments for each query. This way, if an attacker is able to gain access to a query’s runtime environment, then the attacker will only have access to a single query’s environment and cannot affect other queries that may be running. This also means that queries do not compete for resources and a faulty query cannot adversely affect another query.

Only Required Data is Held and Encrypted by DeltaStream

The best way to keep user’s data safe is to not store it in the first place. Of course there is some amount of data that DeltaStream must store in order to support its feature set, but any data that is not absolutely required is not stored.

Let’s consider the common use-case query that does the following:

  1. Read data from a Kafka topic
  2. Mutate the data (according to the SQL query)
  3. Write the results to sink storage

In this example, DeltaStream connects to the source Kafka topic and reads the source records into memory. Mutations for that data are performed in-memory, then the data is written to the sink destination (another Kafka topic, S3 in Delta format for Databricks, Snowflake, etc). User data is kept entirely in memory and at no point in this scenario does DeltaStream persist the user data to disk.

There are three caveats where some form of user data is written to disk, and in each of these cases, the at-rest data is stored in encrypted storage.

  • Stateful queries, such as queries that perform aggregations, have their operational state occasionally snapshotted and stored for fault-tolerance purposes
  • Data given by the user to add connectivity to their storage layer, such as Kafka
  • Queries sinking to a Materialized View write data to an encrypted data store where the view is created

Finally, any data that is in-transit is encrypted with TLS to ensure that there is end to end data security.

BYOC and Dedicated Data Plane Deployment Options

The DeltaStream platform is implemented with a control plane and a data plane. At a high level, the control plane decides how data is managed and processed, and the data plane is where the actual management and processing of data occurs. In DeltaStream, data never leaves the data plane. The only communication between the control plane and data plane includes the data plane pulling instructions from the control plane, and the data plane pushing metrics and status updates to the control plane.

A user can choose between three different data plane deployment models: Public, Dedicated, and Bring Your Own Cloud (BYOC).

  • The Public data plane shares network and other resources with other DeltaStream customers.
  • A Dedicated data plane can be set up for users who want their own isolated network and resources in a cloud managed by DeltaStream.
  • The BYOC deployment is for users who require network and resources to be in their own cloud account so that data never leaves their cloud account.

In all three options, DeltaStream manages the network and resources so that users can still get a serverless DeltaStream experience, the difference is in which VPC is the data plane running.

Although most users may be satisfied with the Public deployment option, at DeltaStream we recognize that security requirements differ for different customers in different industries. That’s why we’ve chosen to provide BYOC and Dedicated as options to allow users to choose which model best suits their security needs. Read more about BYOC (also called Private SaaS).

Federated Data Governance and DeltaStream

Federated data governance is the union of data federation and data governance. Why both? Data federation is the concept of viewing multiple data sources in a unified view. Data governance is the idea of managing and ensuring data availability, security, and integrity. Only with both data federation and data governance together will users have things like access control over a unified view of all their data sources. This means there is a single place to define access control policies for all your data leading to less management overhead and fewer mistakes when defining policies.

For existing streaming storages like Kafka, the data is structured into a flat namespace. Access control is typically managed by Access Control Lists (ACLs), which are created for each user and resource (i.e. a single user has access to a single topic). Managing ACLs is cumbersome. If you need to give 3 users access to 50 topics you have to create 150 ACLS. This time-consuming process is prone to error, especially as the storage layer grows larger. This poses a security risk, as ACLs can easily be misconfigured and give a user more permissions than they ought to be granted for example. Further, managing consistent access control across storage systems, such as multiple Kafka clusters or across multiple streaming platforms, becomes extremely difficult.

To address the deficiencies of a flat namespace, DeltaStream brings a hierarchical namespace model such that data resources exist within a “Database” and “Schema”. In this model, data from different storage systems can exist within the same database and schema in DeltaStream, so the logical organization of data is decoupled from the data’s physical storage systems. DeltaStream then provides Role Based Access Control (RBAC) on top of this relational organization of data, where roles can define access policies for particular resources and users can inherit one or many roles.

Diagram of the unified view that DeltaStream offers to users over their streaming resources, RBAC can be applied to Database, Schemas and Relations

By simplifying and standardizing access control with namespacing and RBAC, DeltaStream empowers users to implement granular and sophisticated access control policies across multiple data sets. This means within your organization, data can be shared easily on a need-to-know basis.

Wrapping up

Designing data systems with security in mind is essential. We have covered at a high level how DeltaStream secures user data and provides federated data governance to help users secure data within their organization. Announcing our SOC II compliance a few months ago was just the first step towards proving our commitment to data security. If you want to learn more about DeltaStream’s approach to security or want to try DeltaStream for yourself, reach out to us to schedule a demo.

19 Oct 2023

Min Read

Capturing Data Changes in Real-Time

Change Data Capture (CDC) is a data integration technique to identify changes in a data source in real-time and capture them in downstream systems such as databases or data lakes to ensure consumers always have access to the most up-to-date data. Changes in the source data are continuously recorded as they occur. These changes are then made available for different purposes such as replication, synchronization, migration or transformation to other systems. CDC is well-suited for relational databases, which are often a critical part of enterprise systems. CDC captures data changes at the operation level and provides fine-grained control over data changes. This is useful for tracking historical data and handling complex data synchronization scenarios. Moreover, CDC often maintains transactional context in the source, ensuring that changes are delivered in a consistent order, reflecting the order of transactions in the source system. CDC pipelines are designed for high reliability and data consistency and therefore they include built-in mechanisms for error handling, retries for failed operations, and ensure data integrity.

What does a Change Data Capture pipeline look like?

A CDC pipeline typically consists of multiple components and stages designed to efficiently transfer data changes from a source system to a target system. The “source system” represents the original data where changes must be identified and captured. Within the pipeline, a “CDC capture agent” plays a pivotal role by continuously monitoring data changes in the source, capturing inserts, updates or deletions. Every change event is recorded and subsequently, in accordance with the chosen CDC delivery method, this ‘change log’ is published to the ‘target system’ where the captured changes are stored in a format optimized for subsequent processing.

Change Data Capture pipelines in DeltaStream

Just like you can use DeltaStream to process streaming data from sources like Kafka, you can now use DeltaStream to efficiently create pipelines to propagate CDC logs from PostgreSQL to a sink. Such a pipeline tracks data changes (INSERT, DELETE, UPDATE), at the row level in a table defined in a PostgreSQL database and pushes them into a downstream sink such as a Kafka topic or downstream analytics database such as Databricks.

Here, we cover an example to show how you can define and run such a pipeline. Assume you have a table, named pageviews in a PostgreSQL database called visits. Each row in pageviews has 3 columns: viewtime, userid and pageid and shows when a specific user visited a given page.

  1. visits=# SELECT * FROM pageviews;
  2.  
  3. viewtime | userid | pageid
  4. ------------+--------+--------
  5. 1623090132 | User_1 | Page_2
  6. 1659896532 | User_3 | Page_1
  7. 1592418132 | User_2 | Page_3

Over time, the rows in pageviews change. For each change, a CDC log is generated in the JSON format. Below is an example CDC log for an UPDATE operation. The op field captures the operation type and “before” and “after” fields capture the column values in the row before and after the update. The source field contains some metadata such as the database, schema and table names. Similar CDC logs are generated per INSERT or DELETE operation. In case of an INSERT, there is no before value and for a DELETE there is no after.

  1. {
  2. "op": "u",
  3. "ts_ms": 1623096418,
  4. "before": {
  5. "viewtime": 1592418132,
  6. "userid": "User_1",
  7. "pageid": "Page_3"
  8. },
  9. "after": {
  10. "viewtime": 1623090132,
  11. "userid": "User_1",
  12. "pageid": "Page_2"
  13. },
  14. "source": {
  15. "db": "visits",
  16. "schema": "public",
  17. }
  18. }

Source definition

As the first step to define a CDC pipeline in DeltaStream, you need to create a “store” to access pageviews CDC logs. This can be done via web UI or by using the CREATE STORE command in the DeltaStream CLI:

  1.  
  2. CREATE STORE pgstore
  3. WITH (
  4. 'type' = POSTGRESQL,
  5. 'uris' = 'postgresql://my.postgresql.uri:5432/visits',
  6. 'postgres.username' = 'username',
  7. 'postgres.password' = 'password');

Next step is defining a “stream” in DeltaStream on top of pageviews changelog records in pgstore. This stream is used as the CDC source to read CDC logs for data changes in pageviews. The statement below is an example DDL statement to define the CDC source stream for pageviews. This statement includes all the fields in the CDC logs’ JSON records. Depending on the use case, you may add or remove some of the columns.

  1. CREATE STREAM pageviews_cdc(
  2. op VARCHAR,
  3. ts_ms BIGINT,
  4. `before` STRUCT<viewtime BIGINT, userid VARCHAR, pageid VARCHAR>,
  5. `after` STRUCT<viewtime BIGINT, userid VARCHAR, pageid VARCHAR>,
  6. `source` STRUCT<db VARCHAR, `table` VARCHAR, `lsn` BIGINT>)
  7. WITH (
  8. 'store'='pgstore',
  9. 'value.format'='json',
  10. 'postgresql.db.name'='visits',
  11. 'postgresql.schema.name'='public',
  12. 'postgresql.table.name'='pageviews');

Once the pageviews_cdc stream is created successfully, you can see that among the relations in DeltaStream and it can be used like any other stream within DSQL queries.

Change Data Capture pipeline definition

Now that the CDC source is defined, you need to define the DSQL query that creates the CDC sink and starts the query to propagate the source data changes into the sink. Assume you have already added a Kafka store in DeltaStream and it is marked as your current store. You can find the details on how you can define and use such a store in this previous blog post. The CREATE STREAM AS SELECT statement (CSAS) can be used to create a new stream backed by a Kafka topic, as the CDC sink, in this store and start running the CDC pipeline. Here is an example of such a CSAS statement:

  1.  
  2. CREATE STREAM cdc_sink WITH (
  3. 'topic.partitions' = 1,
  4. 'topic.replicas' = 1) AS
  5. SELECT * FROM pageviews_cdc;

The above statement creates a new topic, named cdc_sink, in the current Kafka store and starts writing the CDC logs for the changes in pageviews_cdc into it. If you want to use a different name for the sink topic or you want to use an already existing topic, you can add that to the WITH clause using the topic property.

The query in CSAS specifies the structure of the records written into the sink. In the above example, the query selects all columns from the CDC source. Therefore, records in cdc_sink will have the exact same structure as pageviews_cdc records. You can check the CDC logs as being written to sink using the PRINT TOPIC command on cdc_sink:

  1. db1.public/kafka_store# PRINT TOPIC cdc_sink;
  2. {"op":"c","ts_ms":1693430399726,"before":null,"after":{"viewtime":1693430399292,"userid":"User_1","pageid":"Page_94"},"source":{"db":"visits","schema":"public","table":"pageviews","lsn":38990016}}
  3. {"op":"u","ts_ms":1693430400749,"before":{"viewtime":1693430390134,"userid":"User_1","pageid":"Page_94"},"after":{"viewtime":1693430399292,"userid":"User_1","pageid":"Page_81"},"source":{"db":"visits","schema":"public","table":"pageviews","lsn":38990384}}
  4. {"op":"c","ts_ms":1693430401772,"before":null,"after":{"viewtime":1693430401306,"userid":"User_2","pageid":"Page_83"},"source":{"db":"visits","schema":"public","table":"pageviews","lsn":38990528}}
  5. {"op":"d","ts_ms":1693430402792,"before":{"viewtime":1693430399112,"userid":"User_8","pageid":"Page_14"},"after":null,"source":{"db":"visits","schema":"public","table":"pageviews","lsn":38990720}}

If you are not interested in adding all the fields in CDC logs to the sink; You can modify the query definition in CSAS to match with your use case. For example, if you are just interested in CDC logs related to the INSERT operations and you want to only capture the new record along with the lsn of the change from PostgreSQL, you can write the CSAS statement as below:

  1. CREATE STREAM cdc_sink WITH (
  2. 'topic.partitions' = 1,
  3. 'topic.replicas' = 1) AS
  4. SELECT
  5. `after`,
  6. `source`->`lsn` AS `lsn`
  7. FROM pageviews_cdc
  8. WHERE op = 'c';

Given that both pageviews_cdc and cdc_sink are defined as streams in DeltaStream, you can simply use them in any DSQL query with other relations to filter, aggregate and join their records and create more complex pipelines.

Get Started with DeltaStream and Change Data Capture

CDC is a vital component in modern data management and integration solutions. It is especially valuable in scenarios where maintaining data consistency and keeping downstream systems synchronized with changes in the source data is critical.

In this blog post, we showed how a CDC pipeline can be created in DeltaStream to capture changes in a PostgreSQL data source and propagate them into a Kafka sink in real-time.

DeltaStream provides a comprehensive stream processing platform and can be used to create, manage and secure complex CDC pipelines. It is easy to operate and scales automatically. You can get more in-depth information about DeltaStream features and use cases by checking our blogs series. You can reach out to our team to schedule a demo and start using the system.

16 Oct 2023

Min Read

Always Fresh: Snowflake and DeltaStream Integrate

Snowflake is a popular data warehouse destination for streaming data. It allows users to process large amounts of data for a variety of purposes, such as dashboards, machine learning, and applications. However, to keep these use cases up to date, users need to keep the data fresh. DeltaStream’s Snowflake integration makes it easy to keep your data fresh in Snowflake with a simple SQL query.

How can I use Snowflake with DeltaStream?

Our new Snowflake integration provides a way to write pre-processed data into Snowflake without the need for storing the data in an intermediary storage:

Using DeltaStream queries, source(s) data can be transformed in the shape and form that best suits your Snowflake tables, then it can use your Snowflake Store and continuously write the resulting data in the target Snowflake table.

DeltaStream’s new Snowflake Store uses the Snowpipe Streaming APIs to provide the best streaming experience to your Snowflake account without any data loss. Stay tuned for more information on this in the upcoming blog posts.

How DeltaStream integrates with Snowflake

Creating a Snowflake Store

In DeltaStream, a Snowflake Store can be created for connectivity with a specific Snowflake Account. Use the SNOWFLAKE store type to create a new Snowflake Store:

  1.  
  2. CREATE STORE sf WITH (
  3. 'type' = SNOWFLAKE,
  4. 'uris' = 'https://<account_identifier>.snowflakecomputing.com',
  5. 'access_region' = 'AWS us-east-1',
  6. 'snowflake.account_id' = '<account_identifier>',
  7. 'snowflake.username' = '<account_username>',
  8. 'snowflake.role_name' = 'ACCOUNTADMIN',
  9. 'snowflake.client.key_file' = '/path/to/key.p8',
  10. 'snowflake.client.key_passphrase' = '<key_passphrase_if_any>',
  11. 'snowflake.warehouse_name' = '<warehouse_name>'
  12. );

Just like any other Store creation in DeltaStream, properties.file can also be used to offload the extra store information to a properties file. See Store Parameters for more information.

Once the Snowflake Store is created, you get access to any CRUD operation on databases, schemas, and tables on the specified Snowflake account without leaving your DeltaStream account.

For example, you can list available Snowflake databases:

  1. main_db.public/sf# LIST ENTITIES;
  2. Entity name
  3. -------------------------
  4. DELTA_STREAMING
  5. SNOWFLAKE
  6. SNOWFLAKE_SAMPLE_DATA

Or operate on the Snowflake schemas and tables:

  1. main_db.public/sf# LIST ENTITIES IN "DELTA_STREAMING";
  2. Entity name
  3. ----------------------
  4. INFORMATION_SCHEMA
  5. PUBLIC
  6.  
  7. main_db.public/sf# CREATE ENTITY "DELTA_STREAMING"."MY_SCHEMA";
  8. main_db.public/sf# LIST ENTITIES IN "DELTA_STREAMING";
  9. Entity name
  10. ----------------------
  11. INFORMATION_SCHEMA
  12. MY_SCHEMA
  13. PUBLIC

  1. main_db.public/sf# LIST ENTITIES IN "DELTA_STREAMING"."PUBLIC";
  2. Entity name
  3. -------------------
  4. STREAM_DATA_TBL
  5. sf_pv_table

The Snowflake entities can also be described for additional information:

  1. main_db.public/sf# DESCRIBE ENTITY "DELTA_STREAMING"."PUBLIC".sf_pv_table;
  2. Type | Name | Created at | Last DDL At | Last DDL By | Retention time | Updated At | Comment
  3. --------+----------------------------------------+----------------------+----------------------+-------------------------+----------------+----------------------+----------
  4. Table | "DELTA_STREAMING"."PUBLIC".sf_pv_table | 2023-09-05T22:26:24Z | 2023-09-05T22:26:24Z | SNOWPIPE_STREAMING_USER | 1 | 2023-09-05T22:26:25Z |
  5.  
  6. Columns:
  7. Name | Type | Nullable | Policy name | Primary key | Unique key | Comment
  8. -----------+-------------------+----------+-------------+-------------+------------+----------
  9. VIEWTIME | NUMBER(38,0) | ✓ | | | |
  10. USERID | VARCHAR(16777216) | ✓ | | | |
  11. PAGEID | VARCHAR(16777216) | ✓ | | | |
  12.  

Materializing Streaming Data with Snowflake

Once your Snowflake Store has been created, you are ready to write queries that’ll enable you to apply any necessary computation to an upstream stream of data and continuously write the result of that computation to a corresponding Snowflake table.

Let’s take a look at a query where we count the number of transactions per type of credit card:

  1.  
  2. CREATE TABLE "CC_TYPE_USAGE" WITH (
  3. 'snowflake.db.name' = 'DELTA_STREAMING',
  4. 'snowflake.schema.name' = 'PUBLIC'
  5. ) AS SELECT
  6. cc_type AS "CREDIT_TYPE",
  7. COUNT(*) AS "TOTAL_SO_FAR"
  8. FROM transactions
  9. GROUP BY cc_type;

In this example, we’re using a CREATE TABLE AS SELECT (CTAS) query to let the system know that we want to create a Table in Snowflake under the “DELTA_STREAMING”.”PUBLIC” namespace with the name CC_TYPE_USAGE. Once the CTAS query is created, we can describe the new Snowflake table of the same name:

  1. main_db.public/sf# DESCRIBE ENTITY "DELTA_STREAMING"."PUBLIC"."CC_TYPE_USAGE";
  2. Type | Name | Created at | Last DDL At | Last DDL By | Retention time | Updated At | Comment
  3. --------+--------------------------------------------+----------------------+----------------------+----------------+----------------+----------------------+----------
  4. Table | "DELTA_STREAMING"."PUBLIC"."CC_TYPE_USAGE" | 2023-09-06T22:23:18Z | 2023-09-06T22:23:18Z | SFACCOUNTOWNER | 1 | 2023-09-06T22:23:18Z |
  5.  
  6. Columns:
  7. Name | Type | Nullable | Policy name | Primary key | Unique key | Comment
  8. ---------------+-------------------+----------+-------------+-------------+------------+----------
  9. CREDIT_TYPE | VARCHAR(16777216) | ✓ | | | |
  10. TOTAL_SO_FAR | NUMBER(38,0) | ✓ | | | |

By creating the above CTAS query, we have started a process that’ll continuously count the number of credit card types that are used in our product transactions. The result set of this query appends a new row to the CC_TYPE_USAGE table for every new count for a specific cc_type, where we can use the last count for each type to perform required analysis for the business.

The result set can be previewed in the CLI:

  1. main_db.public/sf# PRINT ENTITY "DELTA_STREAMING"."PUBLIC"."CC_TYPE_USAGE";
  2. CREDIT_TYPE | TOTAL_SO_FAR
  3. --------------+---------------
  4. VISA | 1
  5. VISA | 2
  6. DISC | 1

Modern Streaming and Analytics

At DeltaStream we believe in providing a modern and unified stream processing solution that can unlock the full potential of your data, leading to improved operational efficiency, data security, and ultimately a competitive advantage for every organization when developing products. We hope that with the new Snowflake integration we open yet another door for providing data unity across your organization.

If you have any questions or want to try out this integration, reach out to us or request a free trial!

10 Oct 2023

Min Read

Seamless Data Flow: Integrating DeltaStream and Databricks

DeltaStream is powered by Apache Flink and processes streaming data from sources such as Kafka and Kinesis. The processed data can be used in downstream applications, materialized in a view, or written to data lakes and other analytical systems. Some common use cases of DeltaStream are latency sensitive applications where processed streaming data is needed in real time and data preparation use cases where users want to use a subset of the data contained within a stream.

In this blog post, we are introducing Databricks as our latest integration with the DeltaStream platform. Now users can process data from their streaming data sources and write the results directly to Delta Lake.

Users are familiar with Apache Spark as a great batch solution for processing massive amounts of data. However, in real-time stream processing users are latency sensitive and that makes a real-time processing engine a perfect fit. In these cases, DeltaStream can be leveraged to continuously process streaming data and write to the Delta Lake, so that Delta Tables are always up to date and data is never late.

For applications such as those that alert on fraudulent activities, having both streaming and batch processing capabilities is useful. On the streaming side, DeltaStream can process data events in real time and generate alerts for fraudulent events when there are 3 or more withdrawals from the same bank account within 10 minutes for example. With our latest integration, these fraudulent events generated by DeltaStream can then be written into the Delta Lake and made available to Databricks for any further batch processing or historical analysis. For example, a data scientist may want to observe how many fraudulent events occurred in the past month, and with these alerts already available in their Delta Lake, they can easily do this analysis in the Databricks ecosystem.

Where does Databricks fit in with DeltaStream?

Databricks is a unified data and analytics platform. Databricks users can process, store, clean, share, and analyze their datasets to solve use cases ranging from business insights to machine learning. It has become an increasingly popular solution for data warehousing, but presents challenges with streaming data. For users with streaming data, they must first manage a connection to load their streaming data into Databricks, then perform ETL batch jobs to process and refine their data. For latency sensitive use cases there is a better way. With the new integration with Databricks, DeltaStream now sits in between users’ streaming data and Databricks. In the DeltaStream platform, users can transform and prepare their streaming data and write the results into Databricks with a single SQL query, meaning that any real-time transformations that need to take place on the streaming data can happen before the data is loaded into Databricks.

How DeltaStream integrates with Delta Lake

The Delta Lake is the storage layer that powers Databrick’s Lakehouse platform. To integrate with Databricks, DeltaStream needs to be able to write to Delta Lakes with the Delta Table format. Behind the scenes, DeltaStream uses Apache Flink as its compute engine. One of the advantages of using Apache Flink is the rich connector ecosystem that comes along with it, including a connector for the Delta Lake. Using the Delta sink connector, we were able to easily integrate our platform with Databricks.

The figure above displays the architecture at a high level. The user, interacting with DeltaStream’s CLI or Web UI, can issue a CREATE TABLE AS SELECT SQL query which the DeltaStream platform receives. DeltaStream will then perform validations and create a Databricks Table in the user’s Databricks workspace, using an associated S3 Delta Lake as the external location for the Table. A continuous DeltaStream query is then launched to perform the SQL computations. This continuous query will read from one of DeltaStream’s accepted sources, such as a Kafka topic, do any necessary transformations, and produce the results in Delta format to the AWS S3 account that is associated with the user’s Databricks workspace. At this point, the continuous query is constantly updating the user’s Delta Lake as records are arriving in the source, keeping the user’s Databricks Table up to date with the latest records. Users can also issue commands in DeltaStream to interact with Databricks such as listing, creating, or dropping Databricks Catalogs, Schemas and Tables. Users can also print their Databricks Table to view the rows.

Use Case: Loading data into Databricks

Let’s walk through an example use case to show off how easy it is to work with Databricks in DeltaStream. Suppose you have a Kafka topic which contains logging data from your stage environments and you want to load the data into Databricks. In this example, you’ll see how you can have an end-to-end solution to solve this use-case using DeltaStream in minutes.

Adding Databricks as a Store

Like Kafka and Kinesis, Databricks would be considered a Store in DeltaStream. The first step in starting the integration is to create the Databricks Store in your DeltaStream organization. The store can be created with the following SQL statement:

  1. CREATE store databricks_store WITH (
  2. 'type' = databricks,
  3. 'access_region' = "AWS us-east-1",
  4. 'uris' = 'https://dbc-abcd1234-5678.cloud.databricks.com',
  5. 'databricks.app_token' = 'dapifakeapptoken12345678',
  6. 'databricks.warehouse_id' = '12345678abcdefgh',
  7. 'aws.access_key_id' = 'FAKE_ACCESS_KEY',
  8. 'aws.secret_access_key' = 'FAKE_SECRET_KEY',
  9. 'databricks.cloud.s3.bucket' = 'deltastream-databricks-bucket1',
  10. 'databricks.cloud.region' = 'AWS us-west-2'
  11. );

After adding the Store, you can interact with the Databricks unity catalog to perform commands such as create, describe, list, and drop.

  1. db.public/databricks_store# LIST ENTITIES;
  2. Entity name
  3. ------------------------------------------------
  4. cat1
  5. main
  6. system
  7. db.public/databricks_store# LIST ENTITIES IN cat1;
  8. Entity name
  9. ----------------------
  10. default
  11. information_schema
  12. sch1
  13. sch2
  14. db.public/databricks_store# DROP ENTITY cat1.sch2;
  15. db.public/databricks_store# LIST ENTITIES IN cat1;
  16. Entity name
  17. ----------------------
  18. default
  19. information_schema
  20. sch1

CTAS Query

After adding your Databricks workspace as a Store in DeltaStream, you can use it to write CREATE TABLE AS SELECT queries (CTAS). These queries create DeltaStream Table Relations by selecting columns from one or multiple source Relations. In this example, let’s assume we already have a Stream of data called “my_logs” which is defined by the following DDL:

  1. CREATE STREAM my_logs (
  2. log_ts BIGINT, "level" VARCHAR, message VARCHAR, env VARCHAR
  3. ) WITH (
  4. 'topic' = 'logging_topic', 'value.format' = 'JSON'
  5. );

In DeltaStream, a CTAS query to only produce records in the “stage” environment would look like the following:

  1. CREATE TABLE stage_logs WITH (
  2. 'store' = 'databricks_store',
  3. 'databricks.catalog.name' = 'cat1',
  4. 'databricks.schema.name' = 'sch1',
  5. 'databricks.table.name' = 'stage_logs',
  6. 'table.data.file.location' = 's3://deltastream-databricks-bucket1/stage_logs'
  7. ) AS SELECT * FROM my_logs WHERE env='stage';

Validating results

After running this CTAS query, a new Databricks Table will be created with the fully qualified path of cat1.sch1.stage_logs. You can confirm this by describing and printing the Table Entity which will describe the underlying Databricks Table:

  1. db.public/databricks_store# DESCRIBE ENTITY cat1.sch1.stage_logs;
  2. Type | Name | Created at | Created by | Updated at | Updated by | Storage location | Comment
  3. --------+----------------------+----------------------+---------------------+----------------------+---------------------+------------------------------------------------+----------
  4. Table | cat1.sch1.stage_logs | 2023-09-11T17:55:22Z | [email protected] | 2023-09-11T17:55:22Z | [email protected] | s3://deltastream-databricks-bucket1/stage_logs |
  5.  
  6. Columns:
  7. Position | Name | Type | Nullable
  8. -----------+----------+--------+-----------
  9. 0 | log_ts | bigint | ✓
  10. 1 | level | string | ✓
  11. 2 | message | string | ✓
  12. 3 | env | string | ✓

  1. db.public/databricks_store# PRINT ENTITY cat1.sch1.stage_logs;
  2. log_ts | level | message | env
  3. ---------------------+--------+---------------+------
  4. 1.694537741684e+12 | INFO | Listening on | stage
  5. 1.694537745903e+12 | WARN | Retrying abc | stage
  6. 1.694537746943e+12 | INFO | Starting x | stage
  7. 1.694537747963e+12 | INFO | Listening on | stage
  8. 1.694537749124e+12 | INFO | Connecting | stage
  9. 1.694537750125e+12 | INFO | committing y | stage
  10. 1.694537751163e+12 | ERROR | Commit failed | stage
  11. 1.694537752223e+12 | INFO | Retrying y | stage
  12. 1.694537753263e+12 | INFO | ABC 148 bytes | stage
  13. 1.694537754283e+12 | INFO | Listening on | stage

If you log into your Databricks workspace, you’ll also find the same information there.

The backing storage for the newly created table will be in S3 in the directory s3://some_bucket/stage_logs. You can validate that the data is arriving in S3 by checking in your AWS console, or from DeltaStream you can also validate that new data is arriving by issuing the PRINT command above.

Get Started with DeltaStream and Databricks

Databricks has provided an incredible platform for data engineers, making it simple and intuitive to manage and process data in your data lake or data warehouse. With DeltaStream’s new Databricks integration, we aim to make it easier for DeltaStream users and Databricks users to use both platforms where they shine and to provide a more complete user experience for streaming and non-streaming use-cases. We’re extremely excited to share this new integration. If you have any questions or want to try it out, reach out to us or try DeltaStream’s free trial!

17 Aug 2023

Min Read

Building Upon Apache Flink for Better Stream Processing

Making streaming data easy to manage, secure, and process has been at the heart of DeltaStream’s problem statement since day one. When it came time to choose how we wanted to process streaming data, it was clear that leveraging Apache Flink was the correct choice. In our previous blog post, we covered Why Apache Flink is the Industry Gold Standard. In this post, we’ll reiterate some of the reasons why we at DeltaStream chose Flink as our underlying processing engine, and explain how our platform improves upon Flink to provide a powerful and intuitive way to build stream processing applications for our customers.

Why Choose Flink for DeltaStream?

The use cases for stream processing are boundless (pun intended). From anomaly detection, to gaming, to IoT, to machine learning, there is a vast range of use cases that streaming can solve and each of these use cases have their own set of requirements. When choosing the correct stream processing engine, we want to ensure that (1) we can support a wide range of use cases, (2) the performance of our system is on par with the industry’s cutting edge, and (3) there is an opportunity to contribute back to the community and fix any bugs we may find. With these requirements, Flink is the only viable choice. At a high level, Apache Flink is a distributed real-time data processing engine that is highly scalable, highly efficient, and has seen massive growth in the past 5 years.

Supporting a wide range of use cases

Flink’s set of low-level and high-level APIs gives us flexibility when building features. When we want to expose certain configurations to our end users we can take advantage of using low-level APIs, and on the flip side, Flink’s high-level APIs allow us to support powerful data transformations that the Flink engine has optimized.

Flink also has a very rich connector ecosystem. For almost every popular data storage system, whether it’s a streaming database such as Apache Kafka, or an at rest database such as Postgres, there is likely already an open source Flink connector. If a connector is missing, Flink provides APIs that allow developers to write their own connector. This enables Flink to seamlessly integrate with most data infrastructures.

System performance

Flink is both low-latency and highly scalable. This means that resources can be rightsized for the workload which saves cost while maintaining performance. Flink’s state of the art savepointing algorithm also enables Flink to be fault tolerant and highly available.

Large open source community

Flink is one of the most popular Apache projects, with over 1,600 contributions during 2022. Its active mailing list and numerous committers makes it easy to find support for any issues as well as take part in the community. The DeltaStream team has made a number of contributions to the Flink project to fix bugs and make improvements. Choosing to use a project that we can both receive support from and contribute back to was a big factor in choosing Apache Flink.

If you want a more in depth analysis of how Flink works and why we think it’s so great, check out our previous blog post.

Enhancing Flink Beyond the Barriers

While Flink is an incredibly powerful technology, it is not the quick stream processing solution you may be searching for. It can be difficult to operate, take months to learn, and it requires stream processing expertise to design a Flink-based service to actually meet your demands. We at DeltaStream have become experts in Flink and have built a complete stream processing cloud offering that goes beyond Flink. The DeltaStream platform not only addresses the difficulties users face when starting out with Flink, but it addresses the difficulties users face when trying to integrate stream processing into their systems in general.

A serverless stream processing platform

While Flink is great in many ways, users may find it and other stream processing frameworks to be quite complex, especially those who are not familiar with other distributed processing systems. Operating Flink has a learning curve that requires users to understand Flink’s memory model, tuning savepointing and checkpointing, and diagnosing issues when things fail. There are plenty of gotchas that come along with Flink and other stream processing frameworks, such as data skew and inefficient serialization. For companies who are trying to add stream processing, this means longer ramp up time for engineers to learn and build the system. A serverless system such as DeltaStream would remove all this operational overhead from the user which means companies can set up and trust stream processing applications in minutes instead of months or even years.

Lower barrier to entry with extended SQL and modern user interface

To work with Flink directly, developers need to be experts in Java or Scala. Flink has been developing their Python API as well, but as of this writing it is much less mature, has fewer features, and is less performant than JVM Flink. DeltaStream removes this barrier by exposing our SQL grammar for defining stream processing jobs. By owning the SQL grammar, the user experience is abstracted away from Flink’s complexities while still taking advantage of Flink’s performance. Accessible through our scriptable CLI, web application, or REST API, the DeltaStream platform lowers the barrier to entry for stream processing while also providing a richer developer experience to our users. Visibility into query metrics, secure data sharing capabilities, and RBAC for queries and other entities in the platform are just a few of the ways that DeltaStream makes it easy for users to manage their streaming resources.

Support for real-time materialized views

Materialized views have been a staple in at-rest databases for a long time, but keeping them constantly up-to-date puts a lot of strain on the underlying database. Flink does not support materialized views out of the box (although Flink does have Dynamic Tables). While Flink is great for building stream processing applications for use cases that are event-driven such as anomaly detection, other use cases that aren’t event-driven, such as querying to find the most clicked on ad in an ad campaign, may be better solved with materialized views. DeltaStream provides materialized views as a core feature to support both in the same platform.

Wrapping Up

Flink is the industry gold standard for stream processing and that’s the main reason why we use it as our underlying processing engine. However, it comes with its complexities, both in development and operations. By managing the operations of Flink and using a more streaming-friendly SQL grammar, DeltaStream aims to make stream processing intuitive and pain free.

Make sure you visit our blogs page if you want to read more about what’s going on at DeltaStream. If you’re ready to try out our system, you can schedule a demo with us or sign up for a free trial. We’d love to show you what we’ve built.

08 Aug 2023

Min Read

Bringing Private SaaS to Streaming Data Customers in DeltaStream

Data security and privacy are top of mind for all customers. This concern is not exclusive to highly regulated industries such as financial services, but for all companies. The number of incidents related to “data leakage” continues to increase and customers are more vigilant than ever to protect their data. SaaS has been the dominant pattern to quickly adopt new solutions and technologies. SaaS has simplified the complexity and costs of deploying and managing technology, freeing up teams to focus on business critical activities. However, with SaaS data has to leave the customer’s network. In this day and age with private connections that don’t go over the public internet the risks of data leakage is low. However, we see customers are increasingly looking for solutions where they can isolate the data within their own network.

When we designed DeltaStream we architected it in a way to enable flexibility in how we deploy our platform. In 2022 we launched our SaaS service, which is serverless and provides for both dedicated and multi-tenant deployments. Our SaaS offering is highly secure and enabled only through private connections to a customers streaming data store (e.g., Apache Kafka). However, for some customers data leaving their networks is a non-starter. We completely understand that customers have a different tolerance for data security risks based on the type of data, the industry they operate in and their customer’s requirements.

A Modern Solution

To address our customers’ security and privacy concerns last week we announced the launch of our Private SaaS offering. With Private SaaS customers can now process their streaming data without the data ever leaving their network all while the service being fully managed by DeltaStream. This does two things:

  1. Guarantees privacy and security of data and eliminates the overhead and complexity of managing the platform. 
  2. Gives customers the capability of processing their streaming data within minutes

We did this by making our data plane portable while connecting to the same control plane as our SaaS product. Our SaaS and Private SaaS offerings share the same code base guaranteeing customers have the latest set of features regardless of the product they choose. 

As a seed-stage company we deliberately made this investment early to address customers pressing data security and privacy concerns.

How Private SaaS Works

As a serverless cloud native platform, DeltaStream includes a control plane and a data plane. The control plane includes services such as authentication, access control, infrastructure and metadata management. Also, the control plane manages the orchestration of many data planes. The data plane is where DeltaStream accesses and processes customer data. This includes reading and writing data from and to stores such as Apache Kafka, running streaming queries and building materialized views. In a typical SaaS model, both control plane and data plane run in the service provider’s Virtual Private Cloud (VPC) account, requiring the transfer of customer data between the customer VPC and the service provider VPC. The following figure depicts a typical Saas architecture.

While this is acceptable to many customers, some customers do not allow data leaving their network. For such cases, private SaaS is the right solution since it guarantees data won’t leave the customer VPC at any point in time. In this model, all access and processing of the customer data will happen inside the customer VPC. This is achieved by running the data plane within the customer VPC. The following figure depicts the architecture for DeltaStream Private SaaS.

With the private SaaS model, DeltaStream provides the same serverless user experience that the customers love but with the guarantee that data remains within the customer VPC. DeltaStream will run the data plane in customer VPC through a secure connection and handle all aspects of running streaming processing workloads.

The Benefits of Private SaaS

1. Data Sovereignty 

As Public Clouds became the norm, the questions around Data Sovereignty have never been more pertinent. There are numerous jurisdictions that have data export restrictions. The most widely known regulation is GDPR. GDPR serves to protect personal data that is collected for customers operating in the EU. Customers cannot export this personal data and customers must implement procedures to protect this data. Private SaaS  allays this concern by allowing the data plane to reside in EU data centers. Ownership and the responsibilities over data never changes hands and the authority always lies with the customer.

2. SaaS adoption made easy

For some organizations, adopting SaaS services that run outside of their VPC is not an option regardless of local laws. This means undertaking complex initiatives and taking on the overhead of building, operating and maintaining not only the required platforms, but also developing disparate centers of excellence to help keep the lights on. This translates to inflexibility and a longer time to market.

3. Serverless platform in your VPC

The platform is serverless meaning we scale the infrastructure for the data plane that runs in your network. You get all the benefits of our SaaS offering while keeping the data in your VPC.

4. Reduce Costs

Private SaaS enables our customers to leverage their existing relationships with their CSPs and apply their negotiated discounts and other benefits over to their DeltaStream bill. This brings in the much needed transparency over your cloud costs and enables you to manage them better.

Try DeltaStream Private SaaS

DeltaStream provides a comprehensive stream processing platform to manage, secure and process all your event streams. It is easy to operate, and scales automatically. You can get more in-depth information about DeltaStream features and use cases by checking the DeltaStream blog. If you are ready to try a modern stream processing solution, you can get started with a free trial of DeltaStream or reach out to our team to schedule a demo.

29 Mar 2023

Min Read

What is DeltaStream and When Should You Use It?

As the creator of ksqlDB at Confluent, I know first hand the power of stream processing, but also understand its limitations and complexities. I founded DeltaStream to build a comprehensive platform that truly revolutionizes stream processing. The platform we built is based on three themes:

  • A complete stream management and processing solution to enable users to see value in minutes while driving down operating costs.
  • Provide a familiar and unified view of streams in leading streaming storage services (i.e., Kafka or Kinesis)
  • Build a security model that elegantly enables and restricts access to data streams

What is DeltaStream?

DeltaStream is a unified serverless stream processing platform that integrates with streaming storage services including Apache Kafka and AWS Kinesis, Confluent Cloud, AWS MSK and Redpanda. Think about it as the compute layer on top of your streaming storage. It provides functionalities of streaming analytics(Stream processing) and streaming databases along with additional features to provide a complete platform to manage, process, secure and share streaming data. 

DeltaStream provides a SQL based interface where you can easily create stream processing applications such as streaming pipelines, materialized views, microservices and many more. It has a pluggable processing engine and currently uses Apache Flink as its primary stream processing engine. However, DeltaStream is more than just a query processing layer on top of Kafka or Kinesis. It brings relational database concepts to the data streaming world, including namespacing and role based access control enabling you to securely access, process and share your streaming data regardless of where they are stored. Unlike existing solutions that mainly focus on processing capabilities, DeltaStream provides a holistic solution for both processing and operating/managing your streaming data. 

Here’s a summary of DeltaStream’s main capabilities that make it uniquely suited for processing and managing data streams:

  • DeltaStream is serverless.  The user no longer has to worry about clusters/servers, architecting or scaling infrastructure to run real-time applications. Gone are the days of cluster sizing, keeping track of which cluster queries run in or how many tasks to allocate to your applications. Unlike many platforms that run multiple queries in one cluster and share cluster resources, queries in DeltaStream run in isolation, can scale up/down independently and seamlessly recover from failures! DeltaStream takes care of all those complexities so you can focus on building the core products that bring value to you and your organization.
  • SQL as the primary interface. SQL is the primary interface for DeltaStream. From creating databases and streams, to running continuous queries or building materialized views on these streams, you can do it all in a simple and familiar SQL interface. DeltaStream provides SQL extensions that enable users to express streaming concepts that don’t have equivalents in traditional SQL. Additionally, if your compute logic requires more than SQL, you can use DeltaStream’s UDFs/UDAFs to define and perform such computations.
  • Always up-to-date Materialized Views. Materialized View is a native capability in DeltaStream. You can build “always up-to-date” materialized views by using continuous queries. Once a materialized view is created, you can query it the same way you query materialized views in relational databases!
  • Unified view over multiple streaming stores. DeltaStream enables you to have a single view into all your streaming data across all your streaming stores. For example, whether you are using one Kafka cluster, multiple Kafka clusters, or multiple platforms like Kafka and Kinesis,  DeltaStream provides a unified view of the streaming data and you can write queries on these streams regardless of where they are stored.
  • Intuitive namespacing. Streaming storage systems such as Apache Kafka have a flat namespace – you can think of this as a file system with no folders! This makes it very challenging to organize streams in such systems. By providing namespacing, DeltaStreams enables users to organize their streams in databases and schemas, similarly to the way they organize their tables in relational databases. And with storage abstraction described above, you can organize your streaming data across all your streaming storage systems.
  • Fine-grained security that you know and love. You can define fine-grained access privileges to determine who can access and perform which operations on objects in DeltaStream. With DeltaStream’s role based access control(RBAC) you can define roles and assign them to users. All these can be done in SQL that you know and love. For instance, you can give read privileges on a specific stream to a given role with a one-line statement!
  • Break down silos for your streaming data with secure sharing. With the namespacing, storage abstraction and role based access control, DeltaStream breaks down silos for your streaming data and enables you to share streaming data securely across multiple teams in your organizations.
  • Push notifications. You can create notifications on results of your continuous queries and push them to a variety of services such as slack, email or pagerduty, or have the results call custom APIs. For instance, consider you have a stream of sensor data from vehicles. You can have a query to compute the average speed of each vehicle and if the average is higher than a threshold for a given time window, send a notification to the driver.

How Users Interact with DeltaStream

Users can interact with DeltaStream through its REST API, a web application or CLI. The following figure shows a screenshot of the DeltaStream web application. Also, using our REST API, you can have your own application call the API or tools like GitHub Actions submit a set of statements that define an application or pipeline.

When should you use DeltaStream?

With the aforementioned capabilities, you can quickly and easily build streaming applications and pipelines on your streaming data. If you are already using any of the streaming storage services such as Apache Kafka and AWS Kinesis, Confluent Cloud, AWS MSK or Redpanda, you should consider using DeltaStream. Here are a couple of use cases that you can use DeltaStream for. 

Assume you have a vehicle information topic in your production Kafka cluster where you ingest real time information such as GPS coordinate, speed and other vehicle data. Consider you want to share this stream in real time with another team but only want to share information from vehicles in a certain geographic region and obfuscate some of the data. Also, you don’t want to give access to the production Kafka cluster and would like to provide the shared information in a topic in a new Kafka cluster. Using DeltaStream, you can easily write a SQL query, as the one shown below, to read the original stream, perform desired projections, transformations and filtering, and continuously write the result into a new stream backed by a topic in the new Kafka cluster that already exists called test_kafka.

  1. CREATE STREAM resultStream WITH('store'='text_kafka') AS
  2. SELECT
  3. vid, lat, lon, mask(pii, '*')
  4. FROM
  5. vehecleStream
  6. WHERE
  7. isInGeoFence(lat, lon) = true;

Once you have the results stream, using the following statement, you can grant read privilege for the team. They would only see the result stream without even seeing the source stream or the production Kafka cluster.

  1. GRANT USAGE, SELECT PRIVILEGE ON resultStream TO alice;

As another example, consider a wiki service where all user interactions with every wiki page is streamed into a Kinesis stream. Let’s assume we want to provide real time page statistics such as number of edits per wiki page. You can easily build a materialized view in DeltaStream using an aggregate query like the following:

  1. CREATE MATERIALIZED VIEW wiki_edit_count AS
  2. SELECT
  3. page_id, count (*) AS edit_count
  4. FROM
  5. wiki_events
  6. WHERE
  7. wiki_event_type = ' edit'
  8. GROUP BY
  9. page_id;

This will create a materialized view in DeltaStream where we have the edit count per wiki page and every time an edit event is appended to the wiki_events stream, the view will be updated in real time. You can now show the up to date edit count for a wiki page every time it is loaded by querying the materialized view and including the edit count in the wiki page. DeltaStream ensures that every time the users open a wiki page, they will see the latest up to date edit counts for the page.

Now that you have seen some of the capabilities of DeltaStream along with a few example use cases, you should check out DeltaStream as the platform for processing, organizing and securing your streaming data. You can schedule a demo where you can see all these capabilities in the context of a real world streaming application! Browse our blogs for more in-depth information on features and use cases. Once you are ready, let’s get in touch to build your streaming applications with DeltaStream, the next generation serverless streaming platform

alert-icon

Please enter a valid email address.

Request Submitted

Thank you for requesting a demo.
You will receive your login information to your email soon.