29 Mar 2023

Min Read

What is DeltaStream and When Should You Use It?

As the creator of ksqlDB at Confluent, I know first hand the power of stream processing, but also understand its limitations and complexities. I founded DeltaStream to build a comprehensive platform that truly revolutionizes stream processing. The platform we built is based on three themes:

A complete stream management and processing solution to enable users to see value in minutes while driving down operating costs.
Provide a familiar and unified view of streams in leading streaming storage services (i.e., Kafka or Kinesis)
Build a security model that elegantly enables and restricts access to data streams

What is DeltaStream?

DeltaStream is a unified serverless stream processing platform that integrates with streaming storage services including Apache Kafka and AWS Kinesis, Confluent Cloud, AWS MSK and Redpanda. Think about it as the compute layer on top of your streaming storage. It provides functionalities of streaming analytics(Stream processing) and streaming databases along with additional features to provide a complete platform to manage, process, secure and share streaming data.

DeltaStream provides a SQL based interface where you can easily create stream processing applications such as streaming pipelines, materialized views, microservices and many more. It has a pluggable processing engine and currently uses Apache Flink as its primary stream processing engine. However, DeltaStream is more than just a query processing layer on top of Kafka or Kinesis. It brings relational database concepts to the data streaming world, including namespacing and role based access control enabling you to securely access, process and share your streaming data regardless of where they are stored. Unlike existing solutions that mainly focus on processing capabilities, DeltaStream provides a holistic solution for both processing and operating/managing your streaming data.

Here’s a summary of DeltaStream’s main capabilities that make it uniquely suited for processing and managing data streams:

DeltaStream is serverless. The user no longer has to worry about clusters/servers, architecting or scaling infrastructure to run real-time applications. Gone are the days of cluster sizing, keeping track of which cluster queries run in or how many tasks to allocate to your applications. Unlike many platforms that run multiple queries in one cluster and share cluster resources, queries in DeltaStream run in isolation, can scale up/down independently and seamlessly recover from failures! DeltaStream takes care of all those complexities so you can focus on building the core products that bring value to you and your organization.
SQL as the primary interface. SQL is the primary interface for DeltaStream. From creating databases and streams, to running continuous queries or building materialized views on these streams, you can do it all in a simple and familiar SQL interface. DeltaStream provides SQL extensions that enable users to express streaming concepts that don’t have equivalents in traditional SQL. Additionally, if your compute logic requires more than SQL, you can use DeltaStream’s UDFs/UDAFs to define and perform such computations.
Always up-to-date Materialized Views. Materialized View is a native capability in DeltaStream. You can build “always up-to-date” materialized views by using continuous queries. Once a materialized view is created, you can query it the same way you query materialized views in relational databases!
Unified view over multiple streaming stores. DeltaStream enables you to have a single view into all your streaming data across all your streaming stores. For example, whether you are using one Kafka cluster, multiple Kafka clusters, or multiple platforms like Kafka and Kinesis, DeltaStream provides a unified view of the streaming data and you can write queries on these streams regardless of where they are stored.
Intuitive namespacing. Streaming storage systems such as Apache Kafka have a flat namespace – you can think of this as a file system with no folders! This makes it very challenging to organize streams in such systems. By providing namespacing, DeltaStreams enables users to organize their streams in databases and schemas, similarly to the way they organize their tables in relational databases. And with storage abstraction described above, you can organize your streaming data across all your streaming storage systems.
Fine-grained security that you know and love. You can define fine-grained access privileges to determine who can access and perform which operations on objects in DeltaStream. With DeltaStream’s role based access control(RBAC) you can define roles and assign them to users. All these can be done in SQL that you know and love. For instance, you can give read privileges on a specific stream to a given role with a one-line statement!
Break down silos for your streaming data with secure sharing. With the namespacing, storage abstraction and role based access control, DeltaStream breaks down silos for your streaming data and enables you to share streaming data securely across multiple teams in your organizations.
Push notifications. You can create notifications on results of your continuous queries and push them to a variety of services such as slack, email or pagerduty, or have the results call custom APIs. For instance, consider you have a stream of sensor data from vehicles. You can have a query to compute the average speed of each vehicle and if the average is higher than a threshold for a given time window, send a notification to the driver.

How Users Interact with DeltaStream

Users can interact with DeltaStream through its REST API, a web application or CLI. The following figure shows a screenshot of the DeltaStream web application. Also, using our REST API, you can have your own application call the API or tools like GitHub Actions submit a set of statements that define an application or pipeline.

When should you use DeltaStream?

With the aforementioned capabilities, you can quickly and easily build streaming applications and pipelines on your streaming data. If you are already using any of the streaming storage services such as Apache Kafka and AWS Kinesis, Confluent Cloud, AWS MSK or Redpanda, you should consider using DeltaStream. Here are a couple of use cases that you can use DeltaStream for.

Assume you have a vehicle information topic in your production Kafka cluster where you ingest real time information such as GPS coordinate, speed and other vehicle data. Consider you want to share this stream in real time with another team but only want to share information from vehicles in a certain geographic region and obfuscate some of the data. Also, you don’t want to give access to the production Kafka cluster and would like to provide the shared information in a topic in a new Kafka cluster. Using DeltaStream, you can easily write a SQL query, as the one shown below, to read the original stream, perform desired projections, transformations and filtering, and continuously write the result into a new stream backed by a topic in the new Kafka cluster that already exists called test_kafka.

CREATE STREAM resultStream WITH('store'='text_kafka') AS
SELECT
    vid, lat, lon, mask(pii, '*')
FROM
    vehecleStream
WHERE
    isInGeoFence(lat, lon) = true;

Once you have the results stream, using the following statement, you can grant read privilege for the team. They would only see the result stream without even seeing the source stream or the production Kafka cluster.

GRANT USAGE, SELECT PRIVILEGE ON resultStream TO alice;

As another example, consider a wiki service where all user interactions with every wiki page is streamed into a Kinesis stream. Let’s assume we want to provide real time page statistics such as number of edits per wiki page. You can easily build a materialized view in DeltaStream using an aggregate query like the following:

CREATE MATERIALIZED VIEW wiki_edit_count AS
SELECT
     page_id, count (*) AS edit_count
FROM
     wiki_events
WHERE
     wiki_event_type = ' edit'
GROUP BY
     page_id;

This will create a materialized view in DeltaStream where we have the edit count per wiki page and every time an edit event is appended to the wiki_events stream, the view will be updated in real time. You can now show the up to date edit count for a wiki page every time it is loaded by querying the materialized view and including the edit count in the wiki page. DeltaStream ensures that every time the users open a wiki page, they will see the latest up to date edit counts for the page.

Now that you have seen some of the capabilities of DeltaStream along with a few example use cases, you should check out DeltaStream as the platform for processing, organizing and securing your streaming data. You can schedule a demo where you can see all these capabilities in the context of a real world streaming application! Browse our blogs for more in-depth information on features and use cases. Once you are ready, let’s get in touch to build your streaming applications with DeltaStream, the next generation serverless streaming platform

21 Nov 2022

Min Read

DeltaStream 101 Part 4 – Data Serialization Formats

So far in our DeltaStream101 blog series, we’ve covered the basics of DeltaStream as well as a couple case studies centered around creating materialized views and processing data from different streaming stores like Apache Kafka and AWS Kinesis. If you’ve missed any of the previous posts, check them out below:

You may recall from previous DeltaStream 101 posts that users create DeltaStream stores on top of streaming storage systems like Apache Kafka and AWS Kinesis. After creating a store, the data that lies within the store ultimately needs to be read and deserialized from bytes for processing. The most popular data formats are JSON, Protocol Buffers (Protobuf), and Apache Avro, all of which are supported by DeltaStream. Each of these data formats require their own serialization/deserialization logic, which can add complexity to the task of working with multiple serialization/deserialization formats. However, with DeltaStream, these complexities are handled behind the scenes, creating a streamlined development process for the user.

In this blog post, we will explore how DeltaStream seamlessly integrates with different data serialization formats, and walk through an example use case.

Imagine that you work for an eCommerce website and you have two streams of data – one that represents transactional data and another that represents customer information. The transactional data is encoded using Protobuf and the customer data is encoded using Avro. The goal is to build a real time analytics dashboard of key performance indicators so your business can visualize the most up to date information about how your products are doing in the marketplace. Let’s also assume that the service that creates the dashboard expects the data to come in JSON format. In our analytics dashboard, we want the following metrics:

Revenue Metric: real-time dollar sum of all transactions per hour
Geographic Traffic Metric: real-time count of transactions from customers by state per hour

Diagram 1: Overview of SQL pipelines
Query 1 aggregates revenue from transactions,
Query 2 enriches transactions by joining it on customers,
Query 3 aggregates transactions from enriched_transactions by state

Set Up Data Formats: Descriptors and Schema Registries

In our use case we have two streams, the Transactions stream in Protobuf format, and the Customers stream in Avro format. We’ll first cover how to set these streams up as sources for DeltaStream.

Transactions Stream

Below in Code 1a, you’ll see an example of a record from the transactions stream. A new record is created every time a customer buys some item. The record itself contains a “tx_time” timestamp of when the transaction occurred, a “tx_id” which is unique per transaction, fields describing which item was purchased and for how much, and also “customer_id” which is used to identify which customer the transaction belongs to. Code 1b shows the Protobuf message used to create the Protobuf descriptor that serializes and deserializes these transactions.

{
  "tx_time": 1667260111428,
  "tx_id": "fe92644b-973a-4b65-ae4c-4b4eed23b5a0",
  "item_id": "Item_7",
  "price": 11,
  "quantity": 1,
  "customer_id": "Customer_9"
}

Code 1a: An example transactions record

message Transactions {
 int64 tx_time        = 1;
 string tx_id         = 2;
 string item_id       = 3;
 int32 price          = 4;
 int32 quantity       = 5;
 string customer_id   = 6;
}

Code 1b: Protobuf message for transactions

For the Transactions stream, you can upload the Protobuf descriptor as a DeltaStream descriptor. A DeltaStream descriptor is an object that resources necessary for the streaming SQL query to serialize and deserialize data, such as a Protobuf descriptor, are uploaded to. After creating the DeltaStream descriptor, you can attach it to the relevant DeltaStream topic.

CREATE DESCRIPTOR_SOURCE pb WITH (
  'file' = '/path/to/protos/transactions_value.proto'
);
 
SHOW DESCRIPTORS;
 
#           Name
# -------------------------
#   pb.transactions_value

Code 1c: Commands to create and show DeltaStream descriptors

Now, let’s observe the contents of our Protobuf topic before and after we attach a descriptor:

PRINT transactions;
 
# |�����0$fe92644b-973a-4b65-ae4c-4b4eed23b5a0Item_8%H�2A(2 Customer_7
# |�����0$b6c7bba9-753d-46ec-85e3-32856df574faItem_4%���@(2 Customer_4

Code 1d: Print DeltaStream topic before attaching descriptor

UPDATE 
  TOPIC transactions WITH (
    'value.descriptor' = pb.transactions_value
  );

Code 1e: Command to update DeltaStream topic with descriptor

PRINT transactions;
 
# | {"eventTime":"1667324708830","txId":"fe92644b-973a-4b65-ae4c-4b4eed23b5a0","itemId":"Item_9","price":8,"quantity":3,"customerId":"Customer_2"}
# | {"eventTime":"1667324709830","txId":"b6c7bba9-753d-46ec-85e3-32856df574fa","itemId":"Item_4","price":13,"quantity":4,"customerId":"Customer_7"}

Code 1f: Print DeltaStream topic after attaching descriptor

Notice how in Code 1d before the DeltaStream descriptor was attached, the contents of the topic are indiscernible bytes. After the DeltaStream descriptor is attached to the topic, the contents are properly deserialized as shown in Code 1e.

Finally, we can define a stream from this topic as shown in DDL 1:

CREATE STREAM transactions (
  tx_time BIGINT, tx_id VARCHAR, item_id VARCHAR, 
  price INTEGER, quantity INTEGER, customer_id VARCHAR
) WITH (
  'topic' = 'transactions', 'value.format' = 'PROTOBUF', 
  'timestamp' = 'tx_time'
);

DDL 1: A DDL statement for operating on the transactions records

Customers Stream

Below in Code 2a, you’ll see an example of a record from the customers stream. The data in this stream describes information about a particular customer, and a new record is created each time a customer’s information is updated. In each record, there is a field for “update_time” which is a timestamp of when the update occurred, an “id” which maps to a unique customer, the customer’s name, and the up to date address for the customer. Code 1b shows the Avro schema used to serialize and deserialize the records in the customers stream.

{
  "update_time": 1667260173792,
  "id": "Customer_1",
  "name": "Jill",
  "address": {
    "state": "AZ",
    "city": "Tucson",
    "zipcode": "85721"
  }
}

Code 2a: An example customers record

{
  "fields": [
    {
      "name": "update_time",
      "type": {
        "format_as_time": "unix_long",
        "type": "long"
      }
    },
    {
      "name": "id",
      "type": {
        "type": "string"
      }
    },
    {
      "name": "name",
      "type": {
        "type": "string"
      }
    },
    {
      "name": "address",
      "type": {
        "type": "record",
        "name": "addressUSRecord",
        "fields": [
          {
            "name": "state",
            "type": "string"
          },
          {
            "name": "city",
            "type": "string"
          },
          {
            "name": "zipcode",
            "type": "string"
          }
        ]
      }
    }
  ],
  "name": "customers",
  "namespace": "deltastream",
  "type": "record"
}

Code 2b: Avro schema for customers

For serialization of Avro records, it’s common to use a schema registry for storing Avro schemas. DeltaStream makes it easy to integrate with an external schema registry. You can import a schema registry by providing a name, the type of the schema registry, and any required connectivity related configuration. The imported schema registry is then attached to a store, so any data from that store can be serialized and deserialized by the schemas in the configured schema registry. Note that even though a schema registry is attached to the store, it operates at the topic level. This means a store can contain both topics with data formats that require the schema registry, and topics that don’t, such as a JSON schema external to the store’s schema registry. The schema registry will simply be used for the topics that require it and ignored for the other topics. Currently, we require a schema registry to be used if your data is serialized with Avro.

CREATE SCHEMA_REGISTRY sr WITH (
  'type' = CONFLUENT_CLOUD, 'availability_zone' = 'us-east-1', 
  'uris' = 'https://abcd-efghi.us-east-2.aws.confluent.cloud', 
  'confluent_cloud.key' = 'fake_key', 
  'confluent_cloud.secret' = 'fake_secret'
);

Code 2c: Command to create DeltaStream schema registry

PRINT customers;
 
#  ��Customer_5 | �����ΆaCustomer_Jane
#  cstateNcitySanta Fezipcode
#  87505
#  ��Customer_4 | �����ΆaCustomer_Jill
#  stateCcity
#          Irvinezipcode
#  92612

Code 2d: Print DeltaStream topic before attaching schema registry

UPDATE STORE kafkastore 
  WITH ('schema_registry.name' = sr);

Code 2e: Command to update DeltaStream store with schema registry

PRINT customers;
 
#  {"id":"Customer_1"} | {"update_time":1667335058024,"id":"Customer_1","name":"Jane","address":{"zipcode":"92612","city":"Irvine","state":"CA"}}
#  {"id":"Customer_5"} | {"update_time":1667335059489,"id":"Customer_5","name":"Jane","address":{"zipcode":"87505","city":"Santa Fe","state":"NM"}}

Code 2f: Print DeltaStream topic after attaching schema registry

Similar to how we needed the DeltaStream descriptor to deserialize data in transactions, note how the schema registry must be attached to the store to properly deserialize data in customers. In Code 2d, before the schema registry is added to the store, the records in the customers topic are indiscernible. After updating the store with the schema registry, we can see the contents are properly deserialized as shown in Code 2e.

Since the customers data is really keyed data, where information is updated per customer “id”, it makes sense to create a changelog on this topic. We can define that changelog and specify “id” as the primary key as shown in DDL 2:

CREATE CHANGELOG customers (
  update_time BIGINT, 
  id VARCHAR, 
  "name" VARCHAR, 
  address STRUCT < "state" VARCHAR, 
  city VARCHAR, 
  zipcode VARCHAR >, 
  PRIMARY KEY(id)
) WITH (
  'topic' = 'customers', 'value.format' = 'AVRO', 
  'timestamp' = 'update_time'
);

DDL 2: A changelog DDL to capture the latest customer information from the customers topic

Revenue Metric: real-time dollar sum of all transactions per hour

For our hourly dollar sum revenue metric, we need to perform a windowed aggregation on the transactions data. We’ve already created and attached a descriptor for this topic and a stream relation that represents this topic. From there, in Query 1, we can aggregate dollar sums of all transactions by simply writing a short SQL query:

CREATE stream hourly_revenue WITH (‘value.format’ = ’json’) AS 
SELECT 
  window_start, 
  window_end, 
  SUM(price * quantity) AS revenue 
FROM 
  tumble(transactions, SIZE 1 hour) 
GROUP BY 
  window_start, 
  window_end;

Query 1: Aggregation of hourly revenue from transactions stream

By default, a new stream created from an existing stream will inherit the properties of the source stream. However, in this query we specify ’value.format’=’json’ in the WITH clause, which signals the output stream to serialize its records as JSON.

We can inspect the results of the new stream using an interactive query, which prints the results to our console:

SELECT * FROM hourly_revenue;
 
#  | {"window_start":"2022-11-01T21:18:40","window_end":"2022-11-01T22:18:40","revenue":1459}
#  | {"window_start":"2022-11-01T21:18:50","window_end":"2022-11-01T22:18:50","revenue":2232}

Geographic Traffic Metric: real-time count of transactions from customers by state per hour

The customers stream defined earlier, provides helpful information to compute geographic traffic information for customers using our eCommerce website, but we also need the transactions stream to generate the number of transactions per state.

We can achieve this by joining the transactions on to the customers information. The persistent SQL statement in Query 2 shows how we can enrich the transactions stream as intended:

CREATE stream transactions_enriched WITH (
  'value.format' = 'json', 'timestamp' = 'tx_time'
) AS 
SELECT 
  transactions.tx_time, 
  transactions.tx_id, 
  transactions.item_id, 
  transactions.price, 
  transactions.quantity, 
  transactions.customer_id, 
  customers.name AS "name", 
  customers.address 
FROM 
  transactions 
  join customers ON transactions.customer_id = customers.id;

Query 2: Enrich transaction records with customers information for the metrics by state

Running a simple interactive query to inspect the results, we can see the enriched stream includes the customer address information with the transaction information:

SELECT * FROM transactions_enriched;
 
#  | {"tx_time":1667433311829,"tx_id":"fe92644b-973a-4b65-ae4c-4b4eed23b5a0","item_id":"Item_5","price":7,"quantity":3,"customer_id":"Customer_1","name":"Jill","address":{"state":"CA","city":"San Mateo","zipcode":"94401"}}

Using Query 2, we were able to join transactions in Protobuf format with customers in Avro format, and write the result to transactions_enriched in JSON format without worrying about what the format requirements are for each of the source or destination relations. Now that we have the transactions_enriched stream, we can perform a simple aggregation to produce our desired metric:

CREATE stream hourly_tx_count_by_state AS 
SELECT 
  window_start, 
  window_end, 
  count(tx_id) AS tx_count, 
  address -> state AS "state" 
FROM 
  tumble (
    transactions_enriched, SIZE 1 hour
  ) 
GROUP BY 
  window_start, 
  window_end, 
  address -> state;

Query 3: Aggregation of hourly count of unique transactions by state

Inspecting the records in the hourly_tx_count_by_state relation, we can see aggregate transaction counts separated by state and for what time window:

SELECT * FROM hourly_tx_count_by_state;
 
#  | {"window_start":"2022-11-02T22:58:40","window_end":"2022-11-02T23:58:40","tx_count":512,"state":"AZ"}
#  | {"window_start":"2022-11-02T22:58:40","window_end":"2022-11-02T23:58:40","tx_count":330,"state":"NM"}
#  | {"window_start":"2022-11-02T22:58:40","window_end":"2022-11-02T23:58:40","tx_count":956,"state":"CA"}

Conclusion

In this post, we demonstrated how DeltaStream makes it easy to work with different serialization formats, whether you need to attach descriptors that describe your data, or link your schema registry. The above example demonstrated how a user can easily set up pipelines in minutes to transform data from one format to another, or to join data available in different data formats. DeltaStream eliminates the need or complexity of managing streaming applications and dealing with complicated serialization or deserialization logic so the user can focus on what matters most: writing easy-to-understand SQL queries and generating valuable data for real-time insights or features.

Expect more blog posts in the coming weeks as we showcase more of DeltaStream’s capabilities for a variety of use cases. Meanwhile, if you want to try this yourself, you can request a demo.

01 Nov 2022

Min Read

DeltaStream 101 Part 3 – Enriching Apache Kafka Topics with Amazon Kinesis Data Streams

In Part 1 of our DeltaStream 101 series, we uncovered how DeltaStream connects to your existing streaming storage, Apache Kafka or Amazon Kinesis, using the DeltaStream Store. In this part of the series, we’re going to expand on that concept and use a real-life example around how you can enrich, filter, and aggregate your data between different streaming stores to simplify your product’s data needs.

As you may remember, we created a clicks stream backed by a clicks topic in an Apache Kafka cluster, and ran an aggregate query to count the number of clicks per URL and device type. In this post, we’re going to enrich the clicks stream with the user data. The user data will come from an Amazon Kinesis data stream where we will declare a users changelog on it in DeltaStream. Changelog represents a stream of upserts or deletions to our users’ information. This allows the resulting enriched stream(s) to include a registered user information as well. Using the enriched user clicks, we’re going to aggregate the number of clicks per URL and region.

This pipeline is demonstrated in Diagram 1:

Diagram 1: Query 1 enriches the clicks stream in Apache Kafka with users changelog in Amazon Kinesis, and Query 2 aggregates users clicks per region.

Accessing the Enrichment Information

First, we need to set up a store to access our Amazon Kinesis data streams:

cat ./kinesis.properties'
'kinesis.access_key_id='[AWS access key ID]'
'kinesis.secret_access_key’='[AWS secret access key]'

The following statement creates a store named prod_kinesis with the provided configurations:

CREATE STORE prod_kinesis
  WITH (
    'type' = KINESIS,
    'availability_zone'='us-east-2',
    'uris'=’https://kinesis.us-east-2.amazonaws.com:443',
    'config_file'='./kinesis.properties'
  );

Once we declare the prod_kinesis store, as with any DeltaStream store, we can inspect our Kinesis data stream, users, that holds our user information by printing it as a topic:

SELECT * FROM user_clicks;

Printing the users topic shows the content as followed:

[
  {
    "registertime": 1665780360439,
    "name": "Edna Hook",
    "email": "[email protected]",
    "userid": "User_4",
    "regionid": "Region_6",
    "gender": "OTHER",
    "interests": [
      "News",
      "Movies"
    ],
    "contactinfo": {
      "phone": "6503349999",
      "city": "San Mateo",
      "state": "CA",
      "zipcode": "94403"
    }
  },
  {
    "registertime": 1665780361439,
    "name": "Shaan Gough",
    "email": "[email protected]",
    "userid": "User_6",
    "regionid": "Region_9",
    "gender": "OTHER",
    "interests": [
      "Game",
      "Sport"
    ],
    "contactinfo": {
      "phone": "6503889999",
      "city": "Palo Alto",
      "state": "CA",
      "zipcode": "94301"
    }
  }
]

Using the values in the data stream, we can create a changelog using the following Data Definition Language (DDL) statement. Note that we’re using the same DeltaStream Database and Schema, clickstream_db.public, that we declared in part 1 of this series for the newly declared changelog:

CREATE CHANGELOG users (
  registertime BIGINT,
  name VARCHAR,
  email VARCHAR,
  userid VARCHAR,
  regionid VARCHAR,
  gender VARCHAR,
  interests ARRAY<VARCHAR>,
  contactinfo STRUCT<phone VARCHAR, city VARCHAR, "state" VARCHAR, zipcode VARCHAR>,
  PRIMARY KEY(userid)
  )
  WITH ( 'store'='prod_kinesis', 'topic'='users', 'value.format'='json');

Every CHANGELOG defines a PRIMARY KEY as context around the changes in a changelog.

Enriching the Clicks

Let’s now use our user information to enrich the click events in the clicks stream, and publish the results back into the prod_kafka store from our previous post in this series:

CREATE STREAM user_clicks
  WITH ('store'=’prod_kafka’)
  AS SELECT
    u.registertime AS user_registertime,
    u.userid AS uid,
    u.regionid AS user_regionid,
    u.gender AS user_gender,
    u.interests AS user_interests,
    c.event_time AS click_time,
    c.device_id AS device_type,
    c.url AS click_url,
    c.ip AS click_location
  FROM clicks c
  JOIN users u ON c.userid = u.userid;

Query 1: Enriching product clicks with users information to be able to expand the clicks report with region

Using just a single persistent SQL statement, Query 1, we were able to:

Enrich the click events by joining the clicks and users relations on the userid column from Kafka and Kinesis, respectively.
Project only the non-PII data from the enriched clicks stream, since we don’t want the sensitive user data to leave our Kinesis store.
and, write back the result of the enrichment into the prod_kafka store, while creating a new user_clicks stream backed by a Kafka topic, configured the same as the underlying topic for the clicks stream.

Since we’re joining a stream with a changelog, a temporal join is implied. In other words, click events are enriched with the correct version of the user information, updating the resulting user_clicks stream and any other downstream streams with the latest user information.

We can inspect the result of the temporal join between clicks and users using the following query:

SELECT * FROM user_clicks;

Showing the following records in the user_clicks stream:

[
  {
    "user_registertime": 1665780360439,
    "uid": "User_4",
    "user_regionid": "Region_6",
    "gender": "OTHER",
    "interests": [
      "News",
      "Movies"
    ],
    "click_time": 1497014222380,
    "device_id": "mobile",
    "click_url": "./home",
    "click_location": "12.12.12.12"
  },
  {
    "user_registertime": 1665780361439,
    "uid": "User_6",
    "user_regionid": "Region_9",
    "gender": "OTHER",
    "interests": [
      "Game",
      "Sport"
    ],
    "click_time": 1497014222385,
    "device_id": "desktop",
    "click_url": "./home",
    "click_location": "12.12.12.12"
  }
]

Clicks per URL and Region

We can now create a new persistent SQL statement, Query 2, to continuously aggregate user_clicks and count the number of clicks per URL per region, and publish the result back into our Kafka store, prod_kafka, under a new url_region_click_count relation, which is a changelog:

CREATE CHANGELOG url_region_click_count
  AS SELECT
    click_url,
    user_regionid,
    count(*) AS url_region_click_count
  FROM user_clicks
  GROUP BY click_url, user_regionid;

Query 2: Aggregating number of clicks per URL and region

User Experience and Beyond

In this post, we looked at a case study where we enriched, transformed and aggregated data from multiple streaming storages, namely, Apache Kafka and Amazon Kinesis. We built a pipeline that was up and running in seconds, without the need for writing streaming applications that could take much longer to develop and require ongoing maintenance. Just a simple example of how DeltaStream make it possible for developers to implement complex streaming applications.

21 Sep 2022

Min Read

DeltaStream 101 Part 2 – Always Up-to-date Materialized Views for Kafka and Kinesis

If you recall in DeltaStream 101 Part 1, we introduced DeltaStream, a serverless stream processing platform to manage, secure and process all your streams on cloud, and walked through a simple clickstream analytics use case.

In this post, we will continue to build on that base. Here, we’ll walk through how you can build materialized views that are continuously updated based on the results of the streaming queries in our previous post, and serve the results of those views to a user on a web page. The event streams that we begin with to ultimately wind up with data in DeltaStream could be built in popular event stream storage platforms such as Apache Kafka, Confluent Cloud, Amazon MSK, or Amazon Kinesis.

This is a sample of a web page built with materialized views in DeltaStream to serve user statistics to a visitor.

Before we dive in, what is a materialized view? In short, a materialized view is the result of a query, stored as a table. Sounds simple enough. But in a database built for streaming data, queries must produce the most up-to-date results in a real-time manner whenever called, so creating and updating a materialized view becomes more complex. Fortunately, DeltaStream takes care of all of these concerns under the hood, serving results with sub second latency. For the DeltaStream user, everything behind creating materialized views for streaming data looks like familiar SQL. The following figure shows how continuous queries in DeltaStream can build Materialized Views from a stream of events in Apache Kafka.

Materialized View #1: Number of times a URL has been visited

Let’s take a look at our first materialized view. Here, we are using the queries from our previous post to create a materialized view that represents the number of times every url has been visited. While this looks like standard SQL, the fact is, if the user visited that URL only a half second ago, it will be represented in our materialized view.

CREATE MATERIALIZED VIEW url_visit_count AS 
  SELECT 
     url, 
     count(*) AS url_visit_count 
  FROM 
     clicks_dev 
  GROUP BY 
     url;

Once we create the materialized view in DeltaStream, we can query it for the latest result the same way we would query a table in a relational database. For instance, the following query returns the number of times a url with address “./home” has been visited:

SELECT 
     * 
  FROM 
     url_visit_count 
  WHERE 
     url = './home';

With another query, we could find the url with the most number of views. This can be computed easily using the following query on the materialized view. Note that, again, since our materialized view continuously updates as click events are received, the result of this query will be the accurate real-time value.

SELECT 
     url, 
     url_visit_count 
  FROM 
     url_visit_count 
  ORDER BY 
     url_visit_count 
  LIMIT 1;

Materialized View #2: Number of visits for each URL via every device type

Now we will build another materialized view. This time we want to build a view to store the number of visits a url has on different devices. If you recall in our previous post, we had filtered events with a desktop device ID and computed the number of events per url and device ID using a continuous query. Here is how we turn that into a materialized view that updates with very low latency with DeltaStream.

CREATE MATERIALIZED VIEW url_device_visit_count AS 
  SELECT 
     url,
     device_id,
     count(*) as url_device_visit_count
  FROM 
     clicks_dev 
  GROUP BY 
     url, device_id;

Materialized View #3: Number of times a user visited a url on each device

And finally, we want to create a materialized view that we can query to know how many times a user visited the website on each device. For example, we may want to see how many times a given user visited various urls in the website on laptops, mobile devices, and tablets. Below is the SQL we can use to create this materialized view in DeltaStream.

CREATE MATERIALIZED VIEW user_device_visit_count AS 
  SELECT 
     user_id,
     device_id,
     count(*) as user_device_visit_count
  FROM 
     clicks_dev 
  GROUP BY 
     user_id, device_id;

Similar to the above examples, once we create the materialized view in DeltaStream, it is ready for querying. Here is another example of querying the last materialized view we created to get the number of visits the user with userid of ‘User_9’ from mobile devices had.

SELECT 
     user_id,
     user_device_visit_count
  FROM 
     user_device_visit_count 
  WHERE 
     user_id = 'User_9'  AND device_id = 'mobile';

I hope you enjoyed these examples so far of how you can use DeltaStream to go from raw event streaming to materialized views that can serve results to a web page for the latest, absolutely accurate results. In future posts, we’ll cover more capabilities for building, managing and securing real-time applications and pipelines. In the meantime, if you want to try this yourself please request a demo.

22 Aug 2022

Min Read

DeltaStream 101

DeltaStream is a serverless stream processing platform to manage, secure and process all your streams on cloud. One of the main goals of DeltaStream is to make stream processing fast and easy by providing a familiar SQL interface to build real time streaming applications, pipelines and materialized views and eliminating the complexity of running infrastructure for such applications.

This is the first blog post in a series where we will show some of the capabilities and features of DeltaStream platform through real world use cases. In this post, we will use a simple clickstream analytics use case.

Imagine we have a stream of click events continuously appended to a topic named ‘clicks’ in our production Kafka cluster. The following is a sample click event in JSON format:

{
  "event_time": 1658710466005,
  "device_id": "mobile",
  "user_id": "User_16",
  "url": "./home",
  "ip": "12.12.12.12"
}

Let’s assume we have a new project to build where we will compute different metrics over the click events in real time. We prefer to build our project on a separate Kafka cluster since we don’t want to make any changes or write any data into the production Kafka cluster while developing our new application. Also let’s assume our Kafka clusters are on Confluent Cloud and DeltaStream can access these clusters through their public endpoints (in future posts we will show how to configure private-link for such connectivity). And finally, we want to have the results in protobuf format.

DeltaStream has a REST API with GraphQL where users can interact with the service. We provide a Web-based client along with a CLI client, however, users can also directly interact with the service through the REST API. In this blog we will use the DeltaStream CLI and we assume we have already logged into the service.

For our clickstream project we will:

Replicate the clicks events from the production Kafka cluster into the development Kafka cluster along with the following changes:
- Convert the event_time to timestamp type
- Drop ip field
- Filter events that have desktop device id
- Convert the format to protobuf
Compute the number of events per url using a continuous query
Compute the number of events per url and device id using a continuous query

The following figure depicts a high level overview of what we plan to accomplish. Query1 will perform the first item above, Query2 and query3 will perform the second and third bullet points above.

Create stores

The first step in DeltaStream to access your data in your streaming storage service such as Kafka cluster is to declare stores. This can be done using the create store statement. In our project, we have two Kafka clusters, so we will declare two stores in DeltaStream. A store in DeltaStream is an abstraction that represents a streaming storage service such as Apache Kafka cluster or AWS Kinesis. Note that DeltaStream does not create a new Kafka cluster in this case and simply defines a store for an existing Kafka cluster. Once you define a store, you will be able to explore the content of the store and inspect data stored there.

The following statements declare our production Kafka cluster. A store has a name along with the metadata that will be used to access the store.

$ cat ./confluent_cloud.properties
'kafka.sasl.hash_function'=PLAIN
'kafka.sasl.username'='[cluster API key]'
'kafka.sasl.password'='[cluster API secret]'
 
<no-db>/<no-store>$ create store prod_kafka WITH ( 
    'type' = KAFKA, 
    'availability_zone'='us-east-1', 
    'uris'='pkcxxxxxx.gcp.confluent.cloud:9092', 
    'config_file'='./confluent_cloud.properties' 
    );

We also need to declare a store for our development Kafka cluster.

<no-db>/prod_kafka $ create store dev_kafka WITH (
    'type' = KAFKA, 
    'availability_zone'='us-east-1', 
    'uris'='pkcxxxxxx.gcp.confluent.cloud:9092', 
    'config_file'='./confluent_cloud_dev.properties'
    );

Now that we have declared our stores, we can inspect them. In the case of Kafka stores, for instance, we can list topics, create and delete topics with desired partitions and replication factors and print the content of topics. Note that we can only perform these operations on a given store if the authentication and authorization we provided while declaring the store has enough permissions to perform these operations.

As an example, the following figure shows how we can list the topics in the production Kafka cluster and print the content of the clicks topic.

<no-db>/prod_kafka$ show TOPICS;
  Topic name  
--------------
  clicks    
  pageviews   
  userid  
<no-db>/cc$ PRINT TOPIC clicks;
 | {"event_time":1497014222380,"device_id":"mobile","user_id":"User_16","url":"./home","ip","12.12.12.12"}
 | {"event_time":1497014222385,"device_id":"desktop","user_id":"User_18","url":"./home","ip","12.12.12.12"}
 | {"event_time":1497014222390,"device_id":"mobile","user_id":"User_1","url":"./home","ip","12.12.12.12"}

Once we have our stores declared and tested, we can go to the next step where we will use the relational capabilities of DeltaStream to build our clickstream analysis application.

Create databases and streams

Similar to other relational databases, DeltaStream uses databases and schemas to organize relational entities such as streams. The first step of using relational capabilities of DeltaStream is to create a database. For our clickstream analysis application, we create a new database using the following statement.

CREATE DATABASE clickstream_db;

Similar to other relational databases, DeltaStream creates a default schema named ‘public’ when a database is created. Once we create the first database in the DeltaStream, it will be the default database. Now we can create a stream for our source topic which is in our production Kafka cluster. The following statement is a DDL statement that declares a new stream over the clicks topic in prod_kafka store.

CREATE STREAM clicks 
  (
    event_time BIGINT, 
    device_id VARCHAR, 
    user_id VARCHAR, 
    url VARCHAR, 
    ip VARCHAR
  ) 
  WITH (
    'store' = 'prod_kafka', 
    'topic'=clicks, 
    'value.format'='JSON'
  );

Queries

Once we declare a stream over a topic we will be able to build our application by writing continuous queries to process the data in real time.

The first step is to transform and replicate the clicks data from the production Kafka cluster into the development Kafka cluster. In DeltaStream, this can be easily done with a simple query as the following.

CREATE STREAM clicks_dev 
  WITH (
    ‘store’ = ‘dev_kafka’, 
    'value.format'='protobuf'
  ) AS 
  SELECT 
    toTimestamp(event_time) AS event_timestamp,  
    device_id, 
    user_id, 
    url 
  FROM clicks 
  WHERE device_id <> 'desktop';

The above query creates a new stream backed by a topic named clicks_dev in the dev_kafka cluster and continuously reads the clicks events from the clicks stream in the production Kafka cluster and apply the transformations, projection and filtering and writes the results into the clicks_dev stream.

Now that we have the clicks_dev we can write aggregate queries and build the results of these queries in the dev_kafka Kafka cluster. The first query will create a CHANGELOG that continuously computes and updates the number of events per url.

CREATE CHANGELOG click_count_per_url AS 
  SELECT 
    url,
    count(*) as url_visit_count
  FROM clicks_dev 
  GROUP BY url;

Finally, the following query computes the number of visits per url and per device.

CREATE CHANGELOG click_count_per_url_per_device  AS 
  SELECT 
    url,
    device_id,
    count(*) as url_device_visit_count
  FROM clicks_dev 
  GROUP BY url, device_id;

In this blog post, we showed how you can build a simple clickstream analytics application using DeltaStream. We showed DeltaStream’s capabilities of reading from and writing into different streaming data stores and how easily you can build stream processing applications with a few SQL statements. This is the first blog post in a series where we will show some of the capabilities and features of DeltaStream platform through real world use cases. In future posts we will show more capabilities of DeltaStream in building, managing and securing real-time applications and pipelines. In the meantime, if you want to try this yourself please request a demo.

What is DeltaStream?

How Users Interact with DeltaStream

When should you use DeltaStream?

Set Up Data Formats: Descriptors and Schema Registries

Transactions Stream

Customers Stream

Revenue Metric: real-time dollar sum of all transactions per hour

Geographic Traffic Metric: real-time count of transactions from customers by state per hour

Conclusion

Accessing the Enrichment Information

Enriching the Clicks

Clicks per URL and Region

User Experience and Beyond

Materialized View #1: Number of times a URL has been visited

Materialized View #2: Number of visits for each URL via every device type

Materialized View #3: Number of times a user visited a url on each device

Create stores

Create databases and streams

Queries

Request Submitted

Share this blog post