23 Jun 2025

Min Read

Stream Smarter, Spend Less: Shift-Left with DeltaStream and Cut Snowflake Costs by 75%

Snowflake is one of the most powerful cloud data platforms available today. But as organizations increasingly rely on it to power business intelligence, AI, and real-time applications, many are discovering a costly tradeoff. Using Snowflake as the primary engine for all ELT (Extract, Load, Transform) processing can quickly balloon cloud spend.

Every Dynamic Table refresh, every intermediate table, and every downstream aggregation adds up—especially when data volumes or update frequencies grow.

This is why a growing number of modern data teams are shifting left to simplify streaming ETL and slash compute and storage bills.

Why Shift Left?

Most pipelines today follow an ELT model: extract raw data, land it in Snowflake, then transform it using tools like Dynamic Tables. While convenient, this pattern introduces hidden inefficiencies. You pay to store intermediate layers—often called Bronze and Silver tables—and rack up warehouse credits with every scheduled refresh. And because those transformations are tied to batch-based triggers, you’re often stuck waiting minutes (or longer) for updated insights.

DeltaStream offers a better way. By shifting left, teams can move from ELT to ETL, transforming data before it hits the warehouse. DeltaStream processes raw files as they land—cleaning, enriching, and aggregating them in motion. With just SQL, teams can build real-time pipelines that send only the final, analytics-ready results—Gold tables—into Snowflake. The result? A leaner, faster, and far more cost-effective architecture.

A Real-World Benchmark Using NYC Taxi Data

To prove the difference, we ran a 24-hour benchmark using NYC Yellow Taxi trip data. We tested two different pipeline strategies: one with Snowflake doing all the transformation work (ELT), and another where DeltaStream handled real-time transformations before the data reached Snowflake (ETL). In the Snowflake path, data landed in raw form and passed through multiple Dynamic Tables to get cleaned and enriched. Each table refreshed every minute, consuming compute resources even when no new data arrived. The result was a full transformation pipeline inside Snowflake—functional, but costly.

In the DeltaStream path, we ingested raw data directly from S3 into Kafka using a simple SQL statement. DeltaStream then joined and enriched the data in real time, skipping intermediate Silver tables altogether. Only the final Gold aggregates were streamed into Snowflake.

Ingest Once, Stream Forever

Both pipelines began by ingesting the same set of raw JSONL files dropped into an S3 bucket.

s3://aurora-demo-deltastream-e2e-s3-bucket/yellow-taxi/
├── yellow_taxi_2023-01.jsonl
├── yellow_taxi_2023-02.jsonl
├── ...
└── taxi_zone_lookup.jsonl

Using DeltaStream, the ingestion process was automatic and serverless. New files were picked up as they landed, schemas were versioned and validated, and no custom Spark jobs or manual scripts were needed. In contrast to traditional batch jobs, the system was truly event-driven: ingest once, and the stream keeps flowing. Once data was ingested, we landed it into a Snowflake Bronze table using Snowpipe Streaming.

CREATE TABLE yt_2023_bronze
WITH (
  'store' = 'snow_bench', 
  'snowflake.db.name' = 'DEMO_DB',
  'snowflake.schema.name' = 'SNOW_BENCH'
)
 
AS SELECT * FROM yellow_taxi_2023;

This stage was identical in both setups and created a consistent starting point. From there, however, the approaches diverged dramatically.

The Snowflake ELT Path: Functional but Expensive

Inside Snowflake, we used two Dynamic Tables to transform and enrich the data. One table cleaned the raw trip data and calculated basic metrics like trip speed. Another joined it with lookup tables to add zone and borough information.

SILVER_TAXI_CLEAN: cleans up trips, calculates mph
SILVER_TAXI_ENRICHED: adds zone and borough names

Because Dynamic Tables refresh on a fixed schedule, these transformations ran every 60 seconds, regardless of whether new data had arrived.

Next, we built three additional Dynamic Tables for analytics: one for 15-minute zone aggregates, one for hourly borough stats, and one for daily top zones. While this delivered useful business insights, the cost of constantly running these transformations was substantial.

The cost profile:

Silver Table Storage: 371 MB
Warehouse Usage for Dynamic Table Refreshes: $17.72/day
Snowpipe Streaming: $0.16

Compute was always on, and storage requirements grew with every new version of the Silver and Gold tables.

The DeltaStream ETL Path: Leaner, Faster, Cheaper

There were no intermediate Silver tables to maintain. No refresh schedules to manage. As soon as a new file landed in S3, DeltaStream loaded it into Kafka, ran the transformation, and streamed only the final Gold aggregates into Snowflake.

CREATE STREAM SILVER_TAXI_ENRICHED 
AS
SELECT *
FROM yellow_taxi_2023 AS c
JOIN yt_lookup_cl AS pu ON c.PULocationID = pu.LocationID
...

The cost profile:

Gold Table Storage: 0.64 MB
Warehouse Usage: $0.17 (initial load only)
Snowpipe Streaming: $0.40

This approach not only simplified the architecture, but eliminated unnecessary compute and storage costs.

Key Results: 75% Cost Savings!

Over 24 hours, we tracked compute and storage costs across both pipelines:

Snowflake storage (Silver tables = 371 MB vs Gold tables = .64MB)
Warehouse usage ($17.72 /day just for refresh vs $.17 for initial Gold table creation)
Snowpipe Streaming usage (.16 for ELT vs .4 for ETL )

Key Difference: 1-Minute Dynamic Table Refresh vs. Real-Time Updates

The Bottomline: The Snowflake ELT path consumed 4x–10x more compute resources than DeltaStream, a cost savings of more than 75%.

Bonus Benefit: DeltaStream Simplifies Streaming with SQL

Perhaps the most surprising outcome? Simplicity. With DeltaStream, you don’t need to learn Java or wrestle with Flink SDKs. You write SQL, just like in Snowflake. There’s no need to manage watermarks, orchestrate batch windows, or worry about how stream processing frameworks handle state. DeltaStream takes care of all of that—giving you clean, governed, and real-time data with far less operational burden.

You get all the power of streaming ETL—without the learning curve.

The Takeaway: Shift Left and Save Big

This benchmark confirms what many data teams are already realizing: doing everything inside Snowflake might be simple, but it’s not always efficient. With DeltaStream, you can reduce compute and storage costs, shrink latency, and streamline your architecture—all while using familiar SQL.

Shift left. Get fresher data. Cut your Snowflake bill. Stream smarter—with DeltaStream.

Want to see how much we can save you on your Snowflake bill?

Contact us for a complimentary stream assessment with DeltaStream CEO Hojjat Jafarpour.
We’ll evaluate your current architecture, identify quick wins, and deliver a custom action plan to reduce costs, simplify pipelines, and accelerate time to insight.

13 May 2025

Min Read

Medical IOT Data with Kafka and Iceberg

Have you ever spent time in the hospital or had biometric tests done? Those devices generate and save a lot of information that should prompt some action or process. We often look for anomalies in alert situations, but there are significantly more uses for that data. You might be wearing a heart monitor to record information for a possible procedure or a sleep study that records your vitals to identify a problem. These devices are constantly improving their ability to collect actionable information and improve our lives.
DeltaStream provides a powerful ability to process data in Apache Kafka topics and write to various destinations, including Apache Iceberg. The AWS Glue and Apache Polaris (incubating) catalogs are supported. For this example, we’ll be using Polaris. If you are unfamiliar with DeltaStream, this short interactive demo will walk you through it. With that out of the way, let’s look at our setup.

Use Case: Aggregating IOT Data to Iceberg

For this example, we’ll use a stream of simulated data containing a timestamp, patient ID, and assorted vital information. The description will be in the solution section. The data stream will be aggregated into a new table written to Iceberg, which will be our source of truth.

Our data is comprised of the following:

Inbound health_data Kafka topic with the following data:
- 100 patients
- Sensor data emitted every 10 minutes for 6 months starting June 1, 2024
- Sensor data fields are randomly generated between normal bounds for the fields
Outbound to Iceberg aggregated patient and heart rate data

Our queries do the following:

Aggregate the hourly average heart rate per patient from Kafka and write to Iceberg
Find the top 3 patients with the highest daily heart rate for each month

Setup and Solution

First, let’s define the data we are working with. Our Kafka topic health_data looks like this:

{
	"key": NULL,
	"value": {
		"event_timestamp": 17260295530,
		"patientId": "Patient-1",
		"heartRate": 74,
		"bodyTemperature": 99.3,
		"bloodPressureSystolic": 94,
		"bloodPressureDiastolic": 68,
		"respiratoryRate": 16,
		"oxygenLevel": 96,
		"bodyPosition": "Standing"
	}
}

That means we need to create a DeltaStream Stream-type Object. We do this with a CSAS statement:

CREATE STREAM health_data (
    event_timestamp BIGINT,
    "patientId" STRING,
    "heartRate" INT,
    "bodyTemperature" DOUBLE,
    "bloodPressureSystolic" INT,
    "bloodPressureDiastolic" INT,
    "respiratoryRate" INT,
    "oxygenLevel" INT,
    "bodyPosition" STRING
)
WITH (
      'starting.position' = 'earliest',	
	'topic' = 'health_metrics_sg',
	'value.format' = 'JSON'
);

Our next step is to create an Iceberg table to hold our hourly average heart rate data. The following CTAS does that for us, and this is where the magic happens. We can aggregate our data in flight before we land it in Iceberg. This dramatically reduces latency, eliminates compute costs for transformation, and reduces storage size. If you are unfamiliar with configuring Iceberg in DeltaStream, this short interactive demo can walk you through it.

CREATE TABLE avg_heart_rate WITH(
      'iceberg.rest.catalog.namespace.name'='sgns',
      'iceberg.rest.catalog.table.name'='avg_heart_rate')
AS SELECT 
   window_start, window_end, "patientId", AVG("heartRate") AS avg_hr
FROM 
   TUMBLE(health_data, SIZE 1 HOUR) 
WITH('timestamp'='event_timestamp', 'starting.position'='earliest')
GROUP BY 
   window_start, window_end, "patientId";

Our resulting data looks like this:

{
	"window_start": "2024-08-22T23:00",
	"window_end": "2024-08-23T00:00",
	"patientId": "Patient-1",
	"avg_hr": 83
}

Now that we have an Iceberg table with our patient ID, the average hourly heart rate, and the time window, we can look at the patients with the highest heart rate in a given month. DeltaStream enables us to immediately perform analytics on the data it just processed from Kafka to Iceberg without leaving the DeltaStream environment or performing any additional configurations – using only familiar SQL.

WITH ranked_heart_rates AS (
    SELECT 
        "patientId",
        avg_hr,
        window_start,
        window_end,
        MONTH(window_start) AS MONTH,
        YEAR(window_start) AS YEAR,
        ROW_NUMBER() OVER (
            PARTITION BY YEAR(window_start), MONTH(window_start)
            ORDER BY avg_hr DESC
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS rank
    FROM 
        avg_heart_rate
    WHERE 
        MONTH(window_start) = MONTH(window_end)
        AND YEAR(window_start) = YEAR(window_end)
)
SELECT 
    "patientId",
    avg_hr,
    MONTH,
    YEAR
FROM 
    ranked_heart_rates
WHERE 
    rank <= 3
ORDER BY 
    YEAR, MONTH, rank;

This query can look daunting, but it’s not a lot of code for some powerful results. Let’s break it down.

Step 1: Creating a Common Table Expression (CTE)

The query begins with a CTE named ranked_heart_rates. This temporary result set performs the initial data preparation and ranking.

Step 2: Extracting Time Components

We extract the month and year from the window_start timestamp, creating separate columns for easier grouping and sorting using DeltaStream Temporal Functions:

   MONTH(window_start) AS MONTH,
   YEAR(window_start) AS YEAR,

Step 3: Ranking Records with Window Functions

The heart of this query uses the ROW_NUMBER() window function to assign a rank to each patient’s heart rate within its month-year group:

       ROW_NUMBER() OVER (
            PARTITION BY YEAR(window_start), MONTH(window_start)
            ORDER BY avg_hr DESC
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS rank

This function:

PARTITION BY divides the data into month-year segments
ORDER BY avg_hr DESC ranks records with highest heart rates first
The window frame specification ensures proper calculation across the partition

Step 4: Filtering for Valid Time Windows

We only include records where the start and end timestamps fall within the same calendar month and year. Our sample data is all in 2024, but this takes into account crossing the year boundary:

   WHERE 
        MONTH(window_start) = MONTH(window_end)
        AND YEAR(window_start) = YEAR(window_end)

Step 5: Selecting the Top 3 Per Month

Finally, the main query filters the ranked results only to include the top 3 entries per month and orders them chronologically:

SELECT patientId, avg_hr, window_start, MONTH, YEAR 
FROM ranked_heart_rates 
WHERE rank <= 3 
ORDER BY YEAR, MONTH, rank;

Processing Medical Device Data

We just walked through a scenario for processing medical device data coming in via a Kafka topic by consolidating average hourly data per patient and writing it to Iceberg. We then immediately queried the data in Iceberg from DeltaStream to find patients with the highest average heart rate per month. A visualization tool could be used on that same Iceberg data, further taking advantage of open table formats. Notably, we performed a “shift-left” with our processing. We were able to aggregate data in flight. Then we landed that data in Iceberg, thus reducing latency and additional storage and compute costs, seamlessly preparing the data for usage in a single place.

22 Apr 2025

Min Read

Enable “Shift-Left” with Apache Kafka and Iceberg

In the past few years, the Apache Iceberg table format has become the 800-pound gorilla in the data space. DeltaStream supports reading and writing Iceberg using either the AWS Glue catalog or the Apache Polaris (incubating) catalog. This blog walks you through a data scenario in which data in Apache Kafka topics are read, filtered, and enriched with data from another Kafka topic, then written to Iceberg and queried from DeltaStream.

Writing Data Tables to Iceberg

When you sign up for a DeltaStream demo, you’re provided with a demo Kafka cluster called trial_store. If you are unfamiliar with DeltaStream, this short interactive demo walks you through it. For the Iceberg catalog implementation, we’ll be using REST and the Snowflake Polaris implementation, the docs for which can be found here. AWS Glue is also supported, as is any other REST implementation.

For this example, we’ll use two of the topics in that cluster and follow the first part of the Quick Start guide, but we’ll enhance it to write two new tables of processed data to Iceberg.

Our demo includes the following:

Users Kafka topic
Pageviews Kafka topic
Enrich the pageviews with user information
Write to two tables in Iceberg, comprised of:
- Pageviews per city per minute
- Pageviews per user per hour

Our queries do the following:

Top 3 cities with highest pageviews per hour
Top 5 users with the highest pageviews per hour

Streaming Lakehouse with DeltaStream Fusion

First, let’s define the data we are working with. Our Kafka topic pageviews looks like this:

{
	"key": {
		"userid": "User_5"
	},
	"value": {
		"viewtime": 1742335218442,
		"userid": "User_5",
		"pageid": "Page_67"
	}
}

That means we need to create a DeltaStream, Stream-type Object. We do this with a CSAS statement:

CREATE STREAM pageviews (
    viewtime BIGINT, 
    userid VARCHAR, 
    pageid VARCHAR
)WITH (
    'topic'='pageviews', 
    'value.format'='JSON'
);

That now sets up our pageviews topic so you can access it via SQL. Let’s look at the data structure in our users topic. Note that a userid field will tie the two streams together.

{
	"key": {
		"userid": "User_2"
	},
	"value": {
		"registertime": 1742597129390,
		"userid": "User_2",
		"regionid": "Region_9",
		"gender": "OTHER",
		"interests": [
			"News",
			"Movies"
		],
		"contactinfo": {
			"phone": "6503889999",
			"city": "Palo Alto",
			"state": "CA",
			"zipcode": "94301"
		}
	}
}

Next, we define a DeltaStream Changelog object for our users topic named users_log. Note that we’ve created an array from interests and a struct from the contactinfo. Also note that state is a reserved word in DeltaStream, so it is enclosed in quotes. Our command looks like this:

CREATE CHANGELOG users_log (
    registertime BIGINT, 
    userid VARCHAR, 
    regionid VARCHAR, 
    gender VARCHAR, 
    interests ARRAY<VARCHAR>, 
    contactinfo STRUCT<phone VARCHAR, city VARCHAR, "state" VARCHAR, zipcode VARCHAR>, 
    PRIMARY KEY(userid)
)WITH (
    'topic'='users', 
    'key.format'='json', 
    'key.type'='STRUCT<userid VARCHAR>', 
    'value.format'='json'
);

Next, we join our STREAM and CHANGELOG into a new enriched Kafka topic defined as a DeltaStream STREAM object named csas_enriched_pv. This combines user data with the pageviews information that is written to our Iceberg table.

CREATE STREAM csas_enriched_pv AS 
SELECT 
    TO_TIMESTAMP_LTZ(viewtime, 3) AS viewtime,  
    p.userid AS userid, 
    pageid, 
    TO_TIMESTAMP_LTZ(registertime, 3) AS registertime, 
    regionid, 
    gender, 
    interests, 
    Contactinfo -> city as user_city,
    Contactinfo -> "state" as user_state
FROM pageviews p
    JOIN users_log u ON u.userid = p.userid;

Here is what that data looks like for reference:

{
	"key": {
		"userid": "User_5"
	},
	"value": {
		"viewtime": "2025-03-24 21:18:23.526Z",
		"userid": "User_5",
		"pageid": "Page_22",
		"registertime": "2025-03-24 21:18:23.232Z",
		"regionid": "Region_1",
		"gender": "MALE",
		"interests": [
			"News",
			"Movies"
		],
            "user_city": "Irvine",
		"user_state": "CA"
	}
}

Now that the Kafka topic csas_enriched_pv is available, the fun part begins. This creates an Iceberg table with our “Pageviews per state per minute” scenario. If you are unfamiliar with configuring Iceberg in DeltaStream, this short interactive demo can walk you through it.

CREATE TABLE pv_per_city_per_minute WITH(
       'iceberg.rest.catalog.namespace.name'='sgns',
       'iceberg.rest.catalog.table.name'='pv_per_city_per_minute'
)
AS SELECT 
 user_city, 
 count(pageid) AS viewcount, 
 window_start,
 window_end
FROM TUMBLE(csas_enriched_pv, size 1 minutes)
GROUP BY user_city, window_start, window_end;

Let’s break down what is happening here for those unfamiliar:

CREATE TABLE pv_per_city_per_minute: Creates a new table named “pv_per_city_per_minute”
WITH(…): Contains table properties or configurations. Further explanation is below.
SELECT user_city, count(pageid) AS viewcount, window_start, window_end:

Selects the user’s city
Counts the number of page IDs and names this count “viewcount”
Includes the start and end times of each window interval

TUMBLE(csas_enriched_pv, size 1 minutes): Applies a tumbling window function to the “csas_enriched_pv” table with a window size of 1 minute. A tumbling window divides data into non-overlapping, fixed-size time intervals.
GROUP BY user_city, window_start, window_end: Groups the results by state and time window, so the count is calculated separately for each state within each one-minute interval

One possible initial confusion is where that ‘iceberg.rest.catalog.namespace.name’ value comes from. Iceberg supports an arbitrary number of namespaces to organize tables in a catalog. Our data store that maps to the Iceberg catalog is named sgir; by right-clicking on it we get a dialog that includes add, which prompts for a name. This becomes your new namespace; in our case, it is sgns.

To further explain the WITH parameters, the ‘iceberg.rest.catalog.table.name’ specifies the same value as the CREATE TABLE. It can be a different value to alias the table name you’ll be using in your queries to the table in Iceberg – for example, if you want to shorten the name or organize your query in a particular fashion. In our case, though, we’re keeping the values the same.
With that query running, let’s launch our second query to generate our table for “Pageviews per user per hour.” What we’re doing here is very similar, but we’ve changed our TUMBLE window to 1 hour and are using userid instead of user_city.

CREATE TABLE pv_per_user_per_hour WITH(
       'iceberg.rest.catalog.namespace.name'='sgns',
       'iceberg.rest.catalog.table.name'='pv_per_user_per_hour'
)
AS SELECT 
 userid, 
 COUNT(pageid) AS viewcount, 
 window_start,
 window_end
FROM TUMBLE(csas_enriched_pv, SIZE 1 HOUR)
GROUP BY userid, window_start, window_end;

Without leaving DeltaStream, we can query the Iceberg table for our “Pageviews per city per minute” scenario directly in the same workspace and get results without moving to another tool.

SELECT user_city, HOUR(window_start) AS hour_value, SUM(viewcount) AS total             
FROM pv_per_city_per_minute             
GROUP BY user_city, HOUR(window_start)             
ORDER BY total DESC             
LIMIT 3;

Next, we perform our second query to get the results from our Iceberg table for our top 5 users per hour.

SELECT userid, window_start, SUM(viewcount) AS total             
FROM pv_per_user_per_hour             
GROUP BY userid, window_start             
ORDER BY total DESC             
LIMIT 5;

Like our first scenario, our second scenario, “Pageviews per user per hour, ” is all within DeltaStream. Since this data is now in Iceberg, you can also use any other compatible compute engine or BI tool without any lock-in.

Shift-left and So Much More

We’ve just walked through a classic “shift-left” scenario where we have moved enrichment and filtering to the streaming architecture and the popular Iceberg table format on AWS, making it available to query from any compatible compute engine. We’ve reduced latency by not waiting for a batch process to move the data through a medallion architecture, and we’ve reduced costs by eliminating that transformation compute cost and additional storage. We can even query those Iceberg tables from DeltaStream to do additional joins and enhancements and write them back out to Iceberg if we want. This is just the tip of the, um, Iceberg when it comes to what is possible with DeltaStream.

11 Mar 2025

Min Read

Real-time Anomaly Detection with Sensor Data

Real-time sensor data monitoring is critical for industries ranging from manufacturing to IoT-enabled vehicle sensors for transport. Identifying anomalies, unexpected deviations that could signal equipment failure, environmental hazards, or system inefficiencies, becomes a key challenge when dealing with a stream of sensor readings flowing through a Kafka topic. This blog dives into the process of performing anomaly detection on a set of sensors publishing to a Kafka topic, guiding you through the steps to harness streaming data, apply detection techniques, and ensure timely insights. Whether you’re a data engineer or a curious developer, you’ll discover practical methods to spot the unusual in the constant hum of sensor activity.

Use Case: Missing Data as an Anomaly

For our scenario, we will work with 100 sensors that send a heartbeat every 5 minutes. The sensor is considered down if a heartbeat isn’t detected after 15 minutes. There are two scenarios this leads to: either the sensor is faulty and is sending heartbeats after the window, or it stops sending a heartbeat entirely. To simulate this data, I wrote a Python program with the following parameters:

100 sensors comprised of:
- UUID as a sensor ID
- Unix epoch timestamp
- Generate sensor records for 100 days starting from November 1, 2024
Each sensor will start at random times over 48 hours from the start date
10% of the sensors will fail to generate a timestamp for a random amount of time between 15 and 120 minutes on a random number of days
2% of the sensors will stop sending a timestamp between 15 and 60 days from the first timestamp.

Here is the Python code if you want to play with it yourself in a free DeltaStream trial; you just need a Kafka cluster of your own available:

Our resulting JSON file looks like this:

import json
import time
import random
import uuid
import datetime
from kafka import KafkaProducer

# Kafka Configuration
KAFKA_BROKER = “your bootstrap server”
TOPIC = “sensor_data”

# Kafka Producer Setup

producer = KafkaProducer(
    bootstrap_servers=KAFKA_BROKER,
    security_protocol=’SASL_SSL’,
    sasl_mechanism=’SCRAM-SHA-512′,
    sasl_plain_username=’your username’,
    sasl_plain_password=’your password’,
    value_serializer=lambda v: json.dumps(v).encode(“utf-8”),
    key_serializer=lambda k: k.encode(“utf-8″) if k is not None else None,
)

# Simulation Constants
SENSOR_COUNT = 100
TEMP_FAILURE_PERCENTAGE = 0.10  # 10% of sensors fail randomly
PERM_FAILURE_PERCENTAGE = 0.02  # 2% of sensors fail permanently
INTERVAL_SECONDS = 5 * 60  # 5 minutes
TEMP_FAILURE_MIN = 15 * 60  # 15 minutes
TEMP_FAILURE_MAX = 120 * 60  # 120 minutes
PERM_FAILURE_MIN_DAYS = 15  # 15 days
PERM_FAILURE_MAX_DAYS = 60  # 60 days
START_DATE = datetime.datetime(2024, 11, 1, 0, 0, 0)
SIMULATION_DAYS = 100
SIMULATION_DURATION = SIMULATION_DAYS * 24 * 60 * 60  # 100 days in seconds
RANDOM_START_WINDOW = 48 * 60 * 60  # 48 hours in seconds

# Generate 100 Unique Sensor IDs
sensors = [str(uuid.uuid4()) for _ in range(SENSOR_COUNT)]

# Select 10% of sensors for temporary failures
temp_fail_sensors = set(random.sample(sensors, int(SENSOR_COUNT * TEMP_FAILURE_PERCENTAGE)))

# Select 2% of sensors for permanent failures
perm_fail_sensors = set(random.sample(sensors, int(SENSOR_COUNT * PERM_FAILURE_PERCENTAGE)))

# Assign each sensor a random start time within the first 48 hours
sensor_start_times = {
    sensor: START_DATE.timestamp() + random.randint(0, RANDOM_START_WINDOW)
    for sensor in sensors
}

# Generate failure windows (15-120 min) for temporary failures
temp_fail_windows = {}
for sensor in temp_fail_sensors:
    num_failures = random.randint(3, 10)  # Number of random failure periods
    failure_days = random.sample(range(1, SIMULATION_DAYS + 1), num_failures)

temp_fail_windows[sensor] = [
        (
            START_DATE.timestamp() + (day * 24 * 60 * 60) + random.randint(0, 24 * 60 * 60),
            START_DATE.timestamp() + (day * 24 * 60 * 60) + random.randint(0, 24 * 60 * 60) + random.randint(TEMP_FAILURE_MIN, TEMP_FAILURE_MAX)
        )
        for day in failure_days
    ]

# Assign permanent failure times (between 15 and 60 days from start)
perm_fail_times = {
    sensor: sensor_start_times[sensor] + random.randint(PERM_FAILURE_MIN_DAYS * 24 * 60 * 60, PERM_FAILURE_MAX_DAYS * 24 * 60 * 60)
    for sensor in perm_fail_sensors
}

# Track permanently failed sensors
perm_failed_sensors = set()

# Track last failure timestamp for each sensor (prevents duplicate logs)
last_fail_timestamp = {sensor: None for sensor in temp_fail_sensors}

# Start simulation at the given start date
current_time = START_DATE.timestamp()

while current_time = start_time:
            # Check if the sensor has permanently failed
            if sensor in perm_fail_sensors and current_time >= perm_fail_times[sensor]:
                if sensor not in perm_failed_sensors:
                    print(f”❌ Sensor {sensor} permanently stopped at {datetime.datetime.utcfromtimestamp(perm_fail_times[sensor])}”)
                    perm_failed_sensors.add(sensor)
                continue  # Sensor is permanently down

# Check if the sensor is temporarily failing
            if sensor in temp_fail_sensors:
                for fail_start, fail_end in temp_fail_windows[sensor]:
                    if fail_start <= current_time <= fail_end:
                        if last_fail_timestamp[sensor] != fail_start:
                            print(f"⚠️ Sensor {sensor} temporarily down from {datetime.datetime.utcfromtimestamp(fail_start)} to {datetime.datetime.utcfromtimestamp(fail_end)}")
                            last_fail_timestamp[sensor] = fail_start  # Prevent duplicate logs
                        continue  # Skip sending data

# Normal sensor data (only sensor_id and timestamp)
            data = {
                "sensor_id": sensor,
                "time_stamp": int(current_time)  # Use integer UNIX timestamp for clarity
            }

# Send data to Kafka
            producer.send(TOPIC, value=data, key=sensor)

# Move simulation time forward by 5 minutes
    current_time += INTERVAL_SECONDS

# Close producer after all data is sent
producer.close()

import json
import time
import random
import uuid
import datetime
from kafka import KafkaProducer
 
# Kafka Configuration
KAFKA_BROKER = "your bootstrap server"
TOPIC = "sensor_data"
 
# Kafka Producer Setup
 
producer = KafkaProducer(
    bootstrap_servers=KAFKA_BROKER,
    security_protocol='SASL_SSL',
    sasl_mechanism='SCRAM-SHA-512',
    sasl_plain_username='your username',
    sasl_plain_password='your password',
    value_serializer=lambda v: json.dumps(v).encode("utf-8"),
    key_serializer=lambda k: k.encode("utf-8") if k is not None else None,
)
 
# Simulation Constants
SENSOR_COUNT = 100
TEMP_FAILURE_PERCENTAGE = 0.10  # 10% of sensors fail randomly
PERM_FAILURE_PERCENTAGE = 0.02  # 2% of sensors fail permanently
INTERVAL_SECONDS = 5 * 60  # 5 minutes
TEMP_FAILURE_MIN = 15 * 60  # 15 minutes
TEMP_FAILURE_MAX = 120 * 60  # 120 minutes
PERM_FAILURE_MIN_DAYS = 15  # 15 days
PERM_FAILURE_MAX_DAYS = 60  # 60 days
START_DATE = datetime.datetime(2024, 11, 1, 0, 0, 0)
SIMULATION_DAYS = 100
SIMULATION_DURATION = SIMULATION_DAYS * 24 * 60 * 60  # 100 days in seconds
RANDOM_START_WINDOW = 48 * 60 * 60  # 48 hours in seconds
 
# Generate 100 Unique Sensor IDs
sensors = [str(uuid.uuid4()) for _ in range(SENSOR_COUNT)]
 
# Select 10% of sensors for temporary failures
temp_fail_sensors = set(random.sample(sensors, int(SENSOR_COUNT * TEMP_FAILURE_PERCENTAGE)))
 
# Select 2% of sensors for permanent failures
perm_fail_sensors = set(random.sample(sensors, int(SENSOR_COUNT * PERM_FAILURE_PERCENTAGE)))
 
# Assign each sensor a random start time within the first 48 hours
sensor_start_times = {
    sensor: START_DATE.timestamp() + random.randint(0, RANDOM_START_WINDOW)
    for sensor in sensors
}
 
# Generate failure windows (15-120 min) for temporary failures
temp_fail_windows = {}
for sensor in temp_fail_sensors:
    num_failures = random.randint(3, 10)  # Number of random failure periods
    failure_days = random.sample(range(1, SIMULATION_DAYS + 1), num_failures)
 
    temp_fail_windows[sensor] = [
        (
            START_DATE.timestamp() + (day * 24 * 60 * 60) + random.randint(0, 24 * 60 * 60),
            START_DATE.timestamp() + (day * 24 * 60 * 60) + random.randint(0, 24 * 60 * 60) + random.randint(TEMP_FAILURE_MIN, TEMP_FAILURE_MAX)
        )
        for day in failure_days
    ]
 
# Assign permanent failure times (between 15 and 60 days from start)
perm_fail_times = {
    sensor: sensor_start_times[sensor] + random.randint(PERM_FAILURE_MIN_DAYS * 24 * 60 * 60, PERM_FAILURE_MAX_DAYS * 24 * 60 * 60)
    for sensor in perm_fail_sensors
}
 
# Track permanently failed sensors
perm_failed_sensors = set()
 
# Track last failure timestamp for each sensor (prevents duplicate logs)
last_fail_timestamp = {sensor: None for sensor in temp_fail_sensors}
 
# Start simulation at the given start date
current_time = START_DATE.timestamp()
 
while current_time <= START_DATE.timestamp() + SIMULATION_DURATION:
    for sensor in sensors:
        start_time = sensor_start_times[sensor]
 
        if current_time >= start_time:
            # Check if the sensor has permanently failed
            if sensor in perm_fail_sensors and current_time >= perm_fail_times[sensor]:
                if sensor not in perm_failed_sensors:
                    print(f"❌ Sensor {sensor} permanently stopped at {datetime.datetime.utcfromtimestamp(perm_fail_times[sensor])}")
                    perm_failed_sensors.add(sensor)
                continue  # Sensor is permanently down
 
            # Check if the sensor is temporarily failing
            if sensor in temp_fail_sensors:
                for fail_start, fail_end in temp_fail_windows[sensor]:
                    if fail_start <= current_time <= fail_end:
                        if last_fail_timestamp[sensor] != fail_start:
                            print(f"⚠️ Sensor {sensor} temporarily down from {datetime.datetime.utcfromtimestamp(fail_start)} to {datetime.datetime.utcfromtimestamp(fail_end)}")
                            last_fail_timestamp[sensor] = fail_start  # Prevent duplicate logs
                        continue  # Skip sending data
 
            # Normal sensor data (only sensor_id and timestamp)
            data = {
                "sensor_id": sensor,
                "time_stamp": int(current_time)  # Use integer UNIX timestamp for clarity
            }
 
            # Send data to Kafka
            producer.send(TOPIC, value=data, key=sensor)
 
    # Move simulation time forward by 5 minutes
    current_time += INTERVAL_SECONDS
 
# Close producer after all data is sent
producer.close()

{
	"key": {
		"key": "268aded0-0cd4-49ce-ab49-cfc123fdabf9"
	},
	"value": {
		"sensor_id": "268aded0-0cd4-49ce-ab49-cfc123fdabf9",
		"time_stamp": 1730447400
	}
}

Solution: Detecting failed Sensors

All of the SELECT statements are displayed in the following screenshot, as well as some of the results of glitching sensors and the number of minutes they failed to send their heartbeat.

Let’s break down each of the statements and explain what we’re doing:
This first statement creates a DeltaStream Object of type STREAM that defines the layout of the sensor_data topic in our Kafka cluster. This gives us an entity that we can perform actions on against the topic.

CREATE STREAM sensor_data (
              sensor_id VARCHAR, time_stamp BIGINT ) 
WITH ('topic' = 'sensor_data', 'value.format' = 'json');

This part has the secret sauce to making this work: the ds_lag_bigint function. It is a built-in function to access the value of a column from a previous row within the same relation. This is particularly useful for scenarios where you need to compare or compute differences between rows. In our case, we use it to determine the time interval between consecutive sensor events. The function requires specifying the column from which to retrieve values and the number of rows from which to look back. For example, ds_lag_bigint(time_stamp, 1) returns the time_stamp from the previous row, allowing us to calculate the time difference between consecutive sensor readings and thus determine if the difference is outside our acceptable bound of 15 minutes.
Other than that, we’re creating the stream topic with a retention of 72 hours (in milliseconds) to ensure the data lasts long enough to run my tests. Finally, the FROM clause forces the SELECT to read from the beginning of the topic, and what field is the timestamp.

-- This calculates the diff between current time_stamp and previous time_stamp per sensor
CREATE STREAM sensor_history WITH ('kafka.topic.retention.ms' = '172800000') AS
SELECT
  sensor_id,
  time_stamp,
  ds_lag_bigint (time_stamp, 1) OVER (
    partition by sensor_id
    ORDER BY time_stamp ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
  ) AS prev_ts
FROM
  sensor_data WITH ('starting.position' = 'earliest', 'timestamp' = 'time_stamp');

Next, we create a new Kafka Topic, bad_sensors, for any record in sensor_history with two timestamps greater than 15 minutes. This will be our list of sensors with anomalies that need to be researched. If they go online and offline outside of our acceptable range, they are likely failing, or something else needs to be investigated.

-- This computes the diff between current and prev time_stamp values and picks the bad ones
CREATE STREAM bad_sensors
WITH
  ('kafka.topic.retention.ms' = '172800000') AS
SELECT
  sensor_id,
  time_stamp,
  prev_ts,
  (time_stamp - prev_ts) AS diff
FROM
  sensor_history
WITH
  ('starting.position' = 'earliest')
WHERE
  prev_ts IS NOT NULL
  AND (time_stamp - prev_ts) > (15 * 60);

Finally, let’s take a look at the data in bad_sensors. This formats out the current and previous heartbeat timestamps and the difference between them in minutes.

SELECT 
  sensor_id, 
  FROM_UNIXTIME(prev_ts) AS fmt_prev, 
  FROM_UNIXTIME(time_stamp) AS fmt_time, 
  (time_stamp - prev_ts) AS diff
FROM 
  bad_sensors 
WITH 
  ('starting.position' = 'earliest');

In a production environment, it would make sense for the bad_sensors data to be fed into a location that populates a dashboard where decisions can be made in near real-time. This could be something like Iceberg, Clickhouse, Databricks, Snowflake, or whatever you use.

Summary

This blog explores anomaly detection for real-time sensor data on a Kafka topic, using a 100-sensor setup with heartbeats every 5 minutes. We define anomalies as misses exceeding 15 minutes, simulate 100 days of data with Python, and apply DeltaStream to process it, racking history, calculating gaps with ds_lag_bigint, and pinpointing failures via SQL. This approach is streamlined for engineers and easily adaptable to production dashboards.

17 Dec 2024

Min Read

Enhancing Fraud Detection with PuppyGraph and DeltaStream

The banking and finance industry has been one of the biggest beneficiaries of digital advancements. Many technological innovations find practical applications in finance, providing convenience and efficiency that can set institutions apart in a competitive market. However, this ease and accessibility have also led to increased fraud, particularly in credit card transactions, which remain a growing concern for consumers and financial institutions.

Traditional fraud detection systems rely on rule-based methods that struggle in real-time scenarios. These outdated approaches are often reactive, identifying fraud only after it occurs. Without real-time capabilities or advanced reasoning, they fail to match fraudsters’ rapidly evolving tactics. A more proactive and sophisticated solution is essential to combat this threat effectively.

This is where graph analytics and real-time stream processing come into play. Combining PuppyGraph, the first and only graph query engine, with DeltaStream, a stream processing engine powered by Apache Flink, enables institutions to improve fraud detection accuracy and efficiency, including real-time capabilities. In this blog post, we’ll explore the challenges of modern fraud detection and the advantages of using graph analytics and real-time processing. We will also provide a step-by-step guide to building a fraud detection system with PuppyGraph and DeltaStream.

Let’s start by examining the challenges of modern fraud detection.

Common Fraud Detection Challenges

Credit card fraud has always been a game of cat and mouse. Even before the rise of digital processing and online transactions, fraudsters found ways to exploit vulnerabilities. With the widespread adoption of technology, fraud has only intensified, creating a constantly evolving fraud landscape that is increasingly difficult to navigate. Key challenges in modern fraud detection include:

Volume: Daily credit card transactions are too vast to review and identify suspicious activity manually. Automation is critical to sorting through all that data and identifying anomalies.
Complexities: Fraudulent activity often involves complex patterns and relationships that traditional rule-based systems can’t detect. For example, fraudsters may use stolen credit card information to make a series of small transactions before a large one or use multiple cards in different locations in a short period.
Real-time: The sooner fraud is detected, the less financial loss there will be. Real-time analysis is crucial in detecting and preventing transactions as they happen, especially when fraud can be committed at scale in seconds.
Agility: Fraudsters will adapt to new security measures. Fraud detection systems must be agile, even learning as they go, to keep up with the evolving threats and tactics.
False positives: While catching fraudulent transactions is essential, it’s equally important to avoid flagging legitimate transactions as fraud. False positives can frustrate customers, especially when a card is automatically locked out due to legitimate purchases. As a consequence, they can adversely affect revenue.

To tackle these challenges, businesses require a solution that processes large volumes of data in real-time, identifies complex patterns, and evolves with new fraud tactics. Graph analytics and real-time stream processing are essential components of such a system. By mapping and analyzing transaction networks, businesses can more effectively detect anomalies in customer behavior and identify potentially fraudulent transactions.

Leveraging Graph Analytics for Fraud Detection

Traditional fraud detection methods analyze individual transactions in isolation. This can miss connections and patterns that emerge when we examine the bigger picture. Graph analytics allows us to visualize and analyze transactions as a network of connected things.

Think of it like a social network. Each customer, credit card, merchant, and device becomes a node in the graph, and each transaction connects those nodes. We can find hidden patterns and anomalies that indicate fraud by looking at the relationships between nodes.

Figure: an example schema for fraud detection use case

Here’s how graph analytics can be applied to fraud detection:

Finding suspicious connections: Graph algorithms can discover unusual patterns of connections between entities. For example, if the same person uses multiple credit cards in different locations in a short period or a single card is used to buy from a group of merchants known for fraud, those connections will appear in the graph and be flagged as suspicious.
Uncovering fraud rings: Fraudsters often work within the same circles, using multiple identities and accounts to carry out scams. Graph analytics can find those complex networks of people and their connections, helping to identify and potentially break up entire fraud rings.
Surfacing identity theft: When a stolen credit card is used, the spending patterns will generally be quite different from the cardholder’s normal behavior. By looking at the historical and current transactions within a graph, you can see sudden changes in spending habits, locations, and types of purchases that may indicate identity theft.
Predicting future fraud: Graph analytics can predict future fraud by looking at historical data and the patterns that precede a fraudulent transaction. By predicting fraud before it happens, businesses can take action to prevent it.

Of course, all of these benefits are extremely helpful. However, the biggest hurdle to realizing them is the complexity of implementing a graph database. Let’s look at some of those challenges and how PuppyGraph can help users avoid them entirely.

Challenges of Implementing and Running Graph Databases

As shown, graph databases can be an excellent tool for fraud detection. So why aren’t they used more frequently? This usually boils down to implementing and managing them, which can be complex for those unfamiliar with the technology. The hurdles that come with implementing a graph database can far outweigh the benefits for some businesses, even stopping them from adopting this technology altogether. Here are some of the issues generally faced by companies implementing graph databases:

Cost: Traditional relational databases have been the norm for decades, and many organizations have invested heavily in their infrastructure. Switching to a graph database or even running a proof of concept requires a significant upfront investment in new software, hardware, and training.
Implementing ETL: Extracting, transforming, and loading (ETL) data into a graph database can be tricky and time-consuming. Data needs to be restructured to fit into a graph model, which requires knowledge of the underlying data to be moved over and how to represent these entities and relationships within a graph model. This requires specific skills and adds to the implementation time and cost, meaning the benefits may be delayed.
Bridging the skills gap: Graph databases require a different data modeling and querying approach from traditional databases. In addition to the previous point regarding ETL, finding people with the skills to manage, maintain, and query the data within a graph database can also be challenging. Without these skills, graph technology adoption is mostly dead in the water.
Integration challenges: Integrating a graph database with existing systems and applications is complex. This usually involves taking the output from graph queries and mapping them into downstream systems, which requires careful planning and execution. Getting data to flow smoothly and be compatible with different systems is significant.

These challenges highlight the need for solutions that make graph database adoption and management more accessible. A graph query engine like PuppyGraph addresses these issues by enabling teams to integrate their data and query it as a graph in minutes without the complexity of ETL processes or the need to set up a traditional graph database. Let’s look at how PuppyGraph helps teams become graph-enabled without ETL or the need for a graph database.

How PuppyGraph Solves Graph Database Challenges

PuppyGraph is built to tackle the challenges that often hinder graph database adoption. By rethinking graph analytics, PuppyGraph removes many entry barriers, opening up graph capabilities to more teams than otherwise possible. Here’s how PuppyGraph addresses many of the hurdles mentioned above:

Zero-ETL: One of PuppyGraph’s most significant advantages is connecting directly to your existing data warehouses and data lakes—no more complex and time-consuming ETL. There is no need to restructure data or create separate graph databases. Simply connect the graph query engine directly to your SQL data store and start querying your data as a graph in minutes.
Cost: PuppyGraph reduces the expenses of graph analytics by using your existing data infrastructure. There is no need to invest in new database infrastructure or software and no ongoing maintenance costs of traditional graph databases. Eliminating the ETL process significantly reduces the engineering effort required to build and maintain fragile data pipelines, saving time and resources.
Reduced learning curve: Traditional graph databases often require users to master complex graph query languages for every operation, including basic data manipulation. PuppyGraph simplifies this by functioning as a graph query engine that operates alongside your existing SQL query engine using the same data. You can continue using familiar SQL tools for data preparation, aggregation, and management. When more complex queries suited to graph analytics arise, PuppyGraph handles them seamlessly. This approach saves time and allows teams to reserve graph query languages specifically for graph traversal tasks, reducing the learning curve and broadening access to graph analytics.
Multi-query language support: Engineers can continue to use their existing SQL skills and platform, allowing them to leverage graph querying when needed. The platform offers many ways to build graph queries, including Gremlin and Cypher support, so your existing team can quickly adopt and use graph technology.
Effortless scaling: PuppyGraph’s architecture separates compute and storage so it can easily handle petabytes of data. By leveraging their underlying SQL storage, teams can effortlessly scale their compute as required. You can focus on extracting value from your data without scaling headaches.
Fast deployment: With PuppyGraph, you can deploy and start querying your data as a graph in 10 minutes. There are no long setup processes or complex configurations. Fast deployment means you can start seeing the benefits of graph analytics and speed up your fraud detection.

In short, PuppyGraph removes the traditional barriers to graph adoption so more institutions can use graph analytics for fraud detection use cases. By simplifying, reducing costs, and empowering existing teams with effortless graph adoption, PuppyGraph makes graph technology accessible for all teams and organizations.

Real-Time Fraud Prevention with DeltaStream

Speed is key in the fight against fraud, and responsiveness is crucial to preventing or minimizing the impact of an attack. Systems and processes that act on events with minimal latency can mean the difference between successful and unsuccessful cyber attacks. DeltaStream empowers businesses to analyze and respond to suspicious transactions in real-time, minimizing losses and preventing further damage.

Why Real-Time Matters:

Immediate Response: Rapid incident response means security and data teams can detect, isolate, and trigger mitigation protocols, minimizing their vulnerability window faster than ever. With real-time data and sub-second latency, the Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR) can be significantly reduced.
Proactive Prevention: Data and security teams can identify behavior patterns as they emerge and implement mitigation tactics. Real-time allows for continuous monitoring of system health and security with predictive models.
Improved Accuracy: Real-time data provides a more accurate view of customer behavior for precise detection. Threats are more complex than ever and often involve multi-stage attack patterns; streaming data aids in identifying these complex and ever-evolving threat tactics.

DeltaStream’s Key Features:

Speed: Increase the speed of your data processing and your team’s ability to create data applications. Reduce latency and cost by shifting your data transformations out of your warehouse and into DeltaStream. Data teams can also quickly write queries in SQL to create analytics pipelines with no other complex languages to learn.
Team Focus: Eliminate maintenance tasks with our continually optimizing Flink operator. Your team isn’t focused on infrastructure, meaning they can focus on building and strengthening pipelines.
Unified View: An organization’s data rarely comes from just one source. Process streaming data from multiple sources in real-time to get a complete picture of activities. This means transaction data, user behavior, and other relevant signals can be analyzed together as they occur.

By combining PuppyGraph’s graph analytics with DeltaStream’s real-time processing, businesses can create a dynamic fraud detection system that stays ahead of evolving threats.

Step-by-Step tutorial: DeltaStream and PuppyGraph

In this tutorial, we go through the high-level steps of integrating DeltaStream and PuppyGraph.

The detailed steps are available at:

Starting a Kafka Cluster

We start a Kafka Server as the data input. (Later in the tutorial, we’ll send financial data through Kafka.)

We create topics for financial data like this:

bin/kafka-topics.sh --create --topic kafka-Account --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

Setting up DeltaStream

Connecting to Kafka

Log in to the Deltastream console. Then, navigate to Resources and add a Kafka Store – for example, kafka_demo – with the Kafka Cluster parameters we created in the previous step.

Next, in the Workspace, create a deltastream database – for example: kafka_db
After that, we use DeltaStream SQL to create streams for the Kafka topics we created in the previous step. The stream describes the topic’s physical layout so it can be easily referenced with SQL. Here is an example of one of the streams we create in DeltaStream for a Kafka topic. Once we declare the streams, we can build streaming data pipelines to transform, enrich, aggregate, and prepare streaming data for analysis in PuppyGraph. First, we’ll define the account_stream from the kafka-Account topic.

CREATE STREAM account_stream (
  "label" STRING,
  "accountId" BIGINT,
  "createTime" STRING,
  "isBlocked" BOOLEAN,
  "accoutType" STRING,
  "nickname" STRING,
  "phonenum" STRING,
  "email" STRING,
  "freqLoginType" STRING,
  "lastLoginTime" STRING,
  "accountLevel" STRING
) WITH (
  'topic' = 'kafka-Account',
  'value.format' = 'JSON'
);

Next, we’ll define the accountrepayloan_stream from the kafka-AccountRepayLoan topic:

CREATE STREAM accountrepayloan_stream (
  "label" STRING,
  "accountrepayloandid" BIGINT,
  "loanId" BIGINT,
  "amount" DOUBLE,
  "createTime" STRING
) WITH (
  'topic' = 'kafka-AccountRepayLoan',
  'value.format' = 'JSON'
);

And finally, we’ll show the accounttransferaccount_stream from the kafka-AccountTransferAccount. You’ll note there is both a fromid and toid that will like to the loanId. This allows us to enrich data in the account payment stream with account information from the account_stream and combine it with the account transfer stream.

With DeltaStream, this can then easily be written out as a more succinct and enriched stream of data to our destination, such as Snowflake or Databricks. We combine data from three streams with just the information we want, preparing the data in real-time from multiple streaming sources, which we then graph using PuppyGraph.

CREATE STREAM accounttransferaccount_stream (
  "label" VARCHAR,
  "accounttransferaccountid", BIGINT,
  "fromd" BIGINT,
  "toid" BIGINT,
  "amount" DOUBLE,
  "createTime" STRING,
  "ordernum" BIGINT,
  "comment" VARCHAR,
  "paytype" VARCHAR,
  "goodstype" VARCHAR
) WITH (
  'topic' = 'kafka-AccountTransferAccount',
  'value.format' = 'JSON'
);

Adding a Store for Integration

PuppyGraph will connect to the stores and allow querying as a graph.

Once our data is ready in the desired format, we can write streaming SQL queries in DeltaStream to write data continuously in the desired storage. In this case, we can use DeltaStream’s native integration with Snowflake or Databricks, where we will use PoppyGraph. Here is an example of writing data continuously into a table in Snowflake or Databricks from DeltaStream:

CREATE TABLE ds_account
WITH
(
'store' = '<store_name>'
<Storage parameters>
) AS
SELECT * FROM account_stream;

For Databricks integration, refer to the Databricks integration documentation for detailed steps.
For Snowflake integration, refer to Snowflake integration documentation for detailed steps.

Starting data processing

Now, you can start a Kafka Producer to send the financial JSON data to Kafka. For example, to send account data, run:

kafka-console-producer.sh --broker-list localhost:9092 --topic kafka-Account < json_data/Account.json

DeltaStream will process the data, and then we will query it as a graph.

Query your data as a graph

You can start PuppyGraph using Docker. Then upload the Graph schema, and that’s it! You can now query the financial data as a graph as DeltaStream processes it.

Start PuppyGraph using the following command:

docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \
-e DATAACCESS_DATA_CACHE_STRATEGY=adaptive \
-e <STORAGE PARAMETERS> \
--name puppy --rm -itd puppygraph/puppygraph:stable

Log into the PuppyGraph Web UI at http://localhost:8081 with the following credentials:

Username: puppygraph

Password: puppygraph123

Upload the schema:Select the file schema_<storage>.json in the Upload Graph Schema JSON section and click Upload.

Navigate to the Query panel on the left side. The Gremlin Query tab offers an interactive environment for querying the graph using Gremlin. For example, to query the accounts owned by a specific company and the transaction records of these accounts, you can run:

g.V("Company[237]")
  .outE('CompanyOwnAccount').inV()
  .outE('AccountTransferAccount').inV()
  .path()

Conclusion

As this blog post explores, traditional fraud detection methods simply can’t keep pace with today’s sophisticated criminals. Real-time analysis and the ability to identify complex patterns are critical. By combining the power of graph analytics with real-time stream processing, businesses can gain a significant advantage against fraudsters.

PuppyGraph and DeltaStream offer robust and accessible solutions for building real-time dynamic fraud detection systems. We’ve seen how PuppyGraph unlocks hidden relationships and how DeltaStream analyzes real-time data to quickly and accurately identify and prevent fraudulent activity. Ready to take control and build a future-proof, graph-enabled fraud detection system? Try PuppyGraph and DeltaStream today. Visit PuppyGraph and DeltaStream to get started!

06 Jun 2024

Min Read

Real-time Airline Data Pipeline for on time Flight Status with SQL

Air travel is one of the most popular modes of transportation, but it doesn’t come without the risk of flight delays from weather, technical malfunctions, or other reasons. These delays, while most of the time is out of the airline’s control, can be frustrating for passengers who only wish for stress-free travel. Accurate and timely communication of flight delays, gate changes, or other status updates are essential for maintaining customer satisfaction. To achieve this, airlines need a robust data platform that can handle flight status updates in real-time.

In this post, we will show off how DeltaStream can be used to easily set up streaming pipelines that process real-time airline data. As a fully managed service powered by Apache Flink, users are given the capabilities of Flink without needing to deal with the complexities of running and scaling Flink jobs. In fact, Flink is just an implementation detail, as users can simply write SQL queries to set up their stream processing pipelines. For our example, we will transform flight status to create an always up-to-date view of current flight statuses.

Raw Source Data for Airline Flight Status

For our example, we have 2 raw source streams:

flights: flight information, including scheduled departure and arrival times.
flight_updates: updates to flight status, including a new departure time.

In DeltaStream, we can create Relations to represent these data:

CREATE CHANGELOG flights (
  flight_id VARCHAR, 
  event_ts TIMESTAMP, 
  origin VARCHAR, 
  destination VARCHAR, 
  scheduled_dep TIMESTAMP, 
  scheduled_arr TIMESTAMP, 
  PRIMARY KEY (flight_id)
) WITH (
  'topic' = 'flights', 
  'value.format' = 'json', 
  'timestamp' = 'event_ts'
);

CREATE STREAM flight_updates (
  flight_id VARCHAR, 
  event_ts TIMESTAMP, 
  updated_departure TIMESTAMP, 
  "description" VARCHAR
) WITH (
  'topic' = 'flight_updates',
  'value.format' = 'json', 
  'timestamp' = 'event_ts'
);

Use Case: Create an Always Up-to-Date View of Current Flight Status

In order to get the latest complete flight status information, we need to enrich our flight_updates Stream by joining it with our flights Changelog. The result of this join will be a stream of flight updates which include the latest departure and arrival times. We can achieve this with the following query:

CREATE STREAM enriched_flight_updates AS 
SELECT 
  u.flight_id, 
  u.event_ts, 
  f.origin, 
  f.destination, 
  (DS_TOEPOCH(u.updated_departure) - DS_TOEPOCH(f.scheduled_dep)) / 60000 AS mins_delayed, 
  u.updated_departure AS current_departure, 
  CAST(TO_TIMESTAMP_LTZ(
      (
        DS_TOEPOCH(f.scheduled_arr) + 
        (DS_TOEPOCH(u.updated_departure) - DS_TOEPOCH(f.scheduled_dep))
      ), 3) AS TIMESTAMP
  ) AS current_arrival, 
  u."description" 
FROM 
  flight_updates u 
  JOIN flights f ON u.flight_id = f.flight_id;

We want to eventually materialize these update statuses into a view, using flight_id as the primary key. So, we can define a Changelog that is backed by the same data as the enriched_flight_updates Stream. Notice in the WITH clause of the following statement that the topic parameter is set to enriched_flight_updates.

CREATE CHANGELOG enriched_flight_updates_log (
  flight_id VARCHAR, 
  event_ts TIMESTAMP, 
  origin VARCHAR, 
  destination VARCHAR, 
  mins_delayed BIGINT, 
  current_departure TIMESTAMP, 
  current_arrival TIMESTAMP, 
  "description" VARCHAR, 
  PRIMARY KEY (flight_id)
) WITH (
  'topic' = 'enriched_flight_updates', 
  'value.format' = 'json',
  'timestamp' = 'event_ts'
);

Now, the enriched_flight_updates_log Changelog will contain all flight updates with the flight information that the original flight_updates Stream was missing. However, only updates are currently included and there are no events for flights that don’t have delays or updates in this Stream. To fix this, we can write an INSERT INTO query to generate updates from the flights. This will ensure that our enriched_flight_updates_log Changelog will capture all the statuses of all flights.

INSERT INTO enriched_flight_updates_log 
SELECT 
  flight_id, 
  event_ts, 
  origin, 
  destination, 
  CAST(0 AS BIGINT) AS mins_delayed, 
  scheduled_dep AS current_departure, 
  scheduled_arr AS current_arrival, 
  CAST(NULL AS VARCHAR) AS "description" 
FROM 
  flights;

Finally, we can materialize our enriched_flight_updates_log into a materialized view that users can query at any time and get the most up-to-date information. Since our enriched_flight_updates_log Changelog has the primary key of flight_id, the view will be updated with UPSERT mode on the flight_id key. If we had instead created our materialized view from the enriched_flight_updates Stream, then the view would be created in append mode where each event is a new row in the view. Using upsert mode, we can update the existing row, if any, based on the Changelog’s primary key.

CREATE MATERIALIZED VIEW flight_status_view AS 
SELECT 
  * 
FROM 
  enriched_flight_updates_log;

After creating our materialized view, we can query it at any moment and get the latest flight statuses. The view can be queried directly from the DeltaStream console or CLI, as well as through the DeltaStream REST API for other applications to access programmatically. Let’s look at some sample input and output data below.

Input for flights:

{"flight_id": "Flight_1", "event_ts": "2024-03-28 10:12:13.489", "origin": "LAX", "destination": "YVR", "scheduled_dep": "2024-05-29 10:33:00", "scheduled_arr": "2024-05-29 13:37:00"}
{"flight_id": "Flight_2", "event_ts": "2024-04-11 11:58:56.489", "origin": "JFK", "destination": "SFO", "scheduled_dep": "2024-05-29 12:30:00", "scheduled_arr": "2024-05-29 19:10:00"}
{"flight_id": "Flight_3", "event_ts": "2024-04-23 10:12:13.489", "origin": "SIN", "destination": "NRT", "scheduled_dep": "2024-05-30 09:25:00", "scheduled_arr": "2024-05-30 17:30:00"}
{"flight_id": "Flight_4", "event_ts": "2024-05-30 15:52:13.837", "origin": "AUS", "destination": "ORD", "scheduled_dep": "2024-07-20 09:15:00", "scheduled_arr": "2024-07-20 11:30:00"}
…

Input for flight_updates:

{"flight_id": "Flight_1", "event_ts": "2024-05-28 15:52:13.837", "updated_departure": "2024-05-29 12:30:00", "description": "Thunderstorms" }
{"flight_id": "Flight_2", "event_ts": "2024-05-29 12:30:13.837", "updated_departure": "2024-05-29 13:30:00", "description": "Waiting for connecting passengers" }
{"flight_id": "Flight_1", "event_ts": "2024-05-29 12:52:13.837", "updated_departure": "2024-05-29 13:30:00", "description": "More Thunderstorms" }
{"flight_id": "Flight_3", "event_ts": "2024-05-30 06:52:13.837", "updated_departure": "2024-05-30 12:30:00", "description": "Mechanical delays" }
…

Query flight_status_view:

SELECT * FROM flight_status_view ORDER BY flight_id;

Output results:

flight_id |         event_ts         | origin | destination | mins_delayed |  current_departure   |   current_arrival    |          description
------------+--------------------------+--------+-------------+--------------+----------------------+----------------------+---------------------------------
  Flight_1  | 2024-05-29T12:52:13.837Z | LAX    | YVR         |          177 | 2024-05-29T13:30:00Z | 2024-05-29T16:34:00Z | More Thunderstorms
  Flight_2  | 2024-05-29T12:30:13.837Z | JFK    | SFO         |           60 | 2024-05-29T13:30:00Z | 2024-05-29T20:10:00Z | Waiting FOR connecting
            |                          |        |             |              |                      |                      | passengers
  Flight_3  | 2024-05-30T06:52:13.837Z | SIN    | NRT         |          185 | 2024-05-30T12:30:00Z | 2024-05-30T20:35:00Z | Mechanical delays
  Flight_4  | 2024-05-30T15:52:13.837Z | AUS    | ORD         |            0 | 2024-07-20T09:15:00Z | 2024-07-20T11:30:00Z | <nil>

Conclusion

When it comes to air travel, travelers have an expectation that airlines will communicate flight delays and flight status effectively. In this post, we demonstrated a simple use case of how airlines can set up real-time data pipelines to process and join flight and flight updates data, with just a few simple SQL queries using DeltaStream. As a fully managed and serverless service, DeltaStream enables its users to easily create powerful real-time applications without any of the overhead.

If you want to learn more about DeltaStream, sign up for our free trial or reach out to us.

27 Mar 2024

Min Read

Maximizing Performance: Processing Real-time Online Gaming Data

The gaming industry has seen immense growth in the past decade. According to an analysis by EY, the global market for gaming in 2021 was $193.4b, up from $93.6b in 2016, and the market is only expected to continue growing with an estimated market of $210.7b in 2025. With billions of people playing video games, gaming companies need to ensure that their data platforms can handle the demands and requirements of this massive amount of data. Online gaming, which makes up a large portion of the gaming market, often has to handle and act on millions of real-time events per second. Consider all the real-time player interactions, chatrooms, leaderboards, and telemetry data that are part of modern online games. For this reason, game developers need a real-time streaming and stream processing platform that can seamlessly scale, process, and govern these data. This is where DeltaStream comes in – to manage and process all of the streaming data in your organization. In this blog post, we’ll cover how DeltaStream can help game developers for two use cases:

Keeping leaderboards up to date
Temporarily ban players for leaving games early

Connecting a Streaming Store

Although DeltaStream can source real-time data from many different data stores such as Kafka, Kinesis, PostgreSQL (as CDC data), and others, for our use cases we’ll be using RedPanda. As mentioned in an article on gaming from RedPanda, RedPanda is a cost-effective, easily scalable, and very performant alternative to using Kafka. These attributes make it a great streaming storage option for real-time gaming data.

Since RedPanda is compatible with Kafka’s APIs, users can add RedPanda as a Store in DeltaStream with the following statement:

Keeping Leaderboards Up to Date with SQL and Materialized Views

Let’s assume we have a topic called “game_results” in our RedPanda Store. We can think of the events in this topic to be the results of playing some game. So, every time a player finishes a game, a new record is logged into the topic which includes the timestamp of the end of the game, the player ID, and whether or not they won the game. This topic contains records that look like the following:

We can define a DeltaStream Stream that is backed by this topic with the following query:

Next, let’s create a Materialized View to keep track for each player how many games they have completed and how many games they have won:

By creating this Materialized View, DeltaStream launches a Flink job behind the scenes which continuously ingests from the “game_results” topic and updates the view with the latest data. In the world of online gaming, thousands of games could be finished every minute, and as these games are completed, the view will stay up to date with the latest high scores.

Next, a leaderboard can be generated by querying the Materialized View. For example, the following query will return the 10 players with the most wins:

If we wanted to find out which 10 players have the highest win/loss ratio, then we can run the following query:

Temporarily Ban Players for Leaving Games Early with Applications

Although most people enjoy online gaming, sometimes our competitive nature can bring out the worst in us. It’s not uncommon for players to leave games early, phenomenon commonly known as “rage quitting.” For team-based competitive online games however, rage quitters can be detrimental to maintaining a fun and balanced competitive game, as the teammates of the rage quitter have to deal with the consequences of being a man down. To deal with this, gaming developers often add a timeout to players who continuously quit games early to dissuade this behavior.

For this use case, we want to detect when a player has quit 2 of their last 4 games. Let’s assume that there is a topic called “player_game_actions” in our RedPanda Store. Below is an example of a record in this topic:

The action field here describes the interaction between the player and the game lobby. Possible values include JOIN, QUIT, COMPLETE. We can define a Stream backed by this topic:

Now, let’s process this Stream of data to find out which players have left 2 of their last 4 games. While we can solve this problem writing a query with very nested subqueries, we’ll use DeltaStream’s latest Application feature for this example:

Use DeltaStream to Unlock Real-time Gaming Analytics

Real-time interactions are one of the core components of online gaming, and as the industry continues to grow, it becomes increasingly necessary for these gaming companies to find cost-effective, scalable, and low-latency data solutions. As a fully-managed solution powered by Apache Flink, DeltaStream is an easy-to-use, scalable, and resilient system for all real-time stream processing workloads.

In the examples in this post, we built solutions using DeltaStream to process real-time gaming data, ingesting from storage systems such as RedPanda or Confluent Cloud. In the first use case, we used Materialized Views to build a real-time user-driven solution that keeps track of each player’s wins and losses. In the second use case, we built a real-time event-driven solution to detect when a player is being unsportsmanlike, so that downstream backend services can decide how to act on these players with minimal latency. As a system that can do Streaming Analytics and act as a Streaming Database, DeltaStream is a system built for all stream processing workloads. If you want to learn more about how DeltaStream can help unlock real-time insights for your gaming data, reach out to us or get a trial.

28 Feb 2024

Min Read

Up-to-date Data Pipelines for Rideshare App Driver Ratings

Markets like e-commerce, the gig economy, and local businesses all rely on ratings to give an accurate representation of how good a product or service is. Ratings are not only useful for consumers, but they’re also useful for the companies managing these goods and services. While the use of ratings don’t typically require low-latencies, there are still some cases where having an up-to-date rating can unlock opportunities. For example, in a rideshare app, a company may want to block a driver whose rating has fallen below a certain threshold from taking more trips. This is to prevent riders from taking rides from unsafe or unpleasant drivers. Without a rating that is up to date, the systems that validate if a driver is eligible to drive more passengers are working with outdated data and cannot accurately determine if a driver should be allowed to drive or not.

In this post, we’ll showcase how DeltaStream can be used to keep ratings up to date to solve this rideshare use case.

Connect a Store and Create a Stream

Let’s assume we have a Kafka cluster with a topic called “driver_ratings” which contains the data that we’ll be sourcing from, and we’ve already defined a DeltaStream Store for this Kafka cluster (see tutorial for how to create a Store in DeltaStream) called “kafka_cluster.” We can create a Stream to represent this data with the following statement in DeltaStream:

CREATE STREAM driver_ratings (
  event_ts TIMESTAMP,
  driver_id VARCHAR,
  rider_id VARCHAR,
  rating DOUBLE
) WITH (
  'store' = 'kafka_store',
  'topic' = 'driver_ratings',
  'value.format' = 'json',
  'timestamp' = 'event_ts',
  'timestamp.format' = 'iso8601'
);

Real-Time Driver Ratings

For our use case, we want to keep an up-to-date rating for drivers, as well as an up-to-date count of how many reviews a driver has. These values will help downstream applications determine if a driver should be suspended. If a driver has a certain number of rides and their rating is below some threshold, then they should be suspended.

Notice in the schema of the data, from the Stream definition in the setup, that there is a driver_id and rating field. The driver_id field specifies which driver is being rated, and the rating field specifies the rating that the driver has received for a trip. To determine the driver’s rating, we need to keep an up-to-date average of all of a driver’s rides for each driver. We can do this in SQL by grouping by the driver_id field, then using the AVG function to calculate the average. Similarly, for finding the number of reviews, we can use the COUNT function. These results will be persisted to a materialized view so that data consumers can easily query the view to find the latest driver ratings.

CREATE MATERIALIZED VIEW AS SELECT Query:

CREATE MATERIALIZED VIEW avg_driver_ratings AS 
SELECT 
  driver_id, 
  AVG(rating) AS avg_driver_rating, 
  COUNT(*) AS num_reviews 
FROM 
  driver_ratings 
GROUP BY 
  driver_id;

Since the above query performs an aggregation grouping by the driver_id field, the result has a primary key of driver_id. This creates a materialized view in UPSERT mode, such that there is always one row per driver_id that reflects that driver’s current ratings.

By submitting the query, DeltaStream launches a long-lived continuous job in the background which constantly reads from the driver_ratings topic, computes the latest averages and counts, then updates the materialized view. This way, as new ratings arrive in the source topic, the materialized view is updated immediately.

Users can use DeltaStream’s Web App, CLI, or REST API to query the materialized view. Using one of these methods, downstream data consumers, such as the team responsible for driver suspensions in the rideshare company, can query the materialized view for the latest results. For example, we can query the materialized view for driver IDs with a rating below 4 and at least 15 rides.

Query 1 against avg_driver_ratings materialized view:

SELECT 
  * 
FROM 
  avg_driver_ratings 
WHERE 
  avg_driver_rating < 4
  AND num_reviews >= 15;

  driver_id | avg_driver_rating  | num_reviews
------------+--------------------+--------------
  Driver_5  |               3.75 |          16
  Driver_1  | 3.8823529411764706 |          17

Let’s also run a query to select all of our rows in our materialized view to see what the full result set looks like.

Query 2 against avg_driver_ratings materialized view:

SELECT * FROM avg_driver_ratings ORDER BY avg_driver_rating;

  driver_id | avg_driver_rating  | num_reviews
------------+--------------------+--------------
  Driver_6  | 3.5714285714285716 |          14
  Driver_5  |               3.75 |          16
  Driver_1  | 3.8823529411764706 |          17
  Driver_3  |  4.111111111111111 |          18
  Driver_7  |  4.166666666666667 |          18
  Driver_2  |  4.166666666666667 |          18
  Driver_9  |  4.208333333333333 |          24
  Driver_4  |  4.222222222222222 |           9
  Driver_8  |               4.25 |          16

Wrapping Up

In this post, we showcased how DeltaStream can keep average ratings up to date in real time. While calculating average reviews is typically a job done with batch processing, recreating this use case with stream processing creates opportunities for features and data products to be built on the new real-time assumptions. Although we focused on ridesharing in this example, ratings are used in plenty of different contexts and the same pipeline can be used to keep those ratings up to date in DeltaStream.

DeltaStream is the platform to unify, process, and govern all of your streaming data. If you want to learn more about DeltaStream’s materialized views or other capabilities, come schedule a demo with us or sign up for a free trial.

21 Feb 2024

Min Read

Stream Processing for Blockchain Data

Cryptocurrencies, smart contracts, NFTs, and Web3 have infiltrated mainstream media as the newest hot tech (other than generative AI of course). These technologies are backed by blockchains, which are distributed ledgers that rely on cryptography and decentralization to make secure transactions. In this blog post, we’ll be building a stream processing application using DeltaStream to inspect the gas fees associated with Ethereum transactions.

Ethereum is one of the most popular cryptocurrencies. Using Ethereum, users can set up wallets, transfer ether between accounts, interact with smart contracts, and more. With over a million transactions occurring on the Ethereum network every day (the latest Ethereum usage statistics), there is a lot of real-time data getting processed by the network. However, off-chain analytics can also play a role – to extract insights from the blockchain or the metadata associated with the blockchain.

For this use case, we are going to be doing real-time analysis from the transactions data persisted to Ethereum’s blockchain. Any time an Ethereum user wants to perform an action in the network, whether it’s running a smart contract or simply transferring ether from their wallet to another wallet, a transaction needs to be created. The user will send this transaction to what are called “validators” who will persist this transaction to the blockchain. Once a transaction is part of a block on the blockchain, that transaction is completed as blocks are generally irreversible. However, each block on the blockchain has a gas limit, which essentially caps how many transactions can be in a particular block. Each transaction requires a certain amount of gas – a simple transfer of ether from one wallet to another costs 21,000 gas for example. Running complex smart contracts will require more gas and will fill up a block’s gas limit more quickly. This means that not every transaction automatically gets persisted to the blockchain as validators pick which set of transactions they want to include in the next block (read more about gas fees in Ethereum).

After the Ethereum Improvement Proposal (EIP) 1559 upgrade was added to Ethereum, the gas fee structure changed to include priority fees. In order to make your own transaction more attractive for validators to pick, a priority fee can be attached to the transaction. This priority fee is a tip to the validator if they write your transaction to a block. So, the larger the priority fee, the more likely a validator will include the transaction. The question we want to help solve in this post is what should we set the priority fee to be?

Using real Ethereum transactions data from Infura.io, we want to look at what transactions are being persisted to Ethereum’s blockchain in real-time and get a sense of how big the priority fees are for these transactions. So, in the following sections you’ll see how we can create an application to analyze gas fees in real time using DeltaStream. As an Ethereum user, these insights will be valuable for setting reasonable priority fees on transactions without overpaying.

Setup and Assumptions for Analyzing Real-Time Data

Using Infura.io’s APIs, we are able to get Ethereum’s entire blockchain block data. We wrote a small program to wait for new blocks, get the transactions in each block, and write these transactions as JSON-formatted payloads to Kafka. We’ll be using these transactions as the source data for our use case. However, there are some assumptions that we’ll be making to simplify our use case. These assumptions are listed out below for completeness:

We are ignoring legacy transactions that are not type 2 (read more about Ethereum’s transaction types). Type 2 transactions are those that adhere to the post EIP-1559 specifications and use priority fees. Since the EIP-1559 upgrade is backwards compatible, users can still send transactions using the old specifications, but we are ignoring these transactions to simplify our use case.
For our transactions data, we are enriching each transaction payload with the respective block’s timestamp and the transaction’s hash. These fields are not part of the transaction message itself.
Users set the maxPriorityFeePerGas and the maxFeePerGas when making a transaction. While it’s not always the case, for simplicity we will assume that the (maxPriorityFeePerGas + baseFeePerGas) > maxFeePerGas.
Occasionally, transactions with maxPriorityFeePerGas set to 0 or very low values will make it onto the blockchain. The reason validators choose these transactions is likely because users have bribed the validators (learn more about bribes in Ethereum in this research paper). As you’ll see later in the setup, we are going to filter out any transactions with maxPriorityFeePerGas <= 100.

Let’s get on with the setup of our use case. In DeltaStream, after setting up Kafka as a Store, we can create a Stream that is backed by the transactions data in our Kafka topic. The Stream that we create is metadata that informs DeltaStream how to deserialize the records in the Kafka topic. The following CREATE STREAM statement creates a new Stream called eth_txns that is backed by the ethereum_transactions topic.

CREATE STREAM eth_txns (
  "txn_hash" VARCHAR, 
  "block_ts" BIGINT, 
  "blockNumber" BIGINT, 
  "gas" BIGINT, 
  "maxFeePerGas" BIGINT, 
  "maxPriorityFeePerGas" BIGINT, 
  "value" BIGINT
) WITH (
  'topic' = 'ethereum_transactions', 
  'value.format' = 'json', 
  'timestamp' = 'block_ts'
);

Now that we have our eth_txns source Stream defined, we first need to filter out transactions that don’t fit our assumptions. We can write a CREATE STREAM AS SELECT (CSAS) query that will continuously ingest data from the eth_txns Stream, filter out records that don’t meet our criteria, and sink the resulting records to a new Stream backed by a new Kafka topic. In the following query, note in the WHERE clause that we only accept transactions that have maxPriorityFeePerGas > 100 (transactions chosen due to bribes) and maxPriorityFeePerGas < maxFeePerGas (transactions whose priority fee is not accurately represented by maxPriorityFeePerGas).

CSAS query to create eth_txns_filtered Stream:

CREATE STREAM eth_txns_filtered AS 
SELECT 
  "txn_hash", 
  "block_ts", 
  "blockNumber", 
  "maxPriorityFeePerGas", 
  "gas" 
FROM eth_txns WITH ('source.deserialization.error.handling'='IGNORE')
WHERE  "maxPriorityFeePerGas" > 100 AND "maxPriorityFeePerGas" < "maxFeePerGas";

Analyzing Ethereum’s Blockchain Transaction Gas Fee Data

When forming the next block, Ethereum’s validators basically get to select whichever transactions they want from a pool of pending transactions. Since new blocks are only added every several seconds and blocks have a gas limit, we can expect validators to choose transactions that are in their best financial interest and choose the transactions with the highest priority fees per gas. So, we’ll analyze the maxPriorityFeePerGas field over time as transactions flow in to get a sense of what priority fees are currently being accepted.

The following query is a CREATE CHANGELOG AS SELECT (CCAS) query that is calculating the moving average, min, max, and standard deviation of priority fees over a 2 minute window.

CCAS query to create eth_txns_priority_fee_analysis Changelog:

CREATE CHANGELOG eth_txns_priority_fee_analysis AS 
SELECT 
  window_start, 
  window_end, 
  COUNT(*) AS txns_cnt, 
  MIN("maxPriorityFeePerGas") AS min_priority_fee, 
  MAX("maxPriorityFeePerGas") AS max_priority_fee, 
  AVG("maxPriorityFeePerGas") AS avg_priority_fee, 
  STDDEV_SAMP("maxPriorityFeePerGas") AS priority_fee_stddev
FROM HOP(eth_txns_filtered, SIZE 2 MINUTES, ADVANCE BY 15 SECONDS)
GROUP BY window_start, window_end;

Let’s see some result records in the eth_txns_priority_fee_analysis topic:

{
  "window_start": "2023-12-18 21:14:45",
  "window_end": "2023-12-18 21:16:45",
  "txns_cnt": 368,
  "min_priority_fee": 50000000,
  "max_priority_fee": 32250000000,
  "avg_priority_fee": 859790456,
  "priority_fee_stddev": 97003259
}
{
  "window_start": "2023-12-18 21:15:00",
  "window_end": "2023-12-18 21:17:00",
  "txns_cnt": 514,
  "min_priority_fee": 50000000,
  "max_priority_fee": 219087884691,
  "avg_priority_fee": 1951491416,
  "priority_fee_stddev": 79308531
}

Using these results, users can be better informed at setting priority fees for their own transactions. For example, if the transactions are more urgent, they can choose to set the priority fee to a value greater than the average. Similarly, if users want to save money and don’t mind waiting for some time for their transactions to make it onto the blockchain, they can choose a priority fee that is less than the average. These results are also useful for follow on use cases, such as tracking priority fees over a period of time. DeltaStream’s pattern recognition capabilities also allow users to track patterns in the priority fees. For example, users could set up a pattern recognition query to detect when priority fees stop trending upwards or when priority fees experience a sudden drop off.

Intersecting Web3 and Stream Processing

In this blog post, we put together a real-time streaming analytics pipeline to analyze Ethereum’s gas fees. With DeltaStream’s easy-to-use platform, we were able to solve the use case and deploy our pipeline within minutes, using only a few simple SQL queries. Although this is an entry level example, we illustrate a use case at the intersection of these two emerging technologies.

If you are interested in learning more about DeltaStream, schedule a demo with us or sign up for a free trial.

13 Feb 2024

Min Read

Stream Processing for IoT Data

The Internet of Things (IoT) refers to sensors and other devices that share and exchange data over a network. IoT has been on the rise for years and only seems to continue in its growing popularity with other technological advances, such as 5G cellular networks and more “smart” devices. From tracking patient health to monitoring agriculture, the applications for IoT are plentiful and diverse. Other sectors where IoT are used include security, transportation, home automation, and manufacturing.

Oracle defines Big Data as “data that contains greater variety, arriving in increasing volumes and with more velocity.” This definition is simply described with the 3 Vs – volume, velocity, and variety. IoT definitely matches this description, as sensors can emit a lot of data from numerous sensors and devices.

A platform capable of processing IoT data needs to be scalable in order to keep up with the volume of Big Data. It’s very common for many IoT applications to have new sensors added. Consider a drone fleet for package deliveries as an example – you may start off with 10 or 20 drones, but as demands for deliveries increases the size of your drone fleet can grow by orders of magnitude. The underlying systems processing these data needs to be able to scale horizontally to match the increase in data volume.

Many IoT use cases such as tracking patient health and monitoring security feeds require low latency insights. Sensors and devices providing real-time data often need to be acted on in real-time as well. For this reason, streaming and stream processing technologies have become increasingly popular and perhaps essential for solving these use cases. Streaming storage technologies such as Apache Kafka, Amazon Kinesis, and RedPanda can meet the low latency data transportation requirements of IoT. On the stream processing side, technologies such as Apache Flink and managed solutions such as DeltaStream can provide low latency streaming analytics.

IoT data can also come in various types and structures. Different sensors can have different data formats. Take a smart home for example, the cameras in a smart home will likely send very different data from a light or a thermometer. However, these sensors are all related to the same smart home. It’s important for a data platform handling IoT use cases to be able to join across different data sets and handle any variations in data structure, format, or type.

DeltaStream as a Streaming Analytics Platform and a Streaming Database

DeltaStream is a platform to unify, process, and govern streaming data. DeltaStream sits as the compute and governance layer on top of streaming storage systems such as Kafka. Powered by Apache Flink, DeltaStream is a fully managed solution that can process streaming data with very low latencies.

In this blog post we’ll cover 2 examples to show how DeltaStream can solve real-time IoT use cases. In the first use case, we’ll use DeltaStream’s Materialized Views to build a real-time request driven application. For the second use case, we’ll use DeltaStream to power real-time event-driven pipelines.

Use Case Setup: Transportation Sensor Data

For simplicity, both use cases will use the same source data. Let’s assume that our data is available in Apache Kafka and represents updates and sensor information for a truck fleet. We’ll first define Relations for the data in 2 Kafka topics.

The first Relation represents truck information. This includes an identifier for the truck, the speed of the truck, which thermometer is in the truck, and a timestamp for this update event represented as epoch milliseconds. Later on, we will use this event timestamp field to perform a join with data from other sensors. Since we expect regular truck information updates, we’ll define a Stream for this data.

Create truck_info Stream:

CREATE STREAM truck_info (
  event_ts BIGINT, 
  truck_id INT, 
  speed_kmph INT, 
  thermometer VARCHAR
) WITH (
  'topic' = 'truck_info', 'value.format' = 'json', 'timestamp' = 'event_ts'
);

The second Relation represents a thermometer sensor’s readings. The fields include an identifier for the thermometer, the temperature reading, and a timestamp for when the temperature was taken that is represented as epoch milliseconds. Later on, the event timestamp will be used when joining with the truck_info Stream. We will define a Changelog for this data using sensor_id as the primary key.

Create temperature_sensor Changelog:

CREATE CHANGELOG temperature_sensor (
  "time" BIGINT, 
  temperature_c INTEGER, 
  sensor_id VARCHAR, 
  PRIMARY KEY (sensor_id)
) WITH (
  'topic' = 'temperature_sensor', 'value.format' = 'json', 'timestamp' = 'time'
);

Using the Relations we have just defined, we want to find out what the latest temperature readings are in each truck. We can achieve this by using a temporal join to enrich our truck_info updates with the latest temperature readings from the temperature_sensor Changelog. The result of this join will be a Stream of enriched truck information updates with the latest temperature readings in the truck. The following SQL statement will launch a long-lived continuous query that will continually join these two Relations and write the results to a new Stream that is backed by a new Kafka topic.

Create truck_info_enriched Stream using CSAS:

CREATE STREAM truck_info_enriched AS 
SELECT 
  truck_info.event_ts, 
  truck_info.truck_id, 
  truck_info.speed_kmph, 
  temp.sensor_id AS thermometer, 
  temp.temperature_c 
FROM truck_info 
  JOIN temperature_sensor temp
  ON truck_info.thermometer = temp.sensor_id;

While a truck fleet in a real-world environment will likely have many more sensors, such as cameras, humidity sensors, and others, we’ll keep this use case simple and just use a thermometer as the additional sensor. However, users could continue to enrich their truck information events with joins for each additional sensor data feed.

Use Case Part 1: Powering a real-time dashboard

Monitoring and health metrics are essential for managing a truck fleet. Being able to check on the status of particular trucks and generally see that trucks are doing fine can provide peace of mind for the truck fleet manager. This is where a real-time dashboard can be helpful – to have the latest metrics readily available on the status of the truck fleet.

So for our first use case, we’ll use Materialized Views to power a real-time dashboard. By materializing our truck_info_enriched Stream into a queryable view, we can build charts that can query the view and get the latest truck information. We’ll build the Materialized View in two steps. First we’ll define a new Changelog that mirrors the truck_info_enriched Stream, then we’ll create a Materialized View from this Changelog.

Create truck_info_enriched_changelog Changelog:

CREATE CHANGELOG truck_info_enriched_changelog (
  event_ts BIGINT, 
  truck_id INT, 
  speed_kmph INT, 
  thermometer VARCHAR,
  temperature_c INTEGER, 
  PRIMARY KEY (truck_id)
) WITH (
  'topic' = 'truck_info_enriched', 
  'value.format' = 'json'
);

Create truck_info_mview Materialized View using CVAS:

CREATE MATERIALIZED VIEW truck_info_mview AS 
SELECT * FROM truck_info_enriched_changelog;

Note that we could have created this Materialized View sourcing from the truck_info_enriched Stream, but if we created the Materialized View from the Stream, then each event would be a new row in the Materialized View (append mode). Instead we are building the Materialized View from a Changelog so that each event will add a new row or update an existing one based on the Changelog’s primary key (upsert mode). For our example, we only need to know the current status of each truck, so building the Materialized View with upsert mode better suits our use case.

A continuous query will power this Materialized View, constantly ingesting records from the truck_info_enriched Stream and sinking the results to truck_info_mview. Then, we can write queries to SELECT from the Materialized View. A dashboard can easily be built that simply queries this Materialized View to get the latest statuses for trucks. Here are some example queries that might be helpful when building a dashboard for the truck fleet.

Query to get truck IDs with the highest 10 temperatures:

SELECT truck_id, temperature_c 
FROM truck_info_mview 
ORDER BY temperature_c DESC 
LIMIT 10;

Query to get all information about a truck:

SELECT * 
FROM truck_info_mview 
WHERE truck_id = 3;

Query to count the number of trucks that are speeding:

SELECT COUNT(truck_id) AS num_speeding_trucks 
FROM truck_info_mview 
WHERE speed_kmph > 90;

Use Case Part 2: Building a real-time alerting pipeline

While it’s great to be able to pull for real-time metrics for our truck fleet (Use Case Part 1), there are also situations where we may want the truck update events themselves to trigger actions. In our example, we’ve included thermometers as one of the sensors in each of our delivery trucks. Groceries, medicines, and some chemicals need to be delivered in refrigerated trucks. If the trucks aren’t able to stay within a desired temperature range, it could cause the items inside to go bad or degrade. This can be quite serious, especially for medicines and hazardous materials that can have a direct impact on people’s health.

For our second use case, we want to build out a streaming analytics pipeline to power an alerting service. We can use a CSAS to perform real-time stateful transformations on our data set, then sink the results into a new Stream backed by a Kafka topic. Then the sink topic will contain alertable events that the truck fleet company can feed into their alerting system or other backend systems. Let’s stick to our refrigeration example and write a query that detects if a truck’s temperature exceeds a certain threshold.

Create overheated_trucks Stream using CSAS:

CREATE STREAM overheated_trucks AS 
SELECT * FROM truck_info_enriched WHERE temperature_c > 10;

Submitting this CSAS will launch a long-lived continuous query that ingests from the truck_info_enriched Stream, filters for only events where the truck’s temperature is greater than 10 degrees celsius, and sink the results to a new Stream called overheated_trucks. Downstream, the truck fleet company can ingest these records and send alerts to the correct teams or use these records to trigger actions in other backend systems.

Processing IoT Data with DeltaStream

IoT data can be challenging to process due to the high volume of data, the inherent real-time requirements of many IoT applications, and the distributed nature of collecting data from many different sources. While we often treat IoT use cases as their own category, they really span many sectors and use cases. That’s why using a real-time streaming platform, such as DeltaStream, that is able to keep up with the processing demands of IoT and can serve as both a streaming database and streaming analytics platform is essential.

If you want to learn more about how DeltaStream can help your business, schedule a demo with us. We also have a free trial available if you want to try out DeltaStream yourself.

Why Shift Left?

A Real-World Benchmark Using NYC Taxi Data

Ingest Once, Stream Forever

The Snowflake ELT Path: Functional but Expensive

The DeltaStream ETL Path: Leaner, Faster, Cheaper

Key Results: 75% Cost Savings!

Key Difference: 1-Minute Dynamic Table Refresh vs. Real-Time Updates

Bonus Benefit: DeltaStream Simplifies Streaming with SQL

The Takeaway: Shift Left and Save Big

Want to see how much we can save you on your Snowflake bill?

Use Case: Aggregating IOT Data to Iceberg

Setup and Solution

Step 1: Creating a Common Table Expression (CTE)

Step 2: Extracting Time Components

Step 3: Ranking Records with Window Functions

Step 4: Filtering for Valid Time Windows

Step 5: Selecting the Top 3 Per Month

Processing Medical Device Data

Writing Data Tables to Iceberg

Streaming Lakehouse with DeltaStream Fusion

Shift-left and So Much More

Use Case: Missing Data as an Anomaly

Solution: Detecting failed Sensors

Summary

Common Fraud Detection Challenges

Leveraging Graph Analytics for Fraud Detection

Challenges of Implementing and Running Graph Databases

How PuppyGraph Solves Graph Database Challenges

Real-Time Fraud Prevention with DeltaStream

Step-by-Step tutorial: DeltaStream and PuppyGraph

Starting a Kafka Cluster

Setting up DeltaStream

Connecting to Kafka

Adding a Store for Integration

Starting data processing

Query your data as a graph

Conclusion

Raw Source Data for Airline Flight Status

Use Case: Create an Always Up-to-Date View of Current Flight Status

Conclusion

Connecting a Streaming Store

Keeping Leaderboards Up to Date with SQL and Materialized Views

Temporarily Ban Players for Leaving Games Early with Applications

Use DeltaStream to Unlock Real-time Gaming Analytics

Connect a Store and Create a Stream

Real-Time Driver Ratings

Wrapping Up

Setup and Assumptions for Analyzing Real-Time Data

Analyzing Ethereum’s Blockchain Transaction Gas Fee Data

Intersecting Web3 and Stream Processing

DeltaStream as a Streaming Analytics Platform and a Streaming Database

Use Case Setup: Transportation Sensor Data

Use Case Part 1: Powering a real-time dashboard

Use Case Part 2: Building a real-time alerting pipeline

Processing IoT Data with DeltaStream

Request Submitted

Share this blog post