02 Apr 2025

Min Read

A Guide to Stateless vs. Stateful Stream Processing

Stream processing has become a cornerstone of modern data architectures, enabling real-time analytics, event-driven applications, and continuous data pipelines. From tracking user activity on websites to monitoring IoT devices or processing financial transactions, stream processing systems allow organizations to handle data as it arrives. However, not all stream processing is created equal. Two fundamental paradigms—stateless and stateful stream processing—offer distinct approaches to managing data streams, each with unique strengths, trade-offs, and use cases. This blog’ll explore the technical differences, dive into their implementations, provide examples, and touch on when each approach is most applicable.

What is Stream Processing?

First, a little background. Stream processing differs from traditional batch processing by operating on continuous, unbounded flows of data—think logs, sensor readings, or social media updates arriving in real time. Frameworks like Apache Kafka Streams, Apache Flink, and Spark Streaming provide the infrastructure to efficiently ingest, transform, and analyze these streams. The key distinction between stateless and stateful processing lies in how these systems manage information across events.

Stateless Stream Processing: Simplicity in Motion

Stateless stream processing treats each event as an isolated entity, processing it without retaining memory of prior events. When an event arrives, the system applies predefined logic based solely on its content and then moves on.

How Stateless Works:

Input A stream of events, e.g., {user_id: 123, action: "click", timestamp: 2025-03-13T10:00:00}.
Processing Apply a transformation or filter, such as “if action == ‘click’, increment a counter” or “convert timestamp to local time.”
Output Emit the result (e.g., a metric or enriched event) without referencing historical data.

Technical Characteristics:

No State Storage Stateless systems don’t maintain persistent storage or in-memory context, reducing resource overhead.
Scalability Since events are independent, stateless processing scales horizontally easily. You can distribute the workload across nodes without synchronization.
Fault Tolerance Recovery is straightforward—lost events can often be replayed without affecting correctness, assuming idempotency (i.e., processing the same event twice yields the same result).

Latency Processing is typically low-latency due to minimal overhead.

Example Use Case Consider a real-time clickstream filter that identifies and forwards “purchase” events from an e-commerce site. Each event is evaluated independently: if the action is “purchase,” it’s sent downstream; otherwise, it’s discarded. No historical context is needed.

Implementation Example (Kafka Streams):

StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> clicks = builder.stream("clicks-topic");
clicks.filter((key, value) -> value.contains("\"action\":\"purchase\""))
      .to("purchases-topic");

Trade-Offs: Stateless processing is lightweight and simple but limited. It can’t handle use cases requiring aggregation, pattern detection, or temporal relationships—like counting clicks per user over an hour—because it lacks memory of past events.

Stateful Stream Processing: Memory and Context

Stateful stream processing, in contrast, maintains state across events, enabling complex computations that depend on historical data. The system tracks information—like running totals, user sessions, or windowed aggregates—in memory or persistent storage.

How Stateful Works:

Input The same stream, e.g., {user_id: 123, action: "click", timestamp: 2025-03-13T10:00:00}.
Processing Update a state store, e.g., “increment user 123’s click count in a 1-hour window.”
Output Emit results based on the updated state, e.g., “user 123 has 5 clicks this hour.”

Technical Characteristics:

State Management requires mechanisms to track state, such as key-value stores (e.g., RocksDB in Flink) or in-memory caches.
Scalability More complex due to state partitioning and consistency requirements. Keys (e.g., user_id) are often used to shard state across nodes.
Fault Tolerance State must be checkpointed or replicated to recover from failures, adding overhead but ensuring correctness.
Latency Higher than stateless due to state access and updates, though optimizations like caching mitigate this.

Example Use Case: A fraud detection system that flags users with more than 10 transactions in a 5-minute window. This requires tracking per-user transaction counts over time—a stateful operation.

Implementation Example (Flink):

DataStream<Transaction> stream = env.addSource(new TransactionSource());
KeyedStream<Transaction, String> keyed = stream.keyBy(t -> t.userId);
keyed.window(TumblingEventTimeWindows.of(Time.minutes(5)))
     .aggregate(new CountAggregate())
     .filter(count -> count >= 10)
     .addSink(new AlertSink());

Trade-Offs: Stateful processing is powerful but resource-intensive. Managing state increases memory and storage demands, and fault tolerance requires sophisticated checkpointing or logging (e.g., Kafka’s changelog topics). It’s also prone to issues like state bloat (e.g., expiring old windows) if not properly managed.

When to Use Stateless vs. Stateful?

Stateless: Opt for stateless processing when your application involves simple transformations, filtering, or enrichment without historical context. It’s ideal for lightweight, high-throughput pipelines where speed and simplicity matter.
Stateful: Choose stateful processing for analytics requiring memory—aggregations (e.g., running averages), sessionization, or pattern matching. It’s essential when the “why” behind an event depends on “what came before.”

Wrap-up

Stateless and stateful stream processing serve complementary roles in the streaming ecosystem. Stateless processing offers simplicity, scalability, and low latency for independent event handling, making it a go-to for straightforward transformations. In contrast, more complex and resource-heavy, stateful processing provides more advanced capabilities like time-based aggregations and contextual analysis, which are critical for real-time insights. Choosing between them depends on your use case: stateless for speed and simplicity, stateful for depth and memory. Modern frameworks often support both, allowing hybrid pipelines where stateless filters feed into stateful aggregators. Understanding their mechanics empowers you to design efficient, purpose-built streaming systems.

26 Mar 2025

Min Read

An Overview of Shift Left Architecture

Consumer expectations for speed of service has only increased since the dawn of the information age. The ability to process information quickly and cost-effectively is no longer a luxury, it’s a necessity. Businesses across industries are racing to extract value from their data in real-time, and a transformative approach known as “shift left” is gaining traction. With streaming technologies, organizations can move data processing earlier in the pipeline to slash storage and compute costs, cut latency, and simplify operations. Let’s dive into what shift left means, why it’s a game-changer, and how it can reshape your data strategy.

Streaming Data: The Backbone of Modern Systems

Streaming data is ubiquitous in today’s tech ecosystem. From mobile apps to IoT ecosystems, real-time processing powers everything from convenience to security. Consider the scale of this trend: Uber runs over 2,500 Apache Flink jobs to keep ride-sharing seamless; Netflix manages a staggering 16,000 Flink jobs internally; Epic Games tracks real-time gaming metrics; Samsung’s SmartThings platform analyzes device usage on the fly; and Palo Alto Networks leverages streaming for instant threat detection. These examples highlight a clear truth: batch processing alone can’t keep pace with the demands of modern applications.

The Traditional ELT Approach: A Reliable but Rigid Standard

Historically, organizations have leaned on Extract, Load, Transform (ELT) pipelines to manage their data. In this model, raw data is ingested into data warehouses or lakehouses and then transformed for downstream use. Many adopt the “medallion architecture” to structure this process:

Bronze Raw, unprocessed data lands here.
Silver Data is cleansed, filtered, and standardized.
Gold Aggregations and business-ready datasets are produced.

This approach has been a staple thanks to the maturity of batch processing tools and its straightforward design. However, ELT’s limitations are glaring as data volumes grow and real-time needs intensify.

The Pain Points of ELT

High Latency Batch jobs run on fixed hourly, daily, or worse schedules, leaving a gap between data generation and actionability. For time-sensitive use cases, this delay is a dealbreaker.
Operational Complexity When pipelines fail, partial executions can leave a mess. Restarting often requires manual cleanup, draining engineering resources.
Cost Inefficiency Batch processing recomputes entire datasets, even if only a fraction has changed. This overprovisioning unnecessarily inflates compute costs.

Shift Left: Processing Data in Flight

Enter the shift left paradigm. Instead of deferring transformations to the warehouse, this approach uses streaming technologies—like Apache Flink—to process data as it flows through the pipeline. By shifting computation upstream, organizations can tackle data closer to its source, unlocking dramatic improvements.

Why Shift Left Wins

Reduced Latency Processing shrinks from hours or minutes to seconds—or even sub-seconds—making data available almost instantly.
Lower Costs Incremental processing computes only what’s new, avoiding the waste of rehashing unchanged data. Reduced storage costs from data filtering before it lands and no redundant data copies.
Simplified Operations Continuous streams eliminate the need for intricate scheduling and orchestration, reducing operational overhead.

A Real-World Win

Consider a company running batch pipelines in a data warehouse, costing $11,000 monthly. After shifting left to streaming, their warehouse bill dropped to $2,500. Even factoring in streaming infrastructure costs, they halved their total spend—while slashing latency from 30 minutes to seconds. This isn’t an outlier; it’s a glimpse of shift left’s potential.

Bridging the Expertise Gap

Streaming historically demanded deep expertise—think custom Flink jobs or Kafka integrations. That barrier is crumbling. Platforms like Delta Stream are democratizing stream processing with:

Serverless Options No need to manage clusters or nodes.
Automated Operations Fault tolerance and scaling are handled behind the scenes.
SQL-Friendly Interfaces Define transformations with familiar syntax, not arcane code.
Reliability Guarantees Exactly-once processing ensures data integrity without extra effort.

This shift makes streaming viable for teams without PhDs in distributed systems.

Transitioning Made Simple

Adopting shift left doesn’t mean scrapping your existing work. If your batch pipelines use SQL, you’re in luck: those statements can often be repurposed for streaming with minor tweaks. This means you can:

Preserve your business logic.
Stick with SQL-based workflows your team already knows.
See instant latency and cost benefits.
Skip the headache of managing streaming infrastructure.

For example, a batch query aggregating hourly sales could pivot to a streaming windowed aggregation with near-identical syntax—same logic, faster results.

The Future Is Streaming

Shifting left isn’t just an optimization, it’s a strategic evolution. As data grows and real-time demands escalate, clinging to batch processing risks falling behind. Thanks to accessible tools and platforms, what was once the domain of tech giants like Netflix or Uber is now within reach for organizations of all sizes. The numbers speak for themselves: lower costs, sub-second insights, and leaner operations. For competitive businesses, shifting left may soon transition from a smart move to a survival imperative. Ready to rethink your pipelines? Take a look at our on-demand webinar for more, Shift Left: Lower Cost & Reduce Latency of your Data Pipelines.

24 Mar 2025

Min Read

The Flink 2.0 Official Release is Stream Processing All Grown Up

The Apache Flink crew dropped version 2.0.0 on March 24, 2025, and it’s the kind of update that makes you sit up and pay attention. I wrote about what was likely coming to Flink 2.0 back in November, and the announcement doesn’t disappoint. This isn’t some minor patch cobbled together over a weekend—165 people chipped in over two years, hammering out 25 Flink Improvement Proposals and squashing 369 bugs. It’s the first big leap since Flink 1.0 landed back in 2016, and as someone who’s been in the data weeds for more years than I care to remember, I’m here to tell you it’s a release that feels less like hype and more like a toolset finally catching up to reality. Let’s dig into the details.

The Backdrop: Data’s Moving Fast, and We’re Still Playing Catch-Up

Nine years ago, Flink 1.0 showed up when batch jobs were still the default, and streaming was the quirky sidekick. Fast forward to 2025, and the game’s flipped; real-time isn’t optional; it’s necessary. Whether it’s tracking sensor pings from a factory floor or keeping an AI chatbot from spitting out stale answers, data’s got to move at the speed of now. The problem is that most streaming setups still feel like they’re held together with duct tape and optimism, costing a fortune and tripping over themselves when the load spikes. With Flink 2.0, this all becomes more manageable.

What’s Cooking in Flink 2.0

The official rundown’s got plenty of details, but I’m not here to parrot the press release. Here’s my take on what matters:

State Goes Remote: Less Baggage, More Breathing Room
Flink’s new trick of shoving state management off to remote storage is a quiet killer. No more tying compute and state together like they’re stuck in a toxic relationship; now they’re free to do their own thing. With some asynchronous magic and a nod to stuff like ForStDB, it’s built to scale without choking, especially if you’re running on Kubernetes or some other cloud playground. This feels like a lifeline for anyone who’s watched a pipeline buckle under big state.
Materialized Tables: Less Babysitting, More Doing
Ever tried explaining watermarks to a new hire without their eyes glazing over? Flink’s Materialized Tables promise to deal with the details. You toss in a query and a freshness rule, and it figures out the rest, the schema, refreshes, and all the grunt work. That means you can build a pipeline that works for batch and streaming relatively easily. Practical, not flashy.
Paimon Integration: Expanded Lakehouse Support
The Apache Paimon support was interesting to see. I’ve been curious about what might happen in that space for a while now. I wrote about it in late 2023. The focus is on the concept of the Streaming Lakehouse.
AI Nod: Feeding the Future
They hint at AI and large language models with a “strong foundation” line but don’t expect a manual yet. My guess is that Flink is betting on being the real-time engine for fraud alerts or LLM-driven apps that need fresh data to stay sharp, which just makes sense. Flink CDC 3.3 introduced support for OpenAI chat and embedding models, so keep an eye on those developments.

Why I’m Excited for this Flink Release

Flink 2.0 doesn’t feel like it’s chasing trends; it’s tackling the stuff that keeps engineers up at night. Compared to Kafka Streams, which is lean but light on heavy lifting, or Spark Streaming, which still leans on micro-batches like it’s 2015, Flink can handle the nitty-gritty of event-by-event processing. This release doubles down with better cloud smarts and focuses on keeping costs sane. It’s not about throwing more hardware at the problem; it’s about working more innovatively, and that’s a win for anyone who’s ever had to justify a budget.

The usability updates really can’t be understated. Stream processing can be a beast to learn, but those Materialized Tables and cleaner abstractions mean you don’t need to be a guru to get started. It’s still Flink—powerful as ever—but it’s not gatekeeping anymore.

The Rough Edges: Change Hurts

Fair warning: This isn’t a plug-and-play upgrade if you’re cozy on Flink 1.x. Old APIs like DataSet are deprecated, and Scala’s legacy bits got the boot. Migration’s going to sting if your setup’s crusty. But honestly? That’s the price of progress. They’re trimming the fat to keep Flink lean and mean; dealing with the pain now will provide many years of stability.

The Bottom Line: Flink 2.0 is Worth a Look

Flink 2.0 isn’t here to reinvent the wheel but to make the wheel roll more smoothly. It’s a solid, no-nonsense upgrade that fits the chaos of 2025’s data demands: fast, scalable, and less of a pain to run. The community’s poured real effort into this, and it shows. Get all the details from the Flink team in their announcement and then start planning for your updates. Or take a look at DeltaStream if you’re interested in all the functionality of Flink, but without the required knowledge and infrastructure.

20 Mar 2025

Min Read

The Top Four Trends Driving Organizations from Batch to Streaming Analytics

Over the past decade, the way businesses handle data has fundamentally changed. Organizations that once relied on batch processing to analyze data at scheduled intervals are now moving toward streaming analytics—where data is processed in real-time. While early adopters of streaming technologies were primarily large tech companies like Netflix, Apple, and DoorDash, today, businesses of all sizes are embracing streaming analytics to make faster, more informed decisions.

But what’s driving this shift? Below, we explore the key trends pushing organizations toward streaming analytics and highlight the most common use cases where it’s making a significant impact.

1. Rising Customer Expectations for Real-Time Insights

“ 74% of IT leaders report that streaming data enhances customer experiences, and 73% say it enables faster decision-making.” Source: VentureBeat

Modern consumers expect instant interactions. Businesses that rely on batch-processed analytics struggle to keep up with customer demands for instant responses. Streaming analytics allows companies to react in real-time, improving customer satisfaction and competitive advantage.

Example Use Cases:

E-commerce: Dynamic pricing and personalized recommendations based on real-time browsing behavior.
AdTech: Update ad bids dynamically based on audience engagement.
Gaming: Tailors in-game rewards based on real-time player activity.

2. Enterprise-Ready Solutions Make Streaming More Accessible

“ The streaming analytics market is projected to grow at a CAGR of 26% from 2024 to 2032, reaching $176.29B.” Source: GMInsights

Previously, streaming analytics required specialized expertise and was considered too complex and costly for most organizations. Today, the rise of streaming ETL and continuous data integration–combined with cloud-native solutions such as Google Dataflow, RedPanda, Confluent, and DeltaStream–is lowering the barrier to adoption. These platforms provide enterprise-friendly managed solutions that eliminate operational overhead, allowing businesses to implement streaming analytics without needing large in-house engineering teams.

Example Use Cases:

Data Warehousing: Ingests and updates analytics data in real time, ensuring dashboards reflect the latest insights.
IoT Platforms: Aggregates and processes sensor data instantly for real-time monitoring and automation.
Financial Services: Streams transactions into risk analytics models to detect fraud as it happens.

3. The Rise of LLMs and the Need for Fresh, Real-Time Data

“ AI and ML adoption are driving a 40% increase in real-time data workloads.” Source: InfoQ

The rapid adoption of LLMs has shifted the focus from model capabilities to data freshness and uniqueness. Foundational models are becoming increasingly commoditized, and organizations can no longer rely on model performance alone for differentiation. Instead, real-time access to fresh, proprietary data determines accuracy, relevance, and competitive advantage.

The recent partnership between Confluent and Databricks highlights this growing demand for real-time data in AI workloads. Yet, stream processing remains a critical gap—organizations need ways to transform, enrich, and prepare real-time data before feeding it into RAG pipelines and other AI-driven applications to ensure accuracy and relevance.

Example Use Cases:

Real-Time Feature Engineering: Continuously transforms raw data streams into structured features for AI models.
News & Financial Analytics: Filters, enriches, and feeds LLMs with the latest market trends and breaking news.
Conversational AI & Chatbots: Incorporates real-time business data, technical support, and events to improve AI-driven interactions.

4. Regulations are Driving Real-Time Monitoring Needs

“ On November 12, 2024, the UK’s Financial Conduct Authority (FCA) fined Metro Bank £16.7 million for failing real-time monitoring of 60 million transactions worth £51 billion, a direct violation of their Anti-Money Laundering (AML) regulations.” Source FCA

Industries with strict compliance requirements are now mandated to monitor and report data events in real-time. Whether it’s fraud detection in banking, patient data security in healthcare, or GDPR compliance in data privacy, organizations must implement streaming analytics to meet these regulatory requirements. Real-time monitoring ensures businesses can detect anomalies instantly and prevent costly compliance violations.

Example Use Cases:

Banking: Anti-money laundering (AML) compliance.
Telecom: Real-time call monitoring for regulatory audits.
Government: Cybersecurity and national security threat detection.

Conclusion: Streaming Analytics is No Longer Optional

What was once a niche technology for highly technical organizations is now a necessity for businesses across industries. The push toward real-time analytics is being fueled by customer expectations, technological advancements, AI adoption, regulatory requirements, and competitive pressures.

Whether businesses are looking to prevent fraud, optimize supply chains, or personalize customer experiences, the ability to analyze data in motion is now a crucial part of modern data strategies.

For organizations still relying on batch processing, it is time to evaluate how streaming analytics can transform their data-driven decision-making. The future is real-time—how will you be ready?

27 Feb 2025

Min Read

5 Signs It’s Time to Move from Batch Processing to Real-Time

In the past decade, we’ve witnessed a fundamental transformation in the way companies handle their data. Traditionally, organizations relied on batch processing, which involves collecting and processing data at fixed intervals. This worked well in slower-paced industries where insights weren’t needed instantly. However, in a world where speed and real-time decisions are everything, batch processing can feel like an outdated relic, unable to keep up with the demands of real-time decisions and customer expectations. So, how do you know if your business is ready to make the leap from batch to real-time processing? Below, we’ll explore five telltale signs that it’s time to leave batch behind and embrace real-time systems for a more agile, responsive business.

1. Delayed Decision-Making Is Impacting Outcomes

In many industries, the ability to make decisions quickly is the difference between seizing an opportunity and losing it forever. If delays consistently hinder your decision-making in data availability caused by batch processing, your business is suffering.

For example, imagine a retailer that runs inventory updates only once a day through batch processes. If a product sells out in the morning but isn’t flagged as unavailable until the nightly update, the company risks frustrating customers with out-of-stock orders. In contrast, a real-time system would update inventory levels immediately, ensuring always accurate availability.

Delayed decisions caused by outdated data can also lead to financial losses, missed revenue opportunities, and compliance risks in industries such as banking, healthcare, and manufacturing. If you say, “We could’ve avoided this if we had known sooner,” consider real-time processing.

2. Customer Expectations for Real-Time Experiences

Today’s customers expect instant gratification. Whether they want real-time updates on their food delivery, immediate approval for a loan application, or a seamless shopping experience, the demand for speed is non-negotiable. With its inherent lag, batch processing simply can’t meet these expectations.

Take, for example, the rise of ride-sharing apps like Uber or Lyft. These platforms rely on real-time data to match drivers with riders, calculate arrival times, and adjust pricing dynamically. A batch system would create noticeable delays and undermine the entire user experience.

If you receive complaints about laggy services, slow responses, or poor user experience, this is a strong indicator that you need to adopt real-time systems to meet customer expectations.

3. Data Volumes Are Exploding

The amount of data businesses collect today is staggering and growing exponentially. Whether it’s customer interactions, IoT device outputs, social media activity, or transaction data, the challenge is collecting and processing this data efficiently.

Batch processing often struggles to handle high data volumes. Processing large datasets in a single batch can lead to delays, system overloads, and inefficiencies. On the other hand, real-time processing is designed to handle continuous streams of data, breaking them into manageable chunks and processing them as they arrive.

If your data pipelines are becoming unmanageable and your batch processes are taking longer and longer to run, it’s time to shift to a real-time architecture. Real-time systems allow you to scale as data volumes grow, ensuring your business operations remain smooth and efficient.

4. Operational Bottlenecks in Data Pipelines

Batch processing systems can create bottlenecks in your data pipeline, where data piles up waiting for the next scheduled processing window. These bottlenecks can cause delays across your organization, especially when multiple teams rely on the same data to perform their functions.

For example, a finance team waiting for overnight sales reports to run forecasts, a marketing team waiting for campaign performance data, or an operations team waiting for stock updates can all face unnecessary delays due to batch processing constraints.

With real-time systems, data flows continuously, eliminating these bottlenecks and ensuring that teams have access to the insights they need, exactly when they need them. If your teams constantly wait for data to do their jobs, it’s time to break free of batch and move to real-time processing.

5. Business Use Cases Demand Continuous Insights

Certain business use cases simply cannot function without real-time data. These include fraud detection, dynamic pricing, predictive maintenance, and real-time monitoring of IoT devices. Batch processing cannot support these use cases because it relies on processing data after the fact – by which point, the window to act has often already closed.

Take fraud detection as an example. Identifying and preventing fraudulent transactions requires real-time monitoring and analysis of incoming data streams in banking. A batch system that only processes transactions at the end of the day would miss the opportunity to block fraudulent activity in real-time, exposing the business and its customers to significant risks.

If your business expands into use cases requiring immediate action based on fresh data, batch processing will hold you back. Real-time systems provide the continuous insights needed to support these advanced use cases and unlock new growth opportunities.

Making the Transition from Batch Processinto Real-Time

Transitioning from batch to real-time processing is a significant shift, but it pays off. Moving to real-time systems, you can respond instantly to customer needs, operational challenges, and market changes. You’ll also future-proof your organization, ensuring you can scale with growing data volumes and stay competitive in an increasingly real-time world.

If you see one or more of these signs in your business – delayed decisions, lagging customer experiences, overwhelmed data pipelines, or a need for continuous insights, it’s time to act. Although leaving batch processing behind may feel daunting, it’s a necessary step to meet the demands of modern business and thrive in a real-time world.

The sooner you make the move, the sooner you can start capitalizing on the benefits of real-time systems – faster decisions, happier customers, and a more agile business. So, are you ready for real-time? The signs are all there.

19 Feb 2025

Min Read

The 8 Most Impactful Apache Flink Updates

With Apache Flink 2.0 fast approaching, and as a companion to our recent blog, “What’s Coming in Apache Flink 2.0?” I thought I’d look back on some of the impactful updates we’ve seen since it was released in 2014. Apache Flink is an open-source, distributed stream processing framework that has become a cornerstone in real-time data processing. Flink has continued to innovate since its release, pushing the boundaries of what stream and batch processing systems can achieve. With its powerful abstractions and robust scalability, Flink has empowered organizations to process large-scale data across every business sector. Over the years, Flink has undergone a fantastic evolution as a leading stream processing framework. Let’s dive into some history with that intro out of the way.

1. Introduction of Stateful Stream Processing

One of Apache Flink’s foundational updates was the introduction of stateful stream processing, which set it apart from traditional stream processing systems. Flink’s ability to maintain application state across events unlocked new possibilities, such as implementing complex event-driven applications and providing exactly-once state consistency guarantees.

This update addressed one of the biggest challenges in stream processing: ensuring that data remains consistent even during system failures. Flink’s robust state management capabilities have been critical for financial services, IoT applications, and fraud detection systems, where reliability is paramount.

2. Support for Event Time and Watermarks

Flink revolutionized stream processing by introducing event-time processing and the concept of watermarks. Unlike systems that rely on processing time (the time at which an event is processed by the system), Flink’s event-time model processes data based on the time when an event actually occurred. This feature enabled users to handle out-of-order data gracefully, a common challenge in real-world applications.

With watermarks, Flink can track the progress of event time and trigger computations once all relevant data has arrived. This feature has been a game-changer for building robust applications that rely on accurate, time-sensitive analytics, such as monitoring systems, real-time recommendation engines, and predictive analytics.

3. The Blink Planner Integration

In 2019, Flink introduced the modern planner (sometimes referred to as Blink), which significantly improved Flink’s SQL and Table API capabilities. Initially developed by Alibaba, the Blink planner was integrated into the Flink ecosystem to optimize query execution for both batch and streaming data. It offered enhanced performance, better support for ANSI SQL compliance, and more efficient execution plans.

This integration was a turning point for Flink’s usability, making it accessible to a broader audience, including data engineers and analysts who preferred working with SQL instead of Java or Scala APIs. It also established Flink as a strong contender in the world of streaming SQL, competing with other frameworks like Apache Kafka Streams and Apache Beam.

4. Kubernetes Native Deployment

With the rise of container orchestration systems like Kubernetes, Flink adapted to modern infrastructure needs by introducing native Kubernetes support in version 1.10, released in 2022. This update allowed users to seamlessly deploy and manage Flink clusters on Kubernetes, leveraging its scalability, resilience, and operational efficiency.

Flink’s Kubernetes integration simplified cluster management by enabling dynamic scaling, fault recovery, and resource optimization. This update also made it easier for organizations to integrate Flink into cloud-native environments, providing greater operational ability for companies adopting containerized workloads.

5. Savepoints and Checkpoints Enhancements

Over the years, Flink has consistently improved its checkpointing and savepoint mechanisms to enhance fault tolerance. Checkpoints allow Flink to create snapshots of application state during runtime, enabling automatic recovery in the event of failures. Conversely, savepoints are user-triggered, allowing for controlled application updates, upgrades, or redeployments.

Recent updates have focused on improving the efficiency and storage options for checkpoints and savepoints, including support for cloud-native storage systems like Amazon S3 and Google Cloud Storage. These enhancements have made it easier for enterprises to achieve high availability and reliability in mission-critical streaming applications.

6. Flink’s SQL and Table API Advancements

Flink’s SQL and Table API have evolved significantly over the years, making Flink more user-friendly for developers and analysts. Recent updates have introduced support for streaming joins, materialized views, and advanced windowing functions, enabling developers to implement complex queries with minimal effort.

Flink’s SQL advancements have also enabled seamless integration with popular BI tools like Apache Superset, Tableau, and Power BI, making it easier for organizations to generate real-time insights from their streaming data pipelines.

7. PyFlink: Python Support

To broaden its appeal to the growing data science community, Flink introduced PyFlink, its Python API, as part of version 1.9, released in 2019. This update has been particularly impactful as Python remains the go-to language for data science and machine learning. With PyFlink, developers can write Flink applications in Python, access Flink’s powerful stream processing capabilities, and integrate machine learning models directly into their pipelines.

PyFlink has helped Flink bridge the gap between stream processing and machine learning, enabling use cases like real-time anomaly detection, fraud prevention, and personalized recommendations.

8. Flink Stateful Functions (StateFun)

Another transformative update was the introduction of Flink Stateful Functions (StateFun). StateFun extends Flink’s stateful processing capabilities by providing a framework for building distributed, event-driven applications with strong state consistency. This addition made Flink a natural fit for microservices architectures, enabling developers to build scalable, event-driven applications with minimal effort.

Conclusion

Since its inception, Apache Flink has continually evolved to meet the demands of modern data processing. From its innovative stateful stream processing to powerful integrations with SQL, Python, and Kubernetes, Flink has redefined what’s possible in real-time analytics. As organizations embrace real-time data-driven decision-making, Flink’s ongoing innovations ensure it remains at the forefront of stream processing technologies. With a strong community, enterprise adoption, and cutting-edge features, Flink’s future looks brighter than ever.

27 Jan 2025

Min Read

A Guide to the Top Stream Processing Frameworks

Every second, billions of data points pulse through the digital arteries of modern business. A credit card swipe, a sensor reading from a wind farm, or stock trades on Wall Street – each signal holds potential value, but only if you can catch it at the right moment. Stream processing frameworks enable organizations to process and analyze massive streams of data with low latency. This blog explores some of the most popular stream processing frameworks available today, highlighting their features, advantages, and use cases. These frameworks form the backbone of many real-time applications, enabling businesses to derive meaningful insights from ever-flowing torrents of data.

What is Stream Processing?

Stream processing refers to the practice of processing data incrementally as it is generated rather than waiting for the entire dataset to be collected. This allows systems to respond to events or changes in real-time, making it invaluable for time-sensitive applications.
For example:
Fraud detection in banking: Transactions can be analyzed in real-time for suspicious activity.
E-commerce recommendations: Streaming data from user interactions can be used to offer instant product recommendations.
IoT monitoring: Data from IoT devices can be processed continuously for system updates or alerts.
Stream processing frameworks enable developers to build, deploy, and scale real-time applications. Let’s examine some of the most popular ones.

Apache Kafka Streams

Overview:

Apache Kafka Streams, an extension of Apache Kafka, is a lightweight library for building applications and microservices. It provides a robust API for processing data streams directly from Kafka topics and writing the results back to other Kafka topics or external systems. The API only supports JVM languages, including Java and Scala.

Key Features:

It is fully integrated with Apache Kafka, making it a seamless choice for Kafka users.
Provides stateful processing with the ability to maintain in-memory state stores.
Scalable and fault-tolerant architecture.
Built-in support for windowing operations and event-time processing.

Use Cases:

Real-time event monitoring and processing.
Building distributed stream processing applications.
Log aggregation and analytics.

Kafka Streams is ideal for developers already using Kafka for message brokering, as it eliminates the need for additional stream processing infrastructure.

Apache Flink

Overview:
Apache Flink is a highly versatile and scalable stream processing framework that excels at handling unbounded data streams. It offers powerful features for stateful processing, event-time semantics, and exactly-once guarantees.

Key Features:

Support for both batch and stream processing in a unified architecture.
Event-time processing: Handles out-of-order events using watermarks.
High fault tolerance with distributed state management.
Integration with popular tools such as Apache Kafka, Apache Cassandra, and HDFS.

Use Cases:

Complex event processing in IoT applications.
Fraud detection and risk assessment in finance.
Real-time analytics for social media platforms.

Apache Flink is particularly suited for applications requiring low-latency processing, high throughput, and robust state management.

Apache Spark Streaming

Overview:
Apache Spark Streaming extends Apache Spark’s batch processing capabilities to real-time data streams. Its micro-batch architecture processes streaming data in small, fixed intervals, making it easy to build real-time applications.

Key Features:

Micro-batch processing: Processes streams in discrete intervals for near-real-time results.
High integration with the larger Spark ecosystem, including MLlib, GraphX, and Spark SQL.
Scalable and fault-tolerant architecture.
Compatible with popular data sources like Kafka, HDFS, and Amazon S3.

Use Cases:

Live dashboards and analytics.
Real-time sentiment analysis for social media.
Log processing and monitoring for large-scale systems.

While its micro-batch approach results in slightly higher latency compared to true stream processing frameworks like Flink, Spark Streaming is still a popular choice due to its ease of use and integration with the Spark ecosystem.

Apache Storm

Overview:
Apache Storm is one of the pioneers in the field of distributed stream processing. Known for its simplicity and low latency, Storm is a reliable choice for real-time processing of high-velocity data streams.

Key Features:

Tuple-based processing: Processes data streams as tuples in real time.
High fault tolerance with automatic recovery of failed components.
Horizontal scalability and support for a wide range of programming languages.
Simple architecture with “spouts” (data sources) and “bolts” (data processors).

Use Cases:

Real-time event processing for online gaming.
Fraud detection in financial transactions.
Processing sensor data in IoT systems.

Although Apache Storm has been largely overtaken by newer frameworks like Flink and Kafka Streams, it remains an option for applications where low latency and simplicity are key priorities. It is being actively maintained and updated, with version 2.7.1 released in November 2024.

Google Dataflow

Overview:
Google Dataflow is a fully managed, cloud-based stream processing service. It is built on the Apache Beam model, which provides a unified API for batch and stream processing and enables portability across different execution engines.

Key Features:

Unified programming model for batch and stream processing.
Integration with Google Cloud services like BigQuery, Pub/Sub, and Cloud Storage.
Automatic scaling and resource management.
Support for windowing and event-time processing.

Use Cases:

Real-time analytics pipelines in cloud-native applications.
Data enrichment and transformation for machine learning workflows.
Monitoring and alerting systems.

Google Dataflow is best for businesses already operating in the Google Cloud ecosystem.

Amazon Kinesis

Overview:
Amazon Kinesis is a cloud-native stream processing platform provided by AWS. It simplifies streaming data ingestion, processing, and analysis in real-time.

Key Features:

Fully managed service with automatic scaling.
Supports custom application development using the Kinesis Data Streams API.
Integration with AWS services such as Lambda, S3, and Redshift.
Built-in analytics capabilities with Kinesis Data Analytics.

Use Cases:

Real-time clickstream analysis for e-commerce platforms.
IoT telemetry data processing.
Monitoring application logs and metrics.

Amazon Kinesis can be the most sensible option for a company already using AWS services, as it offers a quick way to start.

Choosing the Right Stream Processing Framework

The choice of a stream processing framework depends on your specific requirements, such as latency tolerance, scalability needs, ease of integration, and existing technology stack. For example:

If you’re heavily invested in Kafka, Kafka Streams is a likely fit.
Apache Flink is an excellent choice for low-latency, high-throughput applications and works with a wide array of data repository types.
Organizations with expertise in the cloud can benefit from managed services like Google Dataflow or Amazon Kinesis.

Conclusion

Stream processing frameworks are essential for extracting real-time insights from dynamic data streams. The frameworks mentioned above – Apache Kafka Streams, Flink, Spark Streaming, Storm, Google Dataflow, and Amazon Kinesis, each have unique strengths and ideal use cases. By selecting the right tool for your needs, you can unlock the full potential of real-time data processing, powering next-generation applications and services.

17 Dec 2024

Min Read

Enhancing Fraud Detection with PuppyGraph and DeltaStream

The banking and finance industry has been one of the biggest beneficiaries of digital advancements. Many technological innovations find practical applications in finance, providing convenience and efficiency that can set institutions apart in a competitive market. However, this ease and accessibility have also led to increased fraud, particularly in credit card transactions, which remain a growing concern for consumers and financial institutions.

Traditional fraud detection systems rely on rule-based methods that struggle in real-time scenarios. These outdated approaches are often reactive, identifying fraud only after it occurs. Without real-time capabilities or advanced reasoning, they fail to match fraudsters’ rapidly evolving tactics. A more proactive and sophisticated solution is essential to combat this threat effectively.

This is where graph analytics and real-time stream processing come into play. Combining PuppyGraph, the first and only graph query engine, with DeltaStream, a stream processing engine powered by Apache Flink, enables institutions to improve fraud detection accuracy and efficiency, including real-time capabilities. In this blog post, we’ll explore the challenges of modern fraud detection and the advantages of using graph analytics and real-time processing. We will also provide a step-by-step guide to building a fraud detection system with PuppyGraph and DeltaStream.

Let’s start by examining the challenges of modern fraud detection.

Common Fraud Detection Challenges

Credit card fraud has always been a game of cat and mouse. Even before the rise of digital processing and online transactions, fraudsters found ways to exploit vulnerabilities. With the widespread adoption of technology, fraud has only intensified, creating a constantly evolving fraud landscape that is increasingly difficult to navigate. Key challenges in modern fraud detection include:

Volume: Daily credit card transactions are too vast to review and identify suspicious activity manually. Automation is critical to sorting through all that data and identifying anomalies.
Complexities: Fraudulent activity often involves complex patterns and relationships that traditional rule-based systems can’t detect. For example, fraudsters may use stolen credit card information to make a series of small transactions before a large one or use multiple cards in different locations in a short period.
Real-time: The sooner fraud is detected, the less financial loss there will be. Real-time analysis is crucial in detecting and preventing transactions as they happen, especially when fraud can be committed at scale in seconds.
Agility: Fraudsters will adapt to new security measures. Fraud detection systems must be agile, even learning as they go, to keep up with the evolving threats and tactics.
False positives: While catching fraudulent transactions is essential, it’s equally important to avoid flagging legitimate transactions as fraud. False positives can frustrate customers, especially when a card is automatically locked out due to legitimate purchases. As a consequence, they can adversely affect revenue.

To tackle these challenges, businesses require a solution that processes large volumes of data in real-time, identifies complex patterns, and evolves with new fraud tactics. Graph analytics and real-time stream processing are essential components of such a system. By mapping and analyzing transaction networks, businesses can more effectively detect anomalies in customer behavior and identify potentially fraudulent transactions.

Leveraging Graph Analytics for Fraud Detection

Traditional fraud detection methods analyze individual transactions in isolation. This can miss connections and patterns that emerge when we examine the bigger picture. Graph analytics allows us to visualize and analyze transactions as a network of connected things.

Think of it like a social network. Each customer, credit card, merchant, and device becomes a node in the graph, and each transaction connects those nodes. We can find hidden patterns and anomalies that indicate fraud by looking at the relationships between nodes.

Figure: an example schema for fraud detection use case

Here’s how graph analytics can be applied to fraud detection:

Finding suspicious connections: Graph algorithms can discover unusual patterns of connections between entities. For example, if the same person uses multiple credit cards in different locations in a short period or a single card is used to buy from a group of merchants known for fraud, those connections will appear in the graph and be flagged as suspicious.
Uncovering fraud rings: Fraudsters often work within the same circles, using multiple identities and accounts to carry out scams. Graph analytics can find those complex networks of people and their connections, helping to identify and potentially break up entire fraud rings.
Surfacing identity theft: When a stolen credit card is used, the spending patterns will generally be quite different from the cardholder’s normal behavior. By looking at the historical and current transactions within a graph, you can see sudden changes in spending habits, locations, and types of purchases that may indicate identity theft.
Predicting future fraud: Graph analytics can predict future fraud by looking at historical data and the patterns that precede a fraudulent transaction. By predicting fraud before it happens, businesses can take action to prevent it.

Of course, all of these benefits are extremely helpful. However, the biggest hurdle to realizing them is the complexity of implementing a graph database. Let’s look at some of those challenges and how PuppyGraph can help users avoid them entirely.

Challenges of Implementing and Running Graph Databases

As shown, graph databases can be an excellent tool for fraud detection. So why aren’t they used more frequently? This usually boils down to implementing and managing them, which can be complex for those unfamiliar with the technology. The hurdles that come with implementing a graph database can far outweigh the benefits for some businesses, even stopping them from adopting this technology altogether. Here are some of the issues generally faced by companies implementing graph databases:

Cost: Traditional relational databases have been the norm for decades, and many organizations have invested heavily in their infrastructure. Switching to a graph database or even running a proof of concept requires a significant upfront investment in new software, hardware, and training.
Implementing ETL: Extracting, transforming, and loading (ETL) data into a graph database can be tricky and time-consuming. Data needs to be restructured to fit into a graph model, which requires knowledge of the underlying data to be moved over and how to represent these entities and relationships within a graph model. This requires specific skills and adds to the implementation time and cost, meaning the benefits may be delayed.
Bridging the skills gap: Graph databases require a different data modeling and querying approach from traditional databases. In addition to the previous point regarding ETL, finding people with the skills to manage, maintain, and query the data within a graph database can also be challenging. Without these skills, graph technology adoption is mostly dead in the water.
Integration challenges: Integrating a graph database with existing systems and applications is complex. This usually involves taking the output from graph queries and mapping them into downstream systems, which requires careful planning and execution. Getting data to flow smoothly and be compatible with different systems is significant.

These challenges highlight the need for solutions that make graph database adoption and management more accessible. A graph query engine like PuppyGraph addresses these issues by enabling teams to integrate their data and query it as a graph in minutes without the complexity of ETL processes or the need to set up a traditional graph database. Let’s look at how PuppyGraph helps teams become graph-enabled without ETL or the need for a graph database.

How PuppyGraph Solves Graph Database Challenges

PuppyGraph is built to tackle the challenges that often hinder graph database adoption. By rethinking graph analytics, PuppyGraph removes many entry barriers, opening up graph capabilities to more teams than otherwise possible. Here’s how PuppyGraph addresses many of the hurdles mentioned above:

Zero-ETL: One of PuppyGraph’s most significant advantages is connecting directly to your existing data warehouses and data lakes—no more complex and time-consuming ETL. There is no need to restructure data or create separate graph databases. Simply connect the graph query engine directly to your SQL data store and start querying your data as a graph in minutes.
Cost: PuppyGraph reduces the expenses of graph analytics by using your existing data infrastructure. There is no need to invest in new database infrastructure or software and no ongoing maintenance costs of traditional graph databases. Eliminating the ETL process significantly reduces the engineering effort required to build and maintain fragile data pipelines, saving time and resources.
Reduced learning curve: Traditional graph databases often require users to master complex graph query languages for every operation, including basic data manipulation. PuppyGraph simplifies this by functioning as a graph query engine that operates alongside your existing SQL query engine using the same data. You can continue using familiar SQL tools for data preparation, aggregation, and management. When more complex queries suited to graph analytics arise, PuppyGraph handles them seamlessly. This approach saves time and allows teams to reserve graph query languages specifically for graph traversal tasks, reducing the learning curve and broadening access to graph analytics.
Multi-query language support: Engineers can continue to use their existing SQL skills and platform, allowing them to leverage graph querying when needed. The platform offers many ways to build graph queries, including Gremlin and Cypher support, so your existing team can quickly adopt and use graph technology.
Effortless scaling: PuppyGraph’s architecture separates compute and storage so it can easily handle petabytes of data. By leveraging their underlying SQL storage, teams can effortlessly scale their compute as required. You can focus on extracting value from your data without scaling headaches.
Fast deployment: With PuppyGraph, you can deploy and start querying your data as a graph in 10 minutes. There are no long setup processes or complex configurations. Fast deployment means you can start seeing the benefits of graph analytics and speed up your fraud detection.

In short, PuppyGraph removes the traditional barriers to graph adoption so more institutions can use graph analytics for fraud detection use cases. By simplifying, reducing costs, and empowering existing teams with effortless graph adoption, PuppyGraph makes graph technology accessible for all teams and organizations.

Real-Time Fraud Prevention with DeltaStream

Speed is key in the fight against fraud, and responsiveness is crucial to preventing or minimizing the impact of an attack. Systems and processes that act on events with minimal latency can mean the difference between successful and unsuccessful cyber attacks. DeltaStream empowers businesses to analyze and respond to suspicious transactions in real-time, minimizing losses and preventing further damage.

Why Real-Time Matters:

Immediate Response: Rapid incident response means security and data teams can detect, isolate, and trigger mitigation protocols, minimizing their vulnerability window faster than ever. With real-time data and sub-second latency, the Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR) can be significantly reduced.
Proactive Prevention: Data and security teams can identify behavior patterns as they emerge and implement mitigation tactics. Real-time allows for continuous monitoring of system health and security with predictive models.
Improved Accuracy: Real-time data provides a more accurate view of customer behavior for precise detection. Threats are more complex than ever and often involve multi-stage attack patterns; streaming data aids in identifying these complex and ever-evolving threat tactics.

DeltaStream’s Key Features:

Speed: Increase the speed of your data processing and your team’s ability to create data applications. Reduce latency and cost by shifting your data transformations out of your warehouse and into DeltaStream. Data teams can also quickly write queries in SQL to create analytics pipelines with no other complex languages to learn.
Team Focus: Eliminate maintenance tasks with our continually optimizing Flink operator. Your team isn’t focused on infrastructure, meaning they can focus on building and strengthening pipelines.
Unified View: An organization’s data rarely comes from just one source. Process streaming data from multiple sources in real-time to get a complete picture of activities. This means transaction data, user behavior, and other relevant signals can be analyzed together as they occur.

By combining PuppyGraph’s graph analytics with DeltaStream’s real-time processing, businesses can create a dynamic fraud detection system that stays ahead of evolving threats.

Step-by-Step tutorial: DeltaStream and PuppyGraph

In this tutorial, we go through the high-level steps of integrating DeltaStream and PuppyGraph.

The detailed steps are available at:

Starting a Kafka Cluster

We start a Kafka Server as the data input. (Later in the tutorial, we’ll send financial data through Kafka.)

We create topics for financial data like this:

bin/kafka-topics.sh --create --topic kafka-Account --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

Setting up DeltaStream

Connecting to Kafka

Log in to the Deltastream console. Then, navigate to Resources and add a Kafka Store – for example, kafka_demo – with the Kafka Cluster parameters we created in the previous step.

Next, in the Workspace, create a deltastream database – for example: kafka_db
After that, we use DeltaStream SQL to create streams for the Kafka topics we created in the previous step. The stream describes the topic’s physical layout so it can be easily referenced with SQL. Here is an example of one of the streams we create in DeltaStream for a Kafka topic. Once we declare the streams, we can build streaming data pipelines to transform, enrich, aggregate, and prepare streaming data for analysis in PuppyGraph. First, we’ll define the account_stream from the kafka-Account topic.

CREATE STREAM account_stream (
  "label" STRING,
  "accountId" BIGINT,
  "createTime" STRING,
  "isBlocked" BOOLEAN,
  "accoutType" STRING,
  "nickname" STRING,
  "phonenum" STRING,
  "email" STRING,
  "freqLoginType" STRING,
  "lastLoginTime" STRING,
  "accountLevel" STRING
) WITH (
  'topic' = 'kafka-Account',
  'value.format' = 'JSON'
);

Next, we’ll define the accountrepayloan_stream from the kafka-AccountRepayLoan topic:

CREATE STREAM accountrepayloan_stream (
  "label" STRING,
  "accountrepayloandid" BIGINT,
  "loanId" BIGINT,
  "amount" DOUBLE,
  "createTime" STRING
) WITH (
  'topic' = 'kafka-AccountRepayLoan',
  'value.format' = 'JSON'
);

And finally, we’ll show the accounttransferaccount_stream from the kafka-AccountTransferAccount. You’ll note there is both a fromid and toid that will like to the loanId. This allows us to enrich data in the account payment stream with account information from the account_stream and combine it with the account transfer stream.

With DeltaStream, this can then easily be written out as a more succinct and enriched stream of data to our destination, such as Snowflake or Databricks. We combine data from three streams with just the information we want, preparing the data in real-time from multiple streaming sources, which we then graph using PuppyGraph.

CREATE STREAM accounttransferaccount_stream (
  "label" VARCHAR,
  "accounttransferaccountid", BIGINT,
  "fromd" BIGINT,
  "toid" BIGINT,
  "amount" DOUBLE,
  "createTime" STRING,
  "ordernum" BIGINT,
  "comment" VARCHAR,
  "paytype" VARCHAR,
  "goodstype" VARCHAR
) WITH (
  'topic' = 'kafka-AccountTransferAccount',
  'value.format' = 'JSON'
);

Adding a Store for Integration

PuppyGraph will connect to the stores and allow querying as a graph.

Once our data is ready in the desired format, we can write streaming SQL queries in DeltaStream to write data continuously in the desired storage. In this case, we can use DeltaStream’s native integration with Snowflake or Databricks, where we will use PoppyGraph. Here is an example of writing data continuously into a table in Snowflake or Databricks from DeltaStream:

CREATE TABLE ds_account
WITH
(
'store' = '<store_name>'
<Storage parameters>
) AS
SELECT * FROM account_stream;

For Databricks integration, refer to the Databricks integration documentation for detailed steps.
For Snowflake integration, refer to Snowflake integration documentation for detailed steps.

Starting data processing

Now, you can start a Kafka Producer to send the financial JSON data to Kafka. For example, to send account data, run:

kafka-console-producer.sh --broker-list localhost:9092 --topic kafka-Account < json_data/Account.json

DeltaStream will process the data, and then we will query it as a graph.

Query your data as a graph

You can start PuppyGraph using Docker. Then upload the Graph schema, and that’s it! You can now query the financial data as a graph as DeltaStream processes it.

Start PuppyGraph using the following command:

docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \
-e DATAACCESS_DATA_CACHE_STRATEGY=adaptive \
-e <STORAGE PARAMETERS> \
--name puppy --rm -itd puppygraph/puppygraph:stable

Log into the PuppyGraph Web UI at http://localhost:8081 with the following credentials:

Username: puppygraph

Password: puppygraph123

Upload the schema:Select the file schema_<storage>.json in the Upload Graph Schema JSON section and click Upload.

Navigate to the Query panel on the left side. The Gremlin Query tab offers an interactive environment for querying the graph using Gremlin. For example, to query the accounts owned by a specific company and the transaction records of these accounts, you can run:

g.V("Company[237]")
  .outE('CompanyOwnAccount').inV()
  .outE('AccountTransferAccount').inV()
  .path()

Conclusion

As this blog post explores, traditional fraud detection methods simply can’t keep pace with today’s sophisticated criminals. Real-time analysis and the ability to identify complex patterns are critical. By combining the power of graph analytics with real-time stream processing, businesses can gain a significant advantage against fraudsters.

PuppyGraph and DeltaStream offer robust and accessible solutions for building real-time dynamic fraud detection systems. We’ve seen how PuppyGraph unlocks hidden relationships and how DeltaStream analyzes real-time data to quickly and accurately identify and prevent fraudulent activity. Ready to take control and build a future-proof, graph-enabled fraud detection system? Try PuppyGraph and DeltaStream today. Visit PuppyGraph and DeltaStream to get started!

13 Nov 2024

Min Read

What’s Coming in Apache Flink 2.0?

As champions for Apache Flink, we are excited for the 2.0 release and all that it will bring. Apache Flink 1.0 was released in 2016, and while we don’t have an exact release date, it looks like 2.0 will be released in late 2024/early 2025. Version 1.2 was just released in August 2024. Version 2.0 is set to be a major milestone release, marking a significant evolution in the stream processing framework. This blog runs down some of the key features and changes coming in Flink 2.0.

Disaggregated State Storage and Management

One of the most exciting features of Flink 2.0 is the introduction of disaggregated state storage and management. It will utilize a Distributed File System (DFS) as the primary storage for state data. This architecture separates compute and storage resources, addressing key scalability and performance needs for large-scale, cloud-native data processing.

Core Advantages of Disaggregated State Storage

Improved Scalability
By decoupling storage from compute resources, Flink can manage massive datasets—into the hundreds of terabytes—without being constrained by local storage. This separation enables efficient scaling in containerized and cloud environments.
Enhanced Recovery and Rescaling
The new architecture supports faster state recovery on job restarts, efficient fault tolerance, and quicker job rescaling with minimal downtime. Key components include shareable checkpoints and LazyRestore for on-demand state recovery.
Optimized I/O Performance
Flink 2.0 uses asynchronous execution and grouped remote state access to minimize the latency impact of remote storage. A hybrid caching mechanism can improve cache efficiency, providing up to 80% better throughput than traditional file-level caching.
Improved Batch Processing
Disaggregated state storage enhances batch processing by better handling large state data and integrating batch and stream processing tasks, making Flink more versatile across diverse workloads.
Dynamic Resource Management
The architecture enables flexible resource allocation, minimizing CPU and network usage spikes during maintenance tasks like compaction and cleanup.

API and Configuration Changes

Several API and configuration changes will be introduced, including:

Removal of deprecated APIs, including the DataSet API and Scala versions of DataStream and DataSet APIs
Deprecation of the legacy SinkFunction API in favor of the Unified Sink API
Overhaul of the configuration layer, enhancing user-friendliness and maintainability
Introduction of new abstractions such as Materialized Tables in v1.2 and further enhanced in v2
Updates to configuration options, including proper type usage (e.g., Duration, Enum, Int)

Modernization and Unification

Flink 2.0 aims to further unify batch and stream processing:

Modernization of legacy components, such as replacing the legacy SinkFunction with the new Unified Sink API
Enhanced features that combine batch and stream processing seamlessly
Improvements to Adaptive Batch Execution for optimizing logical and physical plans

Performance Improvements

The community is working on making Flink’s performance on bounded streams (batch use cases) competitive with dedicated batch processors. This can further simplify your data processing stack.

Dynamic Partition Pruning (DPP) to minimize I/O costs
Runtime Filter to reduce I/O and shuffle costs
Operator Fusion CodeGen to improve query execution performance

Cloud-Native Focus

Flink 2.0 is being designed with cloud-native architectures in mind:

Improved efficiency in containerized environments
Better scalability for large state sizes
More efficient fault tolerance and faster rescaling

Summary of Flink 2.0

This is an exciting time for Apache Flink 2.0. It represents a significant leap forward in unified batch and stream processing, focusing on cloud-native architectures, improved performance, and streamlined APIs. These changes aim to address the evolving needs of data-driven applications and set new standards for what’s possible in data processing. DeltaStream is proudly powered by Apache Flink, which makes it easy to start running Flink in minutes. Get a free trial of DeltaStream and see for yourself.

29 Oct 2024

Min Read

A Guide to Standard SQL vs. Streaming SQL: Why Do We Need Both?

Understanding the Differences Between Standard SQL and Streaming SQL

SQL has long been a foundational tool for querying databases. Traditional SQL queries are typically run against static, historical data, generating a snapshot of results at a single point in time. However, the rise of real-time data processing, driven by applications like IoT, financial transactions, security monitoring/intrusion, and social media, has led to the evolution of Streaming SQL. This variant extends traditional SQL capabilities, offering features specifically designed for real-time, continuous data streams.

Standard SQL and Streaming SQL Key Differences

1. Point-in-Time vs. Continuous Queries

In standard SQL, queries are typically run once and return results based on a snapshot of data. For instance, when you query a traditional database to get the sum of all sales, it reflects only the state of data up until the moment of the query.

In contrast, Streaming SQL works with data that continuously flows in, updating queries in real-time. The same query can be run in streaming SQL, but instead of receiving a one-time result, the query is maintained in a materialized view that updates as new data arrives. This is especially useful for use cases like dashboards or monitoring systems, where the data needs to stay current.

2. Real-Time Processing with Window Functions

Streaming SQL introduces window functions, allowing users to segment a data stream into windows for aggregation or analysis. For example, a tumbling window is a fixed-length window (such as one minute) that collects data for aggregation over that time frame. In contrast, a hopping window is a fixed-size time interval that will hop by a specified length. That means if you want to calculate the current inventory every two minutes but update the results every minute, the hopping window would then be two minutes, and the hop size would be a minute.

Windowing in traditional SQL is static and backward-looking, whereas in streaming SQL, real-time streams are processed continuously, updating aggregations within the described window.

3. Watermarks for Late Data Handling

In streaming environments, data can arrive late or out of order. To manage this, Streaming SQL introduces watermarks. Watermarks mark the point in time up to which the system expects to have received data. For instance, if an event is delayed by a minute, a watermark ensures it’s still processed if it arrives within that window, making streaming SQL robust for real-world, unpredictable data flows. Conventional SQL has no ability or need to address this scenario.

4. Continuous Materialization

One of the unique aspects of Streaming SQL is the ability to materialize views incrementally. Unlike traditional databases that recompute queries when data changes, streaming SQL continuously maintains these views as new data flows in. This approach dramatically improves performance for real-time analytics by avoiding expensive re-computations.

Use Cases for Streaming SQL

The rise of streaming SQL has been a game-changer across industries. Common applications include:

Real-time analytics dashboards, such as stock trading platforms or retail systems where quick insights are needed to make rapid decisions.
Event-driven applications where alerts and automations are triggered by real-time data, such as fraud detection or IoT sensor monitoring.
Real-time customer personalization, where user actions or preferences update in real-time to deliver timely recommendations.

Conclusion

While Standard SQL excels in querying static, historical datasets, Streaming SQL is optimized for real-time data streams, offering powerful features like window functions, watermarks, and materialized views. These advancements handle fast-changing data with low latency, offering immediate insights and automation. This article at Datanami in July 2023 pegged 177% growth in streaming adoption in the previous 12 months. As more industries rely on real-time decision-making, streaming SQL is becoming a critical tool for modern data infrastructures.