Real-time data has become increasingly valuable across many industries, including healthcare, IoT, machine learning, and cybersecurity. Stream processing frameworks such as Apache Flink enable organizations to gain immediate insights into their streaming data. For example, in the cases of IoT, companies can understand within sub-seconds if a sensor has failed. However, what if you want to share that processed data (stream) with another team who needs it for a downstream application? The processed stream is what we refer to as a Data Product, and sharing these Data Products is difficult because of the underlying permissioning in a streaming store such as Kafka. If we build a system of governing these source and processed streams, we unlock the ability to securely share Data Products. This enables cross functional and external sharing use cases.
What does it mean to Securely Share Data?
Secure data sharing is a technique that enables data owners to share their data resources in a secure and controlled way with data consumers. Secure sharing for real-time data streams involves the following aspects:
- Data privacy: ensuring that the data streams are protected from unauthorized access, disclosure, and modification, and that the data owners have control over who can access their data and for what purpose.
- Data access: ensuring that the data streams are available, accessible, and usable for the intended data consumers, and that the data sharing is scalable, efficient, and cost-effective.
- Data auditing: ensuring that the usage and sharing of data streams is transparent, accountable, and auditable.
Secure data sharing improves both data collaboration and data innovation. With secure data sharing, data owners and consumers can safely exchange real-time data and insights, encouraging cross team collaboration and extracting more value out of the otherwise siloed data.
Current State of Sharing Streaming Data
Managing data authorization is one of the most common ways to securely share real-time streaming data. Let’s consider Apache Kafka for example, the most popular real-time event streaming storage systems. Data events in Kafka are organized into topics, and access to these topics is managed through Access Control Lists (ACLs). However, at scale, ACLs are difficult to manage and prone to mistakes.
To process real-time streaming data, users often look towards stream processing frameworks like Flink. ACLs would need to be configured in the streaming store to allow the Flink job to read from and write to topics. Then, to share the topic containing the processed data with other users, the ACL needs further additions. There are many streaming and stream processing technologies that make up a streaming data ecosystem, and all of these systems put stress on the ACLs used to control access to data streams
In order to overcome these challenges, there needs to be a data platform that can manage data processing, access control, and data sharing capabilities at a higher level.
Best Practices for Securely Sharing Streaming Data
There are two main types of data sharing that data platforms should support – internal data sharing and 3rd party data sharing.
Internal Data Sharing
Internal data sharing refers to the process of making data accessible for other users, teams, or applications, within an organization. Secure internal data sharing capabilities play along nicely with data meshes, where data owners can be any user or team within an organization. These data owners have the capability of authorizing who has access to data, and determining what access permissions each party has.
One of the common ways we see popular data systems, such as Snowflake, provide internal data sharing capabilities is through Role-Based Access Control (RBAC). RBAC is scalable and easy to understand, making it an effective tool for controlling access to data. While access to streaming data has typically been defined with ACLs, an RBAC-based approach would address the scalability issues seen in streaming storage systems today. A data platform that sits on top of streaming storage systems, such as Kafka, and provides an RBAC interface for managing access to the underlying Kafka topics would provide a much more intuitive access control experience for real-time data users.
3rd Party Data Sharing
3rd party data sharing refers to the process of making data accessible for parties outside of an organization. This enables organizations to collaborate with other organizations without giving them access to their data ecosystem directly. In the current streaming landscape, this kind of data sharing is not natively supported. For instance, only internal data sharing is allowed in Kafka, through ACLs.
Using Snowflake as an example, they enable secure 3rd party data sharing through the concept of shares, allowing data objects in a Snowflake organization to be made shareable for other Snowflake organizations. The data providers in this case can specify which accounts can consume from these data objects. This is just one example out of many possible ways to implement 3rd party sharing. Providing such functionality for streaming data would unlock opportunities for organizations to collaborate with real-time streaming data as well.
Secure sharing for real-time data streams improves data accessibility and enhances collaboration with real-time analytics, while enabling teams to maintain the privacy of their data assets. However, the data sharing capabilities in the current state of real-time streaming data are not up to par with the capabilities seen in the batch world for at-rest data.
DeltaStream is a stream processing data platform that aims to provide an intuitive way to share streaming data, both internally and with 3rd parties. At DeltaStream, we use RBAC as the approach for access control and provide capabilities for sharing data between organizations. Data Governance and secure data sharing are essential for providing an easy-to-use data ecosystem that allows users to focus on their data products. If you are interested in learning more about DeltaStream, reach out to us for a demo or sign up for a free trial.