Consider storage service where you store and retrieve files. In on-prem environments, HDFS(Hadoop Distributed File System) has been one of the most common storage platforms to use. As for any service in an on-prem environment, as a user you are responsible for all aspects of the operations. This includes bringing up the HDFS cluster, scaling up and down the cluster based on the usage requirements, dealing with different types of failures including but not limited to server failures, network failures and many more. Of course, in an ideal situation as a user you would like to just use the storage service and focus on your business without dealing with the complexity of operating such infrastructure, including the one mentioned above. If you use a cloud environment instead of on-prem, you have the option of choosing to use storage services that are provided by a variety of vendors instead of running HDFS on cloud yourself. However, there are different ways to provide the same service on cloud that can significantly affect user experience and ease of use for such services.
A Look at Managed Services
Let’s go back to our storage service and consider that we are now using a cloud environment and can take advantage of not running our required services ourselves and instead using the ones that vendors offer on this cloud environment. One common option to provide such services is to provide a managed version of the on-prem services. In this case, the service provider takes the same platform that is used in the on-prem environment and makes some improvements to run it on the cloud environment. While this takes away some of the burden of operations from the user, the user still is involved in many other aspects of the operations of such managed services. For the storage service we are considering here, a managed HDFS service would be an example of such an approach. When using a “fully managed” cloud HDFS, as a user you should still make decisions such as provisioning a HDFS cluster through the service. This means that you need to have a good understanding of the amount of storage resources you will be using and let the service provider know if you need more or less of such resources after provisioning the initial cluster. Requiring the user to provide such information in many cases results in confusion and in most cases the initial decision won’t be the accurate one and as the usage continues there will be a need for adjusting the provisioned resources. You cannot expect a user to accurately know how much storage they will need in the next six months or a year.
In addition, the notion of cluster brings many limitations. A cluster has a finite amount of resources available and as the usage continues, the cluster resources would be consumed and there will be a need for more resources. In the case of our “managed” HDFS service, the provisioned storage(disk space) is one of the limited resources and as more and more data is stored in the cluster, the available storage will shrink. At this point, the user has to decide between scaling up the existing cluster by adding more nodes and disk space or adding a new cluster to accommodate the growing need for the storage. To accommodate such issues, users may over-provision resources which in turn can result in unused resources and extra cost. Finally, once a cluster is provisioned, the user will start incurring the cost of the whole cluster regardless if half or all of the cluster resources are utilized. In short, the managed cloud service in most cases will put the burden of resource provisioning on the user, this in turn requires the user to have deep understanding of the required resources not for now, but for short term and long term future.
Unlocking Cloud Simplicity: The Serverless Advantage
Now let’s assume instead of taking the managed service path, a vendor takes a different route and builds a new cloud-only storage service from the ground up where all the complexity and operations are handled under the hood by the vendor and the users don’t have to think about the complexities such as resource provisioning as described above. In the case of storage service, object store services such as AWS S3 are great examples of such an approach, which is called Serverless. As an S3 user, you just need to set up buckets and folders and read and write your files. No need to manage anything or provision any cluster, no need to worry about having enough disk space or nodes. All operational aspects of the service including making sure the service is always available with required storage space is handled by S3. This is a huge win for the users since they can focus on building their applications instead of worrying about provisioning and sizing clusters correctly. With such simplicity in use, we can see why almost every cloud user uses object stores such as S3 for their storage needs unless there is an explicit requirement to use anything else. S3 is a great example for superiority of cloud-native serverless architecture compared to providing a “fully managed” version of the on-prem products.
Another benefit of serverless platforms such as S3 compared to managed services is that S3 enables users to access, organize and secure their data in one place instead of dealing with multiple clusters. In S3 you can organize your data in buckets and folders, have a unified view of all of your data and control access to your data in one place. The same cannot be said for managed HDFS service if you have more than one cluster! In this case users have to keep track of which cluster has which data and how to control access to data across multiple clusters which is a much more complex and error prone process.
Choosing Serverless for Stream Processing
We can have the same argument in favor of serverless offering compared to managed services for many other platforms including stream processing and streaming databases. You can have “fully managed” services where the user has to provision clusters along with specifying the amount of resources this cluster will have before starting to write any query. Indeed, in managed services for stream processing you have much more complexities compared to the managed HDFS example we explained above. The cluster in the stream processing case will be shared among multiple queries which means imperfect isolation and the possibility of one bad query bringing down the whole cluster which disrupts the other queries even though they were healthy and running with no issues. To exacerbate the situation, as you add more streaming queries to the cluster eventually the cluster resources will all be used since the streaming queries are long running jobs and you will need to scale up your cluster or launch new cluster to accommodate newer queries. The first option results in having a larger cluster with more queries that share the same cluster resources and can interfere with each other’s resources.
On the other hand, the second option will result in a growing number of clusters to keep track of and also keep track of which query is running on which cluster. So anytime you have to provision or declare a cluster is a “fully managed” stream processing or streaming database service, you are dealing with the managed service along with the above mentioned restrictions and many more. Even worse, once you provision a cluster, the billing for the cluster starts regardless of having no queries or several queries running in the cluster.
We built DeltaStream as a serverless platform because we believe that such a service should be as easy to use as S3. You can refer to DeltaStream as the S3 of stream processing. In DeltaStream there is no notion of cluster or provisioning. You just connect to your streaming stores such as Apache Kafka or AWS Kinesis and you are up and running ready to run queries. Only pay for queries that are running, and since there is no concept of cluster, you won’t be charged for idle or under utilized clusters! Focus on building your streaming applications and pipelines and leave the infrastructure to us. Launch as many queries as you want and there is no notion of running out of resources! Your queries run in isolation and we can scale them up and down independently without interfering with each other.
DeltaStream is a unified platform that provides stream processing(streaming analytics) and streaming database in one platform. You can build streaming pipelines, event base applications and always uptodate materialized views within familiar SQL syntax. In addition, DeltaStream enables you to organize and secure your streaming data across multiple streaming storage systems. If you are using any flavor of Apache Kafka including Confluent Cloud, AWS MSK or RedPanda or AWS Kinesis you can now try DeltaStream for free by signing up for our free trial.