Building event-driven systems at scale in Kubernetes with Dapr — Part III: What does “at scale” really mean?
In Part I of this series, I discussed why a team would choose to use Dapr to build applications. In Part II, I discussed how Dapr works. In Part III, I will describe my experience leading a team which built a system in Azure on Kubernetes, using Kafka for distributed messaging and Cosmos DB for high throughput and elastic scaling of state management. Dapr is used as the glue that holds all the infrastructure and architecture pieces together.
A brief introduction, I am a Senior Software Engineer at a company which develops software which we present as a platform to our customers (almost like PaaS, but before it was cool). I was given the opportunity to tech lead a newly formed team where our mandate was to build a near realtime, event-driven system that would use AI/ML to intervene at points along a player’s journey where the ML sees potential to improve their experience on our platform. Of course, how that actually works is the secret sauce, which we will not be digging into publicly. In this post, I will discuss how Dapr solved some inherently complex problems that present themselves when building distributed systems, and give a brief summary of the kind of scale we’re running at. Let’s get started.
What problems did Dapr solve for me?
Making it easy to avoid building a distributed monolith
This is a common trap that developers fall into when building micro-services. We think about how we should split the system based on the type of resources they abstract, but we don’t give thought to how the services interact with each other. What typically gets built are a bunch of services that have a small blast radius but await the response of their dependencies, which await the responses of their dependencies, and so on and so fourth. What you end up with is chain of awaited HTTP calls and as a result, a system which is no better nor is it any faster (typically slower, actually) than a monolith. While Dapr can’t fix this problem for you as it’s fundamentally an architectural problem, Dapr certainly makes it easier to avoid making this mistake by plugging in a Pub/Sub component to your system with ease. Furthermore, if you do need to make an awaited call on a dependency, Dapr uses gRPC rather than HTTP to invoke requests between services which comes with the inherent performance benefits of HTTP/2 over HTTP/1.1.
Service discovery to reliably find a route to your destination
This is a problem developers run into almost immediately when building distributed systems — how do I get my app to talk to that other app? Dapr has a lot of overlap with service meshes, but is not a service mesh in of itself. However, from an application developers perspective, Dapr makes it super easy to discover other applications by their dapr application identifier (or dapr-appId
). By nature of the fact that each application has to register itself with Dapr by a unique identifier, Dapr is able to provide you with a simple way of interacting between services simply by addressing it by the dapr-appId
. This is hugely powerful as it ensures requests are load balanced and comms are secure by implementing the principle of least privilege to ensure only those who should have access to your services do have access. It’s difficult to mess it up to be completely honest.
Parallel, independent compute with virtual actors
Using virtual actors is an easy way to scale a system safely when you have a use case for distributed processing of data in a safe, single-threaded, and atomic way. Virtual actors do this by acquiring a lock on themself when they begin processing and release that lock when processing is complete. While an individual actor can only process one event at a time, many actors can be addressed concurrently. This eliminates the need for complex locking and idempotent checks.
In the context of a near realtime, event-driven ML system, we needed to keep a trailing buffer of events for an individual player in order to run meaningful inference. For complex technical reasons, we decided to use virtual actors to keep this trailing buffer of events and virtual actors is what made this possible. To do this without using virtual actors would’ve been highly complex — and dare I say it, nigh on impossible.
Reliable and traceable
Possibly the most understated part of Dapr is how easy it is to build a system that is fault tolerant while being easily observed. What does that really means? It means Dapr has built-in concepts for handling failures and environmental outages in ways that heal itself and is observable during the outage. We leverage Pub/Sub with which Dapr implements Cloud Events which gives us the ability to observe a single event flow through the entire system. Given we are in Azure and so are our dependencies, our distributed traces span across the entire processing pipeline from the event reaching the system, and ML-driven inference response generated, action taken and player interactions on the client.
I cannot overstate how massively important doing this right is.
So what does “at scale” really mean?
320 million events per day. 3700 events per second, give or take a few. Not all of them are processed, but that’s what Dapr can do. Well, actually, Dapr can do a whole lot more than that. To be more precise, the virtual actors are processing around 1000 events per second. There are other auxiliary services that are used to apply constraints and boundaries and such but their volumes are significantly lower as we have designed the system to essentially funnel and eliminate events each step of the way. All of this runs on an Azure Kubernetes Service (AKS) cluster sitting on around 40 VMs — with a maximum of 80 VMs — and Azure Cosmos DB provisioned up to 400K RU/s. For those curious, Cosmos DB averages 1.49ms response time on the getter calls and around 40ms on the setter calls.
By far the limitation on the system is not Dapr, but rather it is how fast one can pragmatically run near realtime inference through an ML decision making module.