How virtual actor frameworks deal with cluster topology change

When nodes are added or removed from the cluster, we say that topology of the cluster changes. With virtual actor model, this introduces interesting challenges that are not present in typical stateless applications. Let’s take a look at how different frameworks handle this situation. When does the topology change? The main scenarios for topology change

When does the topology change?

The main scenarios for topology change are:

a scale up or scale down of the cluster as a response to changing load
failure of a node in the cluster
application upgrade, e.g. rolling update in Kubernetes

What is the challenge with virtual actors and topology change?

With a typical stateless application, the event of application instance being added or removed updates the load balancing mechanism. The next request served by the application might be forwarded to an instance that has just joined the load balancer. This instance will serve the request in the same way as any of the other instances (perhaps a bit slower due to a cold start). If an instance has left the load balancer, there are no undesirable side effects (worst case scenario some in-flight requests will fail, which can be avoided with proper implementation).

After adding an instance, business continues as usual

We can also remove an instance without much consequence

The situation is a bit more complex when a virtual actor framework is involved, because the application is stateful. In this case, the message is sent to a specific activation of the actor, which resides on a specific node. The client does not need to know what node that is exactly, but it still matters since the message needs to be routed by the framework.

There are some interesting questions here:

If we add a new node to our actor cluster, how can we take advantage of this additional processing power? Do we move existing actor activations or should we just wait for new actors to spawn on it?
What is the consequence of a node leaving the cluster?
What happens if we need to rotate all the nodes in the cluster, e.g. during application upgrade?

Default approach

In many cases actor activations are considered to be “sticky”. This means that once they are activated on a node, they will not move unless absolutely necessary. Such approach prevents slowdowns related to actor migration, but at the same time in some scenario it causes unbalanced load among the cluster nodes.

A node was added to the cluster, but existing activations did not move

A node was removed, so the actors had to be reactivated on the remaining nodes

Assuming the default approach, there might be different consequences depending on the lifetime of your virtual actors. Think about these two different applications:

First application processes a stream of data from IoT devices. Each device produces messages with constant or varying frequency, but frequent enough for the corresponding actors to permanently remain in the memory. The list of devices does not change a lot, with a device being added or removed every now and then.
The second application uses virtual actors for game matchmaking process. New games are started all the time, and the existing ones expire. A lot of activations and deactivations of virtual actors happen in this system.

Let’s now consider what happens when we want to scale up the cluster. In scenario 1, the newly added node will not be utilized at all, because existing actor activations will not be moved by default. In the second scenario, the new node should take over part of the load gradually as new virtual actors are created and old ones are deactivated.

What about a node leaving the cluster? Of course the activations need to move. Most commonly, the actors will be unloaded from memory before node shutdown. Then each actor will be recreated on one of the remaining nodes upon first message routed to this actor. This of course means actor state needs to be reloaded from persistent storage (if persistence is used). In a scenario where each cluster node is rotated e.g. due to application upgrade, actors might need to be deactivated and reactivated multiple times. That potentially puts a lot of load on your persistence mechanism and you must account for such scenario when planning the capacity. It also might result in a situation where most of the actors are activated on the nodes that were upgraded first!

So there are some problems with the default “sticky” approach. In practice however, the virtual actor frameworks have some features that allow to work around these problems. Let’s take a look at what Orleans, Proto.Actor, Akka.Net and Dapr have in store for dealing with topology updates.

Orleans

By default, Orleans virtual actors are “sticky” as described in the previous section. The silent assumption is that actors will get deactivated at some point (e.g. due to inactivity). When a message arrives and activates the actor again, it will be spawned on a node according to the balancing strategy. This poses a challenge when the actors are long living.

At the time of writing, there is a proposal for adding grain migration functionality that will enable auto balancing of the cluster.

What user can also do, is to force actor deactivation by calling grain’s DeactivateOnIdle() which deactivates it just after processing of the current message completes. This way, the developer can craft a smart strategy to force actor rebalancing. On the downside, this approach might not work for actors that receive the messages very frequently, due to internal Orleans caching mechanism.

There is a community project OrleansContrib.ActivationShedding that leverages the forced deactivation technique. It monitors the number of activations on each cluster member and forces actor deactivations if imbalance is detected.

Proto.Actor

With it’s default Partition Identity Lookup strategy (as well as with DB Identity Lookup), the virtual actors are “sticky”, which means no automated way to balance the cluster. There are however some tools in Proto.Actor’s toolbox, that can help.

Partition Activator Lookup

First one is the Partition Activator Lookup strategy, which you can use as a substitute for Partition Identity Lookup. With this strategy, actor activations are assigned to particular members by means of consistent hashing algorithm.

Source: https://proto.actor/docs/cluster/partition-activator-lookup/

This means that when new member is added to the cluster, some actors will be moved to it automatically, because the new node assumes the ownership according to the consistent hash. Thanks to that, the cluster will balance. On the downside, all of the actors that changed the ownership will be moved at once. This will result in temporary slowdown of the system and increased load on the database when actor state is restored form persistent storage.

Local Affinity

Another interesting option is the Local Affinity actor placement strategy. It prefers spawning actors on the same node as the sender of a message. It also attempts to migrate the actor if a sender is detected to be on a remote node. This can be useful when:

a small subset of actors collaborates on processing a request, new actors are spawned for this purpose
system consumes messages from a message broker, where specific partitions are assigned to specific processing nodes (e.g. Kafka, Azure Event Hub)

Let’s take a closer look at the latter case. Assume we have an IoT application and the messages produced by devices are sent to Kafka topic, which is partitioned by device id. Assume Node 1 processes partitions 1, 2 and 3. Local Affinity will help to spawn actors with device ids belonging to partitions 1, 2 and 3 on Node 1. This requires a little bit of configuration, because usually only some of the message types should participate in Local Affinity mechanism, while others should not.

Because actors are spawned on the local node, there is no need for cross-node communication. This saves you serialization overhead and a network hop - the processing is very efficient that way. The great thing about this mechanism is that nothing changes from the perspective of the developer. Virtual actor location is still transparent and they can still be reached over network, e.g. to retrieve their state.

IoT devices send messages to Kafka topic partitioned by device id. Actors representing the devices are spawned on the node that consumes the related partition. No cross-node communication is needed.

After a new node joins the consumer group, some partitions are re-assigned to it. Cross-node communication is now needed to send the messages to corresponding actors.

Eventually, Local Affinity migrates actors to proper node, so no cross-node communication is needed anymore.

When a new member joins the consumer group, Kafka will assign some of the partitions to it. However since the activations stay on existing nodes, all of the overhead of network communication is now needed to process messages from those re-assigned partitions.

Properly configured Local Affinity mechanism will help to migrate actors to proper nodes, by deactivating them selectively. The next message that arrives for the actor will activate it again on a correct node. You can configure throttling to ensure the migration happens gradually over time, limiting the strain on actor state storage.

Akka.Net

Akka.Net is unique in the way it handles actors distribution across cluster. It groups them into shards (you specify the number of shards and the grouping rules) and then assigns shards to cluster nodes. This approach (called Cluster Sharding in Akka.Net) is a bit different from canonical virtual actors, but it helps to achieve similar goals.

Automatic load balancing

Akka.Net Cluster Sharding is able to automatically load balance the cluster after topology change. The shard orchestration process gradually migrates actors between nodes, however not one by one, but rather by shards. This means that proper sharding strategy is important in this case. E.g. you can use the HashCodeMessageExtractor for evenly distributing actors across a defined number of shards.

After adding a node, Akka.Net will try to load balance the cluster with shard granularity

One caveat is, that in-flight requests will time out when actors change the location (remember to set the timeout!) as there is no automatic retry mechanism in the message routing. Developers need to handle the retry themselves.

Remembering Entities

Another unique feature of Akka.Net is the ability to remember what actors were active before the migration and to reactivate them automatically after new node is assigned. Normally, the actor will be reactivated when next message arrives for it. However, this might not be the desired behavior in some cases. E.g. let’s say the actor starts a timer when it is first activated, to periodically check a condition or poll data. When the actor gets deactivated by the load balancing mechanism, the periodic process is interrupted. It will only be resumed when the actor is reactivated, which normally would require a message arriving to the actor. Akka.Net comes with a mechanism to automatically reactivate the actors that were deactivated because of the migration.

Dapr

Dapr uses consistent hashing with bounded load algorithm to assign actor instances to nodes in the cluster. This means that actor to node assignment may change when nodes are being added or removed from the cluster. The cluster is kept in balanced state but at the same time number of re-assignments is limited to a minimum.

The auto-balancing of the cluster is a nice feature of Dapr. Bear in mind however, that calls to actors may fail when node leaves the cluster and it’s the developer’s responsibility to retry the communication.

Conclusion

Each of the virtual actor frameworks has rather unique features when it comes to dealing with cluster topology change. Think about:

what is your application’s traffic pattern?
can it tolerate a lot of actor deactivations and reactivations in short period of time?
can you take advantage of local (within node) messaging?
do you need to automatically reactivate actors without waiting for the next message?

There is a high chance that one of the frameworks has features that cover your specific needs.

Marcin Budny, R&D Lead

Marcin is a software architect and developer focused on distributed systems, cloud, APIs and event sourcing. He worked on projects in the field of IoT, telemetry, remote monitoring and management, system integration and computer vision with deep learning. Passionate about latest tech and good music