Grid

Using vlingo/lattice as a data and compute grid.

The vlingo/lattice component provides an API for distributed computations and data processing across multiple nodes in a cluster. It supports distributed parallel processing by sending computational execution requests to actors on any node in a vlingo/cluster with the potential to receive results in return. This is embodied in the vlingo/lattice Grid.

Using the Grid

The vlingo/lattice Grid is a distributed compute construct that is implemented as a vlingo/actors Stage. The Grid API is the same as you would expect from Stage with one exception: the Actor implementations that should be started on a Grid must implement the GridActor abstract class rather than Actor.

To start the Grid use one of the static Grid.start overloads. You may then start GridActor implementations as you would normally do on a Stage

GridActor and StatelessGridActor

The GridActor adds two new methods to the Actor API.

  • S provideRelocationSnapshot()

  • void applyRelocationSnapshot(S snapshot)

This snapshot is not the same as that provided for EventSourced or CommandSourced entities. Instead, it is what Grid will use when your actor must be relocated to another node, such as when continuous redistribution takes place by adding nodes to or removing nodes from the cluster. That is, the Grid uses these as a means to re-balance the actors running on it when the cluster composition changes.

package io.vlingo.actors;
public abstract class GridActor<S extends Serializable> extends Actor {
implements RelocationSnapshotSupplier<S>,
RelocationSnapshotConsumer<S> {
...
}
@FunctionalInterface
public interface RelocationSnapshotSupplier<S> {
S provideRelocationSnapshot();
}
@FunctionalInterface
public interface RelocationSnapshotConsumer<S> {
void applyRelocationSnapshot(S snapshot);
}

The provideRelocationSnapshot() must be implemented by the developer in order to provide a logical, serializable snapshot of the actor’s state.

The applyRelocationSnapshot() must be implemented by the developer in order to reestablish the actor’s state on a remote host. This method should be treated as a constructor, and usually, it should mirror the implementation of the start() method message handler.

If your actor implementation is stateless, a convenience abstract class StatelessGridActor is provided, which implements these relocation methods as no-ops.

The GridActor and StatelessGridActor can also be used on a traditional, non-distributed implementation of Stage via the World.actorFor() API. Yet, to make use of the distributed actor features, you must explicitly reference a Grid and create actor instances using one of the Grid actorFor() overrides.

Using Models On the Grid

Since the introduction of Grid, all vlingo/lattice model implementations have been updated to implement GridActor. This includes most abstract actor types, but not all. Those changed are as follows.

io.vlingo.lattice.model.object.ObjectEntity
io.vlingo.lattice.model.process.ObjectProcess
io.vlingo.lattice.model.process.SourcedProcess
io.vlingo.lattice.model.process.StatefulProcess
io.vlingo.lattice.model.sourcing.CommandSourced
io.vlingo.lattice.model.sourcing.EventSourced
io.vlingo.lattice.model.stateful.StatefulEntity

All model variants provide default implementations for provideRelocationSnapshot(), using the String id or streamName property of the abstract model implementation respectively. Users of these classes must implement applyRelocationSnapshot() since the implementation of that method is specific to the implementation of the entity’s state. One must remember to call restore() after the relocation snapshot is applied so that the entity state can be loaded from the backing storage.

Grid Implementation Details and Guarantees

Each Grid node in the cluster has a copy of a hash ring data structure. Without going into details on how a hash ring works, it is used to determine where in the cluster a given actor is currently located. This means that a given actor purposely has a single instance in the entire distributed Grid, and that one instance will be pinned to a given node. When a message is sent to a GridActor, the Grid looks up in the hash ring on which node that actor is located. The message is then serialized and sent over the network to the node that houses that actor. The Grid instance there looks up the actor locally and delivers the message through the actor's mailbox.

There is, of course, an optimization if the message is sent to an actor that is on the same node as the sender. Such a condition delivers the message directly through the target actor's mailbox without any network overhead.

At its current state, vlingo/lattice Grid is a preview feature. There are a few limitations currently that will soon be eliminated, as discussed here. Please use the current implementation with this awareness.

  1. The Grid currently does not take into consideration the cluster’s quorum in order to prevent multiple instances of the same logical actor doing work in split-brain scenarios. The prevention of split-brain consequences will be corrected soon.

  2. Furthermore, losing a node from your cluster will take down all the actor instances that were previously running on that node. A node failure situation would cause messages sent from clients to that actor to be irrecoverably lost. This lost node limitation will also be addressed in a near-term future release.

These two current limitations are to be corrected in the following ways.

  1. Split-brain scenario: A node that has lost the quorum with the other cluster nodes (as per the vlingo/cluster specification) enters into an illegal state. Attempting to send/deliver a message to an actor from this illegal state node causes exceptions to be thrown to the client each time it attempts to use any actor, whether previously managed locally or remotely. Requests to start new actors on the Grid also fail with an exception. The user must handle these errors in a way that is meaningful for their application (e.g. return an error response to the user of the application, log, etc.)

  2. Lost node scenario: Losing a node will be addressed by extending our internal protocol to support recovery. For stateless and persistent actors this is as simple as starting a new instance of the actor on another node. Stateful actors that are not persistent would have to replicate their state over a number of secondary nodes, other than the one that is primarily responsible for handling messages sent from the clients so that work can resume in the event of a node’s failure.

Entity Examples