Architecture

Last updated 3 months ago

Learn about the vlingo/platform architecture and that of DDD-based microservices built with it.

The vlingo/platform is built fully on a highly concurrent, reactive, message-driven foundation. The bedrock of the platform is actor-based using vlingo/actors. This foundation implements the Actor Model of computation. All other platform components are actor-based. Thus, it's appropriate to first discuss the architecture of the actor-based, message-driven runtime.

Message-Driven Runtime

The Actor Model provides an outstanding, non-leaky, abstraction of concurrency. Given that your hardware supports multiple cores, and even many cores, you can expect tremendous levels of parallelism among many actors that carry out application-level service requests.

The major architectural abstractions of the vlingo/actors Actor Model toolkit are:

  • World: This is the container abstraction within which actors live and operate. A World can have a number of Stage instances in which the life cycles of a subset of live actors are managed. When the World is started, actors can be created. When the World terminates, all Stage and Actor instances are terminated/stopped.

  • Stage: Every World has at least one Stage, known as the default. The Stage is specifically where actors are managed and within which they play or execute. There may be multiple Stage instances, each with several or many actors (even millions) under its management. Each Stage has an internal Directory within which live actors are held, and through which they can be found.

  • Actor: Each actor is an object, but one that reacts to incoming asynchronous messages, and that sends outgoing messages asynchronously. Each actor is assigned a Mailbox, and a Mailbox delivers messages though a Dispatcher. When the actor receive a message, it performs some business behavior internally, and then completes that specific message-handling context. All actors are type safe in that their behaviors are defined by Java method signatures rather than arbitrary object types. Every actor may implement any (practical) number of behavioral interfaces, known as actor protocols.

  • Supervisor: Actors are supervised in order to deal with failure. When an actor experiences an exception while handling a message, the exception is caught by the message dispatcher and is relayed to the actor’s supervisor as a message. When received by the supervisor, the exceptional message is interpreted to determine the appropriate step to correct the actor’s problem. The corrective action can be one of the following: resume, restart, stop, or escalate. In the case of supervision escalation, the exceptional message is relayed to this supervisor’s supervisor, for it to take some action. There are four kinds of supervision: direct parent, default public root, override of public root, and registered common supervisors for specific actor protocols (behavioral interfaces).

  • Scheduler: Each Stage has a scheduler that can be used to schedule future tasks, and that may by repeated on time intervals. An actor determines the timeframe that is needed, and each occasion on which the interval is reached, the actor receives an interval signal message indicating the the it is time to perform some task. The receiving actor can then execute some necessary behavior in response to the interval signal. The actor that creates the scheduled task need not be the target actor of the interval signal message.

  • Logger: Logging capabilities are provided to every actor, and different loggers may be assigned to different actors. Log output occurs asynchronously because loggers are run as actors.

  • Plugins: The vlingo/platform in general, and vlingo/actors specifically, support a plugin architecture. New kinds of plugins can be created at any time to extend the features of vlingo/actors, and any other platform component. In particular, mailboxes and dispatchers, supervision, and logging can be extended via plugins.

  • Testkit: There is a very simple testkit that accompanies vlingo/actors, making it quite easy to test individual and collaborating actors for adherence to protocol and for correctness.

The following is a vlingo/actors architecture diagram.

The components seen in this diagram can be traced back to the above names and descriptions. In the default Stage notice that there are three special actors, one that is {bright yellow} with a #, one that is {bright yellow} with a *, and one that is {red} with an X. Respectively, these are:

# The private root actor, which is the parent of the public root actor and the dead letters actor

* The public root actor, which is the default parent and supervisor if no others are specified

X The dead letters actor, which receives actor messages that could not be delivered

Next the vlingo/cluster is discussed, and following the are comments on the overall platform architecture.

Scale and Resilience with the Multi-Node Cluster Architecture

The vlingo/cluster is a key component that sits on top of vlingo/actors, to support the development of scalable and fault-tolerant tools and applications. A number of additional tools building out the vlingo/platform are built on top of vlingo/cluster. Further, it is intended that you implement and deploy your services/applications in clusters.

Generally a cluster will be composed of multiple nodes of an odd number (not just one, but for example, 3, 5, 21, or 127). The reason for choosing an odd number of nodes is to make it possible to determine whether there is a quorum of nodes (totalNodes / 2 + 1) that can form a healthy cluster. Choosing an even number of nodes works, but in that case, when loosing one node, it doesn’t improve the quorum determination, nor does it strengthen the cluster when all nodes are available.

When a quorum of nodes are available and communicating with one another, a leader is elected. The leader is responsible for making certain decisions in behalf of the cluster, and also announces newly joined nodes and those that have left the cluster.

The cluster consensus protocol is one known by the name Bully Algorithm. Although an unpleasant name, the protocol is simple but powerful. This consensus protocol chooses a leader by determining the node with the "greatest id value." The identity may be numeric or alphanumeric. If numeric the "greatest node id" is determined with numeric comparisons. If alphanumeric, the "greatest node id" is determined lexicographically. Any node that senses that the previous leader has been lost (left the cluster for any reason), messages the other known "nodes with greater ids" asking for an election to take place. Any "nodes with greater ids" tell the "lesser id node" that it will not be the leader. Finally when there are no nodes of "greater id" than a given node--known because it receives no "you aren't the leader" messages--that "greatest id node" declares itself leader. In essence the "node of the greatest id" bullies its way into the leadership position, asserting that it is currently greatest. Any latent message received from a "node of even greater id" than the current declared leader will not take leadership away from it, because it is a healthy leader node.

It's possible that our consensus protocol will be augmented in the future with the Gossip Protocol to support very large clusters. We also have the option to swap out the Bully Algorithm for another, or to provide multiple consensus strategies.

The vlingo/cluster works in full-duplex mode, with two kinds of cluster node communication channels. There are the operational channels, which the cluster nodes use to maintain the health of the cluster. There are also application channels, which the services/applications use to pass messages between nodes. Using two different channel types allows for tuning each type in different ways. For example, application channels may be assigned to a faster network than the operational channels. It also opens the possibility for either channel type to use TCP or UDP independently of the other channel type.

Besides scalable fault-tolerance, the vlingo/cluster also provides cluster-wide, synchronizing attributes of name-value pairs. This enables the cluster to share live and mutating operational state among all nodes. Likewise, application-level services can also make use of cluster-wide, synchronizing attributes in order to enhance shared application/service values.

The following diagram shows a three-node vlingo/cluster.

The cluster depicted here is composed of three nodes. If one node is lost the cluster will still maintain a quorum. However, if two nodes are lost and only one node remains in a running, healthy state, the quorum is lost and the cluster as a whole is considered unhealthy. In that case the one remaining node goes into idle state and awaits one or more of the other nodes to return to a healthy operational state, which will again constitute a quorum and enable a healthy running cluster. A leader is elected using the cluster consensus protocol.

Inter-Node Messaging

Each cluster member maintains communication with the others by means of an operational channel. Over this channel is sent cluster node health information and notifications of nodes joining and leaving the cluster. When a leader is elected, whether when the cluster first starts or when a leader node leaves the cluster, the election and final declaration of leadership are announced over the operational channel. At appropriate times the leader sends cluster directory messages to all nodes so that each node can be aware of all other available nodes.

There is a second kind of channel on each cluster node, the application channel. This channel is used strictly for application-level messages, those between actors that reside on different nodes. There are benefits to separating operational and application messages. Potentially the traffic of each channel could be separated by different networks and use different protocols. For example, the operational channels could be on a slower network because operational messages are fewer and relatively infrequent. The application channel, on the other hand, may need to transport many millions of messages over short timeframes, and thus may require a faster network than the operational one. Further, the protocol used for application channel messages could be different; one having the capacity for much higher throughput.

Using file-based cluster configuration, this defines the nodes of a three-node cluster, each with a unique op or operations port and app or application port.

node.accounts1.id = 1
node.accounts1.name = accounts1
node.accounts1.host = accounts-svr1
node.accounts1.op.port = 37371
node.accounts1.app.port = 37372
node.accounts2.id = 2
node.accounts2.name = accounts2
node.accounts2.host = accounts-svr2
node.accounts2.op.port = 37371
node.accounts2.app.port = 37372
# highest id, default leader
node.accounts3.id = 3
node.accounts3.name = accounts3
node.accounts3.host = accounts-svr3
node.accounts3.op.port = 37371
node.accounts3.app.port = 37372

When one node connects with another node, it opens a connection on both its operational and application channels. These channels facilitate only unidirectional communication. In other words, when a node receives a message on its incoming operational channel, it does not send an outgoing response to the originating node on that same bidirectional channel. Instead, the responding node uses the operational channel of the client node that originated the message. The directory messages circulated by the cluster leader contain the operational and application address information for each node in the cluster.

Cluster-Wide Attributes

The vlingo/cluster platform component also provides cluster-wide, synchronizing attributes. This enables the cluster to share live and mutating operational state among all nodes. These attributes may be operational in nature, or application attributes. Either way, the creation, modification, and removal of an attribute is handled over the operational channel (see previous section). Each attribute has a name, any supported simple datatype such as String, Boolean, Integer, Double, etc., and a value.

Any time that a new attribute is created, or when its value is modified, or when it is deleted, the operation is made know to all live cluster nodes, allowing them to synchronize. The attribute operational protocol includes configurable retries, and any node to newly join the cluster is informed of all current live attributes, and then will be included in ongoing activity notifications.

Platform Architecture

More content follows herein...