Q: My actors seem to hang or deadlock. What's wrong?
A: It is possible that you are experiencing one or a few things that can cause these kinds of problems. Consider the following.
When you are using actors you have a finite number of physical computer threads to run message handlers. If you exhaust all of the available physical threads simultaneously and block them for relatively long periods of time, you will experience these situations. This may happen due to blocking or otherwise slow message handling. It is sometimes an environmental problem. This is an overarching problem with the following points providing more details.
Blocking a thread means that it is parked like a non-moving car in a garage. There are no two ways about it. If you block on file or network I/O, or for any other operation to complete, your physical thread is blocked. There is nothing that makes a Java thread block but frees the physical thread behind it. The physical thread blocks if your Java thread blocks. If all/most of your threads are blocked, you aren't processing messages as would be possible if blocking could be eliminated.
You may be using Completes::await() or Future::get(), which are blocking operations. If these are used by your actors, throughput will be stopped for as long as the blocking operation requires to finish. Note that both forms of Completes::await() are meant to be used for tests only, and it is only the test case that should use either form of Completes::await(). Never use this in a production environment.
You simply have too few machine cores to service all your actors' messages in a timely manner. This would be rare, but it's possible. We have seen the use of very thinly provisioned cloud server machines, such as dual core, used to handle long-running message handling with mandatory cloud node timeouts. Given two cores, you likely have four physical threads available (due to hyper-threading; 2 cores x 2 threads per core). If you are trying to service several or many actors with only four threads and three or four of them are blocking on I/O or other slow processing, your cloud node stands a high risk of health timeout and being shut down by the infrastructure.
Related to the previous point, but not obvious, is that there may be less physical threads than you think. For example, if you are running on AWS, certain provisioning choices will change the standard 2 threads per core. Specifically AWS Fargate provides 1 thread per core, not 2. Thus, 2 cores yields 2 physical threads, not 4. This is typically unexpected by DevOps. Be sure you understand what you are getting out of your cloud VM.
Too many actors are dependent on limited resources, such as pooled read/write buffers. Even if your actor messages are handled rapidly, they may still exhaust a resource pool with too many actors in stasis until messages arrive, that will never arrive due to resource overuse. To solve this, either allocate resource pools for smaller sets of related actors with predictable message processing, and/or make your resource pool(s) elastic. An elastic resource pool would grow its number of pooled resources on high demand, and shrink the number when demand decreases.
Sometimes a service can become unresponsive due to inactivity of actors it depends on. This happens when a client actor sends a message to a service-providing actor and the service-providing actor fails to respond. This may occur when the service-providing actor crashes and loses the context of the message it was handling when it crashed. It may be possible for the actor that crashes to maintain the context across the supervisor's intervention and, if resumed from or ignoring the crash, the actor could take corrective measures and respond. In some cases this may be impractical or impossible, and it could be that the client actor must resend its request message. The client actor can detect this situation by means of a scheduled timer interval, giving the client actor intermittent opportunities to check on expected progress and take corrective measures if stalled.
There may be other possibilities, but they are probably somehow related to the above. You may benefit by using one or more of the following:
The following content is focused on the sub-subheading Considering Scheduler Latency under this subheading.
Use separate Scheduler instances.
Consider dealing with scheduling using different strategies.
Q: My actor has experienced a race condition. Is this a bug in XOOM Actors?
A: Not likely. We suggest that it's actually theoretically impossible given the design of our actor foundation, which ensures that only one thread can dispatch one message to a given actor at any given time. Ok, but why are you experiencing a race condition?
You aren't, but have a bug that makes it appear that you have experienced a race.
You are, but it's caused by some client making a direct method invocation on an actor at the same time that a separate assigned thread is delivering a message. We try to make it very difficult to break the rules of the Actor Model, which include actors only accepting asynchronous messages one at a time. Still, we can't prevent a programmer from exposing this outside. For example, it is a bug to pass this to the Scheduler or to some service actor that our actor depends on. It will work because whatever protocol, such as SomeProtocol, the actor implements, will be provided by passing this to obtain a service. But that's wrong and a sure way to experience races. You should always pass selfAs(SomeProtocol.class) as a parameter to the Scheduler or another actor.
You are, but it's because you have a Completes<T> expression evaluating inside your actor that mutates state at the same time that a separate assigned thread is delivering a message that mutates state. Some overlook the fact that Completes<T> delivers asynchronous outcomes via a pooled actor designed for that purpose. This means that a separate thread may be entering your actor because you have invited it in with the Completes<T> outcome. You can solve this problem in one of two ways:
Never use Completes<T> from inside an actor. Design collaborating actors to accept dependents as expected protocols. For example, SomeProtocol provides a service message (method) that takes a SomeProtocolInterest as a parameter. When the actor handling the SomeProtocol message delivery has finished, it replies to its dependent using SomeProtocolInterest. Of course the dependent actor implements the SomeProtocolInterest interface, and passes a reference using selfAs(SomeProtocolInterest.class).
If you do use Completes<T> inside your actor, and admittedly this may be necessary given the design of some protocols, never modify your actor's state from a Completes<T> outcome handler, such as andThen(function). Rather, send yourself a message that will be handled with exclusive access to your state. We suggest not exposing the protocol used for this purpose outside your actor and document the protocol as being for internal use only. As an example, your actor may implement MyInternalProtocol (by a different name) and the Completes<T> outcome pipeline handler dispatches to one of its messages (methods) such as: selfAs(MyInternalProtocol.class).accept(someOutcomeData).
Q: How do Java-based XOOM Actors and Erlang BEAM processes differ in how fairness of message processing is managed?
A: First note that there are some terminology differences between XOOM Actors and the Erlang BEAM. The term Scheduler in XOOM Actors is used for timers that schedule one or continuous future signals to an actor. The XOOM Actors term used for delivering messages to actors on threads is called dispatching. The Erlang BEAM uses the term scheduler and scheduling to describe how time slices are used to managed processes (a.k.a actors).
The Erlang BEAM doesn't assign OS threads to processes in the way that Java does. Instead it has a virtual OS of its own, which enables all kinds of ways to control fairness. Think of how Un*x and Windows operating systems work. These literally interrupt the execution of a process that is running on a thread and dynamically assign that thread to another process. That's what the Erlang BEAM does to manage its runnable processes. Thus, the Erlang BEAM uses preemptive multitasking via its scheduler by giving processes time slices, and then task switching across any number of runnable processes. And, yes, the BEAM virtual OS is interrupted by the actual machine controlling OS to assign physical CPU threads to other processes running on the machine controlled by the OS.
On the other hand, Java uses cooperative multitasking, which relies on code to give up a thread (complete a message reaction) quickly. A JVM is interrupted by the machine controlling OS in mid execution of some/many code paths in order to assign physical CPU threads to other processes running on the machine controlled by the OS. Cooperative multitasking is far less fair than OS task switching based on fixed time slices because much depends on how long any given actor message deliver and handling requires before the actor voluntarily gives up its assigned thread.