How force akka system run on its own threadpool and mixed used of cluster client #4419

Ralf1108 · 2020-05-18T07:49:14Z

Hi,

we are running Akka.Net v1.4.5 on Windows and experiencing random network partitions (sometime twice an hour, sometimes after 3 to 4 ours).
Occasionally we see missing heartbeats in the log:

Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [19170,2226] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.

Sometimes everything works but otherwise we see:

[31] WARN  - 2020-05-18 01:39:45,690: Cluster Node [akka.tcp://[email protected]:13178] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://[email protected]:51747, Uid=1251870918 status = Up, role=[Web], upNumber=2)]. Node roles [So3]
[33] INFO  - 2020-05-18 01:39:53,925: A network partition detected - unreachable nodes: [akka.tcp://[email protected]:51747], remaining: [akka.tcp://[email protected]:13178]
[31] INFO  - 2020-05-18 01:39:53,972: A network partition has been detected. KeepMajority(role: 'So3') decided to down following nodes: [Member(address = akka.tcp://[email protected]:51747, Uid=1251870918 status = Up, role=[Web], upNumber=2)]
[24] INFO  - 2020-05-18 01:39:53,987: Cluster Node [akka.tcp://[email protected]:13178] - Marking unreachable node [akka.tcp://[email protected]:51747] as [Down]
[24] INFO  - 2020-05-18 01:39:54,362: Cluster Node [akka.tcp://[email protected]:13178] - Leader is moving node [akka.tcp://[email protected]:51772] to [Up]
[24] INFO  - 2020-05-18 01:39:54,362: Cluster Node [akka.tcp://[email protected]:13178] - Leader is removing unreachable node [akka.tcp://[email protected]:51747]
[24] INFO  - 2020-05-18 01:39:54,362: Member removed [akka.tcp://[email protected]:51747]
[31] WARN  - 2020-05-18 01:39:54,440: Association to [akka.tcp://[email protected]:51747] having UID [1251870918] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
[33] INFO  - 2020-05-18 01:39:54,503: Removing receive buffers for [akka.tcp://[email protected]:13178]->[akka.tcp://[email protected]:51747]
[35] ERROR - 2020-05-18 01:40:03,206: No response from remote for outbound association. Associate timed out after [15000 ms].

We then programmatically restart the disassociated akka system and they reconnect again.

Interestingly this also happens if there is no load on the system

Because of the log message we think the issue could be that akka system itself gets no thread time to manage its heartbeats (20 sec delayed heartbeat is rather long).

What we did so far:

Moved every long running operation into a task and pipe the result back to its actor
Ensured that all remaining message handlers in the actors do as little work as possible

But this didn't fix the issue.
What we now wanted to do was running the whole actor system on a dedicated thread pool, not only user actors. There is a section in the documentation how to use dedicated threads for actors.
Dispatchers.
In the documentation there is a "Global dispatcher" named:

By default, all actors share a single Global Dispatcher. Unless you change the configuration, this dispatcher uses the .NET Thread Pool behind the scenes, which is optimized for most common scenarios. That means the default configuration should be good enough for most cases.

My question:
How can we change this "Global dispatcher" to use its own thread pool so we can figure out if this fixes our problem?

Our other theory is that when introducing akka.net into our system we started by having a separate akka process and connected to it from our web server via ClusterClient. Recently we integrated signalR in our Webserver. To be able to send messages from the akka process to the signalR component running on the web server we started a second akka sytem there and connected it to the akka process. So the web server is currently a part of the akka system and also uses the cluster client in our legacy code to talk to the akka process as well.

In the cluster client documentation it says:

ClusterClient should not be used when sending messages to actors that run within the same cluster. Similar functionality as the ClusterClient is provided in a more efficient way by Distributed Publish Subscribe in Cluster for actors that belong to the same cluster.

It says it is not recommend but not why. And there is no hint about what could go wrong.
Does somebody knows more details about how this not recommended approach could affect the system stability?

The text was updated successfully, but these errors were encountered:

Ralf1108 · 2020-05-25T10:01:22Z

we manged to unrealiably reproduce the problem with a small test program. This should help to diagnose the error:

See #4432

Ralf1108 · 2020-05-25T10:03:11Z

maybe this issue can be closed because the reproduction sample could be enough to find the problems

Ralf1108 · 2020-05-27T14:42:51Z

FYI: it is possible to use dedicated dispatchers for the different system actor groups.
According to Cluster config documentation there is a config key called "use-dispatcher" to set own custom dispatchers.
This approach is also used in other akka sections.

I will give it a try with a dedicated ForkJoinDispatcher for the cluster stuff. Maybe there is some improvement.

Aaronontheweb · 2020-05-27T22:34:37Z

Sorry for not replying soon @Ralf1108 - but yes you should be able to customize the dispatcher for the ClusterClient. The Props fluent interface also supports specifying a dispatcher that way.

ismaelhamed · 2020-06-19T18:01:01Z

#26816 would probably help

to11mtm · 2020-07-10T01:03:14Z

#26816 would probably help

I think it would. I tried an ad-hoc-ish implementation found that while max performance seemed a little slower (possibly machine constraints since it has a low core count) all sorts of stability issues went away in our stress tests, both around Cluster heartbeats as well as DData issues.

Aaronontheweb · 2020-07-10T01:20:38Z

@to11mtm @ismaelhamed I was thinking about this exact issue today actually - that we have a couple of oustanding issues in Akka.NET's internal actors:

Akka.Remote runs on its own internal threadpool and that fixed a number of performance issues;
Other systems, like Akka.Persistence, run on their own but probably shouldn't - they should use a dispatcher that also gets shared with other pieces of system infrastructure.

We should port that PR and just put everything on the same dispatcher. And we should up the max internal concurrency setting to 64 just like on the JVM.

ismaelhamed · 2020-07-10T05:25:25Z

Yep. The key point of that PR is to protect Akka internals from user code. This is specially true in Akka Cluster, where failure to respond to HeartBeats due to thread starvation can cause a lot of unnecessary temporary unreachables.

to11mtm · 2020-07-11T20:19:53Z

@Aaronontheweb was curious if you could clarify:

Other systems, like Akka.Persistence, run on their own but probably shouldn't - they should use a dispatcher that also gets shared with other pieces of system infrastructure.

Is the thought here that Persistence should be running in the 'internal-dispatcher'?

Also, on that same note, I notice many persistence plugins (i.e. the SQL ones) usually specify 'default-dispatcher' in their configs. While I know SQL Server is the worst culprit here with it's high risk of blocking during both Read and write, this seems like something else that should be revisited, as DB Access is almost always blocking to some level.

Aaronontheweb · 2020-07-21T19:05:37Z

@to11mtm

Is the thought here that Persistence should be running in the 'internal-dispatcher'?

That's correct.

as DB Access is almost always blocking to some level.

I think we use await methods on the native drivers for running most queries, which should use overlapped I/O, I/O completion ports, and I/O threads - that shouldn't have much of an impact on the execution system. However, the way their continuation tasks get scheduled is a different story. We could theoretically marshal all of those Tasks onto a TaskScheduler that runs on the same threads as the actor, but that would have to be done explicitly since you can't set the TaskScheduler.Default ambiently without already being inside a Task.

Aaronontheweb · 2020-07-21T19:12:05Z

@Ralf1108 I think what you're asking for can actually be done already via the use-dispatcher HOCON:

https://github.com/akkadotnet/akka.net/blob/dev/src/contrib/cluster/Akka.Cluster.Tools/Client/reference.conf#L29

That will allow the ClusterClientReceptionist to run inside its own dispatcher, effectively excluding it's workload from the work queue that is shared by other actors inside Akka.NET.

As for this:

ClusterClient should not be used when sending messages to actors that run within the same cluster. Similar functionality as the ClusterClient is provided in a more efficient way by Distributed Publish Subscribe in Cluster for actors that belong to the same cluster.

This is poorly worded - what it means is: don't use ClusterClient within the cluster, for inter-node communication. Use it when you have an external system that itself isn't part of the cluster (i.e. a web UI) that needs to communicate with an Akka.NET cluster (i.e. some back-end nodes.)

Will changing the dispatcher for the receptionist help?

Swoorup · 2020-08-03T07:37:06Z

Does this solve the issue for now?

FYI: it is possible to use dedicated dispatchers for the different system actor groups.
According to Cluster config documentation there is a config key called "use-dispatcher" to set own custom dispatchers.
This approach is also used in other akka sections.

I will give it a try with a dedicated ForkJoinDispatcher for the cluster stuff. Maybe there is some improvement.

Swoorup · 2020-08-05T06:58:44Z

For anyone that might be looking for a workaround to this. Here is the config I am using for now:

          cluster {
            failure-detector {
              heartbeat - interval = 5 s
            }
            use-dispatcher = cluster-dispatcher
            run-coordinated-shutdown-when-down = on
            auto-down-unreachable-after = 10s
            seed-nodes = [ %s ]
            roles = ["importer"]
          }
        }
        cluster-dispatcher {
          type = "Dispatcher"
          executor = "fork-join-executor"
          fork-join-executor {
            parallelism-min = 2
            parallelism-max = 4
          }
        }

Aaronontheweb · 2020-10-06T02:13:31Z

close via #4511

Swoorup · 2020-10-26T04:48:13Z

@Aaronontheweb does this mean that we don't need any additional config to prevent starvation issues of this kind?

Aaronontheweb · 2020-10-26T16:01:14Z

@Swoorup nope, the newest nightly builds of Akka.NET should automatically handle this.

Ralf1108 mentioned this issue May 25, 2020

How prevent "Scheduled sending of heartbeat was delayed" and occasionally network partitions #4432

Closed

Arkatufus mentioned this issue Jul 10, 2020

Port scala akka PR #26816 to Akka.NET #4511

Merged

Aaronontheweb closed this as completed Oct 6, 2020

Aaronontheweb added this to the 1.4.11 milestone Oct 6, 2020

Aaronontheweb added the akka-actor label Oct 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How force akka system run on its own threadpool and mixed used of cluster client #4419

How force akka system run on its own threadpool and mixed used of cluster client #4419

Ralf1108 commented May 18, 2020 •

edited

Loading

Ralf1108 commented May 25, 2020

Uh oh!

Ralf1108 commented May 25, 2020

Uh oh!

Ralf1108 commented May 27, 2020

Uh oh!

Aaronontheweb commented May 27, 2020

Uh oh!

ismaelhamed commented Jun 19, 2020

Uh oh!

to11mtm commented Jul 10, 2020

Uh oh!

Aaronontheweb commented Jul 10, 2020

Uh oh!

ismaelhamed commented Jul 10, 2020 •

edited

Loading

Uh oh!

to11mtm commented Jul 11, 2020

Uh oh!

Aaronontheweb commented Jul 21, 2020

Uh oh!

Aaronontheweb commented Jul 21, 2020

Uh oh!

Swoorup commented Aug 3, 2020

Uh oh!

Swoorup commented Aug 5, 2020

Uh oh!

Aaronontheweb commented Oct 6, 2020

Uh oh!

Swoorup commented Oct 26, 2020

Uh oh!

Aaronontheweb commented Oct 26, 2020

Uh oh!

How force akka system run on its own threadpool and mixed used of cluster client #4419

How force akka system run on its own threadpool and mixed used of cluster client #4419

Comments

Ralf1108 commented May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ralf1108 commented May 25, 2020

Uh oh!

Ralf1108 commented May 25, 2020

Uh oh!

Ralf1108 commented May 27, 2020

Uh oh!

Aaronontheweb commented May 27, 2020

Uh oh!

ismaelhamed commented Jun 19, 2020

Uh oh!

to11mtm commented Jul 10, 2020

Uh oh!

Aaronontheweb commented Jul 10, 2020

Uh oh!

ismaelhamed commented Jul 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

to11mtm commented Jul 11, 2020

Uh oh!

Aaronontheweb commented Jul 21, 2020

Uh oh!

Aaronontheweb commented Jul 21, 2020

Uh oh!

Swoorup commented Aug 3, 2020

Uh oh!

Swoorup commented Aug 5, 2020

Uh oh!

Aaronontheweb commented Oct 6, 2020

Uh oh!

Swoorup commented Oct 26, 2020

Uh oh!

Aaronontheweb commented Oct 26, 2020

Uh oh!

Ralf1108 commented May 18, 2020 •

edited

Loading

ismaelhamed commented Jul 10, 2020 •

edited

Loading