Skip to content

How force akka system run on its own threadpool and mixed used of cluster client #4419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Ralf1108 opened this issue May 18, 2020 · 16 comments
Closed
Milestone

Comments

@Ralf1108
Copy link
Contributor

Ralf1108 commented May 18, 2020

Hi,

we are running Akka.Net v1.4.5 on Windows and experiencing random network partitions (sometime twice an hour, sometimes after 3 to 4 ours).
Occasionally we see missing heartbeats in the log:

Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [19170,2226] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.

Sometimes everything works but otherwise we see:

[31] WARN  - 2020-05-18 01:39:45,690: Cluster Node [akka.tcp://[email protected]:13178] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://[email protected]:51747, Uid=1251870918 status = Up, role=[Web], upNumber=2)]. Node roles [So3]
[33] INFO  - 2020-05-18 01:39:53,925: A network partition detected - unreachable nodes: [akka.tcp://[email protected]:51747], remaining: [akka.tcp://[email protected]:13178]
[31] INFO  - 2020-05-18 01:39:53,972: A network partition has been detected. KeepMajority(role: 'So3') decided to down following nodes: [Member(address = akka.tcp://[email protected]:51747, Uid=1251870918 status = Up, role=[Web], upNumber=2)]
[24] INFO  - 2020-05-18 01:39:53,987: Cluster Node [akka.tcp://[email protected]:13178] - Marking unreachable node [akka.tcp://[email protected]:51747] as [Down]
[24] INFO  - 2020-05-18 01:39:54,362: Cluster Node [akka.tcp://[email protected]:13178] - Leader is moving node [akka.tcp://[email protected]:51772] to [Up]
[24] INFO  - 2020-05-18 01:39:54,362: Cluster Node [akka.tcp://[email protected]:13178] - Leader is removing unreachable node [akka.tcp://[email protected]:51747]
[24] INFO  - 2020-05-18 01:39:54,362: Member removed [akka.tcp://[email protected]:51747]
[31] WARN  - 2020-05-18 01:39:54,440: Association to [akka.tcp://[email protected]:51747] having UID [1251870918] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
[33] INFO  - 2020-05-18 01:39:54,503: Removing receive buffers for [akka.tcp://[email protected]:13178]->[akka.tcp://[email protected]:51747]
[35] ERROR - 2020-05-18 01:40:03,206: No response from remote for outbound association. Associate timed out after [15000 ms].

We then programmatically restart the disassociated akka system and they reconnect again.

Interestingly this also happens if there is no load on the system

Because of the log message we think the issue could be that akka system itself gets no thread time to manage its heartbeats (20 sec delayed heartbeat is rather long).

What we did so far:

  • Moved every long running operation into a task and pipe the result back to its actor
  • Ensured that all remaining message handlers in the actors do as little work as possible

But this didn't fix the issue.
What we now wanted to do was running the whole actor system on a dedicated thread pool, not only user actors. There is a section in the documentation how to use dedicated threads for actors.
Dispatchers.
In the documentation there is a "Global dispatcher" named:

By default, all actors share a single Global Dispatcher. Unless you change the configuration, this dispatcher uses the .NET Thread Pool behind the scenes, which is optimized for most common scenarios. That means the default configuration should be good enough for most cases.

My question:
How can we change this "Global dispatcher" to use its own thread pool so we can figure out if this fixes our problem?

Our other theory is that when introducing akka.net into our system we started by having a separate akka process and connected to it from our web server via ClusterClient. Recently we integrated signalR in our Webserver. To be able to send messages from the akka process to the signalR component running on the web server we started a second akka sytem there and connected it to the akka process. So the web server is currently a part of the akka system and also uses the cluster client in our legacy code to talk to the akka process as well.

In the cluster client documentation it says:

ClusterClient should not be used when sending messages to actors that run within the same cluster. Similar functionality as the ClusterClient is provided in a more efficient way by Distributed Publish Subscribe in Cluster for actors that belong to the same cluster.

It says it is not recommend but not why. And there is no hint about what could go wrong.
Does somebody knows more details about how this not recommended approach could affect the system stability?

@Ralf1108
Copy link
Contributor Author

we manged to unrealiably reproduce the problem with a small test program. This should help to diagnose the error:

See #4432

@Ralf1108
Copy link
Contributor Author

maybe this issue can be closed because the reproduction sample could be enough to find the problems

@Ralf1108
Copy link
Contributor Author

FYI: it is possible to use dedicated dispatchers for the different system actor groups.
According to Cluster config documentation there is a config key called "use-dispatcher" to set own custom dispatchers.
This approach is also used in other akka sections.

I will give it a try with a dedicated ForkJoinDispatcher for the cluster stuff. Maybe there is some improvement.

@Aaronontheweb
Copy link
Member

Sorry for not replying soon @Ralf1108 - but yes you should be able to customize the dispatcher for the ClusterClient. The Props fluent interface also supports specifying a dispatcher that way.

@ismaelhamed
Copy link
Member

#26816 would probably help

@to11mtm
Copy link
Member

to11mtm commented Jul 10, 2020

#26816 would probably help

I think it would. I tried an ad-hoc-ish implementation found that while max performance seemed a little slower (possibly machine constraints since it has a low core count) all sorts of stability issues went away in our stress tests, both around Cluster heartbeats as well as DData issues.

@Aaronontheweb
Copy link
Member

@to11mtm @ismaelhamed I was thinking about this exact issue today actually - that we have a couple of oustanding issues in Akka.NET's internal actors:

  1. Akka.Remote runs on its own internal threadpool and that fixed a number of performance issues;
  2. Other systems, like Akka.Persistence, run on their own but probably shouldn't - they should use a dispatcher that also gets shared with other pieces of system infrastructure.

We should port that PR and just put everything on the same dispatcher. And we should up the max internal concurrency setting to 64 just like on the JVM.

@ismaelhamed
Copy link
Member

ismaelhamed commented Jul 10, 2020

Yep. The key point of that PR is to protect Akka internals from user code. This is specially true in Akka Cluster, where failure to respond to HeartBeats due to thread starvation can cause a lot of unnecessary temporary unreachables.

@to11mtm
Copy link
Member

to11mtm commented Jul 11, 2020

@Aaronontheweb was curious if you could clarify:

Other systems, like Akka.Persistence, run on their own but probably shouldn't - they should use a dispatcher that also gets shared with other pieces of system infrastructure.

Is the thought here that Persistence should be running in the 'internal-dispatcher'?

Also, on that same note, I notice many persistence plugins (i.e. the SQL ones) usually specify 'default-dispatcher' in their configs. While I know SQL Server is the worst culprit here with it's high risk of blocking during both Read and write, this seems like something else that should be revisited, as DB Access is almost always blocking to some level.

@Aaronontheweb
Copy link
Member

@to11mtm

Is the thought here that Persistence should be running in the 'internal-dispatcher'?

That's correct.

as DB Access is almost always blocking to some level.

I think we use await methods on the native drivers for running most queries, which should use overlapped I/O, I/O completion ports, and I/O threads - that shouldn't have much of an impact on the execution system. However, the way their continuation tasks get scheduled is a different story. We could theoretically marshal all of those Tasks onto a TaskScheduler that runs on the same threads as the actor, but that would have to be done explicitly since you can't set the TaskScheduler.Default ambiently without already being inside a Task.

@Aaronontheweb
Copy link
Member

@Ralf1108 I think what you're asking for can actually be done already via the use-dispatcher HOCON:

https://github.com/akkadotnet/akka.net/blob/dev/src/contrib/cluster/Akka.Cluster.Tools/Client/reference.conf#L29

That will allow the ClusterClientReceptionist to run inside its own dispatcher, effectively excluding it's workload from the work queue that is shared by other actors inside Akka.NET.

As for this:

ClusterClient should not be used when sending messages to actors that run within the same cluster. Similar functionality as the ClusterClient is provided in a more efficient way by Distributed Publish Subscribe in Cluster for actors that belong to the same cluster.

This is poorly worded - what it means is: don't use ClusterClient within the cluster, for inter-node communication. Use it when you have an external system that itself isn't part of the cluster (i.e. a web UI) that needs to communicate with an Akka.NET cluster (i.e. some back-end nodes.)

Will changing the dispatcher for the receptionist help?

@Swoorup
Copy link

Swoorup commented Aug 3, 2020

Does this solve the issue for now?

FYI: it is possible to use dedicated dispatchers for the different system actor groups.
According to Cluster config documentation there is a config key called "use-dispatcher" to set own custom dispatchers.
This approach is also used in other akka sections.

I will give it a try with a dedicated ForkJoinDispatcher for the cluster stuff. Maybe there is some improvement.

@Swoorup
Copy link

Swoorup commented Aug 5, 2020

For anyone that might be looking for a workaround to this. Here is the config I am using for now:

          cluster {
            failure-detector {
              heartbeat - interval = 5 s
            }
            use-dispatcher = cluster-dispatcher
            run-coordinated-shutdown-when-down = on
            auto-down-unreachable-after = 10s
            seed-nodes = [ %s ]
            roles = ["importer"]
          }
        }
        cluster-dispatcher {
          type = "Dispatcher"
          executor = "fork-join-executor"
          fork-join-executor {
            parallelism-min = 2
            parallelism-max = 4
          }
        }

@Aaronontheweb
Copy link
Member

close via #4511

@Aaronontheweb Aaronontheweb added this to the 1.4.11 milestone Oct 6, 2020
@Swoorup
Copy link

Swoorup commented Oct 26, 2020

@Aaronontheweb does this mean that we don't need any additional config to prevent starvation issues of this kind?

@Aaronontheweb
Copy link
Member

@Swoorup nope, the newest nightly builds of Akka.NET should automatically handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants