Skip to content

Performance regression on upgrade to 1.4 #4417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Zetanova opened this issue May 17, 2020 · 14 comments
Closed

Performance regression on upgrade to 1.4 #4417

Zetanova opened this issue May 17, 2020 · 14 comments

Comments

@Zetanova
Copy link
Contributor

I upgraded all nodes from 1.3.X to 1.4.6 all nodes working fine
but all nodes using 100% of a core in idling state.

Even a WebApi-Node that hosts more or less no custom actors.

I am using docker 19.03.8 and dotnet core 3.1
Normaly i would use procexp to check what thread is using 100%
but under docker/linux, i dont know how to do it.

ThreadList of the WebAPI node:
image

I would be glad to get tips how to debug it or resolve.

@Zetanova
Copy link
Contributor Author

Zetanova commented May 17, 2020

with the help of dotnet-trace
i created a cpu perf trace and open it in vs

the hotpath is mared as "external" how to load the rest of the debug symbols?

image

Edit: found the filter "show external code path"

@Zetanova
Copy link
Contributor Author

One hotpath is in DotNetty.Common.dll with a Slim lock:
image

The second hotpath is in Helios.Concurrency.DedicatedThreadPool
and most likly related to the DotNetty WaitHandle
image

A third hotpathis in HashWheelTimerSchedule.WaitForNextTick()
image

Diagnosis

All hotpaths seam to use the ManualResetEventSlim WaitHandle
It spins always first as a Life-Lock and switches then to a normal WaitHandle.

ManualResetEventSlim should only be used if it is assumed to be set most of the time and the lock time is very short. If it is used frequently and not always in a reset state if will produce high CPU cycles/spins.

@Zetanova
Copy link
Contributor Author

I tested now the HashWheelTimerSchedule.WaitForNextTick() and could not reproduce the behavior in a demo project. https://github.com/Zetanova/AkkaTimerTest

it is very strage.

my current status is that if i start a 1-5 nodes in debug/docker they consume
on 2cores around 80% CPU in idling around.

@Zetanova
Copy link
Contributor Author

Debug or production builds making no difference

The seed node start with ~2.5% CPU
when the second node joins the cluster then both using ~6% CPU and idling
when the 3th node is joining the cluster all nodes using ~11% CPU
when the 4th and 5th node joing the cluster all nodes using 12-17% CPU

5x 15% => 85% CPU

@Aaronontheweb
Copy link
Member

As we tried to show everyone as loudly as we could in all Akka.NET v1.4 release notes and documentation: https://getakka.net/articles/remoting/performance.html

Turn off remote batching if you're running a low-traffic system.

@Aaronontheweb
Copy link
Member

Please do that and update us with the results.

@Zetanova
Copy link
Contributor Author

I already tried to disable dotnetty buffering without any chance.
It is not about the latency is about the sudden high CPU consumption
In a release build it is not that high

I even chance back to 1.3.6 and have the same issue.

Then i tried to chance a image in the k8s cluster
the idling container/pod jumped from 20-40m to 115-150m

I am currently try to run it on older dotnet versions.
Maybe there was something...

@Aaronontheweb
Copy link
Member

Ok, if's not an issue with the DotNetty batching system then that's a bit of a mystery - might be that .NET Core changed part of the underlying runtime itself. We didn't touch many of the concurrency primitives, other than changing onto .NET Standard 2.0.

@Zetanova
Copy link
Contributor Author

I tried down to mcr.microsoft.com/dotnet/core/aspnet:3.1.1-buster-slim and all with this new issue

mcr.microsoft.com/dotnet/core/aspnet:3.1-alpine3.11
mcr.microsoft.com/dotnet/core/aspnet:3.1-alpine3.10
have the same issue but only using 60MB memory (debian-buster used 100MB)

I think its a kernel patch or something all distro images a got rebuild 20 days ago
and that triggered MS to rebuild of new and old dotnet versions too.

Maybe someone can confirm the high CPU usage.
but dont forget to make a pull before to download the latest build
docker pull mcr.microsoft.com/dotnet/core/aspnet:3.1 and
mcr.microsoft.com/dotnet/core/sdk:3.1

@Zetanova
Copy link
Contributor Author

It is very easy to check

  1. start the seed node/container of the cluster
    docker stats => seed node 2.6% CPU
  2. start a second node/container and let it join the cluster
    docker stats => both nodes have ~6% CPU each
  3. start a 3th node/container and let it join the cluster
    docker stats => all nodes have ~10-13% CPU each

@Aaronontheweb
Copy link
Member

I think its a kernel patch or something all distro images a got rebuild 20 days ago
and that triggered MS to rebuild of new and old dotnet versions too.

So you don't think this is an Akka.NET issue? Just want to be clear.

@Aaronontheweb
Copy link
Member

Might not be a bad idea to revisit #4032 cc @akkadotnet/contributors

@Zetanova
Copy link
Contributor Author

yes, no akka issue

@Aaronontheweb
Copy link
Member

@Zetanova looks like there's evidence that this is an Akka.NET issue - follow #4434 for updates on it. User added a pretty convincing reproduction sample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants