Skip to content

[Tracking issue] go-libp2p resource manager critical post release fixes #9442

@BigLep

Description

@BigLep

This is the tracking issue for the streams of work that need to get corrected/fixed as result of the Kubo 0.17 release with libp2p resource manager enabled by default: #8761

Must complete before 0.18 RC

Theme 2: reports of default limits not working well for users

Theme 3: improve RM errors coming from other peers

This is related to reports of a disabled go-libp2p resource manager still managing resources. In fact this is an error message from a remote go-libp2p peer exceeding their own resource limits and when the message is printed on the local peer its hard to differentiate between the do.

Things we can do:

  1. Adjust messaging for resource manager on the local Kubo side so it's easier to differentiate between local and remote peer resource manager errors.
    • It looks like the message already is differentiable. Feel free to do more though.
  2. Adds some docs on how to differentiate and point to the go-libp2p issue for handling this better in go-libp2: quic: Error from peer should be explicit that it is coming from remote peer rather than a local error libp2p/go-libp2p#1928

Theme 4: confusion around "magic values"

This is about how "4611686018427388000" looks like a random number when it is actually our "infinity".

Things we can do:

  1. Allow -1 to denote infinity for the go-libp2p resource manager. Requested: Enable specifying infinite limits with a more "intuitive" magic number like -1 libp2p/go-libp2p#1935
  2. Document what Kubo's magic value is:
  3. Using another number that is more obvious a magic number like 999999999999999999

Options not on the table:

  1. ipfs swarm limits all to go from "4611686018427388000" to "infinity" on the output because that isn't valid JSON.
  2. Similarly we can't add a comment after the magic number because that isn't valid json and the go json parser we use doesn't allow parsing non-spec compliant things like comments.

Theme 5: be clearer on startup about what limits are being set and why

Add a log message like:

First computing default go-libp2p resource manager limits based on Swarm.ResourceManager.maxMemory of {Swarm.ResourceManager.maxMemory} and then applying any user-supplied overrides on top. Run ipfs swarm limits all to see the resulting limits.

Theme 6: Wrong tone about resource manager getting in the way (being a bug) vs. being a feature

UX angle about being clear that in general this is a feature not a bug:

Theme 7: clarity around the "error message" meaning

There isn't clarity around what "system: cannot reserve inbound connection: resource limit exceeded". means. For this example, it means Swarm.ResourceMgr.Limits.System.ConnsInbound is exceeded.

Things we can do:

  1. Provide docs on how to interpret this message.
  2. Reverse engineer the message based on https://github.com/libp2p/go-libp2p/blob/master/p2p/host/resource-manager/scope.go so we can map it back to Swarm.ResourceMgr.Limits.$scope.$limit. If we do that, we can then print what the limit value is.
    • 2022-12-06 maintainer conversation: not going to do this

Theme 8: Provide actionable advice when resource limits are hit

When a resource limit is hit, we point users to https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr. We could provide a better documentation path for how someone debugs this situation.

Theme 9: have the ConnMgr limits be set under Swarm.ResourceMgr.Limits.System.ConnsInbound

As discussed in #9468, this allows low priority idle connections to get cleaned up to make space for higher priority connections

Things we can do:

  1. Lower the default limits
  2. Log when ConnMgr limits aren'tunder Swarm.ResourceMgr.Limits.System.ConnsInbound

Theme 10: resource manager doing its job of protecting a node is alarming

This was broughtup throughout #9432, but generally users have found the resource manager "ERRORs" spammy. That they are printed as ERRORs also runs counter to our narrative about this being a feature. Should a feature doing its job be an error?

Things we can do:

Theme 11: fix bugs in the swarm stats command

Theme 12: remove additional footguns around (soft) ConnMgr and (hard) ResourceMgr limits and their interactions

Theme 13: clarify and improve handling of zeroes

Ideally completing before the 0.18 final release

Theme 1: usability issues in entering config

It's too easy for someone to enter invalid config and not get any feedback that they have done so.

Theme 11: Improve usability in the swarm stats command

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions