Libp2p connections return stream reset but no disconnections notified

Howdy,

We’ve been seeing elevated amounts of stream resets that do not originate from our application layer code that utilizes libp2p. To reiterate, the protocol messages do not appear to land in the stream handlers that might reject them with a stream reset. It appears that the streams get reset before the messages trickle up to our stack, and to make things even worse, the underlying symptom is that it looks like the connection hangs in a way that one peer assumes it is connected to another, while the other does not share the same view of things.

It appears that things got worse with the update to libp2p v0.16.0 but now with 0.17.0 this stuff is just all over the place and it causes higher level protocols that are meant to enforce SLAs to go haywire and impose all sorts of peer sanctions. We don’t get any disconnection notification from the Host's Network. These nodes sit together on the same k8s cluster, so it is quite a peculiar behavior that we haven’t seen before.

Could you point out to what can lead to this sort of situation? Any experience with this sort of behavior?

I am assuming you are using go-libp2p?

Maybe @marten or @vyzo can help here.

that is correct, we are using go-libp2p

  1. In v0.17.0, we decreased the concurrent stream limit in yamux from 1000 to 256 (the limit applies per yamux session).
  2. This probably shows that there’s a bug / leak in how you use streams. You might want to investigate if you’re probably closing streams once you’re done with them, otherwise you’ll leak resources (and run into the limit).
  3. In v0.18.0, we will introduce a resource manager (see the release notes for rc1 for more details), which will (dynamically) limit the number of streams. We just released rc4, which lifts the yamux limit in favor of the limits imposed by the resource manager. These limits can be adjusted (see the resource manager for details). Note that if there’s a leak in your code, you’ll eventually run into the limits anyway, so it would pay off to investigate that anyway.

We do use a lot of transient streams, but I am not sure how many do we use concurrently. We definitely do not multiplex within a protocol on a single stream as suggested by some (however in the examples shown in eth2 implementations this is not done and transient disposable streams are used as far as I remember).

Can you elaborate on what is the expected behavior when the stream limit is reached? is the data just dropped silently or are we supposed to get a stream reset or is there another explicit error which is expected to bubble up the stack? it might be useful to return an explicit error in that case (this is just a suggestion).