Rust<-> Go interop - idle connection timeout race condition?

Hi everyone,

I have a question regarding rust-libp2p ↔ go-libp2p interoperability.

I’m trying to query this DHT bootstrap node:

/dns/bootnode.1.lightclient.mainnet.avail.so/tcp/37000/p2p/12D3KooW9x9qnoXhkHAjdNFu92kMvBRSiFBMAoC5NnifgzXjsuiM

It is written in Rust, and the source code is here: GitHub - availproject/avail-light-bootstrap: Bootstrap node for the Avail Light Client

I’m trying to build a tool that can bootstrap into Avail’s DHT network. This tool will be written in Go, so I’m using go-libp2p to connect to that bootstrapper.

I noticed that I can successfully connect to the bootstrapper, but the connection gets pruned immediately. In go-libp2p I would call .Connect(...) on a BasicHost and this doesn’t return an error. Then, when I immediately list the open connections, it tells me to have 0.

To debug the issue, I ran the bootstrapper node locally and could see in the logs that every time I connect, I get a connection but immediately followed by a TCP KeepAliveTimeout. From my investigation I found out that this keep-alive/idle timeout is set to 0 by default.

The condition to prune a connection is computed here and dependent on the idle timeout The tl;dr:

As long as we’re still negotiating substreams or have any active streams shutdown is always postponed.

So, a solution would be to just configure a non-zero idle_connection_timeout.

However, I searched the Ethereum lighthouse code and couldn’t find a reference that sets the idle_connection_timeout to a non-zero value. There’s also this PR which seems to discuss this and from a quick glance, a 0 idle connection timeout seems to be the desired behaviour. This makes me question if I’m on the right path with the above solution or if something else needs to be configured differently.

From my above understanding, it looks to me like there is a race condition. The basichost in go-libp2p will wait until the identify exchanged has completed before handing back control to the user (source). This means if I open a stream right after the Connect call has returned, there’s a brief moment where there are no open streams and nothing’s negotiated. This means that the remote rust libp2p peer will prune the connection.

When querying my local bootstrapper (as opposed to the hosted one), I could confirm that it’s sometimes working and sometimes not (which hints at some racy behaviour). I could also make it work consistently when I’m just dialing the peer and immediately open the DHT stream.

I confirmed that Prysm also uses a libp2p basichost under the hood (source). This means the Prysm ↔ lighthouse interaction would also be susceptible to the above race condition?

To end with some clear questions:

  • Is there really a race condition or am I completely wrong here?
  • Should users set a non-zero idle connection timeout? What’s the recommendation?
  • If 0 is the recommended value wouldn’t this make go ↔ rust interop flaky because of the above reasons?
Some notes for experimentation with the Avail bootstrapper:
  • There are some requirements that my host needs to satisfy in order to connect:
    • the host must have an allowed agent version like: avail-light-client/bootstrap/0.1.3/rust-client
    • the host must support the DHT protocol: /avail_kad/id/1.0.0-b91746 (regardless of being a DHT client or not (which doesn’t make sense imo but that’s a different discussion))
  • For my experimentation I’m using vole.

Hey!

From what it sounds like, it doesnt sound like a race condition, but instead it does sounds like rust-libp2p is disconnecting due to an idle connection, which by default would disconnect right away since the duration is set to zero by default. There has been discussions to increase the idle timeout, but in this case, you might want to set it yourself via Config::with_idle_connection_timeout in SwarmBuilder::with_swarm_config (see ping example) . I find 10 to 15 seconds to be enough for default values but this could be smaller or larger depending on your use case.

Hi @darius, thanks for your input!

From what it sounds like, it doesnt sound like a race condition, but instead it does sounds like rust-libp2p is disconnecting due to an idle connection, which by default would disconnect right away since the duration is set to zero by default.

I called it a race condition because from go-libp2p’s perspective rust-libp2p’s idle timeout is racing against its own efforts to open a new stream. Sometimes rust wins and go gets disconnected and sometimes go wins and we have an open stream which keeps the connection open.

This obviously not only applies in the rust ↔ go interaction but in general. I’m mentioning go-libp2p here because its default BasicHost explicitly waits until the identify exchange has completed before the user can open a new stream, which amplifies this issue.

And, as said above, because Prysm uses the BasicHost under the hood, I believe this could have consequences for Prysm → Lighthouse connectivity in the Ethereum network. I wasn’t able to confirm this with João from Lighthouse yet though. Lighthouse nodes usually have many open connections to Prysm nodes but I hypothesise that they are mostly outbound (instead of inbound) from Lighthouse’s perspective. This is what we couldn’t confirm yet.

There has been discussions to increase the idle timeout, but in this case, you might want to set it yourself via Config::with_idle_connection_timeout in SwarmBuilder::with_swarm_config (see ping example) . I find 10 to 15 seconds to be enough for default values but this could be smaller or larger depending on your use case.

I wasn’t aware of the discussion, thanks for the pointer! I think changing the default idle timeout in rust-libp2p to something > 0 is the right solution here. In the meantime, manually setting it works as well of course :+1: I just wanted to raise the general issue (which seems to be already being discussed).