Hi everyone,
I have a question regarding rust-libp2p ↔ go-libp2p interoperability.
I’m trying to query this DHT bootstrap node:
/dns/bootnode.1.lightclient.mainnet.avail.so/tcp/37000/p2p/12D3KooW9x9qnoXhkHAjdNFu92kMvBRSiFBMAoC5NnifgzXjsuiM
It is written in Rust, and the source code is here: GitHub - availproject/avail-light-bootstrap: Bootstrap node for the Avail Light Client
I’m trying to build a tool that can bootstrap into Avail’s DHT network. This tool will be written in Go, so I’m using go-libp2p to connect to that bootstrapper.
I noticed that I can successfully connect to the bootstrapper, but the connection gets pruned immediately. In go-libp2p I would call .Connect(...)
on a BasicHost and this doesn’t return an error. Then, when I immediately list the open connections, it tells me to have 0
.
To debug the issue, I ran the bootstrapper node locally and could see in the logs that every time I connect, I get a connection but immediately followed by a TCP KeepAliveTimeout
. From my investigation I found out that this keep-alive/idle timeout is set to 0
by default.
The condition to prune a connection is computed here and dependent on the idle timeout The tl;dr:
As long as we’re still negotiating substreams or have any active streams shutdown is always postponed.
So, a solution would be to just configure a non-zero idle_connection_timeout
.
However, I searched the Ethereum lighthouse code and couldn’t find a reference that sets the idle_connection_timeout
to a non-zero value. There’s also this PR which seems to discuss this and from a quick glance, a 0
idle connection timeout seems to be the desired behaviour. This makes me question if I’m on the right path with the above solution or if something else needs to be configured differently.
From my above understanding, it looks to me like there is a race condition. The basichost
in go-libp2p will wait until the identify exchanged has completed before handing back control to the user (source). This means if I open a stream right after the Connect
call has returned, there’s a brief moment where there are no open streams and nothing’s negotiated. This means that the remote rust libp2p peer will prune the connection.
When querying my local bootstrapper (as opposed to the hosted one), I could confirm that it’s sometimes working and sometimes not (which hints at some racy behaviour). I could also make it work consistently when I’m just dialing the peer and immediately open the DHT stream.
I confirmed that Prysm also uses a libp2p basichost under the hood (source). This means the Prysm ↔ lighthouse interaction would also be susceptible to the above race condition?
To end with some clear questions:
- Is there really a race condition or am I completely wrong here?
- Should users set a non-zero idle connection timeout? What’s the recommendation?
- If
0
is the recommended value wouldn’t this make go ↔ rust interop flaky because of the above reasons?
Some notes for experimentation with the Avail bootstrapper:
- There are some requirements that my host needs to satisfy in order to connect:
- the host must have an allowed agent version like:
avail-light-client/bootstrap/0.1.3/rust-client
- the host must support the DHT protocol:
/avail_kad/id/1.0.0-b91746
(regardless of being a DHT client or not (which doesn’t make sense imo but that’s a different discussion))
- the host must have an allowed agent version like:
- For my experimentation I’m using
vole
.