I want to deploy some Wasp (a program made for the IOTA blockchain) deployments onto different AWS EC2 machines. Everything is working when I deploy a network composed by a low number of machines. However when the number of machines is around one hundred (or even less), only SOME of the nodes are unable to connect each other (every node has the same firewall policies, so it can’t be a problem of reachability). Wasp is using go-libp2p to implement connection, and I’m obtaining the following errors:
WARN Peering.peer:15.228.161.235:4000 Failed to send outgoing message, unable to allocate stream, reason=failed to dial 12D3KooWJj52NW8UbZ2pXK88CQ7FHzvfVE3TVvi8KCyHHakP35ZU:
* [/ip4/15.228.161.235/tcp/4000] failed to negotiate security protocol: EOF
WARN Peering.peer:15.228.161.235:4000 Failed to send outgoing message, unable to allocate stream, reason=failed to dial 12D3KooWJj52NW8UbZ2pXK88CQ7FHzvfVE3TVvi8KCyHHakP35ZU:
* [/ip4/15.228.161.235/tcp/4000] dial backoff
Since I’m trying to solve this problem for a long time now, I also tried to check what Swarm was logging as debug, and I read the following:
2023-09-06T10:47:33.429Z DEBUG basichost basic/basic_host.go:739 host 12D3KooWDaPpyUadtWoQpr2Kw2v8axCsPjA7LaJoq3nVtrVT2i59 dialing 12D3KooWJj52NW8UbZ2pXK88CQ7FHzvfVE3TVvi8KCyHHakP35ZU
2023-09-06T10:47:33.429Z DEBUG swarm2 swarm/swarm_dial.go:243 dialing peer {"from": "12D3KooWDaPpyUadtWoQpr2Kw2v8axCsPjA7LaJoq3nVtrVT2i59", "to": "12D3KooWJj52NW8UbZ2pXK88CQ7FHzvfVE3TVvi8KCyHHakP35ZU"}
2023-09-06T10:47:33.429Z DEBUG swarm2 swarm/limiter.go:193 [limiter] adding a dial job through limiter: /ip4/15.228.161.235/tcp/4000
2023-09-06T10:47:33.429Z DEBUG swarm2 swarm/limiter.go:161 [limiter] taking FD token: peer: 12D3KooWJj52NW8UbZ2pXK88CQ7FHzvfVE3TVvi8KCyHHakP35ZU; addr: /ip4/15.228.161.235/tcp/4000; prev consuming: 27
2023-09-06T10:47:33.429Z DEBUG swarm2 swarm/limiter.go:167 [limiter] executing dial; peer: 12D3KooWJj52NW8UbZ2pXK88CQ7FHzvfVE3TVvi8KCyHHakP35ZU; addr: /ip4/15.228.161.235/tcp/4000; FD consuming: 28; waiting: 0
2023-09-06T10:47:33.429Z DEBUG swarm2 swarm/swarm_dial.go:490 12D3KooWDaPpyUadtWoQpr2Kw2v8axCsPjA7LaJoq3nVtrVT2i59 swarm dialing 12D3KooWJj52NW8UbZ2pXK88CQ7FHzvfVE3TVvi8KCyHHakP35ZU /ip4/15.228.161.235/tcp/4000
2023-09-06T10:47:33.465Z DEBUG swarm2 swarm/limiter.go:73 [limiter] freeing FD token; waiting: 0; consuming: 92
2023-09-06T10:47:33.465Z DEBUG swarm2 swarm/limiter.go:100 [limiter] freeing peer token; peer 12D3KooWJj52NW8UbZ2pXK88CQ7FHzvfVE3TVvi8KCyHHakP35ZU; addr: /ip4/15.228.161.235/tcp/4000; active for peer: 1; waiting on peer limit: 0
2023-09-06T10:47:33.465Z DEBUG swarm2 swarm/swarm_dial.go:281 network for 12D3KooWDaPpyUadtWoQpr2Kw2v8axCsPjA7LaJoq3nVtrVT2i59 finished dialing 12D3KooWJj52NW8UbZ2pXK88CQ7FHzvfVE3TVvi8KCyHHakP35ZU
2023-09-06T10:47:33.465Z DEBUG swarm2 swarm/limiter.go:201 [limiter] clearing all peer dials: 12D3KooWJj52NW8UbZ2pXK88CQ7FHzvfVE3TVvi8KCyHHakP35ZU
2023-09-06T10:47:33.582Z DEBUG basichost basic/basic_host.go:739 host 12D3KooWDaPpyUadtWoQpr2Kw2v8axCsPjA7LaJoq3nVtrVT2i59 dialing 12D3KooWJj52NW8UbZ2pXK88CQ7FHzvfVE3TVvi8KCyHHakP35ZU
2023-09-06T10:47:33.582Z DEBUG swarm2 swarm/swarm_dial.go:243 dialing peer {"from": "12D3KooWDaPpyUadtWoQpr2Kw2v8axCsPjA7LaJoq3nVtrVT2i59", "to": "12D3KooWJj52NW8UbZ2pXK88CQ7FHzvfVE3TVvi8KCyHHakP35ZU"}
2023-09-06T10:47:33.582Z DEBUG swarm2 swarm/swarm_dial.go:281 network for 12D3KooWDaPpyUadtWoQpr2Kw2v8axCsPjA7LaJoq3nVtrVT2i59 finished dialing 12D3KooWJj52NW8UbZ2pXK88CQ7FHzvfVE3TVvi8KCyHHakP35ZU
2023-09-06T10:47:33.582Z DEBUG swarm2 swarm/limiter.go:201 [limiter] clearing all peer dials: 12D3KooWJj52NW8UbZ2pXK88CQ7FHzvfVE3TVvi8KCyHHakP35ZU
...
Since this is a problem happening only with an high number of machines deployed I was wondering if this could be a problem of timeouts during dials or maybe there is a limit in the number of connections. What steps I could apply to try solving this important issue for me?