Nebula libp2p DHT crawler

dennis-tra · July 7, 2021, 3:12pm

Hi everyone,

For the last two weeks, I’ve built a libp2p DHT crawler, and I’d love to discuss some of the results so far. You can find the source code here:

There is also a Grafana-Dashboard available here: https://nebula.dtrautwein.eu/

Just send me a PM (if that’s possible here on discourse), and I’m happy to provide login credentials to view the board. In the remainder of this post, I’m just using screenshots.

The crawler runs every 30 minutes by connecting to the standard DHT bootstrap nodes and then recursively following all entries in the k-buckets until all peers have been visited.

The crawler tracks and persists the following information:

“Connectable” peers. This means, all peers that we could connect to AND fetch their DHT entries.
- Their supported protocols
- Their agent versions
- Their peer IDs and associated multi-addresses
Undialable peers - peers that refused to connect or where the connection attempt timed out (timeout 60s)
- Connection error reasons
  - i/o timeout
  - connection refused
  - protocol not supported
  - peer id mismatch
  - no route to host
  - network is unreachable
  - no good addresses
  - context deadline exceeded
  - no public IP address
  - max dial attempts exceeded
  - unknown

For every peer that the crawler could connect to it creates a Session entry in the database. This information includes:

type Session struct {
  // A unique id that identifies a particular session
  ID int
  // The peer ID in the form of Qm... or 12D3...
  PeerID string
  // When was the peer successfully dialed the first time
  FirstSuccessfulDial time.Time
  // When was the most recent successful dial
  LastSuccessfulDial time.Time
  // When should we try to dial the peer again
  NextDialAttempt null.Time
  // When did we notice that this peer is not reachable.
  // This cannot be null because otherwise the unique constraint
  // uq_peer_id_first_failed_dial would not work (nulls are distinct).
  // An unset value corresponds to the timestamp 1970-01-01
  FirstFailedDial time.Time
  // The duration that this peer was online due to multiple subsequent successful dials
  MinDuration null.String
  // The duration from the first successful dial to the point were it was unreachable
  MaxDuration null.String
  // indicates whether this session is finished or not. Equivalent to check for
  // 1970-01-01 in the first_failed_dial field.
  Finished bool
  // How many subsequent successful dials could we track
  SuccessfulDials int
  // When was this session instance updated the last time
  UpdatedAt time.Time
  // When was this session instance created
  CreatedAt time.Time
}

Then there is a second mode of the crawler which I called monitor-mode. In this mode nebula fetches every 10 seconds all sessions from the database that are due to be dialed in the next 10 seconds (based on the NextDialAttempt timestamp) or overdue. It attempts to dial all peers using the saved multi-addresses and then updates their session instances accordingly if they’re dialable or not.

The dial interval increases the longer the session is. That’s the SQL logic:

 next_dial_attempt    = 
  CASE
	 WHEN 1.1 * (NOW() - sessions.first_successful_dial) < '30s'::interval THEN
		NOW() + '30s'::interval
	 WHEN 1.1 * (NOW() - sessions.first_successful_dial) > '40m'::interval THEN
		NOW() + '40m'::interval
	 ELSE
        NOW() + 1.1 * (NOW() - sessions.first_successful_dial)
  END;

(This is not the actual SQL statement which can be found here - it should just demonstrate the logic).

So 30s after the crawler connected the first time to a peer it tries to dial it again. If that succeeds it tries again after 33s, then after 36s, and so on…
Now as I’m writing this, this sounds quite often - what are your thoughts on the interval?

It is capped at 40m because the crawler runs every 30m and the peer should be found in the crawl (which takes a couple of minutes) anyways. If the crawl didn’t find the peer, the monitoring task will try to reach it again. At this point I’m thinking - if it’s not found in the DHT: should it then still be counted online even if the peer is dialable?

Here are some other configuration settings of the crawler that I would love to get feedback on:

Dial timeout: 60s
Protocols: /ipfs/kad/1.0.0, /ipfs/kad/2.0.0
- I saw 2.0.0 somewhere but don’t know if it’s necessary
- (The crawler also works with /fil/kad/testnetnet/kad/1.0.0)
I’m using network.WithForceDirectDial to prevent dial backoffs and handle them myself.
The crawler does not retry connecting to peers, the monitoring process does. Three retries with
sleepDuration := time.Duration(float64(5*(i+1)) * float64(time.Second)) where i is the retry
It filters multi addresses that pass manet.IsPrivateAddr(maddr) - is this actually necessary? I’ve seen some logic around choosing the best address.

Now some numbers:

A crawl takes roughly 5 minutes. I can get the crawl down to 2 minutes if I decrease the connection timeouts to 10s.

Crawled peers 2021-07-07 16:30 CEST

Agent version distribution (combination of all pre releases + builds)

Agent version distribution II (separate prereleases + builds)

Supported Protocols where more than 200 nodes supported each protocol. There are many more protocols were there less than 200 nodes supporting each.

Connection errors of the last crawl (not “Dial” as it’s in the title of the panel):

Questions:

I’m using a single libp2p host to crawl many peers in parallel. Similarly, the monitoring process also only uses a single host. Is there an internal limit in parallel dials that slows down the crawl and could be circumvented by instantiating multiple libp2p hosts?
As can be seen in the complete agent version distribution there are 44% of nodes that don’t have an agent version. The logic is basically
```
if err := host.Connect(ctx, peer); err != nil {
  return err
}
agent, err := host.Peerstore().Get(peer.ID, "AgentVersion")
...
```
Could it be that the identity exchange hasn’t finished after the Connect call has returned? I saw that there is a channel blocking the return statement until the identity has been determined <-h.ids.IdentifyWait(c) in basic_host.go:780 but the comment above this line indicates that it’s not guaranteed:
```
// TODO: Consider removing this? On one hand, it's nice because we can
// assume that things like the agent version are usually set when this
// returns. On the other hand, we don't _really_ need to wait for this.
//
// This is mostly here to preserve existing behavior.
```
I’m not sure due to the word usually.
I’ve heard that there are 10 hydra-boosters with 100 heads each, so the crawler should find 1000 peers with the hydra-booster/x.y.z agent version - but it’s only finding 654. Do you have an idea why?

The monitoring process receives a lot of connection refused errors. This leads to a single node having many session instances in the database where the end of the previous is just a minute (or so) before the beginning of the next session. Could this be due to the frequent connection attempts? At which level could the connection be blocked - firewall, OS, libp2p? Here is an example of such a peer that responds to many connection attempts/dials and then the dial fails intermittently (despite retries).

   id   |                       peer_id                        |     first_successful_dial     |     last_successful_dial      | next_dial_attempt |       first_failed_dial       |  min_duration   |  max_duration   | finished | successful_dials |          updated_at           |          created_at
--------+------------------------------------------------------+-------------------------------+-------------------------------+-------------------+-------------------------------+-----------------+-----------------+----------+------------------+-------------------------------+-------------------------------
 543353 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 15:42:41.910688+02 | 2021-07-07 15:43:02.02431+02  |                   | 2021-07-07 15:44:46.384017+02 | 00:00:20.113622 | 00:02:19.475972 | t        |                3 | 2021-07-07 15:45:01.38666+02  | 2021-07-07 15:42:41.910688+02
 542483 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 15:37:31.67989+02  | 2021-07-07 15:39:10.85505+02  |                   | 2021-07-07 15:42:05.621077+02 | 00:01:39.17516  | 00:04:48.942734 | t        |                5 | 2021-07-07 15:42:20.622624+02 | 2021-07-07 15:37:31.67989+02
 536184 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 15:31:20.973472+02 | 2021-07-07 15:33:25.823083+02 |                   | 2021-07-07 15:36:54.191171+02 | 00:02:04.849611 | 00:05:48.220279 | t        |                4 | 2021-07-07 15:37:09.193751+02 | 2021-07-07 15:31:20.973472+02
 509114 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 14:17:19.863394+02 | 2021-07-07 14:46:55.404861+02 |                   | 2021-07-07 15:02:16.39978+02  | 00:29:35.541467 | 00:44:56.536386 | t        |                9 | 2021-07-07 15:02:16.39978+02  | 2021-07-07 14:17:19.863394+02
 508304 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 14:08:07.162411+02 | 2021-07-07 14:11:38.02465+02  |                   | 2021-07-07 14:16:43.204443+02 | 00:03:30.862239 | 00:08:51.044115 | t        |                6 | 2021-07-07 14:16:58.206526+02 | 2021-07-07 14:08:07.162411+02
 503840 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 14:02:05.212238+02 | 2021-07-07 14:04:07.819932+02 |                   | 2021-07-07 14:07:30.95417+02  | 00:02:02.607694 | 00:05:40.744301 | t        |                4 | 2021-07-07 14:07:45.956539+02 | 2021-07-07 14:02:05.212238+02
 473914 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 12:36:34.043855+02 | 2021-07-07 13:30:22.003058+02 |                   | 2021-07-07 13:32:11.849455+02 | 00:53:47.959203 | 00:55:37.8056   | t        |               10 | 2021-07-07 13:32:11.849455+02 | 2021-07-07 12:36:34.043855+02
 473032 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 12:34:35.546656+02 | 2021-07-07 12:34:35.546656+02 |                   | 2021-07-07 12:35:57.802342+02 | 00:00:00        | 00:01:37.256853 | t        |                1 | 2021-07-07 12:36:12.803509+02 | 2021-07-07 12:34:35.546656+02
 448752 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 11:32:19.236026+02 | 2021-07-07 12:02:42.760365+02 |                   | 2021-07-07 12:33:49.411125+02 | 00:30:23.524339 | 01:01:30.175099 | t        |                8 | 2021-07-07 12:33:49.411125+02 | 2021-07-07 11:32:19.236026+02
 ...

The monitoring process also receives a significant amount of peer ID mismatch errors. Could this be due to restart of libp2p hosts that have the same address but now a newly generated peer ID?

I hope you find these numbers interesting and can help me out with my questions.

Cheers,
Dennis

dennis-tra · July 10, 2021, 1:52pm

Just another update: I’ve also started crawling the Filecoin network. Surprisingly, I needed to use the DHT protocol /fil/kad/testnetnet/kad/1.0.0 - this seems wrong I used this list of bootstrap nodes and after connecting to them I checked their supported protocols:

github.com/filecoin-project/lotus

build/bootstrap/mainnet.pi

master

/dns/node.glif.io/tcp/1235/p2p/12D3KooWBF8cpp65hp2u9LK5mh19x67ftAam84z9LsfaquTDSBpt
/dns/bootstrap-venus.mainnet.filincubator.com/tcp/8888/p2p/QmQu8C6deXwKvJP2D8B6QGyhngc3ZiDnFzEHBDx8yeBXST
/dns/bootstrap-mainnet-0.chainsafe-fil.io/tcp/34000/p2p/12D3KooWKKkCZbcigsWTEu1cgNetNbZJqeNtysRtFpq7DTqw3eqH
/dns/bootstrap-mainnet-1.chainsafe-fil.io/tcp/34000/p2p/12D3KooWGnkd9GQKo3apkShQDaq1d6cKJJmsVe6KiQkacUk1T8oZ
/dns/bootstrap-mainnet-2.chainsafe-fil.io/tcp/34000/p2p/12D3KooWHQRSDFv4FvAjtU32shQ7znz7oRbLBryXzZ9NMK2feyyH
/dns/n1.mainnet.fil.devtty.eu/udp/443/quic-v1/p2p/12D3KooWAke3M2ji7tGNKx3BQkTHCyxVhtV1CN68z6Fkrpmfr37F
/dns/n1.mainnet.fil.devtty.eu/tcp/443/p2p/12D3KooWAke3M2ji7tGNKx3BQkTHCyxVhtV1CN68z6Fkrpmfr37F
/dns/n1.mainnet.fil.devtty.eu/udp/443/quic-v1/webtransport/certhash/uEiAWlgd8EqbNhYLv86OdRvXHMosaUWFFDbhgGZgCkcmKnQ/certhash/uEiAvtq6tvZOZf_sIuityDDTyAXDJPfXSRRDK2xy9UVPsqA/p2p/12D3KooWAke3M2ji7tGNKx3BQkTHCyxVhtV1CN68z6Fkrpmfr37F

Again some numbers:

And to answer one of the questions:

The monitoring process receives a lot of connection refused errors. This leads to a single node having many session instances in the database where the end of the previous is just a minute (or so) before the beginning of the next session. Could this be due to the frequent connection attempts? At which level could the connection be blocked - firewall, OS, libp2p? Here is an example of such a peer that responds to many connection attempts/dials and then the dial fails intermittently (despite retries).

   id   |                       peer_id                        |     first_successful_dial     |     last_successful_dial      | next_dial_attempt |       first_failed_dial       |  min_duration   |  max_duration   | finished | successful_dials |          updated_at           |          created_at
--------+------------------------------------------------------+-------------------------------+-------------------------------+-------------------+-------------------------------+-----------------+-----------------+----------+------------------+-------------------------------+-------------------------------
 543353 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 15:42:41.910688+02 | 2021-07-07 15:43:02.02431+02  |                   | 2021-07-07 15:44:46.384017+02 | 00:00:20.113622 | 00:02:19.475972 | t        |                3 | 2021-07-07 15:45:01.38666+02  | 2021-07-07 15:42:41.910688+02
 542483 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 15:37:31.67989+02  | 2021-07-07 15:39:10.85505+02  |                   | 2021-07-07 15:42:05.621077+02 | 00:01:39.17516  | 00:04:48.942734 | t        |                5 | 2021-07-07 15:42:20.622624+02 | 2021-07-07 15:37:31.67989+02
 536184 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 15:31:20.973472+02 | 2021-07-07 15:33:25.823083+02 |                   | 2021-07-07 15:36:54.191171+02 | 00:02:04.849611 | 00:05:48.220279 | t        |                4 | 2021-07-07 15:37:09.193751+02 | 2021-07-07 15:31:20.973472+02
 509114 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 14:17:19.863394+02 | 2021-07-07 14:46:55.404861+02 |                   | 2021-07-07 15:02:16.39978+02  | 00:29:35.541467 | 00:44:56.536386 | t        |                9 | 2021-07-07 15:02:16.39978+02  | 2021-07-07 14:17:19.863394+02
 508304 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 14:08:07.162411+02 | 2021-07-07 14:11:38.02465+02  |                   | 2021-07-07 14:16:43.204443+02 | 00:03:30.862239 | 00:08:51.044115 | t        |                6 | 2021-07-07 14:16:58.206526+02 | 2021-07-07 14:08:07.162411+02
 503840 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 14:02:05.212238+02 | 2021-07-07 14:04:07.819932+02 |                   | 2021-07-07 14:07:30.95417+02  | 00:02:02.607694 | 00:05:40.744301 | t        |                4 | 2021-07-07 14:07:45.956539+02 | 2021-07-07 14:02:05.212238+02
 473914 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 12:36:34.043855+02 | 2021-07-07 13:30:22.003058+02 |                   | 2021-07-07 13:32:11.849455+02 | 00:53:47.959203 | 00:55:37.8056   | t        |               10 | 2021-07-07 13:32:11.849455+02 | 2021-07-07 12:36:34.043855+02
 473032 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 12:34:35.546656+02 | 2021-07-07 12:34:35.546656+02 |                   | 2021-07-07 12:35:57.802342+02 | 00:00:00        | 00:01:37.256853 | t        |                1 | 2021-07-07 12:36:12.803509+02 | 2021-07-07 12:34:35.546656+02
 448752 | 12D3KooWBMLEz6H1rvwfUkbAZ8oFKf6Mc9cjXH4HYYouDLQQ5dnE | 2021-07-07 11:32:19.236026+02 | 2021-07-07 12:02:42.760365+02 |                   | 2021-07-07 12:33:49.411125+02 | 00:30:23.524339 | 01:01:30.175099 | t        |                8 | 2021-07-07 12:33:49.411125+02 | 2021-07-07 11:32:19.236026+02
 ...

This was due to an error in keeping track of multi-addresses. Once I’ve saved a peer ID and it’s associated multi-addresses they weren’t updated - especially if a subsequent crawl found the same peer ID with different addresses. The monitoring process was using the old ones for its dial attempts.

This increased the number of online peers significantly (the second panel from the left):

yiannisbot · July 12, 2021, 10:22am

Great work Dennis! Thanks for the great contribution. Here are some thoughts and answers to some of the questions.

I don’t think so, but it would certainly be worth checking. I’m wondering if the results of the crawl would be different in this case, e.g., more peers found - most likely not, but would be worth double-checking.

That’s correct. PL operates 10 hydra nodes with 100 heads each. I don’t know the exact answer to why you’re finding less, but it does happen that some hydra nodes go (temporarily) offline, or that the heads of an online node might go down temporarily.

Do you see the same Peer ID with different IP addresses, or different Peer IDs from the same IP address? The latter is a bit weird as when you restart the IPFS daemon you get the same Peer ID that you had before. The former is possible though, as when you restart the IPFS daemon your machine (especially in home environments) might well have taken a new IP address.

I hope this helps!
Yiannis.

Jorropo · July 13, 2021, 7:34am

It seems your crawler finds way less peer than it should (about near to twice less).

I run https://github.com/wiberlin/ipfs-crawler from time to time and hit a bit less than 11k peers online usually, this could explain the hydra booster issue too (and I actually suspect ipfs-crawler to miss some peers too, trying with canary peers yield varying level of success).

dennis-tra · July 13, 2021, 8:07am

Hi @Jorropo,

I’ve questioned my numbers based on the wiberlin/ipfs-crawler as well and I figured that both crawlers are finding similar number of peers. Here’s a screenshot of their Dashboard with the most recent numbers:

Source

So, at 8am CEST there were ~13700 peers of which ~7650 were found to be online. My numbers for the same time period:

At 8am CEST there were ~13300 peers of which ~7300 were found to be online. A little less but not near to twice less. As the crawler does not retry connecting to peers during the crawl I could imagine that the missing peers could be found if it did - would be worth checking if the wiberlin/ipfs-crawler does have retries. I’m open for alternative explanations, though

Regarding the hydra-booster: They’re also only finding ~560 hydra-booster nodes. Here’s another screenshot:

My most recent number (not 8am) of hydra-booster nodes is 515.

scriptkitty · July 15, 2021, 5:40pm

Hi from the other crawl-team (:

At our Weizenbaum-crawler we use multiple instances of libp2p-hosts because it sped up the crawls significantly for the old DHT which had a lot more undiable nodes. That has been significantly alleviated with go-ipfs v0.5 and subsequent versions with their DHT-rework.
We also take about 4 minutes per crawl on average, though >95% of nodes are found in the first 2 minutes – similar to what you observed by decreasing the connection timeout to 10s.

We’ve run into libp2p-internal limits on the number of connections opened, namely the swarm filedescriptor limit, which we disable through the respective environment variable.

Regarding your monitoring interval my opinion would be that once every second (if I understood correctly) is probably overly wasteful on resources for the information that it gives. In my crawls in Feb. '21, 75% of sessions were longer than 5 minutes – a huge increase since our Paper in Nov. '19. So I’m unsure if second-level accuracy really gives a lot more insights?

dennis-tra · July 16, 2021, 8:51am

Hi from the other crawl-team (:

Hi there

At our Weizenbaum-crawler we use multiple instances of libp2p-hosts because it sped up the crawls significantly for the old DHT which had a lot more undiable nodes. That has been significantly alleviated with go-ipfs v0.5 and subsequent versions with their DHT-rework.

Do you have instances hosted in different regions? I’ve read/heard somewhere that you have a crawler in the US and in Europe. The “Periodic Measurements of the IPFS Network” page only mentions the one in Europe.

Regarding your monitoring interval my opinion would be that once every second (if I understood correctly) is probably overly wasteful on resources for the information that it gives. In my crawls in Feb. '21, 75% of sessions were longer than 5 minutes – a huge increase since our Paper in Nov. '19. So I’m unsure if second-level accuracy really gives a lot more insights?

The minimum interval is 30s and then it’s half the observed uptime up until 15m maximum interval. I agree that once every second would be overly wasteful!

Thanks for the numbers, so that I can compare mine!

scriptkitty · July 21, 2021, 11:13am

Do you have instances hosted in different regions? I’ve read/heard somewhere that you have a crawler in the US and in Europe. The “Periodic Measurements of the IPFS Network” page only mentions the one in Europe.

Ahh, no, the two monitoring nodes (one in the US, one in GER) are from our other IPFS measurement setup where we track BitSwap-Requests (preprint of the paper). The crawler is only running in Germany, currently.

Topic		Replies	Views
Any existing DHT Crawler? Users and Developers	2	1302	July 25, 2019
Create own DHT bootstrap node Users and Developers	9	2459	September 27, 2021
Relay Query optimisation on DHT (https://github.com/libp2p/go-libp2p/issues/694)	5	500	September 20, 2021
Follow up on on my week project with repo Ecosystem and Community	0	750	April 4, 2020
Working on implementing Protocol Tracker Implementers and Contributors	2	354	August 12, 2022