Notes on Crawling Instances on the Known Fediverse

One of my favourite features of Fediverse is how different instances (whether it be Mastodon, Lemmy or Pixelfed) cooperate with each other over using the ActivityPub protocol. This is a fundamental feature of decentralised social media.

For instance (pun unintended), users of mastodon.social can communicate with mastodonapp.uk, and mastodonapp.uk can communicate with pixelfed.social. As a result, this produces a massive web of interconnected instances

To get a bigger picture of how this works, I decided to explore this further by building my crawler which traverses through each instance, one-by-one, until there is nothing left to explore.

How does this work?

Most instances in the Fediverse feature Mastodon’s /api/v1/instance/peers API JSON endpoint for returning a list of instances (known as peers) an instance is aware of. For example, for mastodon.social, this might look this…

In its raw form, it might look a bit like this.

Algorithm

The algorithm used for this crawler makes use of some basic concepts used in computer science. These include a queue and a breath-first search. Follow the link for more about how they work.

The basic algorithm to build a network of instances is as follows:

  1. Create an empty graph
  2. Start with a seed instance
  3. Add to queue
  4. Pop instance off queue
  5. Get list of instance peers
  6. Add instance and peers as an edge to the graph
  7. Add peers to queue (if not already processed or in queue)
  8. Flag instance as processed
  9. Repeat Step 4 until queue empty

Implementation

My first thought was to use Python as this is my daily driver for day-to-day data analytics, however I eventually settled on Go (also known as Golang).

Why Go? Well, it’s much faster being a compiled programming language and supports native concurrency in the form of goroutines – plus, I’ve always wanted to put Go to use for something.

If you want to see the code for yourself, here is a link to my public gist on GitHub. Feel free to critique it as I’m sure there is a lot more that could be done to improve it – I’m still learning after all 🙂

Results

On a single thread it took 690m40.855s (that’s over 11 hours) to process 172,458 instances with 93,127,216 connections!

That’s quite a lot and almost impossible for Gephi to visualise – even using my MacBook Pro with 64GB RAM with 2.3 GHz 8-core Intel Core i9 processor!

So, to make thing a little easier to manage, I scaled things down to only include instances that had peers. In other words, to remove instances that which have now outgoing connections. This way, we are only focusing on the ‘core’ instances, if that makes sense.

This reduced things down to 13,067 instances with 45,052,520 connections.

Unfortunately, this was still too large for Gephi to visualise, so I had to skip this stage completely.

To give you an idea of how interconnected the graph is, the graph has a diameter (maximum distance between the pair of vertices) of N=4 with an average degree (number of node connections) of 4156.87 (3116.59 s.d). Another way of putting it is the average instance has 4156.87 known peers connected.

Conclusions

Given the size of the network and the somewhat limited [computational] resources I have, I was unable to actually see the results for myself. Gephi simply couldn’t manage loading in the graph – even after repeatedly increasing the memory.

Both the diameter and average really puts the size and scale of the Fediverse into context, which is still quite impressive – even if I couldn’t visualise it.

It’s also worth mentioning that this solution does not make use of goroutines. I’m well aware that if I did, things would speed up drastically. That is something I’ll do for another day once I’ve fully got my head around how they work.