How to Scrape Mastodon Timelines Using Python and Pandas

Over the past few months, Mastodon, the Federated microblogging alternative to Twitter, has gained a lot of attraction in light of the events going on surrounding Elon Musk’s purchase of Twitter. Upon further inspection, it turns out most instances have a public facing REST API for allowing users to interact with their services using third-party software.

This got me thinking. How easy would it be to scrape a timeline of toots (tweets) if I needed to analyse the data for a project? As it turns out, it is very easy.

As shown in a previous post, interacting with REST API’s in Python is ridiculously easy. Interacting with REST API can be done with very few lines of code and can be used to collect data from publicly available sources.

Using the code provided in the previous blog post, we can easily adapt the code to scrape different types of timeline on Mastodon.

Disclaimer: This code used in this blog post is to be used for demonstration and research purposes only. Do not use this code with malicious intent! Please respect the right to privacy and anonymise any personal data where possible – including usernames!

Basic Algorithm

To get started, you’ll need to import the json, requests and pandas libraries.

import json
import requests
import pandas as pd
...

Once imported, well need to set the URL to access the REST API of any public instance. In my case, I’ll use mastodon.social, but you can use whatever instance you like. In the parameters, set the limit parameter to 40 – the maximum value of toots that can be pulled at once

...
URL = 'https://mastodon.social/api/v1/timelines/public'
params = {
    'limit': 40
}
...

To ensure that we don’t scrape everything, we’ll need to set a cap on how much we collect. To do this, we will only collect toots within the most recent hour. We can do this with the help of pandas Timestamp and DateOffset functions.

Will also include a flag is_end which will be set as True once we’ve gone over our since date. This will stop us from scraping anything more.

...
since = pd.Timestamp('now', tz='utc') - pd.DateOffset(hour=1)
is_end = False
...

We also need a list to store our results in.

...
results = []
...

We can now get onto the main scraping part by creating a loop to go through each chunk of toots at a time. This will make the initial API call using the URL and parameters set earlier and save it as a JSON object.

...
while True:
    r = requests.get(URL, params=params)
    toots = json.loads(r.text)
    ...

If, for whatever reason we reached the end of all the toots, we need to make sure that we exit the loop and don’t scrape any more.

    ...
    if len(toots) == 0:
        break
    ...

Using the toots that we’ve already got, we need to go through them one-by-one and check to see if we’ve exceeded our since date. If so, we will set the is_end flag to True and exist the loop. Each tooth will be added to the list of results set earlier.

    ...
    for t in toots:
        timestamp = pd.Timestamp(t['created_at'], tz='utc')
        if timestamp <= since:
            is_end = True
            break
            
        results.append(t)
    
    if is_end:
        break
    ...

As we can only get 4o toots at a time, we need to backtrack through the API and set the max_id to the last ID in the set of toots. This is to ensure that when we make the next iteration in the loop, we will get the next set of toots in the chunk of 40.

    ...
    max_id = toots[-1]['id']
    params['max_id'] = max_id
...

As soon as we exit the loop, we can store our results in a pandas data frame. What you do from here is entirely up to you.

...
df = pd.DataFrame(results)

That is it! This algorithm will successfully go through all the toots on a mastodon instance until the since date has been reached.

Putting this all together, will look like the following …

import json
import requests
import pandas as pd

URL = 'https://mastodon.social/api/v1/timelines/public'
params = {
    'limit': 40
}

since = pd.Timestamp('now', tz='utc') - pd.DateOffset(hour=1)
is_end = False

results = []

while True:
    r = requests.get(URL, params=params)
    toots = json.loads(r.text)

    if len(toots) == 0:
        break
    
    for t in toots:
        timestamp = pd.Timestamp(t['created_at'], tz='utc')
        if timestamp <= since:
            is_end = True
            break
            
        results.append(t)
    
    if is_end:
        break
    
    max_id = toots[-1]['id']
    params['max_id'] = max_id
    
df = pd.DataFrame(results)

Hashtag Timeline

If you want to scrape toots which features a certain hashtag, this can be achieved by modifying the URL to include /api/v1/timelines/tag/:hashtag along with the hashtag.

For example, if you wanted to search all twits with the hashtag #coffee, you would set the URL to the following….

hashtag = 'coffee'
URL = f'https://mastodon.social/api/v1/timelines/tag/{hashtag}'

Much like before, putting this all together will give you…

import json
import requests
import pandas as pd

hashtag = 'coffee'
URL = f'https://mastodon.social/api/v1/timelines/tag/{hashtag}'
params = {
    'limit': 40
}

since = pd.Timestamp('now', tz='utc') - pd.DateOffset(hour=1)
is_end = False

results = []

while True:
    r = requests.get(URL, params=params)
    toots = json.loads(r.text)

    if len(toots) == 0:
        break
    
    for t in toots:
        timestamp = pd.Timestamp(t['created_at'], tz='utc')
        if timestamp <= since:
            is_end = True
            break
            
        results.append(t)
    
    if is_end:
        break
    
    max_id = toots[-1]['id']
    params['max_id'] = max_id
    
df = pd.DataFrame(results)

User Timeline

In order to scrape a user timeline (all toots published by a specific user), we’ll need to make a few modifications as this process is a little more involved as it requires multiple API calls.

In order to scrape the toots of a given user, we need to find their unique user ID. This involves making an API call to /api/v1/accounts/lookup where the acct parameter is needed for finding users according to their instance they are active on using the Webfinger account URI. This is formatted as @[USERNAME]@[INSTANCE_URL]

For example, my acct would be @[email protected]

To make this easier, I’ve made a separate function for performing a user lookup and returning the data.

def user_lookup(acct):
    URL = f'https://mastodon.social/api/v1/accounts/lookup'
    params = {
        'acct': acct
    }

    r = requests.get(URL, params=params)
    user = json.loads(r.text)
    
    return user

With that done, we can now access the user ID with the id key. Accessing a user’s toots can be done with the following…

user = user_lookup(acct='@[email protected]')
user_id = user['id']

URL = f'https://mastodon.social/api/v1/accounts/{user_id}/statuses'
params = {
    'limit': 40
}

Same as before, by putting this together using the main algorithm, we end up with the following except on this occasion, we don’t have to set a since date…

import json
import requests
import pandas as pd

user = user_lookup(acct='@[email protected]')
user_id = user['id']

URL = f'https://mastodon.social/api/v1/accounts/{user_id}/statuses'
params = {
    'limit': 40
}

results = []

while True:
    r = requests.get(URL, params=params)
    toots = json.loads(r.text)

    if len(toots) == 0:
        break
    
    results.extend(toots)
    
    max_id = toots[-1]['id']
    params['max_id'] = max_id
    
df = pd.DataFrame(results)

Conclusions

Overall, scraping toots from the Mastodon API is really easy to do. As we learned before, making REST API calls from Python is really easy to do and can be done with very few lines of code. This blog post walks you through the basics for scraping toots from the API in chunks, and can easily be adapted to suit your needs. With the help of pandas the data can be manipulated and exported to any format of your choice.

Happy scraping!