How to Scrape Mastodon Timelines Using Python and Pandas
Over the past few months, Mastodon, the Federated microblogging alternative to Twitter, has gained a lot of attraction in light of the events going on surrounding Elon Musk’s purchase of Twitter. Upon further inspection, it turns out most instances have a public facing REST API for allowing users to interact with their services using third-party software.
This got me thinking. How easy would it be to scrape a timeline of toots (tweets) if I needed to analyse the data for a project? As it turns out, it is very easy.
As shown in a previous post, interacting with REST API’s in Python is ridiculously easy. Interacting with REST API can be done with very few lines of code and can be used to collect data from publicly available sources.
Using the code provided in the previous blog post, we can easily adapt the code to scrape different types of timeline on Mastodon.
Disclaimer: This code used in this blog post is to be used for demonstration and research purposes only. Do not use this code with malicious intent! Please respect the right to privacy and anonymise any personal data where possible – including usernames!
Basic Algorithm
To get started, you’ll need to import the json
, requests
and pandas
libraries.
import json
import requests
import pandas as pd
...
Once imported, well need to set the URL to access the REST API of any public instance. In my case, I’ll use mastodon.social, but you can use whatever instance you like. In the parameters, set the limit
parameter to 40
– the maximum value of toots that can be pulled at once
...
URL = 'https://mastodon.social/api/v1/timelines/public'
params = {
'limit': 40
}
...
To ensure that we don’t scrape everything, we’ll need to set a cap on how much we collect. To do this, we will only collect toots within the most recent hour. We can do this with the help of pandas
Timestamp
and DateOffset
functions.
Will also include a flag is_end
which will be set as True
once we’ve gone over our since date. This will stop us from scraping anything more.
...
since = pd.Timestamp('now', tz='utc') - pd.DateOffset(hour=1)
is_end = False
...
We also need a list to store our results in.
...
results = []
...
We can now get onto the main scraping part by creating a loop to go through each chunk of toots at a time. This will make the initial API call using the URL and parameters set earlier and save it as a JSON object.
...
while True:
r = requests.get(URL, params=params)
toots = json.loads(r.text)
...
If, for whatever reason we reached the end of all the toots, we need to make sure that we exit the loop and don’t scrape any more.
...
if len(toots) == 0:
break
...
Using the toots that we’ve already got, we need to go through them one-by-one and check to see if we’ve exceeded our since
date. If so, we will set the is_end
flag to True
and exist the loop. Each tooth will be added to the list of results
set earlier.
...
for t in toots:
timestamp = pd.Timestamp(t['created_at'], tz='utc')
if timestamp <= since:
is_end = True
break
results.append(t)
if is_end:
break
...
As we can only get 4o toots at a time, we need to backtrack through the API and set the max_id
to the last ID in the set of toots. This is to ensure that when we make the next iteration in the loop, we will get the next set of toots in the chunk of 40.
...
max_id = toots[-1]['id']
params['max_id'] = max_id
...
As soon as we exit the loop, we can store our results in a pandas
data frame. What you do from here is entirely up to you.
...
df = pd.DataFrame(results)
That is it! This algorithm will successfully go through all the toots on a mastodon instance until the since
date has been reached.
Putting this all together, will look like the following …
import json
import requests
import pandas as pd
URL = 'https://mastodon.social/api/v1/timelines/public'
params = {
'limit': 40
}
since = pd.Timestamp('now', tz='utc') - pd.DateOffset(hour=1)
is_end = False
results = []
while True:
r = requests.get(URL, params=params)
toots = json.loads(r.text)
if len(toots) == 0:
break
for t in toots:
timestamp = pd.Timestamp(t['created_at'], tz='utc')
if timestamp <= since:
is_end = True
break
results.append(t)
if is_end:
break
max_id = toots[-1]['id']
params['max_id'] = max_id
df = pd.DataFrame(results)
Hashtag Timeline
If you want to scrape toots which features a certain hashtag, this can be achieved by modifying the URL to include /api/v1/timelines/tag/:hashtag
along with the hashtag.
For example, if you wanted to search all twits with the hashtag #coffee
, you would set the URL to the following….
hashtag = 'coffee'
URL = f'https://mastodon.social/api/v1/timelines/tag/{hashtag}'
Much like before, putting this all together will give you…
import json
import requests
import pandas as pd
hashtag = 'coffee'
URL = f'https://mastodon.social/api/v1/timelines/tag/{hashtag}'
params = {
'limit': 40
}
since = pd.Timestamp('now', tz='utc') - pd.DateOffset(hour=1)
is_end = False
results = []
while True:
r = requests.get(URL, params=params)
toots = json.loads(r.text)
if len(toots) == 0:
break
for t in toots:
timestamp = pd.Timestamp(t['created_at'], tz='utc')
if timestamp <= since:
is_end = True
break
results.append(t)
if is_end:
break
max_id = toots[-1]['id']
params['max_id'] = max_id
df = pd.DataFrame(results)
User Timeline
In order to scrape a user timeline (all toots published by a specific user), we’ll need to make a few modifications as this process is a little more involved as it requires multiple API calls.
In order to scrape the toots of a given user, we need to find their unique user ID. This involves making an API call to /api/v1/accounts/lookup
where the acct
parameter is needed for finding users according to their instance they are active on using the Webfinger account URI. This is formatted as @[USERNAME]@[INSTANCE_URL]
For example, my acct
would be @[email protected]
To make this easier, I’ve made a separate function for performing a user lookup and returning the data.
def user_lookup(acct):
URL = f'https://mastodon.social/api/v1/accounts/lookup'
params = {
'acct': acct
}
r = requests.get(URL, params=params)
user = json.loads(r.text)
return user
With that done, we can now access the user ID with the id
key. Accessing a user’s toots can be done with the following…
user = user_lookup(acct='@[email protected]')
user_id = user['id']
URL = f'https://mastodon.social/api/v1/accounts/{user_id}/statuses'
params = {
'limit': 40
}
Same as before, by putting this together using the main algorithm, we end up with the following except on this occasion, we don’t have to set a since
date…
import json
import requests
import pandas as pd
user = user_lookup(acct='@[email protected]')
user_id = user['id']
URL = f'https://mastodon.social/api/v1/accounts/{user_id}/statuses'
params = {
'limit': 40
}
results = []
while True:
r = requests.get(URL, params=params)
toots = json.loads(r.text)
if len(toots) == 0:
break
results.extend(toots)
max_id = toots[-1]['id']
params['max_id'] = max_id
df = pd.DataFrame(results)
Conclusions
Overall, scraping toots from the Mastodon API is really easy to do. As we learned before, making REST API calls from Python is really easy to do and can be done with very few lines of code. This blog post walks you through the basics for scraping toots from the API in chunks, and can easily be adapted to suit your needs. With the help of pandas
the data can be manipulated and exported to any format of your choice.
Happy scraping!