How to Scrape a Twitter Timeline
In a separate blog post, we covered the basics for learning how to stream tweets from Twitter using Python. This is ideal if you are interested in collecting tweets in real-time as and when they are created but not ideal if you want to scrape tweets made in the past.
In this post, we cover the basics for learning how to scrape tweets from a user’s timeline (that is, tweets created by a single user) using tweepy
– a Python wrapper for interacting with the Twitter API.
Getting Started
Before starting, you need to generate your API keys from Twitter and I’ve downloaded and installed tweepy
. Generating Twitter API keys is fairly straightforward and the tutorial can be found here. As for installing tweepy
…
pip install tweepy
Setting up the API
Once installed, you will need to load in tweepy
and fill in the API keys where appropriate. Note: Pandas has been included to help with exporting and modelling the data.
# Load in packages
import tweepy
import pandas as pd
# Set API keys
auth = tweepy.OAuthHandler('[TWITTER-APP-KEY]', '[TWITTER-APP-SECRET]')
auth.set_access_token('[TWITTER-OAUTH-TOKEN]', '[TWITTER_OAUTH-TOKEN-SECRET]')
api = tweepy.API(auth, wait_on_rate_limit=True)
...
As we initialise the API, it is important to set the wait_on_rate_limit
parameter to True
as this will ensure that we don’t bump into any errors if we go over the API’s rate limit. Instead, if we have reach the limit, the program will wait.
Collecting Tweets
Once the API has been configured, we will be able to start scraping tweets from a user’s timeline. Because the API can only retrieve a total of 200 tweets at a time, we need to introduce an infinite loop and process tweets in chunks.
Thankfully we can keep track of where we are by using the max_id
parameter which will only retrieve tweets prior to this tweet ID. This will ensure that we are scraping the entire timeline of the user and will give us as much coverage as possible.
For each iteration, all the tweets (stored as JSON objects) in the chunk are appended to a global list. Once we have reached the end of the timeline, the API will return nothing which in turn will then break us out of the infinite loop thanks to the if len(ts) == 0
line.
...
# Scrape the timeline
username = "[USERNAME]"
tweets = []
last_id = None
while True:
try:
ts = api.user_timeline(screen_name=username, count=200, max_id=last_id)
except tweepy.errors.Unauthorized as e:
return tweets
except tweepy.errors.NotFound as e:
return tweets
if len(ts) == 0:
break
tweets.extend([t._json for t in ts])
last_id = ts[-1].id -1
...
Exporting Tweets
Once we finished scraping all the tweets we can use the json_normalise
function of pandas to convert our array of JSON objects into a tabulated dataframe which can then be exported to a CSV file using to_csv
.
...
df = pd.json_normalize(tweets)
df.to_csv(f'{username}.csv')
Done!
Final Comments
In this short blog post, we cover the basics for scraping a user’s timeline on Twitter. Thanks to Python and tweepy
, tweets can be scraped quickly and easily with very little effort. I’m sure that the code featured in this post could be reworked or improved to better suit your needs. For example, as opposed to saving tweets to a simple CSV file, why not save them to a database?