In a separate blog post, we covered the basics for learning how to stream tweets from Twitter using Python. This is ideal if you are interested in collecting tweets in real-time as and when they are created but not ideal if you want to scrape tweets made in the past.
In this post, we cover the basics for learning how to scrape tweets from a user’s timeline (that is, tweets created by a single user) using
tweepy – a Python wrapper for interacting with the Twitter API.
Before starting, you need to generate your API keys from Twitter and I’ve downloaded and installed
tweepy. Generating Twitter API keys is fairly straightforward and the tutorial can be found here. As for installing
pip install tweepy
Setting up the API
Once installed, you will need to load in
tweepy and fill in the API keys where appropriate. Note: Pandas has been included to help with exporting and modelling the data.
# Load in packages import tweepy import pandas as pd # Set API keys auth = tweepy.OAuthHandler('[TWITTER-APP-KEY]', '[TWITTER-APP-SECRET]') auth.set_access_token('[TWITTER-OAUTH-TOKEN]', '[TWITTER_OAUTH-TOKEN-SECRET]') api = tweepy.API(auth, wait_on_rate_limit=True) ...
As we initialise the API, it is important to set the
wait_on_rate_limit parameter to
True as this will ensure that we don’t bump into any errors if we go over the API’s rate limit. Instead, if we have reach the limit, the program will wait.
Once the API has been configured, we will be able to start scraping tweets from a user’s timeline. Because the API can only retrieve a total of 200 tweets at a time, we need to introduce an infinite loop and process tweets in chunks.
Thankfully we can keep track of where we are by using the
max_id parameter which will only retrieve tweets prior to this tweet ID. This will ensure that we are scraping the entire timeline of the user and will give us as much coverage as possible.
For each iteration, all the tweets (stored as JSON objects) in the chunk are appended to a global list. Once we have reached the end of the timeline, the API will return nothing which in turn will then break us out of the infinite loop thanks to the
if len(ts) == 0 line.
# Scrape the timeline
username = "[USERNAME]"
tweets = 
last_id = None
ts = api.user_timeline(screen_name=username, count=200, max_id=last_id)
except tweepy.errors.Unauthorized as e:
except tweepy.errors.NotFound as e:
if len(ts) == 0:
tweets.extend([t._json for t in ts])
last_id = ts[-1].id -1
Once we finished scraping all the tweets we can use the
json_normalise function of pandas to convert our array of JSON objects into a tabulated dataframe which can then be exported to a CSV file using
df = pd.json_normalize(tweets)
In this short blog post, we cover the basics for scraping a user’s timeline on Twitter. Thanks to Python and
tweepy, tweets can be scraped quickly and easily with very little effort. I’m sure that the code featured in this post could be reworked or improved to better suit your needs. For example, as opposed to saving tweets to a simple CSV file, why not save them to a database?