How to Hydrate Tweets using Twython
In the previous blog post, we introduced the easy way of collecting tweets by using Hydrator. If you don’t know what it means to ‘hydrate’ tweets, go back and read the previous post . 🙂
In this blog post, we will learn how to hydrate tweets the harder way using Twython – a Python package for receiving Tweets from the API
If I’m honest, I don’t like to say the “harder way”. I prefer to say the more challenging way 🙂
Before, we continue, you may which to familiarise yourself with the theory behind hydrating tweets in the previous blog post as this will help to clear up any questions you may have.
In this blog post, we will go through the process of hydrating tweets programmatically, step-by-step.
Installing and Importing Twython
To get things going, simply type the following command to install it using Python’s PIP package management tool.
$ pip3 install twython
As usual, we’ll need to import a few things into our environment…
import json
import time
import pandas as pd
from twython import Twython
...
Some of these imports may not make too much sense at this point in time, but, you’ll see why they’re needed later on.
Loading in the IDs
Once those who been imported, we need a way to load in the IDs, in my case, I prefer to use pandas
as my data modelling package of choice. Using this, we can read the data as a csv
file. Please note that the data you’re using may be saved in a different structure so you’ll need to factor that in. In my case, I’m using the IEEE COVID-19 Tweets dataset where data is stored in two columns (tweet ID and sentiment).
Obviously, you’re free to import whatever file type you want. That’s the beauty of using pandas
.
...
df = pd.read_csv('[MY CSV FILE].csv', names=['ID', 'Sentiment'])
...
Connecting to the Twitter API
This is where Twython comes in. In order to use the API, we need to generate a set of API keys from Twitter. In order to do this, head on over to apps.twitter.com to create an app which allows you to obtain your API keys.
I’ll leave you to go ahead and get that information yourself but once you generate your tokens, insert them into the relevant parts below.
...
TWITTER_AUTH = {
'app_key': '[YOUR APP KEY]',
'app_secret': '[YOUR APP_SECRET]',
'oauth_token': '[YOUR OAUTH TOKEN]',
'oauth_token_secret': '[YOUR OAUTH TOKEN SECRET]'
}
twitter = Twython(app_key=TWITTER_AUTH['app_key'],
app_secret=TWITTER_AUTH['app_secret'],
oauth_token=TWITTER_AUTH['oauth_token'],
oauth_token_secret=TWITTER_AUTH['oauth_token_secret'])
...
Dividing ID’s into Chunks
If you recall from the previous blog post, we mentioned that the Twitter API only allows us to make 900 requests within a 15-minute window with a maximum of 100 IDs per request. To ensure that we make the most of our API calls, we need to introduce a mechanism for subdividing our long list of ID’s down into sub-lists of size 100 and to ensure that they fall within our 15-minute window.
We can do this with the following code where we iterate over the list in steps of n
and slicing our list into sublists and returning them as a generator with the yield
keyword. In our case, n
is our maximum length (100).
...
def divide_chunks(ids, n):
for i in range(0, len(ids), n):
yield ids[i:i + n]
...
Dividing our IDs into chunks is as simple as …
...
chunks = list(divide_chunks(df['ID'].values, 100))
total_chunks = len(chunks)
...
Getting and Saving Tweets
This is where the bulk of the work is done. To begin we need to create a file to store our results using the open
function and set the mode
to a+
which will allow us to create a new file (if not already done) and append the results, row-by-row.
...
i = 0
with open('[YOUR OUTPUT FILE].json', 'a+', encoding='utf-8') as out_f:
...
For each of the chunks, check if we have not reached the maximum number of requests using the counter i
. If so, sleep for 15 minutes.
...
for chunk in chunks:
# Convert ID from integer to string
chunk = [str(id) for id in chunk]
i += 1
# Check if gone over max requests
if i % 175 == 0:
time.sleep(60 * 15)
...
After this, go ahead and get the tweets using the twitter.lookup_status
function using the chunk of 100 IDs as a concatenated string separated with a ‘,’. If there is an error, wait and attempt the request again until there is no error.
...
while True:
try:
# Get the tweets
search_results = twitter.lookup_status(id=','.join(chunk), map="false", trim_user="false", include_entities="true", tweet_mode="extended")
except Exception as e:
# Check for error and attempt again
print(e)
sec = 60 * 10
print(f'Waiting {sec} second(s)...')
print()
time.sleep(sec)
continue
break
...
Then for each of the results sent back from the API, append the tweet to the file as a JSON object with a new line break.
...
for tweet in search_results:
out_f.write(json.dumps(tweet))
out_f.write('\n')
Putting this all together gives you…
import json
import time
import pandas as pd
from twython import Twython
df = pd.read_csv('[MY CSV FILE].csv', names=['ID', 'Sentiment'])
TWITTER_AUTH = {
'app_key': '[YOUR APP KEY]',
'app_secret': '[YOUR APP_SECRET]',
'oauth_token': '[YOUR OAUTH TOKEN]',
'oauth_token_secret': '[YOUR OAUTH TOKEN SECRET]'
}
twitter = Twython(app_key=TWITTER_AUTH['app_key'],
app_secret=TWITTER_AUTH['app_secret'],
oauth_token=TWITTER_AUTH['oauth_token'],
oauth_token_secret=TWITTER_AUTH['oauth_token_secret'])
def divide_chunks(ids, n):
for i in range(0, len(ids), n):
yield ids[i:i + n]
chunks = list(divide_chunks(df['ID'].values, 100))
total_chunks = len(chunks)
i = 0
with open('[YOUR OUTPUT FILE].json', 'a+', encoding='utf-8') as out_f:
for chunk in chunks:
# Convert ID from integer to string
chunk = [str(id) for id in chunk]
i += 1
# Check if gone over max requests
if i % 175 == 0:
time.sleep(60 * 15)
while True:
try:
# Get the tweets
search_results = twitter.lookup_status(id=','.join(chunk), map="false", trim_user="false", include_entities="true", tweet_mode="extended")
except Exception as e:
# Check for error and attempt again
print(e)
sec = 60 * 10
print(f'Waiting {sec} second(s)...')
print()
time.sleep(sec)
continue
break
# Append tweets
for tweet in search_results:
out_f.write(json.dumps(tweet))
out_f.write('\n')
print(f'Chunk {i} of {total_chunks} ({(i/total_chunks)*100}%)')
And that is it!
Final Thoughts
In this blog post, we covered everything we need to do to hydrate tweets using the Twitter API via Twython. As you can see, it’s a little more complex compared to what we did previously but, we learnt a lot in the process.
However you decide to hydrate your tweets is entirely up to you. Personally, I see the benefits of both. If you’re scraping your tweets on a separate machine, I personally would write a python script using Twython and let that run headlessly on a server without the need for a GUI.