Extracting Interactions Networks from Twitter using TWINT and Python
EDIT (Jan 2023): It looks like TWINT is not long being updated as the GitHub rep has now been archived.
Twitter is a pretty cool platform for many reasons. People can post whatever they’re interested in and allow the entire world to know what they think. Aside from this, one of the most important features of Twitter is the way users interact with other users and hashtags. From the perspective of a data analyst, Twitter is amazing! Never has it been so easy to get tons of data on other people’s social traces and opinions of things. Not only can you collect lots of meaningful data, but you can collect data in such a short space of time too. This makes it particularly ideal for people (with a deadline) who want to get a grasp of a particular concept trending on Twitter.
This is where TWINT comes in.
What is TWINT?
TWINT is an amazing piece of software written in Python which scrapes data from Twitter without using their official API. This has the double benefits of being able to collect data anonymously and without being restricted by their API limits. This means that you can search for historical data and get loads of results. Furthermore, it is incredibly easy to use and allows you to export your search results to multiple formats including various databases such as Elasticsearch and SQLite as well as simple file formats such as CSV and TXT files.
Data can be collected in one of two ways. Firstly, tweets can be scraped using the CLI (command line interface) where are simple queries and lookups can be performed. Secondly, tweets can be collected programmatically by generating a Python script. This allows for greater flexibility and control. More on this will follow…
Why visualise networks?
Performing network analysis is ideal for situations like this as Twitter – as well as other social networks – provide a mechanism to allow users to directly interact with other users. In the case of Twitter, this can be achieved in multiple ways but mainly through the use of retweets (sharing other persons to tweet) or by mentioning them in their own tweet.
Visualising these interactions will allow us to observe conversational dynamics between users to get a better picture of who is driving the conversation and to understand how particular concepts and topics become viral. The basic concept is that a directed network (network containing edges that point to another node in one direction) is used to model the flow of interactions between users. Think of it like this A retweets from B or A mentions B…
Collecting Tweets
You may be thinking, “this sounds really cool and interesting, but how do I code this up?”. Well, this section of the blog post is dedicated to the practical implementation regarding how this can be produced in Python. To begin Twint needs to be installed. Feel free to check out the official GitHub repository for a complete set of instructions.
For me what I did was…
pip3 install twint
As soon as you’ve got things installed, you can start scraping! To start, you can put together a simple search query to start collecting specific tweets. In my case, I’m interested in looking at tweets containing #COVID19 and the words ‘Delta variant’. I don’t think you need me to explain what these mean 😉
This is what the simple Python script looks like. In my case, I’m collecting tweets (using the search terms provided) since the 2021-06-01 which are exported as a CSV file.
import twint
seed_terms = ['#COVID19', '"Delta variant"']
search_terms = ' OR '.join(seed_terms)print(search_terms)
# Configure
c = twint.Config()
c.Search = search_terms
c.Store_csv = True
c.Since = '2021-06-01'
c.Output = 'covid-tweets.csv'
# Runntwint.run.Search(c)
Depending on the popularity of the topic, it might take a while to collect the data and you may end up with quite a large file. Once the data has been collected, we can now begin to extract the relevant features and convert them into social networks.
Extracting Networks
This is where things start to get fun (well at least for me :)). For this, we will be using pandas
to allow us to import the CSV and regular expressions to extract parts of the tweet that contain the linked user.
import networkx as nx
import pandas as pd
import numpy as np
import re
# Read in the tweets
file_in = 'covid-tweets.csv'
df = pd.read_csv(file_in)
# replace NaN's with an empty string
df = df.replace(np.nan, '')
Now that the data has been loaded in, we can start to build our networks. In this case, we are generating retweet and mention networks and exporting them to CSV files.
# create a networkx directed graph
G_retweet = nx.DiGraph()
G_mention = nx.DiGraph()
# loop through each row
for r in df.iterrows():
author = r[1]['username']
author = f'@{author}'
text = r[1]['tweet']
try:
timestamp = pd.to_datetime(r[1]['created_at'])
except:
continue
# use regular expressions to extract retweets and mentions
retweets = set(re.findall(r"RT @(\w+)", text))
mentions = set(re.findall(r"@(\w+)", text))
# remove duplicated users
for r in retweets:
if r in mentions:
mentions.remove(r)
# add the users if there are any mentioned in the text.
has_users = len(retweets) + len(mentions) > 0
if has_users:
for u in retweets:
u = f'@{u}'
G_retweet.add_edge(author, u, Timestamp=timestamp)
for u in mentions:
u = f'@{u}'
G_mention.add_edge(author, u, Timestamp=timestamp)
Now that the basic logic of constructing networks is complete, we need to export the data. Thankfully data can be interchanged easily between networkx
graphs and pandas
data frames (and vice versa). This is just one way we can export the data.
df_retweet = nx.to_pandas_edgelist(G_retweet)
df_retweet.to_csv('retweet.csv', index=False)
df_mention = nx.to_pandas_edgelist(G_mention)
df_mention.to_csv('mention.csv', index=False)
What’s next?
Now that we’ve got the data, we need to think about how we are going to visualise it. The next blog post looks at doing just this to see if we can find anything interesting.