If you’re a nerd like me, you’ll probably be very familiar with Reddit . Reddit describes itself as “the front page of the internet” which is certainly true for me and many others. I use it pretty much on a daily basis. Anything from tech news digest, to niche topics and to tech support. There’s a subreddit for anything!
I’ve used Reddit as a data source for many of my research projects mainly due to the wealth of data available and the ease of access using the official API.
This brings me to the most important point. How do I collect the data? Conveniently, there’s a Python package for interacting with the Reddit API called PRAW .
Note : If you don’t fancy programming (which is fine btw) there are third-party services like pushshift.io which archives data without the need for putting together a script.
PRAW (Python Reddit API Wrapper) is perhaps one of the most used Python packages for scraping data off Reddit. I’ve used this for all my research projects and suits my needs well.
On a side note, PRAW has full access to the API meaning that it has the ability to both read and write data. This is particularly useful if you’re interested in creating bots for automating tasks like posting submissions and leaving replies. For the purposes of this task, we will just be reading data.
Much like installing any other python package, installing this package can be done with a simple
pip install . PRAW is only supported on Python 3.6+ so if you’re using an older version, you will need to update to at least 3.6
pip install praw
Getting API Keys
If you’ve ever used an API before, you’ll be familiar with the concept of API keys. In summary, API keys essentially allow you to access the API using your account without a password. This is known as OAuth . This also has the added benefit of allowing you to keep track of what projects you have running and how much data is being collected.
A word of warning : Most API’s place limits on the number of requests you make to their servers. This is done to ensure that people aren’t spamming the network with requests which could slow things down for others.
Getting API keys from Reddit is really easy. All you need is an account to get going.
Once you have logged into your account at reddit.com, head over to https://www.reddit.com/prefs/apps/ and click on “create another app…” under the “developed applications” subheading. Once that has been clicked, you’ll be presented with a form.
For the name and description, you can put whatever you want. Make sure that the redirect URL points to a valid URL (it can be anything you like) and that “script” is selected.
Once this is done, you’ll be able to see your client ID and client secret. Hold on to these as you’ll need them in the next step.
Connecting to the API
With the newly created client ID and secret, we can now start putting together our Python scraper.
To begin, let’s import the package and loading in the API keys (ID and secret).
import praw CLIENT_ID = "[YOUR ID KEY HERE]" CLIENT_SECRET = "[YOUR SECRET KEY HERE]" USER_AGENT = "[YOUR USERNAME HERE]" reddit = praw.Reddit( client_id=CLIENT_ID, client_secret=CLIENT_SECRET, user_agent=USER_AGENT, )
If you manage to get this far then congrats. This is the most technical bit and the rest is relatively straightforward.
If this worked successfully then we can move on to the next step – getting information about a subreddit.
Getting a subreddit
Using the initialised API, we can begin to collect data. In this case, subreddits; a space/community of Reddit where users post submissions and leave replies relating to a particular topic.
To test this out, I’m going to use the r/worldnews subreddit. This is as simple as …
subreddit = reddit.subreddit('worldnews')
With this code, we can begin to understand more about the subreddit of interest. We can look up certain features of the subreddit such as the name, ID, description and the date it was created.
name = subreddit.display_name id_ = subreddit.id public_description = subreddit.public_description created_at = subreddit.created print((name, id_, public_description, created_at)) # ('worldnews', '2qh13', 'A place for major news from around the world, excluding US-internal news.', 1201231119.0)
Getting subreddit posts
As you may or may not know, subreddit posts (also known as submissions) are ranked in different ways. By default, posts are ordered by ‘hot’ (the current most popular submissions based upon upvotes). Posts can also be ordered by new, rising, controversial, top and gilded.
To get posts ranked by ‘hot’, we can use the following…
for submission in subreddit.hot(): ...
If you want to collect posts sorted by new, rising, controversial or gilded, just replace the methods name with the appropriate rank (e.g.
new() ). We can also limit the results using the
Much like subreddits, we can also extract essential attributes associated with a post including the link, title, author, score (upvotes and downvotes) and the date the post was created.
for submission in subreddit.hot(limit=5): title = submission.title link = submission.url author = submission.author_fullname score = submission.score created = submission.created print((title, link, author, score, created)) # ('Hana Horka: Czech singer dies after catching Covid intentionally - BBC News', 'https://www.bbc.com/news/world-europe-60050996', 't2_96e98', 2231, 1642595485.0) # ... ...
Hopefully, you should be getting the hang of things now. Posts can be collected from subreddits and comments can be collected from posts. They’re all connected.
Before getting to the code, if you’ve ever used Reddit before you’ll know that comments are arranged in two ways; Top-level and replies. To keep things simple, I’ll focus on top-level comments for now and talk about replies later.
Continuing on from before, we can collect the top-level comments using the … attribute like this…
... for tl_comment in submission.comments: body = tl_comment.body author = tl_comment.author.name created = tl_comment.created score = tl_comment.score print((body, author, created, score))
If you want to get replies for each of these comments, this is when things can get a little complicated and is beyond the scope of this post. On Reddit, users can reply to replies which produce a “nested” structure of the comments. This has a recursive effect where comments can exist on multiple layers.
This blog post provides a very basic overview of how to use PRAW for collecting data off Reddit. I think it is fair to say that we have only just scratched the surface of what you can achieve using PRAW. Hopefully, you can begin to see how this can be used for many network science-related projects.
In the next blog post, we will go through the process of using PRAW to process the entire collection of comments on a post to generate reply networks. In doing so, we can use this to find out who replies to who and more.