Will They Reply? Analysing the Reply Networks of 32 Programming Language Subreddits
Have you ever used Reddit for learning a programming language? There are many subreddit detected to specific programming languages. They are great for finding project ideas, learning new topics and getting inspired. Many (if not most) programming subreddits have an active community of users who are willing to provide support to others who post submissions opening up discussions.
I’ve always wondered if it is possible to find out which subreddits are better than others in terms of user engagement and by how much. I don’t know about you, but I would like to know if there is a strong chance that someone is going to reply to my query. This is quite important as this can help both moderators and discussion starters alike understand how effective communities are at engaging with others. As it turns out, I feel there is a way and the answer is discovered through reply networks.
What are reply networks?
In this blog, we introduced the notion of Reddit reply networks where simple user-to-user interactions are captured based upon replies between users. The concept is simple. A directed edge is formed between two users where one user replies to the other.
Using reply networks can help us understand meaningful connections within the network. For example, reply networks provide a window for helping us understand conversational dynamics such as the following:
- Who replies the most? (out degree)
- Who receives the most replies? (in degree)
- Who are the most influential commenters? (node centrality)
- How often do users reply to each other? (reciprocity)
The Data
To get things going, 500 of the most recent submission were collected from a total of 32 programming-related subjects. A few of these include Go, Python, Java, JavaScript, C++ and many others.
For each subreddit, a reply network was produced by aggregating all of the comments from the total 500 recent submissions. If you’re interested in how these are created, I put together a blog post explaining how this is done using PRAW.
The subreddit reply networks have a combined average of 1185 nodes and 3410 edges. The complete data can be found at the end of the post.
The Results
As a result, the reply networks used in this study are quite complex as they feature many interactions. Below is an example of the r/python subreddit where nodes are coloured according to modularity and size according to eigenvector centrality.
- Reciprocity: Helpful users are likely to reply to questions meaning that two-way reciprocated connections are important for discussion
- Density: Used to understand how well-connected discussion is.
- Transitivity: Small communities of users (triads) are helpful for discerning strong conversations among multiple users.
Reciprocity
As mentioned earlier, reciprocity is one of the most important metrics for determining two-way conversations. The results from the study indicate that r/racket, r/matlab, r/visualbasic and r/Rlanguage are among the most highly-ranked subreddits for reciprocity.
With respect to density, r/forth, r/Delphi r/perl and r/d_language had the highest proportion of occupied edges within the network. Bearing in mind these numbers are quite small (which is often expected for density). There is a strong likelihood that these communities are much smaller than the popular ones like r/Python.
Similar to density, r/forth r/perl and r/Delphi right quite highly for transitive ties which suggests that you’re more likely to find triad-like communities in these subreddits than elsewhere. These are big clues for detecting cliques of users.
Overall I believe that the results speak for themselves. If I’m honest, I wasn’t expecting that the lesser-known subreddits such as r/forth and Delphi would have such strong results compared to the popular ones such as r/Python or r/JavaScript etc. I think this is down to a few reasons which I’ve summarised below…
Small communities
As mentioned earlier, it appears to be the smaller communities that appear to be the most popular. This makes sense considering that if you reduce the size of the community there is a strong possibility that you’re going to engage with the same user meaning that reciprocated ties and transitivity is going to be quite high.
Network design
The design of the network may have an impact on the results based upon how the discussions are modelled. Considering that we are collapsing hierarchical discussion trees as user-to-user interactions, there is a possibility that we may be missing important data which could allude to different types of conversation. For example, a reply network doesn’t consider debates between users whereas a reply tree would show the depth of the discussion.
Limited engagement
I remember reading somewhere how a very small subset of users actually engage with content produced on Reddit. A very small percentage of users actually contribute towards leaving meaningful replies for users with questions. I think this might be a factor to consider when studying these networks.
As an experiment, I thought that these results are interesting, but I think it’s important to keep an open mind on how we model these networks going ahead. After all, this is how science advances forward.
Subreddits
If you’re interested in the numbers and the subreddits used, these are as follows…
Subreddit | No. Nodes | No. Edges | Density | Reciprocity | Transitivity |
---|---|---|---|---|---|
r/racket | 452 | 1368 | 0.006711 | 0.589181 | 0.072255 |
r/matlab | 749 | 1627 | 0.002904 | 0.583897 | 0.021305 |
r/visualbasic | 519 | 1761 | 0.006550 | 0.579216 | 0.029903 |
r/Rlanguage | 794 | 2003 | 0.003181 | 0.574139 | 0.018270 |
r/scheme | 501 | 1604 | 0.006403 | 0.566085 | 0.065525 |
r/forth | 417 | 1781 | 0.010267 | 0.563728 | 0.115033 |
r/delphi | 373 | 1144 | 0.008245 | 0.552448 | 0.072368 |
r/ocaml | 564 | 1635 | 0.005149 | 0.539450 | 0.032446 |
r/asm | 734 | 1843 | 0.003426 | 0.538253 | 0.025814 |
r/fortran | 790 | 2532 | 0.004062 | 0.526066 | 0.031309 |
r/d_language | 384 | 1057 | 0.007187 | 0.524125 | 0.066078 |
r/lisp | 903 | 3886 | 0.004771 | 0.513124 | 0.066695 |
r/rstats | 1040 | 2357 | 0.002181 | 0.509122 | 0.022743 |
r/perl | 615 | 3094 | 0.008194 | 0.504848 | 0.101755 |
r/clojure | 730 | 2125 | 0.003993 | 0.491294 | 0.040505 |
r/latex | 963 | 2417 | 0.002609 | 0.489863 | 0.016770 |
r/lua | 810 | 2389 | 0.003646 | 0.489745 | 0.022349 |
r/haskell | 1315 | 4939 | 0.002858 | 0.484714 | 0.041025 |
r/erlang | 483 | 913 | 0.003922 | 0.484118 | 0.025435 |
r/fsharp | 613 | 2087 | 0.005563 | 0.482990 | 0.065214 |
r/Kotlin | 1170 | 2850 | 0.002084 | 0.480702 | 0.028571 |
r/sql | 1417 | 3349 | 0.001669 | 0.472977 | 0.018622 |
r/ruby | 940 | 2322 | 0.002631 | 0.472007 | 0.036358 |
r/scala | 990 | 3359 | 0.003431 | 0.465019 | 0.038037 |
r/c_programming | 1575 | 5069 | 0.002045 | 0.464786 | 0.031794 |
r/swift | 1134 | 2382 | 0.001854 | 0.464316 | 0.015414 |
r/rust | 2348 | 6125 | 0.001111 | 0.463020 | 0.023427 |
r/golang | 1849 | 4245 | 0.001242 | 0.457008 | 0.018304 |
r/python | 2370 | 4101 | 0.000730 | 0.453548 | 0.031596 |
r/php | 2542 | 9350 | 0.001448 | 0.449198 | 0.050137 |
r/csharp | 2137 | 5346 | 0.001171 | 0.436214 | 0.021638 |
r/cpp | 2938 | 10521 | 0.001219 | 0.422393 | 0.028237 |
r/java | 2458 | 9054 | 0.001499 | 0.412194 | 0.047729 |
r/javascript | 2668 | 5311 | 0.000746 | 0.378836 | 0.019625 |