Generating Editor Graphs from Wikipedia Articles
Wikipedia research phd networks subgraphs collective-intelligence API graph revisionThere is no doubt that Wikipedia as a service is super, super useful. I don\’t know how many times I\’ve used it to learn about a new topic. Wikipedia is perhaps best well-known for its service as a collaborative platform where users (or editors) of any background can come and contribute towards sharing their knowledge and ensure that a given Wikipedia article maintains a high quality.
I feel that little appreciation goes towards the editors themselves. After all, they are the ones who are making the contributions!
To understand a little bit more about how editors interact on Wikipedia, this post lists the basic methods used to derive a network graph to model the collaborative interactions of Wikipedia editors from a time-series format to a static graph. While this approach is used for research purposes, further alterations and suggestions are encouraged to form a more accurate representation of the task at hand.
Problem definition
The data is represented as a network structure meaning that the underlining principles of graph theory are in use. In this example, a Wikipedia editor is represented as a vertex and a directed edge is used to pair editors together as A edits B. In this context, a graph is modelled based from the order users edit a Wikipedia article. In this case, the most recent editor edits the previous revision.
Parameters
To query data from Wikipedia, data can be accessed from a single
endpoint URL. This can be accessed at
https://en.wikipedia.org/w/api.php
. Additionally, parameters are
needed in the GET request to further refine results. A complete list can
be accessed here. For example, the URL for accessing the most recent
revision for the article \“Coffee\” can be located at:
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=json&&rvlimit=max&titles=Coffee
Data Formats
The revision of any Wikipedia article can be reproduced as either as an XML file or JSON document, accessed by a single URL endpoint API. For this exercise, the JSON format will be selected as this offers a more convenient option for loading data structures into the Python programming language. Bellow illustrates a sample revision document from the API.
"continue": {
"rvcontinue": "20181017170220|864501192",
"continue": "||"
},
"query": {
"pages": {
"604727": {
"pageid": 604727,
"ns": 0,
"title": "Coffee",
"revisions": [
{
"revid": 864501853,
"parentid": 864501313,
"user": "Heroeswithmetaphors",
"timestamp": "2018-10-17T17:07:50Z",
"comment": "/* Caffeine content */links"
}
]
}
}
}
The Algorithm
The Python programming environment conveniently provides the packages
needed to model the data as a network graph. The package networkx
offers convenient features used to generate and manipulate network
graphs. In addition to this, the library matplotlib
can be used to
plot the resulting visualisation. Providing that the user has an article
title they wish to analyse, the pseudo-code for the algorithm can be
described as follows:
- Get article title
- Set URL parameters and load JSON
For each
revision entry in document:- Get user and store as node
- Get next user in list and store as node
If
user pair does not exist: store edge between two users
- Output edge pairs as CSV
- Plot user graph
Conclusions
While the algorithm provided in this solution is far from complete, it provides the foundations to analyse the basic graph structure of Wikipedia editors by means of capturing their interactions. Furthermore, the output of the resulting graph only considers interactions that accumulate over time within a single article. Contributors are welcome to modify the algorithm to model interactions in a more meaningful format.