There is no doubt that Wikipedia as a service is super, super useful. I don’t know how many times I’ve used it to learn about a new topic. Wikipedia is perhaps best well-known for its service as a collaborative platform where users (or editors) of any background can come and contribute towards sharing their knowledge and ensure that a given Wikipedia article maintains a high quality.
I feel that little appreciation goes towards the editors themselves. After all, they are the ones who are making the contributions!
To understand a little bit more about how editors interact on Wikipedia, this post lists the basic methods used to derive a network graph to model the collaborative interactions of Wikipedia editors from a time-series format to a static graph. While this approach is used for research purposes, further alterations and suggestions are encouraged to form a more accurate representation of the task at hand.
The data is represented as a network structure meaning that the underlining principles of graph theory are in use. In this example, a Wikipedia editor is represented as a vertex and a directed edge is used to pair editors together as A edits B. In this context, a graph is modelled based from the order users edit a Wikipedia article. In this case, the most recent editor edits the previous revision.
To query data from Wikipedia, data can be accessed from a single endpoint URL. This can be accessed at
https://en.wikipedia.org/w/api.php. Additionally, parameters are needed in the GET request to further refine results. A complete list can be accessed here. For example, the URL for accessing the most recent revision for the article “Coffee” can be located at:
The revision of any Wikipedia article can be reproduced as either as an XML file or JSON document, accessed by a single URL endpoint API. For this exercise, the JSON format will be selected as this offers a more convenient option for loading data structures into the Python programming language. Bellow illustrates a sample revision document from the API.
"comment": "/* Caffeine content */links"
The Python programming environment conveniently provides the packages needed to model the data as a network graph. The package
networkx offers convenient features used to generate and manipulate network graphs. In addition to this, the library
matplotlib can be used to plot the resulting visualisation. Providing that the user has an article title they wish to analyse, the pseudo-code for the algorithm can be described as follows:
- Get article title
- Set URL parameters and load JSON
For each revision entry in document:
- Get user and store as node
- Get next user in list and store as node
If user pair does not exist: store edge between two users
- Output edge pairs as CSV
- Plot user graph
While the algorithm provided in this solution is far from complete, it provides the foundations to analyse the basic graph structure of Wikipedia editors by means of capturing their interactions. Furthermore, the output of the resulting graph only considers interactions that accumulate over time within a single article. Contributors are welcome to modify the algorithm to model interactions in a more meaningful format.