Tag Archives: networks

Generating Editor Graphs from Wikipedia Articles

There is no doubt that Wikipedia as a service is super, super useful. I don’t know how many times I’ve used it to learn about a new topic. Wikipedia is perhaps best well-known for its service as a collaborative platform where users (or editors) of any background can come and contribute towards sharing their knowledge and ensure that a given Wikipedia article maintains a high quality.

I feel that little appreciation goes towards the editors themselves. After all, they are the ones who are making the contributions!

To understand a little bit more about how editors interact on Wikipedia, this post lists the basic methods used to derive a network graph to model the collaborative interactions of Wikipedia editors from a time-series format to a static graph. While this approach is used for research purposes, further alterations and suggestions are encouraged to form a more accurate representation of the task at hand.

Problem definition

The data is represented as a network structure meaning that the underlining principles of graph theory are in use. In this example, a Wikipedia editor is represented as a vertex and a directed edge is used to pair editors together as A edits B. In this context, a graph is modelled based from the order users edit a Wikipedia article. In this case, the most recent editor edits the previous revision.

Parameters

To query data from Wikipedia, data can be accessed from a single endpoint URL. This can be accessed at https://en.wikipedia.org/w/api.php. Additionally, parameters are needed in the GET request to further refine results. A complete list can be accessed here. For example, the URL for accessing the most recent revision for the article “Coffee” can be located at:

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=json&&rvlimit=max&titles=Coffee

Data Formats

The revision of any Wikipedia article can be reproduced as either as an XML file or JSON document, accessed by a single URL endpoint API. For this exercise, the JSON format will be selected as this offers a more convenient option for loading data structures into the Python programming language. Bellow illustrates a sample revision document from the API.

    "continue": {
       "rvcontinue": "20181017170220|864501192",
       "continue": "||"
  },
   "query": {
       "pages": {
           "604727": {
               "pageid": 604727,
               "ns": 0,
               "title": "Coffee",
               "revisions": [
                  {
                       "revid": 864501853,
                       "parentid": 864501313,
                       "user": "Heroeswithmetaphors",
                       "timestamp": "2018-10-17T17:07:50Z",
                       "comment": "/* Caffeine content */links"
                  }
              ]
          }
      }
  }

The Algorithm

The Python programming environment conveniently provides the packages needed to model the data as a network graph. The package networkx offers convenient features used to generate and manipulate network graphs. In addition to this, the library matplotlib can be used to plot the resulting visualisation. Providing that the user has an article title they wish to analyse, the pseudo-code for the algorithm can be described as follows:

  1. Get article title
  2. Set URL parameters and load JSON
  3. For each revision entry in document:
    1. Get user and store as node
    2. Get next user in list and store as node
    3. If user pair does not exist: store edge between two users
  4. Output edge pairs as CSV
  5. Plot user graph

Conclusions

While the algorithm provided in this solution is far from complete, it provides the foundations to analyse the basic graph structure of Wikipedia editors by means of capturing their interactions. Furthermore, the output of the resulting graph only considers interactions that accumulate over time within a single article. Contributors are welcome to modify the algorithm to model interactions in a more meaningful format.

Let the Research Begin

For those of you who have paid close attention to my life recently will know that I have just started my PhD. I’m very much excited to get things going as I know I’ll just love going all-out nerd on a topic in depth.

Having met a lot of new people recently, I’ve had to repeat a lot of the same information for the purpose of making introductions. So for those who I haven’t met or don’t know me, my PhD research is looking at networks and the role it plays in collective intelligence.

In short, I plan on using this by looking at simple sub-graphs that belong to a much larger directed graph built of a collection of linked nodes. This is more broadly known as graph theory at it allows us to solve problems that can be expressed as entities and relations. This is particularly useful for modelling platforms such as social networks and interactions between users.

I certainly hope that I’ll be able to keep these positive and ambitious feelings going without feeling discoursed when things get difficult. I’m also trying to use this time to try out new activities and join new clubs to try and broaden my mindset outside of the office.