How Self-Organising Maps Work: Explained with Graphics

Artificial Neural Networks (simply known as neural networks) come in all different shapes and sizes and can be designed to solve different types of problems. Classification, regression, supervised and unsupervised. Neural networks are incredibly versatile.

It’s fair to say that neural networks are the go-to solution for any predictive task imaginable. This is especially true for those in the data science, machine learning and academic world(s). For this reason, I thought I would take some time out of my schedule to learn more about how they work and build a few for myself to test out. I’ve never really thought about using them for anything important.

That was until I came across the Self-organising Map (otherwise known as SOM) neural network.

In my opinion, I feel that SOM’s are perhaps one of the most underrepresented types of neural network and, quite possibly, one of the most useful.

In this blog post, I will cover three points:

  • What is a SOM neural network?
  • Why SOM’s are useful
  • How do SOM’s work?

So let’s get started, what exactly is a SOM neural network.

What is a SOM neural network?

A SOM is a type of neural network which is designed to turn a high-dimensional data set into a low-dimension representations whilst preserving the structure of the original data. 

In other words, it attempts to approximate the data in its original form which is then mapped to a graph structure, such as a two-dimensional grid. This means data points which are close together in the original set will be close together in the new “mapped space”.

Much like other neural networks, a SOM attempts to “understand” a data set by going through a training process which involves building a model by repeatedly taking samples of the data and, with the help of some clever mathematics (explained later), the model will eventually align with the data.

The animation below is a very simple example of what a SOM does. Think of it like a blanket which wraps around some data points in an N-dimensional space.

An animation of weights “learning” the data in a scatter plot.

Why SOM’s are useful

As mentioned before, SOM neural networks are designed to make high-dimensional data sets easier to analyse and visualise. They do this by building a map between the real, high-dimensional, data (the input) and the low-dimensional output (the map).

What makes SOM’s useful is that they are versatile, as they can work with data in all different shapes and sizes in an attempt to build an accurate representation of the data. In most cases, a 2D lattice is used (like the one shown in the animation above) although almost any network configuration can be used such as a chain, hexagonal and triangular grid.

How do SOM’s work?

Like all machine learning algorithms, a SOM neural network needs to be trained on a data set. The general training process involves repeatedly presenting random data samples to the neural network and adjust weights (minimising the error) until a given threshold has been reached. This is known as gradient descent and is widely used in machine learning to make the training process as efficient as possible.

This process can be summarised with the following stages:

Initialise Data and Weights

To get started, we need some data. In this simple example, imagine we have a 2-dimensional data set containing \(N=100\) points with four classes. Let’s call it \(X\). It might look something like this when plotted:

Simple example of some mock data.

With the data \(X\) loaded, we can now set up our weights, \(W\) with the same dimensions. For now, this can be just some random numbers. These weights represent the strength of the connections, where each row repents a neural unit – a node in the network. Again, when plotted, it might look like this:

Simple example of some mock data including weights.

Define Leaning Rate and Neighbourhood Restraint

Now that the data and weights have been set up, we need to define a few functions before we start the training process. A learning rate and neighbourhood restraint.

A learning rate, let’s call it \(\alpha\), is used to determine how quickly or slowly the algorithm “learns” updates the weights. Usually, the value of \(\alpha\) starts at just below \(1.0\) (such as \(0.999\)) but slowly decreases per iteration. A high learning rate means that weights are updated too quickly and a low learning rate means that weights are not updated enough.

The rate at which this decreases is determined by a function. In my case, this function is determined by the following, where \(x\) is the current position and \(k\) is some variable. The value of \(k\) can be adjusted to determine the rate at which the learning rate decreases.

\begin{align*} f(x) = e^{xk}\end{align*}

Plotting this with different values of \(k\) gives us this:

Plotting different values of k using the learning rate function.

In addition to the learning rate, a neighbourhood restraint, \(\theta\), is used to determine how much (or little) influence a neuron has on neighbouring neurons. In other words, neurons that are closer have more influence than those that are further away.

Consider the example below.

Example of a neighbourhood in a network.

The target node (shown in red) has greater influence over those in blue as opposed to those in white. This is because there is a shorter path between them.

To determine how much influence neighbouring nodes have, another function is used to calculate a value between \(1.0\) and \(0.0\). The general idea is that the further away the node, the less influence it has. The neighbourhood restraint function is based upon the equation of a bell curve (also known as a normal distribution) where \(x\) represents the distance from the target node and \(s\) is used to control the “spread” of the curve.

This is defined as follows:

\begin{align*} f(x) = e^{-\frac{x}{s^2}}\end{align*}

Plotting for different values of \(s\) gives us.

Plotting different values of s using the neighbourhood restraint function.

Training Algorithm

With the data, weights, learning rate and neighbourhood restraint functions set up, we can now begin the process of training the SOM. The algorithm is fairly straight forward and is defined as follows:

Step 1: Pick random data point

Using the data provided in \(X\), we can take samples by picking a row at random with the variable \(j\) such that a random sample can be represented by \(X_j\).

Step 2: Find closed weight

To find the closest weight to the random data point \(X_j\), we can use the Euclidean distance to find the neuron with the shortest distance using the weights for each neuron. The Euclidean distance is calculated as follows:

$$ d(x, y) = || x – y || = \Sigma_{i=1}^{N} (x_i – y_i)^2 $$

Repeating this for each of the neuron’s weights, we can determine the neuron with the shortest distance. This neuron is called the best matching unit (otherwise known as a BMU).

Step 3: Update weights

Now that we have found the BMU, we can update its weights and the weights of the nodes connected to the BMU. This is determined by how the network has been set up. The weights are updated using the following equation where \(W_i(t)\) represents the weights of the BMU at the current iteration \(t\), \(\theta(t)\) is the neighbourhood restraint,\ (\alpha(t)\) as the learning rate and \(W_i(t+1)\) as the new updated weights of the BMU:

\begin{align*} W_i(t+1) = W_i(t) + \theta(t) \alpha(t) (X_j(t) – W_i(t))\end{align*}

Step 4: Repeat from Step 1

Steps 1 to 3 are repeated over several 1,000’s of iterations (depending on other variables such as the learning rate and the size of the data) to ensure that the SOM will eventually converge and align itself with the data.

Using the same example shown above, the plots below show how the neuron weights align themselves with the data points after 75,000 iterations.

The SOM weights after training.
The SOM weights after training (including grid).

Assign Neurons to Data

Now that the neurons have been trained on the data, we need to actually assign a neuron (a node in the network) to a data point in a one-to-one mapping. After all, that’s the whole point of training the SOM.

There are different strategies to achieve this, but the simple approach is to keep track of how many times each random data point “wins” a neuron (based upon the shortest distance during the training process). The data point with the most number of matches gets assigned to that neural unit.

It’s fair to say that this approach isn’t foolproof, as there may be some data points which don’t get assigned to a neuron at all.

Using the original example above, this is what the final result looks like when trained on a 2D lattice:

An example of the trained SOM using mock data.

I even managed to animate this over time, so you can see how each neuron gets assigned to different data points as it learns the data.

Animation of a SOM getting assigned to different neurons during the training process.

Example

To test this out, I thought I would train a SOM neural network on the well known Iris data set – a data set which is featured in many data science tutorials. It contains \(N=4\) dimensions (Sepal length, Sepal width, Petal length and Petal width) and has \(N=3\) classes.

Subsequently, the SOM produced the following result, where each unique colour represents a class in the set:

Self-organised map using the famous Iris data set.

Conclusions

In this rather long post, I cover 1) the basics for what a SOM neural networks is, 2) why SOM’s are useful for data visualisation and representation, 3) how they work (training e.t.c.) and 4) what they look like using a simple example. While this post only covers the theory of SOM neural networks, I plan to do a separate blog post covering how this can be implemented in Python using off-the-shelf packages such as matplotlib and numpy.