How to Build a Self-Organising Map (SOM) in Python Step-by-Step

In a separate blog post, I covered the basis for understanding how self-organising maps (SOM) work. This included what a SOM is, why SOM’s are useful and how they work – including the training algorithm.

In this blog post, we’ll go through how we can build a simple SOM neural network of our own using simple off-the-shelf packages in Python, such as numpy. If you haven’t already seen my previous post, please read it first as it will cover everything you need to know about the theory of SOM’s which will helpful when understanding the code used in this post.

The full code can be found here.

Import packages

To begin, let’s begin with a bunch of imports…

import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
from sklearn import datasets
...

Load dataset

For the purposes of this blog post, we’ll keep things simple and use the famous Iris data set, which is featured in sklearn. This is loaded in along with the corresponding labels.

...
dataset = datasets.load_iris()

X = dataset['data']
labels = dataset['target']
...

Initialise weights and graph

The weights used for the SOM need to equal to or grater than the size of the original data. This is to ensure that every data point is assigned a node. At this stage, the weights can be anything (such as a grid) although, to keep things simple, a randomised matrix will suffice.

Note: In this blog post, I use the terms node, unit and neuron interchangeably, but they basically all refer to the same thing.

...
W = np.random.random(size=X.shape)
...

Along with the weights, we need to define the logic for processing neighbouring nodes. To make this a little easier, networkx is used, as it provides all the necessary functions for computing the neighbourhood and distance between nodes.

Thanks to networkx literally any graph topology can be used as the basis for the SOM. In my case, I’m using a simple 2D grid using the grid_2d_graph function.

...
s = int(np.ceil(np.sqrt(W.shape[0])))
G = nx.grid_2d_graph(m=s, n=s)
...

To ensure that the i’th row of the weights maps to the i’th node in the graph, we generate two dictionaries to ensure we can quickly convert between 1D and 2D indexes.

...
c2i = {c: i for i, c in enumerate(G.nodes())}
i2c = {v: k for k, v in c2i.items()}
...

We also need to ensure that the weights match the size of the graph (if rounded up). This is a quick fix, as we can just concatenate the difference.

...
diff = np.abs(len(G.nodes()) - W.shape[0])
if diff > 0:
    W = np.concatenate([W, np.random.random(size=(diff, W.shape[1]))])
...

Define learning rate

As mentioned in the previous blog post, we need to define a learning rate function to ensure that the neural network converge at the optimal rate. There are many ways to do this but, in my case, I used a simple exponential decay function.

...
def f1(x, k):
    return np.exp(x * k)

def learning_rate(s, k):
    return f1(x=s, k=k)
...

Define neighbourhood restraint

Similarly, we also need to define a neighbourhood restraint function to determine how much influence the best matching unit (BMU) node has on its neighbours. In my example, I use a simple bell curve taken from a Gaussian distribution.

...
def f2(x, s):
    return np.exp(-(x / (s ** 2)))

def restraint(G, best, n, sigma=4):
    dist = nx.shortest_path_length(G, source=best, target=n)
    return f2(dist, s=sigma)
...

The training algorithm

Prior to training the SOM, we need to keep track of which unit (a node / neuron) won which data point. A simple dictionary will do the trick.

...
winners = {i: [] for i in range(W.shape[0])}
...

We also need to set a limit on how many training interactions are performed to ensure that the algorithm completes. 75,000 iterations should be plenty.

...
max_iter = 75000

# Start training loop
for s in range(max_iter):
    ...

Within the main training loop, we train the model by taking a random sample of the data.

    ...
    r_idx = np.random.randint(X.shape[0])
    x = X[r_idx, :]
    ...

With the random data point, the Euclidean distance is measured with each weight to determine the BMU – the weight with the smallest distance, as marked by best_idx. This is super easy thanks to linear algebra and the numpy.norm function.

    ...
    # Find the best matching unit (BMU) using Euclidean distance
    x_stack = np.stack([x]*W.shape[0], axis=0)
    dists = np.linalg.norm(x_stack - W, axis=1)
    best_idx = np.argmin(dists)
    ...

Now that we found the BMU, we need to update its weights using the learning rate initialised earlier.

    ...
    # Set learning rate
    k = -(1/1000)
    a = learning_rate(s, k)
    
    # Update weights
    W[best_idx, :] = W[best_idx, :] + a * (x - W[best_idx, :])
    ...

We also need to add the index of the data point to the “won” list of the BMU.

    ...
    winners[best_idx].append(r_idx)
    ...

Now that we’ve updated the weights of the BMU, we also need to update the weights of its immediate neighbours too. This is where the neighbourhood restraint function comes in.

    ...
    immediate_n = list(G[i2c[best_idx]])
    for n in immediate_n:
        W[c2i[n], :] = W[c2i[n], :] + restraint(G, i2c[best_idx], n) * a * (x - W[c2i[n], :])
    ...

Assign data to nodes

Now that the SOM has been trained, we need to match each data point with a node in the graph. This is why we initialised the winners dictionary, so we can assign data points based upon how many times it was assigned. A data point with the most number of “wins” will be assigned. If nothing has been assigned, then we will simply return no index.

...
# Match unit to most "won" data point, else sign no label
def best_match(idxs):
    idx = None
    if len(idxs) > 0:
        idx = max(idxs, key=idxs.count)
    
    return idx

unit_match = {i: best_match(idxs) for i, idxs in winners.items()}
unit_label = {i: labels[idx] for i, idx in unit_match.items() if idx}
...

Now that we have assigned data to nodes, we can now visualise the results. In my case, I’m using a scatter plot however, you can use what every data visualisation technique you think is more appropriate (such a heatmap, etc). The resulting output can be found below.

...
# Plot the results as a scatter plot (optional, try heat map)
results = np.array([list(i2c[k]) for k in unit_label.keys()])
pred = list(unit_label.values())

plt.figure(figsize=(5, 5))
plt.scatter(results[:, 0], results[:, 1], c=pred, s=100)
plt.axis('off')
plt.show()
Resulting self-organised map using the Iris data set.

Conclusions

Overall, this blog post covered the basic for building a very simple SOM neural network using go-to Python packages such as numpy and matplotlib. I must say, I really enjoyed building the code and tweaking the algorithm parameters until it worked optimally. That being said, I’m sure there are many other ways in which this code can be optimised and improved for large datasets.