Finding Peaks in a Dataset and Why It Is Not Straightforward

Finding Peaks in a Dataset and Why It Is Not Straightforward

In my job as a data analyst, I come across many different types of problems to solve. Some are relatively easy to solve, others not so much. That was until recently, where I came across a problem I have never given much thought before. That was until now.

What is the problem? Finding multiple peaks in a dataset.

You might think, this sounds incredibly simple. Just take the max value in the set. And sure, you would be right. While this is ideal for finding the global maximum, it doesn’t cover all peaks. Note, I’m using the plural, implying multiple peaks.

The Problem

Let me explain this further with some dummy data. Consider the simple function below.

A dummy 1-D dataset featuring three distinct peaks.

It’s quite clear from the graphic that there are three distinct peaks in the dataset where some are higher than others.

If we were to take the argmax (finding the max value’s index position), we would end up finding the global maximum.

However, as you can see, there are multiple peaks in this set. What if I wanted to find the locations for each of these three peaks?

This is where we counter our first issue with the argmax function.

Problem 1: argmax can only find the highest peak.

So with this in mind, our second option might be to use the argmax function to take the index of not just the highest value, but the second and third-highest values in the set by taking the top N (N, being some fixed number) sorted values.

This is what it would look like if the top three values were plotted on the same dataset.

As you can see, this reveals another issue with argmax as the three values selected are piled around the largest peak. This because the second and third-highest values in the set are still found within the highest peak.

It only finds the peaks closest to each other, even though it’s clear that there are other peaks.

Problem 2: argmax can’t detect multiple peaks

So this raises an important question.

How can we adequately detect multiple peaks in a dataset?

To answer this, we need to consider things such as distance between peaks, peak height, thresholds and neighbouring values.

The Solution

Thankfully, If you live in the world of Python like me, then you are in luck. Using the awesome scipy package, there is a function known as find_peaks which does the job for us.

I won’t go into the maths in detail but, according to scipya peak or local maximum is defined as any sample whose two direct neighbours have a smaller amplitude“.

Revisiting our example earlier, using the find_peaks algorithm correctly finds all the peaks as marked below.

Here is an example snippet of how this can be implemented in Python. In my case, I built my own my fancy peak creator function, which essentially combines three normal distributions using random values for location and scale.