4 Ways of Calculating Autocorrelation in Python

4 Ways of Calculating Autocorrelation in Python

Autocorrelation is a function that provides a correlation of a data set with itself on different delays (lags). This has many applications in statistics and signal processing. This subject was also touched on in our previous post on how to write a pitch detection algorithm in Python using autocorrelation.

What is autocorrelation

Autocorrelation, in simple terms, is the correlation of a signal with itself on different delays. We can write this for real-valued discrete signals as \[R_{ff}(l) = \sum_{n=0}^N  f(n)f(n - l).\] Definition for continuous and random signals can be found, e.g., from Wikipedia.

https://en.wikipedia.org/wiki/Autocorrelation

Obviously, the maximum is at lag \(l = 0\). If the peaks of the autocorrelation function occur at even intervals, we can assume that the signal periodic component at that interval.

Let's consider a 10Hz sine wave, and sample this wave with a 1000Hz sampling rate.  

With the 1000Hz sampling rate, we will have 100 samples per full period of the wave. Now, look at the autocorrelation function on the sine wave  

Look's almost the same. Notice how we have a maximum at \(l = 0\). Since our signal is perfectly periodic, we will have a maximum at each period. That's every 100 samples or \(l = 100\).

In our following implementations, we consider the autocorrelation function, where we normalize the range between \([1, -1]\). This output comparison is easier.  There are different varieties depending on the application and somewhat on the definition.  

Data set and number of lags to calculate

Before going into the methods of calculating autocorrelation, we need to have some data. You can find below the data set that we are considering in our examples. The data consists of a list of random integers. It could be anything really, but here we did not want to provide the data any specific properties. Thus, we made it random.

In addition to a data set, we need to know how many lag points we are interested in calculating. In this case, we've chosen  10 points. We could have calculated the autocorrelation of all possible lags (the same as data set length). However, for some of the methods shown here, the computational complexity is relative to the number of lags. We want to highlight this by choosing only a subset of lags to consider.

# Our data set 
data = [3, 16, 156, 47, 246, 176, 233, 140, 130, 
        101, 166, 201, 200, 116, 118, 247, 
        209, 52, 153, 232, 128, 27, 192, 168, 208, 
        187, 228, 86, 30, 151, 18, 254, 
        76, 112, 67, 244, 179, 150, 89, 49, 83, 147, 90, 
        33, 6, 158, 80, 35, 186, 127]

# Delay (lag) range that we are interesting in
lags = range(10)

4 Ways of Calculating Autocorrelation

Now, to the point. Here are four different ways of implementing the autocorrelation function. One is vanilla python implementation without any external dependencies. The other three use common statistics/mathematics libraries.

Python only

This is a Python-only method without any external dependencies for calculating the autocorrelation.  

''' Python only implementation '''

# Pre-allocate autocorrelation table
acorr = len(lags) * [0]

# Mean
mean = sum(data) / len(data) 

# Variance
var = sum([(x - mean)**2 for x in data]) / len(data) 

# Normalized data
ndata = [x - mean for x in data]


# Go through lag components one-by-one
for l in lags:
    c = 1 # Self correlation
    
    if (l > 0):
        tmp = [ndata[l:][i] * ndata[:-l][i] 
               for i in range(len(data) - l)]
        
        c = sum(tmp) / len(data) / var
        
    acorr[l] = c

Output with our test data set

[ 1. , 0.07326561, 0.01341434, -0.03866088, 0.13064865,
  -0.05907283, -0.00449197,  0.08829021, -0.05690311,  
  0.03172606]

Statsmodels library

Statsmodels is a great library for statistics and it provides a simple interface for computing the autocorrelation. This is the simplest one to use. You just provide the data and how many lag points you need.

''' Statsmodels '''

import statsmodels.api as sm

acorr = sm.tsa.acf(data, nlags = len(lags)-1)

Output with our test data set

[ 1. , 0.07326561, 0.01341434, -0.03866088, 0.13064865,
  -0.05907283, -0.00449197, 0.08829021, -0.05690311, 
  0.03172606]

Numpy.correlate

Numpy provides a simple function for calculating correlation. Since autocorrelation is basically correlation for a data set with itself, it is no surprise that there is a way to use numpy.correlate to calculate autocorrelation.

''' numpy.correlate '''

import numpy

x = np.array(data) 

# Mean
mean = numpy.mean(data)

# Variance
var = numpy.var(data)

# Normalized data
ndata = data - mean

acorr = numpy.correlate(ndata, ndata, 'full')[len(ndata)-1:] 
acorr = acorr / var / len(ndata)

Output with our test data set

[ 1. , 0.07326561, 0.01341434, -0.03866088, 0.13064865,
  -0.05907283, -0.00449197,  0.08829021, -0.05690311,  
  0.03172606]

Fourier Transform

It turns out that autocorrelation is the real part of the inverse Fourier transform of the power spectrum. This follows from the Wiener–Khinchin theorem. We can exploit this, and write the following simple algorithm

  1. Take the Fourier transform of our data set.
  2. Calculate the corresponding power spectrum.  
  3. Take the inverse Fourier transform of the power spectrum and you get the autocorrelation.  
''' Fourier transform implementation '''

import numpy

# Nearest size with power of 2
size = 2 ** numpy.ceil(numpy.log2(2*len(data) - 1)).astype('int')

# Variance
var = numpy.var(data)

# Normalized data
ndata = data - numpy.mean(data)

# Compute the FFT
fft = numpy.fft.fft(ndata, size)

# Get the power spectrum
pwr = np.abs(fft) ** 2

# Calculate the autocorrelation from inverse FFT of the power spectrum
acorr = numpy.fft.ifft(pwr).real / var / len(data)

Output with our test data set

[ 1. , 0.07326561, 0.01341434, -0.03866088, 0.13064865,
  -0.05907283, -0.00449197, 0.08829021, -0.05690311, 
  0.03172606]

Summary

As expected, all four methods produce the same output. It is up to you, which approach is the most convenient for you. The performance difference is out of the scope of this post, but as your data set starts to increase in size you can expect an exponential increase in complexity.

Further reading