How to Do Cross-Correlation in Python: 4 Different Methods

How to Do Cross-Correlation in Python: 4 Different Methods
💡
Update 22.6.2022: Fixed date set generation.

Cross-correlation is a basic signal processing method, which is used to analyze the similarity between two signals with different lags. Not only can you get an idea of how well the two signals match with each other, but you also get the point of time or an index, where they are the most similar.

Whenever you need to find similarities between two signals, datasets, or functions, cross-correlation is one of the tools that you should try.

Below you can see an illustration of the cross-correlation between sine and cosine functions. Unsurprisingly, the maximum is when the phase of the functions (lag) is off by \(\frac{3\pi}{2}\), which is the delay that makes the two signals overlap

0:00
/
Cross-correlation between sine & cosine function.

Cross-correlation has tons of applications. Investors use it to check how two stocks or assets perform against each other.  In time series analysis, it can be used to find the time delays between two series.  

Cross-correlation definition

Cross-correlation is the correlation between two signals on different delays (lags).  

The definition is quite simple, you just overlap the two signals with a given delay. We can write this for real-valued discrete signals as

\[R_{fg}(l) = \sum_{n=0}^N  f(n)g(n + l)\]

It is also called the sliding inner product, because, for a given delay, it is basically an inner product of the two signals.

Definitions for complex, continuous, and random signals can be found, e.g., in Wikipedia.

https://en.wikipedia.org/wiki/cross-correlation

Note that autocorrelation can be viewed as a special case of cross-correlation, where the cross-correlation is taken with respect to the signal itself.

If you are looking for efficient packages to compute autocorrelation, check our post for 4 Ways of Calculating Autocorrelation in Python.

Data set and number of lags to calculate

Before going into the methods of calculating cross-correlation, we need to have some data. You can find below the data set that we are considering in our examples. The data set consists of two sinusoidal functions with  \(\frac{\pi}{4}\) phase difference.

import numpy as np

# First signal 
sig1 = np.sin(np.r_[-1:1:0.1])

# Seconds signal with pi/4 phase shift. Half the size of sig1
sig2 = np.sin(np.r_[-1:0:0.1] + np.pi/4)

Cross-correlation: 3 essential package + pure python implementation

Our brief introduction to cross-correlation is done and we are ready for the code. Here are three essential packages from math, signal processing, and statistics disciplines to calculate cross-correlation. As a bonus, we've thrown in a pure Python implementation without any external dependencies.  

Python only

This is a Python-only method without any external dependencies for calculating the cross-correlation.  

''' Python only implementation '''

# Pre-allocate correlation array
corr = (len(sig1) - len(sig2) + 1) * [0]

# Go through lag components one-by-one
for l in range(len(corr)):
    corr[l] = sum([sig1[i+l] * sig2[i] for i in range(len(sig2))])

print(corr)

Output with our test data set

[-0.471998494510103, -0.24686753498102817, -0.019269956645980538, 0.20852016072607304, 0.4342268135797527, 0.6555948156484444, 0.8704123310300105, 1.076532974119988, 1.271897255587048, 1.4545531601096169, 1.62267565026772]

NumPy

NumPy is the defacto numerical computation package for Python. It comes as no surprise that NumPy comes with a built-in method for cross-correlation.

''' Numpy implementation '''

import numpy as np

corr = np.correlate(a=sig1, v=sig2)

print(corr)

Output with our test data set

[-0.47199849 -0.24686753 -0.01926996  0.20852016  0.43422681  0.65559482
  0.87041233  1.07653297  1.27189726  1.45455316  1.62267565]

SciPy

When NumPy falls short, SciPy is most of the package to look at. It contains helpful methods for varying fields of science and engineering. When it comes to cross-correlation, we need to import the signal processing package. SciPy cross-correlation automatically pads the signal at the beginning and end, which is why it returns a longer signal response for cross-correlation than our pure Python implementation and the NumPy package. In our test case, we remove these padded components, to make the result comparable.

''' SciPy implementation '''

import scipy.signal  

corr = scipy.signal.correlate(sig1, sig2)

# Remove padded correlations
corr = corr[(len(sig1)-len(sig2)-1):len(corr)-((len(sig1)-len(sig2)-1))]

print(corr)

Output with our test data set

[-0.47199849 -0.24686753 -0.01926996  0.20852016  0.43422681  0.65559482
  0.87041233  1.07653297  1.27189726  1.45455316  1.62267565]

Statsmodels

Statsmodels is a really helpful package for those working with statistics. Here, it must be kept in mind that in statistics cross-correlation always includes normalization, which ensures that the correlation is within \([-1,1]\).  

To this end, we first show you how to do the normalization for the NumPy example and then compare the results.

Basically, the normalization involves moving the signal mean to 0 and dividing by the standard deviation and signal length.  

''' Numpy implementation (normalized)'''

import numpy as np

nsig1 = sig1 - np.mean(sig1) # Demean
nsig2 = sig2 - np.mean(sig2) # Demean

corr = np.correlate(a=nsig1, v=nsig2)
corr =/ (len(sig2) * np.std(sig1) * np.std(sig2)) # Normalization

print(corr)

Output with our test data set

[0.45714893 0.48309157 0.50420731 0.52028518 0.53116453 0.53673667
 0.5369459  0.53179015 0.52132093 0.50564285 0.48491254] 

Now, let's have a look at what the Statsmodels package provides

''' Statsmodels implementation '''

import statsmodels.api as sm

corr = sm.tsa.stattools.ccf(sig2, sig1, adjusted=False)

# Remove padding and reverse the order
corr[0:(len(sig2)+1)][::-1] 

Output with our test data set

[0.45714893 0.48309157 0.50420731 0.52028518 0.53116453 0.53673667
 0.5369459  0.53179015 0.52132093 0.50564285 0.48491254]

Notice that, similarly to the SciPy implementation, we needed to remove the padding. Also, Statsmodels provides the cross-correlation response in reversed order with respect to the other schemes, which is why we needed to flip the result.

Summary

As usual, it is up to you, which implementation is the best suited for you. If the performance is not an issue, go with the package that you are using anyways. For performance, NumPy is usually quite a safe bet. However, we have not done any performance comparison here. For the least dependencies go with NumPy or Python-only implementation.

Further reading