How to Calculate Correlation using NumPy

How to Calculate Correlation using NumPy
How to Calculate Correlation with NumPy

Correlation is a statistical technique that shows how strongly two variables are related. In the field of data science and machine learning, correlation analysis is a critical initial step to understand the relationships between different variables in your dataset. This article aims to guide you through calculating correlation with NumPy, a powerful library in Python.

Understanding Correlation

Correlation measures the degree to which two variables move in relation to each other. It's expressed as a value between -1 and 1, known as the correlation coefficient. If the coefficient is close to 1, it signifies a strong positive correlation, while -1 represents a strong negative correlation. A value near 0 indicates little or no relationship.

The correlation coefficient \( r \) for two variables \( \mathbf{x} \) and \( \mathbf{y} \) can be calculated using the formula:

\[
r = \frac{n(\sum x_iy_i) - (\sum x_i)(\sum y_i)}{\sqrt{(n\sum x_i^2 - (\sum x_i)^2][n\sum y_i^2 - (\sum y_i)^2)}},
\]

where

  • \( n \) is the number of pairs of scores,
  • \( \sum x_iy_i \) is the sum of the products of paired scores,
  • \( \sum x_i \) and \( \sum y_i \) are the sum of the \( \mathbf{x} \) scores and \( \mathbf{y} \) scores respectively,
  • \( \sum x_i^2 \) and \( \sum y_i^2 \) are the sum of the squared \( \mathbf{x} \) scores and \( \mathbf{y} \) scores respectively.

Ways of Calculating Correlation using NumPy

NumPy provides several ways to calculate the correlation between two or more variables in a dataset:

np.corrcoef()

This function returns the correlation matrix of the variables. The correlation matrix is a two-dimensional array with correlation coefficients. The diagonal of the matrix always consists of 1s as these represent the correlation of a variable with itself. The rest of the elements in the matrix represent the correlation coefficients of each pair of variables. Example:

import numpy as np
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 4, 5, 6])
correlation_matrix = np.corrcoef(x, y)
print(correlation_matrix)
Calculating correlation coefficients with NumPy

np.cov()

This function computes the covariance matrix of the variables. Covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, the covariance is positive. While covariance gives you the direction of the relationship between two variables, it doesn't provide the strength of the relationship like correlation does. However, the correlation coefficient can be calculated from the covariance. Example:

import numpy as np
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 4, 5, 6])
covariance_matrix = np.cov(x, y)
print(covariance_matrix)
Calculating covariance with NumPy

In addition to these, Python’s scientific computing library SciPy provides the function scipy.stats.pearsonr() that calculates the Pearson correlation coefficient and the p-value for testing non-correlation. It’s a more advanced function that you might use if you need more than what np.corrcoef() provides.

Remember, NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Breakdown of Calculating Correlation with NumPy

In Python, we use the NumPy library to make the process of calculating correlation simpler. Let's consider two arrays x and y.

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 4, 5, 6])
Test dataset

We can compute the correlation using the np.corrcoef() function. This function returns a correlation matrix, which is a two-dimensional array with correlation coefficients.

correlation_matrix = np.corrcoef(x, y)
print(correlation_matrix)
Calculate and print the correlation matrix

The diagonal elements of the matrix are always 1, as they represent the correlation of a variable with itself. The other aspects of the matrix represent the correlation coefficients of the variables. In this case, correlation_matrix[0,1] and correlation_matrix[1,0] will give you the correlation coefficient between x and y.

correlation_coefficient = correlation_matrix[0, 1]
print(correlation_coefficient)
Calculate and print correlation between variables x and y

Conclusion

Understanding the correlation between different data sets can provide valuable insights when dealing with complex datasets. NumPy, with its np.corrcoef() function, simplifies the process of calculating correlation coefficients, providing an easy and efficient tool for data analysis. This functionality, combined with the power of Python, is one of the reasons why data science has become more accessible and popular in recent years.