The Pearson Correlation Coefficient, also known as Pearson’s R, represents a statistical measure that quantifies the strength and direction of the relationship between two variables. This coefficient ranges from -1 to +1, with +1 signifying a perfect positive correlation, -1 a perfect negative correlation, and 0 no correlation at all.

This coefficient provides a normalized measure of covariance, calculated by dividing the covariance of the two variables by the product of their standard deviations. However, it only reflects a linear correlation and does not account for other types of relationships or correlations.

Widely utilized in statistics, the Pearson Correlation Coefficient aids in determining the strength and direction of the relationship between two quantitative variables. Despite its limitations, it remains a crucial tool in statistical analysis due to its ability to provide a clear, quantifiable measure of linear relationships.

Intuition for correlation

Correlation tells us to what extent two variables move together.

One way correlation is usually presented is as a scatter plot. You take two variables, and for each data point, you plot the value of one variable on the x-axis and the value of the other variable on the y-axis.

If there is some correlation, it is usually visible to the naked eye.

As an example, let’s look at the relationship between the price of two cryptocurrencies, Bitcoin and Ethereum. We’ll use the daily closing price for each currency over 30 days.

Out:
<Figure size 640x480 with 1 Axes>

Hopefully, you can see that there is a positive correlation between the two currencies. When the price of Bitcoin goes up, the price of Ethereum tends to go up as well. When the price of Bitcoin goes down, the price of Ethereum tends to go down as well.

In these visualizations, you can also find that there is usually a trend line drawn through the data points. This line is called the line of best fit. It represents the best linear approximation of the relationship between the two variables.

Here’s what that line looks like for our data.

Out:
<Figure size 640x480 with 1 Axes>

In fact, you can consider the slope of this line to be the correlation between the two variables.

If there is a positive correlation, the line will have a positive slope and it will be going up from left to right. If there is a negative correlation, the line will have a negative slope and it will be going down from left to right.

If there is no correlation, the line will be more or less flat.

Visualizing the lack of correlation

For most related variables, you will find that there is some correlation. So you might be used to seeing scatter plots that have a clear trend line.

In order to see clearly what a lack of correlation looks like, let’s look at some data that is not correlated.

Here’s the price of Ethereum against random numbers. There should be no correlation between these two variables, because one of them is completely random.

Out:
<Figure size 640x480 with 1 Axes>

As you can see, the trend line is pretty much flat. The slope of the line is close to zero, which means that there is no correlation between the two variables.

This is a stark contrast to the previous example, where the trend line was basically a diagonal line going up from left to right.

Calculating the Pearson Correlation

In [7]:
mean = lambda x: sum(x) / len(x)

def pearsons_r(x, y):
    x_mean = mean(x)
    y_mean = mean(y)

    x_diff = [i - x_mean for i in x]
    y_diff = [i - y_mean for i in y]

    diff_products = [x * y for x, y in zip(x_diff, y_diff)]
    diff_products_sum = sum(diff_products)

    x_diff_squared = [i * i for i in x_diff]
    y_diff_squared = [i * i for i in y_diff]

    x_diff_squared_sum = sum(x_diff_squared)
    y_diff_squared_sum = sum(y_diff_squared)

    x_diff_squared_sum_sqrt = sqrt(x_diff_squared_sum)
    y_diff_squared_sum_sqrt = sqrt(y_diff_squared_sum)

    return diff_products_sum / (x_diff_squared_sum_sqrt * y_diff_squared_sum_sqrt)
In [8]:
r = pearsons_r(X, Y)
print(f"r: {r}")
Out:
r: 0.9303664907264924

Rolling Correlation

In [9]:
import requests
import datetime

URL = "https://api.binance.com/api/v3/klines"

def symbol_historical(symbol: str) -> list[float]:
    today = datetime.datetime.today()
    start_date = today - datetime.timedelta(days=1000)
    start_date = int(start_date.timestamp() * 1000)

    params = {
        "symbol": symbol,
        "interval": "1d",
        "startTime": start_date,
        "limit": 1000
    }

    r = requests.get(URL, params=params)
    data = r.json()

    return [float(i[4]) for i in data]
In [10]:
ETH = symbol_historical("ETHUSDT")
BTC = symbol_historical("BTCUSDT")
In [11]:
WINDOW_SIZE = 50
x_window = deque(maxlen=WINDOW_SIZE)
y_window = deque(maxlen=WINDOW_SIZE)

dates = []
correlations = []

for i, (x, y) in enumerate(zip(ETH, BTC)):
    x_window.append(x)
    y_window.append(y)

    if len(x_window) < WINDOW_SIZE:
        continue

    date = datetime.datetime.today() - datetime.timedelta(days=1000) + datetime.timedelta(days=i)

    r = pearsons_r(x_window, y_window)

    dates.append(date)
    correlations.append(r)

df = pd.DataFrame({"date": dates, "correlation": correlations})
df = df.set_index("date")

df.plot(title="ETHUSDT vs BTCUSDT correlation");
Out:
<Figure size 640x480 with 1 Axes>

What are those dips? They might indicate some large events that affected the market. This could be a good signal for a trading strategy.