© Copyright Quantopian Inc.

© Modifications Copyright QuantRocket LLC

Licensed under the Creative Commons Attribution 4.0.

Disclaimer

By Delaney Mackenzie and Maxwell Margenot

Pairs trading is a classic example of a strategy based on mathematical analysis. The principle is as follows. Let's say you have a pair of securities X and Y that have some underlying economic link. An example might be two companies that manufacture the same product, or two companies in one supply chain. If we can model this economic link with a mathematical model, we can make trades on it. We'll start by constructing a toy example.

Before we proceed, note that the content in this lecture depends heavily on the Integration, Cointegration and Stationarity lecture in order to properly understand the mathematical basis for the methodology that we employ here. It is recommended that you go through that lecture before continuing this one.

In [1]:

```
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.stattools import coint
# just set the seed for the random number generator
np.random.seed(107)
import matplotlib.pyplot as plt
```

We model X's daily returns by drawing from a normal distribution. Then we perform a cumulative sum to get the value of X on each day.

In [2]:

```
X_returns = np.random.normal(0, 1, 100) # Generate the daily returns
# sum them and shift all the prices up into a reasonable range
X = pd.Series(np.cumsum(X_returns), name='X') + 50
X.plot();
```

Now we generate Y. Remember that Y is supposed to have a deep economic link to X, so the price of Y should vary pretty similarly. We model this by taking X, shifting it up and adding some random noise drawn from a normal distribution.

In [3]:

```
some_noise = np.random.normal(0, 1, 100)
Y = X + 5 + some_noise
Y.name = 'Y'
pd.concat([X, Y], axis=1).plot();
```

We've constructed an example of two cointegrated series. Cointegration is a more subtle relationship than correlation. If two time series are cointegrated, there is some linear combination between them that will vary around a mean. At all points in time, the combination between them is related to the same probability distribution.

For more details on how we formally define cointegration and how to understand it, please see the Integration, Cointegration, and Stationarity lecture.

We'll plot the difference between the two now so we can see how this looks.

In [4]:

```
(Y - X).plot() # Plot the spread
plt.axhline((Y - X).mean(), color='red', linestyle='--') # Add the mean
plt.xlabel('Time')
plt.legend(['Price Spread', 'Mean']);
```

That's an intuitive definition, but how do we test for this statistically? There is a convenient cointegration test that lives in `statsmodels.tsa.stattools`

. Let's say that our confidence level is $0.05$. We should see a p-value below our cutoff, as we've artifically created two series that are the textbook definition of cointegration.

In [5]:

```
# compute the p-value of the cointegration test
# will inform us as to whether the spread between the 2 timeseries is stationary
# around its mean
score, pvalue, _ = coint(X,Y, maxlag=1)
print(pvalue)
```

2.0503418653408677e-16

Correlation and cointegration, while theoretically similar, are not the same. To demonstrate this, we'll show examples of series that are correlated, but not cointegrated, and vice versa. To start let's check the correlation of the series we just generated.

In [6]:

```
X.corr(Y)
```

Out[6]:

0.9497090646385932

That's very high, as we would expect. But how would two series that are correlated but not cointegrated look?

A simple example is two series that just diverge.

In [7]:

```
X_returns = np.random.normal(1, 1, 100)
Y_returns = np.random.normal(2, 1, 100)
X_diverging = pd.Series(np.cumsum(X_returns), name='X')
Y_diverging = pd.Series(np.cumsum(Y_returns), name='Y')
pd.concat([X_diverging, Y_diverging], axis=1).plot();
```

In [8]:

```
print('Correlation: ' + str(X_diverging.corr(Y_diverging)))
score, pvalue, _ = coint(X_diverging,Y_diverging, maxlag=1)
print('Cointegration test p-value: ' + str(pvalue))
```

Correlation: 0.9931343801275687 Cointegration test p-value: 0.8815557674695217

A simple example of this case is a normally distributed series and a square wave.

In [9]:

```
Y2 = pd.Series(np.random.normal(0, 1, 1000), name='Y2') + 20
Y3 = Y2.copy()
```

In [10]:

```
# Y2 = Y2 + 10
Y3[0:100] = 30
Y3[100:200] = 10
Y3[200:300] = 30
Y3[300:400] = 10
Y3[400:500] = 30
Y3[500:600] = 10
Y3[600:700] = 30
Y3[700:800] = 10
Y3[800:900] = 30
Y3[900:1000] = 10
```

In [11]:

```
Y2.plot()
Y3.plot()
plt.ylim([0, 40]);
```

In [12]:

```
# correlation is nearly zero
print('Correlation: ' + str(Y2.corr(Y3)))
score, pvalue, _ = coint(Y2,Y3, maxlag=1)
print('Cointegration test p-value: ' + str(pvalue))
```

Correlation: -0.04130406958091662 Cointegration test p-value: 0.0

Sure enough, the correlation is incredibly low, but the p-value shows that these are cointegrated.

Because you'd like to protect yourself from bad markets, often times short sales will be used to hedge long investments. Because a short sale makes money if the security sold loses value, and a long purchase will make money if a security gains value, one can long parts of the market and short others. That way if the entire market falls off a cliff, we'll still make money on the shorted securities and hopefully break even. In the case of two securities we'll call it a hedged position when we are long on one security and short on the other.

Because the securities drift towards and apart from each other, there will be times when the distance is high and times when the distance is low. The trick of pairs trading comes from maintaining a hedged position across X and Y. If both securities go down, we neither make nor lose money, and likewise if both go up. We make money on the spread of the two reverting to the mean. In order to do this we'll watch for when X and Y are far apart, then short Y and long X. Similarly we'll watch for when they're close together, and long Y and short X.

This is when the spread is small and we expect it to become larger. We place a bet on this by longing Y and shorting X.

This is when the spread is large and we expect it to become smaller. We place a bet on this by shorting Y and longing X.

One important concept here is that we are placing a bet on one specific thing, and trying to reduce our bet's dependency on other factors such as the market.

The best way to do this is to start with securities you suspect may be cointegrated and perform a statistical test. If you just run statistical tests over all pairs, you'll fall prey to multiple comparison bias.

Here's a method to look through a list of securities and test for cointegration between all pairs. It returns a cointegration test score matrix, a p-value matrix, and any pairs for which the p-value was less than $0.05$.

The methods for finding viable pairs all live on a spectrum. At one end there is the formation of an economic hypothesis for an individual pair. You have some extra knowledge about an economic link that leads you to believe that the pair is cointegrated, so you go out and test for the presence of cointegration. In this case you will incur no multiple comparisons bias. At the other end of the spectrum, you perform a search through hundreds of different securities for any viable pairs according to your test. In this case you will incur a very large amount of multiple comparisons bias.

Multiple comparisons bias is the increased chance to incorrectly generate a significant p-value when many tests are run. If 100 tests are run on random data, we should expect to see 5 p-values below $0.05$ on expectation. Because we will perform $n(n-1)/2$ comparisons, we should expect to see many incorrectly significant p-values. For the sake of this example we will ignore this and continue. In practice a second verification step would be needed if looking for pairs this way. Another approach is to pick a small number of pairs you have reason to suspect might be cointegrated and test each individually. This will result in less exposure to multiple comparisons bias. You can read more about multiple comparisons bias here.

In [13]:

```
def find_cointegrated_pairs(data):
n = data.shape[1]
score_matrix = np.zeros((n, n))
pvalue_matrix = np.ones((n, n))
keys = data.keys()
pairs = []
for i in range(n):
for j in range(i+1, n):
S1 = data[keys[i]]
S2 = data[keys[j]]
result = coint(S1, S2, maxlag=1)
score = result[0]
pvalue = result[1]
score_matrix[i, j] = score
pvalue_matrix[i, j] = pvalue
if pvalue < 0.05:
pairs.append((keys[i], keys[j]))
return score_matrix, pvalue_matrix, pairs
```

We are looking through a set of solar company stocks to see if any of them are cointegrated. We'll start by defining the list of securities we want to look through. Then we'll get the pricing data for each security for the year of 2007.

Our approach here is somewhere in the middle of the spectrum that we mentioned before. We have formulated an economic hypothesis that there is some sort of link between a subset of securities within the energy sector and we want to test whether there are any cointegrated pairs. This incurs significantly less multiple comparisons bias than searching through hundreds of securities and slightly more than forming a hypothesis for an individual test.

NOTE: We include the market in our data. This is because the market drives the movement of so many securities that you often times might find two seemingly cointegrated securities, but in reality they are not cointegrated and just both conintegrated with the market. This is known as a confounding variable and it is important to check for market involvement in any relationship you find.

In [14]:

```
from quantrocket.master import get_securities
from quantrocket import get_prices
securities = get_securities(symbols=['CSIQ', 'ASTI', 'FCEL', 'FSLR', 'SPY'], vendors='usstock')
prices_df = get_prices(
'usstock-learn-1d',
data_frequency='daily',
sids=securities.index.tolist(),
fields=['Close'],
start_date='2007-01-03',
end_date='2008-01-01').loc['Close']
sids_to_symbols = securities.Symbol.to_dict()
prices_df = prices_df.rename(columns=sids_to_symbols)
```

Now we'll run our method on the list and see if any pairs are cointegrated.

In [15]:

```
# Heatmap to show the p-values of the cointegration test between each pair of
# stocks. Only show the value in the upper-diagonal of the heatmap
scores, pvalues, pairs = find_cointegrated_pairs(prices_df)
import seaborn
seaborn.heatmap(
pvalues,
xticklabels=prices_df.columns,
yticklabels=prices_df.columns,
cmap='RdYlGn_r',
mask = (pvalues >= 0.05))
print(pairs)
```

[('FCEL', 'ASTI')]

Looks like 'ASTI' and 'FCEL' are cointegrated. Let's take a look at the prices to make sure there's nothing weird going on.

In [16]:

```
S1 = prices_df['ASTI']
S2 = prices_df['FCEL']
```

In [17]:

```
score, pvalue, _ = coint(S1, S2, maxlag=1)
pvalue
```

Out[17]:

0.09396414782525292

Now we will plot the spread of the two series. In order to actually calculate the spread, we use a linear regression to get the coefficient for the linear combination to construct between our two securities, as shown in the Integration, Cointegration, and Stationarity lecture. Using a linear regression to estimate the coefficient is known as the Engle-Granger method.

In [18]:

```
S1 = sm.add_constant(S1)
results = sm.OLS(S2, S1).fit()
S1 = S1['ASTI']
b = results.params['ASTI']
spread = S2 - b * S1
spread.plot()
plt.axhline(spread.mean(), color='black')
plt.legend(['Spread']);
```

Alternatively, we could examine the ratio betwen the two series.

In [19]:

```
ratio = S1/S2
ratio.plot()
plt.axhline(ratio.mean(), color='black')
plt.legend(['Price Ratio']);
```

Examining the price ratio of a trading pair is a traditional way to handle pairs trading. Part of why this works as a signal is based in our assumptions of how stock prices move, specifically because stock prices are typically assumed to be log-normally distributed. What this implies is that by taking a ratio of the prices, we are taking a linear combination of the returns associated with them (since prices are just the exponentiated returns).

This can be a little irritating to deal with for our purposes as purchasing the precisely correct ratio of a trading pair may not be practical. We choose instead to move forward with simply calculating the spread between the cointegrated stocks using linear regression. This is a very simple way to handle the relationship, however, and is likely not feasible for non-toy examples. There are other potential methods for estimating the spread listed at the bottom of this lecture. If you want to get more into the theory of why having cointegrated stocks matters for pairs trading, again, please see the Integration, Cointegration, and Stationarity Lecture.

So, back to our example. The absolute spread isn't very useful in statistical terms. It is more helpful to normalize our signal by treating it as a z-score.

In practice this is usually done to try to give some scale to the data, but this assumes some underlying distribution, usually a normal distribution. Under a normal distribution, we would know that approximately 84% of all spread values will be smaller. However, much financial data is not normally distributed, and one must be very careful not to assume normality, nor any specific distribution when generating statistics. It could be the case that the true distribution of spreads was very fat-tailed and prone to extreme values. This could mess up our model and result in large losses.

In [20]:

```
def zscore(series):
return (series - series.mean()) / np.std(series)
```

In [21]:

```
zscore(spread).plot()
plt.axhline(zscore(spread).mean(), color='black')
plt.axhline(1.0, color='red', linestyle='--')
plt.axhline(-1.0, color='green', linestyle='--')
plt.legend(['Spread z-score', 'Mean', '+1', '-1']);
```

- Go "Long" the spread whenever the z-score is below -1.0
- Go "Short" the spread when the z-score is above 1.0
- Exit positions when the z-score approaches zero

This is just the tip of the iceberg, and only a very simplistic example to illustrate the concepts. In practice you would want to compute a more optimal weighting for how many shares to hold for S1 and S2. Some additional resources on pair trading are listed at the end of this notebook

In general taking a statistic over your whole sample size can be bad. For example, if the market is moving up, and both securities with it, then your average price over the last 3 years may not be representative of today. For this reason traders often use statistics that rely on rolling windows of the most recent data.

A moving average is just an average over the last $n$ datapoints for each given time. It will be undefined for the first $n$ datapoints in our series. Shorter moving averages will be more jumpy and less reliable, but respond to new information quickly. Longer moving averages will be smoother, but take more time to incorporate new information.

We also need to use a rolling beta, a rolling estimate of how our spread should be calculated, in order to keep all of our parameters up to date.

In [22]:

```
# Get the spread between the 2 stocks
# Calculate rolling beta coefficient
window = 30
rolling_beta = [np.nan] * window
for n in range(window, len(S1)):
y = S1[(n - window):n]
X = S2[(n - window):n]
rolling_beta.append(sm.OLS(y, X).fit().params.iloc[0])
rolling_beta = pd.Series(rolling_beta, index=S2.index)
spread = S2 - rolling_beta * S1
spread.name = 'spread'
# Get the 1 day moving average of the price spread
spread_mavg1 = spread.rolling(window=1).mean()
spread_mavg1.name = 'spread 1d mavg'
# Get the 30 day moving average
spread_mavg30 = spread.rolling(30).mean()
spread_mavg30.name = 'spread 30d mavg'
plt.plot(spread_mavg1.index, spread_mavg1.values)
plt.plot(spread_mavg30.index, spread_mavg30.values)
plt.legend(['1 Day Spread MAVG', '30 Day Spread MAVG'])
plt.ylabel('Spread');
```