© Copyright Quantopian Inc.

© Modifications Copyright QuantRocket LLC

Licensed under the Creative Commons Attribution 4.0.

Disclaimer

by Maxwell Margenot

pandas is a Python library that provides a collection of powerful data structures to better help you manage data. In this lecture, we will cover how to use the `Series`

and `DataFrame`

objects to handle data. These objects have a strong integration with NumPy, allowing us to easily do the necessary statistical and mathematical calculations that we need for finance.

In [1]:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```

With pandas, it is easy to store, visualize, and perform calculations on your data. With only a few lines of code we can modify our data and present it in an easily-understandable way. Here we simulate some returns in NumPy, put them into a pandas `DataFrame`

, and perform calculations to turn them into prices and plot them, all only using a few lines of code.

In [2]:

```
returns = pd.DataFrame(np.random.normal(1.0, 0.03, (100, 10)))
prices = returns.cumprod()
prices.plot()
plt.title('Randomly-generated Prices')
plt.xlabel('Time')
plt.ylabel('Price')
plt.legend(loc=0);
```

So let's have a look at how we actually build up to this point!

In [3]:

```
s = pd.Series([1, 2, np.nan, 4, 5])
print(s)
```

0 1.0 1 2.0 2 NaN 3 4.0 4 5.0 dtype: float64

Every `Series`

has a name. We can give the series a name as a parameter or we can define it afterwards by directly accessing the name attribute. In this case, we have given our time series no name so the attribute should be empty.

In [4]:

```
print(s.name)
```

None

This name can be directly modified with no repercussions.

In [5]:

```
s.name = "Toy Series"
print(s.name)
```

Toy Series

We call the collected axis labels of a `Series`

its index. An index can either passed to a `Series`

as a parameter or added later, similarly to its name. In the absence of an index, a `Series`

will simply contain an index composed of integers, starting at $0$, as in the case of our "Toy Series".

In [6]:

```
print(s.index)
```

RangeIndex(start=0, stop=5, step=1)

pandas has a built-in function specifically for creating date indices, `date_range()`

. We use the function here to create a new index for `s`

.

In [7]:

```
new_index = pd.date_range("2016-01-01", periods=len(s), freq="D")
print(new_index)
```

DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04', '2016-01-05'], dtype='datetime64[ns]', freq='D')

An index must be exactly the same length as the `Series`

itself. Each index must match one-to-one with each element of the `Series`

. Once this is satisfied, we can directly modify the `Series`

index, as with the name, to use our new and more informative index (relatively speaking).

In [8]:

```
s.index = new_index
print(s.index)
```

DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04', '2016-01-05'], dtype='datetime64[ns]', freq='D')

The index of the `Series`

is crucial for handling time series, which we will get into a little later.

`Series`

Elements¶`Series`

are typically accessed using the `iloc[]`

and `loc[]`

methods. We use `iloc[]`

to access elements by integer index and we use `loc[]`

to access the index of the Series.

In [9]:

```
print("First element of the series:", s.iloc[0])
print("Last element of the series:", s.iloc[len(s)-1])
```

First element of the series: 1.0 Last element of the series: 5.0

We can slice a `Series`

similarly to our favorite collections, Python lists and NumPy arrays. We use the colon operator to indicate the slice.

In [10]:

```
s.iloc[:2]
```

Out[10]:

2016-01-01 1.0 2016-01-02 2.0 Freq: D, Name: Toy Series, dtype: float64

When creating a slice, we have the options of specifying a beginning, an end, and a step. The slice will begin at the start index, and take steps of size `step`

until it passes the end index, not including the end.

In [11]:

```
start = 0
end = len(s) - 1
step = 1
s.iloc[start:end:step]
```

Out[11]:

2016-01-01 1.0 2016-01-02 2.0 2016-01-03 NaN 2016-01-04 4.0 Freq: D, Name: Toy Series, dtype: float64

We can even reverse a `Series`

by specifying a negative step size. Similarly, we can index the start and end with a negative integer value.

In [12]:

```
s.iloc[::-1]
```

Out[12]:

2016-01-05 5.0 2016-01-04 4.0 2016-01-03 NaN 2016-01-02 2.0 2016-01-01 1.0 Freq: -1D, Name: Toy Series, dtype: float64

This returns a slice of the series that starts from the second to last element and ends at the third to last element (because the fourth to last is not included, taking steps of size $1$).

In [13]:

```
s.iloc[-2:-4:-1]
```

Out[13]:

2016-01-04 4.0 2016-01-03 NaN Freq: -1D, Name: Toy Series, dtype: float64

We can also access a series by using the values of its index. Since we indexed `s`

with a collection of dates (`Timestamp`

objects) we can look at the value contained in `s`

for a particular date.

In [14]:

```
s.loc['2016-01-01']
```

Out[14]:

1.0

Or even for a range of dates!

In [15]:

```
s.loc['2016-01-02':'2016-01-04']
```

Out[15]:

2016-01-02 2.0 2016-01-03 NaN 2016-01-04 4.0 Freq: D, Name: Toy Series, dtype: float64

With `Series`

, we *can* just use the brackets (`[]`

) to access elements, but this is not best practice. The brackets are ambiguous because they can be used to access `Series`

(and `DataFrames`

) using both index and integer values and the results will change based on context (especially with `DataFrames`

).

In addition to the above-mentioned access methods, you can filter `Series`

using boolean arrays. `Series`

are compatible with your standard comparators. Once compared with whatever condition you like, you get back yet another `Series`

, this time filled with boolean values.

In [16]:

```
print(s < 3)
```

2016-01-01 True 2016-01-02 True 2016-01-03 False 2016-01-04 False 2016-01-05 False Freq: D, Name: Toy Series, dtype: bool

We can pass *this* `Series`

back into the original `Series`

to filter out only the elements for which our condition is `True`

.

In [17]:

```
print(s.loc[s < 3])
```

2016-01-01 1.0 2016-01-02 2.0 Freq: D, Name: Toy Series, dtype: float64

If we so desire, we can group multiple conditions together using the logical operators `&`

, `|`

, and `~`

(and, or, and not, respectively).

In [18]:

```
print(s.loc[(s < 3) & (s > 1)])
```

2016-01-02 2.0 Freq: D, Name: Toy Series, dtype: float64

This is very convenient for getting only elements of a `Series`

that fulfill specific criteria that we need. It gets even more convenient when we are handling `DataFrames`

.

Since we use `Series`

for handling time series, it's worth covering a little bit of how we handle the time component. For our purposes we use pandas `Timestamp`

objects. Let's pull a full time series, complete with all the appropriate labels, by using our `get_prices()`

function. All data pulled with `get_prices()`

will be in `DataFrame`

format. We can modify this index however we like.

In [19]:

```
from quantrocket.master import get_securities
securities = get_securities(symbols='XOM', fields=['Sid','Symbol','Exchange'], vendors='usstock')
securities
```

Out[19]:

Symbol | Exchange | |
---|---|---|

Sid | ||

FIBBG000GZQ728 | XOM | XNYS |

In [20]:

```
from quantrocket import get_prices
XOM = securities.index[0]
start = "2012-01-01"
end = "2016-01-01"
prices = get_prices("usstock-free-1min", data_frequency="daily", sids=XOM, start_date=start, end_date=end, fields="Close")
prices = prices.loc["Close"][XOM]
```

We can display the first few elements of our series by using the `head()`

method and specifying the number of elements that we want. The analogous method for the last few elements is `tail()`

.

In [21]:

```
print(type(prices))
prices.head(5)
```

<class 'pandas.core.series.Series'>

Out[21]:

Date 2012-01-03 76.760 2012-01-04 76.778 2012-01-05 76.546 2012-01-06 75.975 2012-01-09 76.314 Name: FIBBG000GZQ728, dtype: float64

As with our toy example, we can specify a name for our time series, if only to clarify the name the `get_pricing()`

provides us.

In [22]:

```
print('Old name:', prices.name)
prices.name = "XOM"
print('New name:', prices.name)
```

Old name: FIBBG000GZQ728 New name: XOM

Let's take a closer look at the `DatetimeIndex`

of our `prices`

time series.

In [23]:

```
print(prices.index)
print("tz:", prices.index.tz)
```

DatetimeIndex(['2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11', '2012-01-12', '2012-01-13', '2012-01-17', ... '2015-12-17', '2015-12-18', '2015-12-21', '2015-12-22', '2015-12-23', '2015-12-24', '2015-12-28', '2015-12-29', '2015-12-30', '2015-12-31'], dtype='datetime64[ns]', name='Date', length=1006, freq=None) tz: None

Notice that this `DatetimeIndex`

has a collection of associated information. In particular it has an associated frequency (`freq`

) and an associated timezone (`tz`

). The frequency indicates whether the data is daily vs monthly vs some other period while the timezone indicates what locale this index is relative to. We can modify all of this extra information!

If we resample our `Series`

, we can adjust the frequency of our data. We currently have daily data (excluding weekends). Let's downsample from this daily data to monthly data using the `resample()`

method.

In [24]:

```
monthly_prices = prices.resample('M').last()
monthly_prices.head(10)
```

Out[24]:

Date 2012-01-31 74.743 2012-02-29 77.629 2012-03-31 77.835 2012-04-30 77.485 2012-05-31 71.052 2012-06-30 77.323 2012-07-31 78.480 2012-08-31 79.399 2012-09-30 83.174 2012-10-31 82.919 Freq: M, Name: XOM, dtype: float64

In the above example we use the last value of the lower level data to create the higher level data. We can specify how else we might want the down-sampling to be calculated, for example using the median.

In [25]:

```
monthly_prices_med = prices.resample('M').median()
monthly_prices_med.head(10)
```

Out[25]:

Date 2012-01-31 76.5770 2012-02-29 76.6105 2012-03-31 77.2565 2012-04-30 76.6195 2012-05-31 74.2285 2012-06-30 74.1970 2012-07-31 77.2330 2012-08-31 79.7990 2012-09-30 82.9740 2012-10-31 83.4190 Freq: M, Name: XOM, dtype: float64

We can even specify how we want the calculation of the new period to be done. Here we create a `custom_resampler()`

function that will return the first value of the period. In our specific case, this will return a `Series`

where the monthly value is the first value of that month.

In [26]:

```
def custom_resampler(array_like):
""" Returns the first value of the period """
return array_like.iloc[0]
first_of_month_prices = prices.resample('M').apply(custom_resampler)
first_of_month_prices.head(10)
```

Out[26]:

Date 2012-01-31 76.760 2012-02-29 74.948 2012-03-31 77.925 2012-04-30 78.140 2012-05-31 78.114 2012-06-30 70.411 2012-07-31 77.115 2012-08-31 78.534 2012-09-30 79.236 2012-10-31 83.492 Freq: M, Name: XOM, dtype: float64

We can also adjust the timezone of a `Series`

to adapt the time of real-world data. In our case, our time series isn't localized to a timezone, but let's say that we want to localize the time to be 'America/New_York'. In this case we use the `tz_localize()`

method, since the time isn't already localized.

In [27]:

```
eastern_prices = prices.tz_localize('America/New_York')
eastern_prices.head(10)
```

Out[27]:

Date 2012-01-03 00:00:00-05:00 76.760 2012-01-04 00:00:00-05:00 76.778 2012-01-05 00:00:00-05:00 76.546 2012-01-06 00:00:00-05:00 75.975 2012-01-09 00:00:00-05:00 76.314 2012-01-10 00:00:00-05:00 76.510 2012-01-11 00:00:00-05:00 75.939 2012-01-12 00:00:00-05:00 75.635 2012-01-13 00:00:00-05:00 75.760 2012-01-17 00:00:00-05:00 76.483 Name: XOM, dtype: float64

In addition to the capacity for timezone and frequency management, each time series has a built-in `reindex()`

method that we can use to realign the existing data according to a new set of index labels. If data does not exist for a particular label, the data will be filled with a placeholder value. This is typically `np.nan`

, though we can provide a fill method.

The data that we get from `get_prices()`

only includes market days. But what if we want prices for every single calendar day? This will include holidays and weekends, times when you normally cannot trade equities. First let's create a new `DatetimeIndex`

that contains all that we want.

In [28]:

```
calendar_dates = pd.date_range(start=start, end=end, freq='D')
print(calendar_dates)
```

DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06', '2012-01-07', '2012-01-08', '2012-01-09', '2012-01-10', ... '2015-12-23', '2015-12-24', '2015-12-25', '2015-12-26', '2015-12-27', '2015-12-28', '2015-12-29', '2015-12-30', '2015-12-31', '2016-01-01'], dtype='datetime64[ns]', length=1462, freq='D')

Now let's use this new set of dates to reindex our time series. We tell the function that the fill method that we want is `ffill`

. This denotes "forward fill". Any `NaN`

values will be filled by the *last value* listed. So the price on the weekend or on a holiday will be listed as the price on the last market day that we know about.

In [29]:

```
calendar_prices = prices.reindex(calendar_dates, method='ffill')
calendar_prices.head(15)
```

Out[29]:

2012-01-01 NaN 2012-01-02 NaN 2012-01-03 76.760 2012-01-04 76.778 2012-01-05 76.546 2012-01-06 75.975 2012-01-07 75.975 2012-01-08 75.975 2012-01-09 76.314 2012-01-10 76.510 2012-01-11 75.939 2012-01-12 75.635 2012-01-13 75.760 2012-01-14 75.760 2012-01-15 75.760 Freq: D, Name: XOM, dtype: float64

You'll notice that we still have a couple of `NaN`

values right at the beginning of our time series. This is because the first of January in 2012 was a Sunday and the second was a market holiday! Because these are the earliest data points and we don't have any information from before them, they cannot be forward-filled. We will take care of these `NaN`

values in the next section, when we deal with missing data.

Whenever we deal with real data, there is a very real possibility of encountering missing values. Real data is riddled with holes and pandas provides us with ways to handle them. Sometimes resampling or reindexing can create `NaN`

values. Fortunately, pandas provides us with ways to handle them. We have two primary means of coping with missing data. The first of these is filling in the missing data with `fillna()`

. For example, say that we want to fill in the missing days with the mean price of all days.

In [30]:

```
meanfilled_prices = calendar_prices.fillna(calendar_prices.mean())
meanfilled_prices.head(10)
```

Out[30]:

2012-01-01 84.152682 2012-01-02 84.152682 2012-01-03 76.760000 2012-01-04 76.778000 2012-01-05 76.546000 2012-01-06 75.975000 2012-01-07 75.975000 2012-01-08 75.975000 2012-01-09 76.314000 2012-01-10 76.510000 Freq: D, Name: XOM, dtype: float64

Using `fillna()`

is fairly easy. It is just a matter of indicating the value that you want to fill the spaces with. Unfortunately, this particular case doesn't make a whole lot of sense, for reasons discussed in the lecture on stationarity in the Lecture series. We could fill them with with $0$, simply, but that's similarly uninformative.

Rather than filling in specific values with `fillna()`

, we can use the `bfill()`

method to "backward fill", where `NaN`

s are filled with the *next* filled value (instead of forward fill's *last* filled value) like so:

In [31]:

```
bfilled_prices = calendar_prices.bfill()
bfilled_prices.head(10)
```

Out[31]:

2012-01-01 76.760 2012-01-02 76.760 2012-01-03 76.760 2012-01-04 76.778 2012-01-05 76.546 2012-01-06 75.975 2012-01-07 75.975 2012-01-08 75.975 2012-01-09 76.314 2012-01-10 76.510 Freq: D, Name: XOM, dtype: float64

But again, this is a bad idea for the same reasons as the previous option. Both of these so-called solutions take into account *future data* that was not available at the time of the data points that we are trying to fill. In the case of using the mean or the median, these summary statistics are calculated by taking into account the entire time series. Backward filling is equivalent to saying that the price of a particular security today, right now, is tomorrow's price. This also makes no sense. These two options are both examples of look-ahead bias, using data that would be unknown or unavailable at the desired time, and should be avoided.

Our next option is significantly more appealing. We could simply drop the missing data using the `dropna()`

method. This is much better alternative than filling `NaN`

values in with arbitrary numbers.

In [32]:

```
dropped_prices = calendar_prices.dropna()
dropped_prices.head(10)
```

Out[32]:

2012-01-03 76.760 2012-01-04 76.778 2012-01-05 76.546 2012-01-06 75.975 2012-01-07 75.975 2012-01-08 75.975 2012-01-09 76.314 2012-01-10 76.510 2012-01-11 75.939 2012-01-12 75.635 Freq: D, Name: XOM, dtype: float64

Now our time series is cleaned for the calendar year, with all of our `NaN`

values properly handled. It is time to talk about how to actually do time series analysis with pandas data structures.

Let's do some basic time series analysis on our original prices. Each pandas `Series`

has a built-in plotting method.

In [33]:

```
prices.plot();
# We still need to add the axis labels and title ourselves
plt.title("XOM Prices")
plt.ylabel("Price")
plt.xlabel("Date");
```

As well as some built-in descriptive statistics. We can either calculate these individually or using the `describe()`

method.

In [34]:

```
print("Mean:", prices.mean())
print("Standard deviation:", prices.std())
```

Mean: 84.12831908548708 Standard deviation: 6.59102021142679

In [35]:

```
print("Summary Statistics")
print(prices.describe())
```

Summary Statistics count 1006.000000 mean 84.128319 std 6.591020 min 68.116000 25% 79.783250 50% 82.990000 75% 88.993250 max 99.502000 Name: XOM, dtype: float64

We can easily modify `Series`

with scalars using our basic mathematical operators.

In [36]:

```
modified_prices = prices * 2 - 10
modified_prices.head(5)
```

Out[36]:

Date 2012-01-03 143.520 2012-01-04 143.556 2012-01-05 143.092 2012-01-06 141.950 2012-01-09 142.628 Name: XOM, dtype: float64

And we can create linear combinations of `Series`

themselves using the basic mathematical operators. pandas will group up matching indices and perform the calculations elementwise to produce a new `Series`

.

In [37]:

```
noisy_prices = prices + 5 * pd.Series(np.random.normal(0, 5, len(prices)), index=prices.index) + 20
noisy_prices.head(5)
```

Out[37]:

Date 2012-01-03 67.168340 2012-01-04 101.895873 2012-01-05 96.558114 2012-01-06 140.748484 2012-01-09 89.780228 dtype: float64

If there are no matching indices, however, we may get an empty `Series`

in return.

In [38]:

```
empty_series = prices + pd.Series(np.random.normal(0, 1, len(prices)))
empty_series.head(5)
```

Out[38]:

2012-01-03 00:00:00 NaN 2012-01-04 00:00:00 NaN 2012-01-05 00:00:00 NaN 2012-01-06 00:00:00 NaN 2012-01-09 00:00:00 NaN dtype: float64

Rather than looking at a time series itself, we may want to look at its first-order differences or percent change (in order to get additive or multiplicative returns, in our particular case). Both of these are built-in methods.

In [39]:

```
add_returns = prices.diff()[1:]
mult_returns = prices.pct_change()[1:]
```

In [40]:

```
plt.title("Multiplicative returns of XOM")
plt.xlabel("Date")
plt.ylabel("Percent Returns")
mult_returns.plot();
```

pandas has convenient functions for calculating rolling means and standard deviations, as well!

In [41]:

```
rolling_mean = prices.rolling(30).mean()
rolling_mean.name = "30-day rolling mean"
```

In [42]:

```
prices.plot()
rolling_mean.plot()
plt.title("XOM Price")
plt.xlabel("Date")
plt.ylabel("Price")
plt.legend();
```

In [43]:

```
rolling_std = prices.rolling(30).std()
rolling_std.name = "30-day rolling volatility"
```

In [44]:

```
rolling_std.plot()
plt.title(rolling_std.name);
plt.xlabel("Date")
plt.ylabel("Standard Deviation");
```