Pandas for time series data
Contents
Pandas for time series data¶
In this notebook, we’re going to introduce some of the basic Pandas functionality for working with time series
We will cover how to:
set datetime index to a dataframe
filter data using dates
temporally aggregate time series from higher to lower frequencies
apply some common transformation (log, percent change, difference)
apply rolling window calculations
import pandas as pd
import numpy as np
generate some random data using numpy (see numpy doc)
# old way using the randn method (type `?np.random.normal` for more info)
n_periods = 100
data = np.random.normal(loc=0, scale=1, size=n_periods)
# new (recommended) way
n_periods = 100
from numpy.random import default_rng
rng = default_rng()
data = rng.normal(loc=0, scale=1, size=n_periods)
Note
loc=0
, scale=1
are default values so they could be ommitted. We could also use the rng.standard_normal(size=n_periods)
method
The datetime index¶
time_index = pd.date_range(start='2022-01-01', periods = n_periods, freq='D')
freq='D'
sets the frequency to daily. Some other useful for us options are:
freq='M'
- month end frequencyfreq='MS'
- month start frequencyfreq='Q'
- quarter end frequencyfreq='QS'
- quarter start frequencyfreq='Y'
orfreq='A'
- year end frequencyfreq='YS'
orfreq='AS'
- year start frequency
see pandas doc for the full list of options
df = pd.DataFrame(data = data, index=time_index, columns=['data',])
df.head()
data | |
---|---|
2022-01-01 | -0.693954 |
2022-01-02 | -0.209926 |
2022-01-03 | 0.044877 |
2022-01-04 | 0.527820 |
2022-01-05 | 0.923959 |
type(df.index), type(df['data'])
(pandas.core.indexes.datetimes.DatetimeIndex, pandas.core.series.Series)
Having a datetime index lets us perform time-aware selection and manipulations very easily
for instance, we can get the observations from a given month
and compute some statistics about them
df.loc['2022-04']
data | |
---|---|
2022-04-01 | 0.072442 |
2022-04-02 | 2.313431 |
2022-04-03 | -0.774668 |
2022-04-04 | 0.132488 |
2022-04-05 | 1.755500 |
2022-04-06 | -2.213172 |
2022-04-07 | 1.618603 |
2022-04-08 | 1.433903 |
2022-04-09 | -1.297705 |
2022-04-10 | 1.274354 |
print(f"the mean is {df.loc['2022-04'].mean()[0]}")
print(f"the min and max values are {df.loc['2022-04'].min()[0]} and {df.loc['2022-04'].max()[0]}")
the mean is 0.43151774216027733
the min and max values are -2.2131721696816187 and 2.3134305670436035
An aside:
These are examples of using f-strings. We can apply formatting to get a more readable output. The next cell shows how to print only two decimal places.
See f-string doc for more info
print(f"the mean is {df.loc['2022-04'].mean()[0]:.2f}")
print(f"the min and max values are {df.loc['2022-04'].min()[0]:.2f} and {df.loc['2022-04'].max()[0]:.2f}")
the mean is 0.43
the min and max values are -2.21 and 2.31
all data starting at given date (note that it is inclusive)
df.loc['2022-03-31':]
data | |
---|---|
2022-03-31 | 0.168551 |
2022-04-01 | 0.072442 |
2022-04-02 | 2.313431 |
2022-04-03 | -0.774668 |
2022-04-04 | 0.132488 |
2022-04-05 | 1.755500 |
2022-04-06 | -2.213172 |
2022-04-07 | 1.618603 |
2022-04-08 | 1.433903 |
2022-04-09 | -1.297705 |
2022-04-10 | 1.274354 |
all data before a given date (note that is it inclusive)
df.loc[:'2022-01-12']
data | |
---|---|
2022-01-01 | -0.693954 |
2022-01-02 | -0.209926 |
2022-01-03 | 0.044877 |
2022-01-04 | 0.527820 |
2022-01-05 | 0.923959 |
2022-01-06 | 0.115014 |
2022-01-07 | -0.206611 |
2022-01-08 | 0.791693 |
2022-01-09 | -0.271810 |
2022-01-10 | -0.125983 |
2022-01-11 | -1.379321 |
2022-01-12 | -0.972250 |
all data between two dates (note that is it inclusive on both ends)
df.loc['2022-01-12':'2022-02-12']
data | |
---|---|
2022-01-12 | -0.972250 |
2022-01-13 | -1.053409 |
2022-01-14 | -1.650255 |
2022-01-15 | -1.251144 |
2022-01-16 | 0.725652 |
2022-01-17 | 2.123418 |
2022-01-18 | -0.367741 |
2022-01-19 | 1.032479 |
2022-01-20 | 0.449727 |
2022-01-21 | -1.204295 |
2022-01-22 | -0.008893 |
2022-01-23 | 1.236906 |
2022-01-24 | -1.444002 |
2022-01-25 | 0.987860 |
2022-01-26 | 0.121889 |
2022-01-27 | 1.140643 |
2022-01-28 | 1.076552 |
2022-01-29 | 0.184321 |
2022-01-30 | -0.050446 |
2022-01-31 | 0.589317 |
2022-02-01 | 0.466329 |
2022-02-02 | -1.049473 |
2022-02-03 | 1.206759 |
2022-02-04 | -0.840185 |
2022-02-05 | 1.142660 |
2022-02-06 | -1.486865 |
2022-02-07 | 1.243011 |
2022-02-08 | 0.489471 |
2022-02-09 | -0.080931 |
2022-02-10 | 0.098374 |
2022-02-11 | -0.368505 |
2022-02-12 | -2.364454 |
Temporal aggregation¶
This is when we move from a shorter to a longer period. For example, daily data can be aggregated to weekly, monthly, quarterly, annual. Typically, aggregation is done by taking the sum or the average
See pandas doc for more info and examples of using the resample
method
df.resample('M').sum()
data | |
---|---|
2022-01-31 | 1.182088 |
2022-02-28 | -9.062631 |
2022-03-31 | 7.715243 |
2022-04-30 | 4.315177 |
df.resample('M').mean()
data | |
---|---|
2022-01-31 | 0.038132 |
2022-02-28 | -0.323665 |
2022-03-31 | 0.248879 |
2022-04-30 | 0.431518 |
aggregation can be done also by taking the first/last observation of the higher frequency (daily here) data
df.resample('M').first()
data | |
---|---|
2022-01-31 | -0.693954 |
2022-02-28 | 0.466329 |
2022-03-31 | -0.360594 |
2022-04-30 | 0.072442 |
df.resample('M').last()
data | |
---|---|
2022-01-31 | 0.589317 |
2022-02-28 | -1.773561 |
2022-03-31 | 0.168551 |
2022-04-30 | 1.274354 |
we can apply multiple functions at the same time using the
agg
method
df.resample('M').agg(['mean', 'sum', 'min', 'max', 'first', 'last'])
data | ||||||
---|---|---|---|---|---|---|
mean | sum | min | max | first | last | |
2022-01-31 | 0.038132 | 1.182088 | -1.650255 | 2.123418 | -0.693954 | 0.589317 |
2022-02-28 | -0.323665 | -9.062631 | -2.364454 | 1.305562 | 0.466329 | -1.773561 |
2022-03-31 | 0.248879 | 7.715243 | -2.146517 | 2.109419 | -0.360594 | 0.168551 |
2022-04-30 | 0.431518 | 4.315177 | -2.213172 | 2.313431 | 0.072442 | 1.274354 |
Note
we could have the monthly frequency index for the first day of the month using df.resample('MS')
Pivot Tables¶
create a new monthly time series for 2 years
monthly_index = pd.date_range(start='2021-01-01', end='2022-12-31', freq='M')
len(monthly_index)
24
df_monthly = pd.DataFrame(data=np.random.normal(loc=0, scale=1, size=len(monthly_index)),
columns=['data',],
index=monthly_index)
df_monthly.head()
data | |
---|---|
2021-01-31 | 0.484735 |
2021-02-28 | 0.687497 |
2021-03-31 | 0.147638 |
2021-04-30 | 0.404760 |
2021-05-31 | -0.883069 |
df_monthly['month'] = df_monthly.index.month
df_monthly['quarter'] = df_monthly.index.quarter
df_monthly['year'] = df_monthly.index.year
df_monthly
data | month | quarter | year | |
---|---|---|---|---|
2021-01-31 | 0.484735 | 1 | 1 | 2021 |
2021-02-28 | 0.687497 | 2 | 1 | 2021 |
2021-03-31 | 0.147638 | 3 | 1 | 2021 |
2021-04-30 | 0.404760 | 4 | 2 | 2021 |
2021-05-31 | -0.883069 | 5 | 2 | 2021 |
2021-06-30 | -0.854109 | 6 | 2 | 2021 |
2021-07-31 | -1.276728 | 7 | 3 | 2021 |
2021-08-31 | 0.254640 | 8 | 3 | 2021 |
2021-09-30 | 0.577472 | 9 | 3 | 2021 |
2021-10-31 | -1.186852 | 10 | 4 | 2021 |
2021-11-30 | -0.664026 | 11 | 4 | 2021 |
2021-12-31 | -0.321033 | 12 | 4 | 2021 |
2022-01-31 | 0.832603 | 1 | 1 | 2022 |
2022-02-28 | -0.773677 | 2 | 1 | 2022 |
2022-03-31 | 0.319106 | 3 | 1 | 2022 |
2022-04-30 | -0.745475 | 4 | 2 | 2022 |
2022-05-31 | 0.898935 | 5 | 2 | 2022 |
2022-06-30 | -0.431810 | 6 | 2 | 2022 |
2022-07-31 | -1.345803 | 7 | 3 | 2022 |
2022-08-31 | 0.143264 | 8 | 3 | 2022 |
2022-09-30 | -0.004844 | 9 | 3 | 2022 |
2022-10-31 | 1.102417 | 10 | 4 | 2022 |
2022-11-30 | 0.713127 | 11 | 4 | 2022 |
2022-12-31 | 0.347538 | 12 | 4 | 2022 |
we can use the
pivot_table
method to see the data for each month by year
pd.pivot_table(df_monthly,
index='month',
columns="year",
values='data')
year | 2021 | 2022 |
---|---|---|
month | ||
1 | 0.484735 | 0.832603 |
2 | 0.687497 | -0.773677 |
3 | 0.147638 | 0.319106 |
4 | 0.404760 | -0.745475 |
5 | -0.883069 | 0.898935 |
6 | -0.854109 | -0.431810 |
7 | -1.276728 | -1.345803 |
8 | 0.254640 | 0.143264 |
9 | 0.577472 | -0.004844 |
10 | -1.186852 | 1.102417 |
11 | -0.664026 | 0.713127 |
12 | -0.321033 | 0.347538 |
a more useful exmaple of
pivot_table
to to compute a summary statistic for some level of aggregattionfor example average value by quarter
pd.pivot_table(df_monthly,
index='quarter',
columns="year",
values='data',
aggfunc=np.mean)
year | 2021 | 2022 |
---|---|---|
quarter | ||
1 | 0.439957 | 0.126010 |
2 | -0.444139 | -0.092783 |
3 | -0.148205 | -0.402461 |
4 | -0.723971 | 0.721027 |
compare with using
groupby
method
df_monthly[['year', 'quarter', 'data']].groupby(['year', 'quarter']).mean()
data | ||
---|---|---|
year | quarter | |
2021 | 1 | 0.439957 |
2 | -0.444139 | |
3 | -0.148205 | |
4 | -0.723971 | |
2022 | 1 | 0.126010 |
2 | -0.092783 | |
3 | -0.402461 | |
4 | 0.721027 |
Transformations¶
log-transformation¶
the log is defined for positive values only. To assure that all values are positive, below I add a constant to the values in the data column. I also create a new dataframe and rename the column as y.
min_value = np.abs(df.min())
new_df = df + min_value + 0.1 # this adds min_value + 0.1 to all elements of the data column of df
new_df.columns = ['y']
new_df.head()
y | |
---|---|
2022-01-01 | 1.770501 |
2022-01-02 | 2.254528 |
2022-01-03 | 2.509331 |
2022-01-04 | 2.992274 |
2022-01-05 | 3.388413 |
## log of y
np.log(new_df['y'])
2022-01-01 0.571262
2022-01-02 0.812941
2022-01-03 0.920016
2022-01-04 1.096034
2022-01-05 1.220362
...
2022-04-06 -1.381180
2022-04-07 1.406846
2022-04-08 1.360555
2022-04-09 0.154221
2022-04-10 1.318767
Freq: D, Name: y, Length: 100, dtype: float64
percent change from last period¶
new_df.pct_change(periods=1)
y | |
---|---|
2022-01-01 | NaN |
2022-01-02 | 0.273384 |
2022-01-03 | 0.113019 |
2022-01-04 | 0.192459 |
2022-01-05 | 0.132387 |
... | ... |
2022-04-06 | -0.940454 |
2022-04-07 | 15.248907 |
2022-04-08 | -0.045236 |
2022-04-09 | -0.700708 |
2022-04-10 | 2.204467 |
100 rows × 1 columns
difference relative to last period¶
new_df.diff(periods=1)
y | |
---|---|
2022-01-01 | NaN |
2022-01-02 | 0.484027 |
2022-01-03 | 0.254804 |
2022-01-04 | 0.482943 |
2022-01-05 | 0.396138 |
... | ... |
2022-04-06 | -3.968672 |
2022-04-07 | 3.831776 |
2022-04-08 | -0.184700 |
2022-04-09 | -2.731608 |
2022-04-10 | 2.572059 |
100 rows × 1 columns
lagged values \(y(t-1)\)¶
shifts the index forward by a number of periods, one here
new_df.shift(1)
y | |
---|---|
2022-01-01 | NaN |
2022-01-02 | 1.770501 |
2022-01-03 | 2.254528 |
2022-01-04 | 2.509331 |
2022-01-05 | 2.992274 |
... | ... |
2022-04-06 | 4.219954 |
2022-04-07 | 0.251282 |
2022-04-08 | 4.083058 |
2022-04-09 | 3.898357 |
2022-04-10 | 1.166749 |
100 rows × 1 columns
lead values \(y(t+1)\)¶
shifts the index backward
new_df.shift(-1)
y | |
---|---|
2022-01-01 | 2.254528 |
2022-01-02 | 2.509331 |
2022-01-03 | 2.992274 |
2022-01-04 | 3.388413 |
2022-01-05 | 2.579468 |
... | ... |
2022-04-06 | 4.083058 |
2022-04-07 | 3.898357 |
2022-04-08 | 1.166749 |
2022-04-09 | 3.738808 |
2022-04-10 | NaN |
100 rows × 1 columns
add the transformed y as new columns of the dataframe
new_df['log_y'] = np.log(new_df['y'])
new_df['pct_ch_y'] = new_df['y'].pct_change(periods=1)
new_df['diff_y'] = new_df['y'].diff(periods=1)
new_df['y(t-1)'] = new_df['y'].shift(periods=1)
new_df['y(t+1)'] = new_df['y'].shift(periods=-1)
new_df.head()
y | log_y | pct_ch_y | diff_y | y(t-1) | y(t+1) | |
---|---|---|---|---|---|---|
2022-01-01 | 1.770501 | 0.571262 | NaN | NaN | NaN | 2.254528 |
2022-01-02 | 2.254528 | 0.812941 | 0.273384 | 0.484027 | 1.770501 | 2.509331 |
2022-01-03 | 2.509331 | 0.920016 | 0.113019 | 0.254804 | 2.254528 | 2.992274 |
2022-01-04 | 2.992274 | 1.096034 | 0.192459 | 0.482943 | 2.509331 | 3.388413 |
2022-01-05 | 3.388413 | 1.220362 | 0.132387 | 0.396138 | 2.992274 | 2.579468 |
new_df.tail()
y | log_y | pct_ch_y | diff_y | y(t-1) | y(t+1) | |
---|---|---|---|---|---|---|
2022-04-06 | 0.251282 | -1.381180 | -0.940454 | -3.968672 | 4.219954 | 4.083058 |
2022-04-07 | 4.083058 | 1.406846 | 15.248907 | 3.831776 | 0.251282 | 3.898357 |
2022-04-08 | 3.898357 | 1.360555 | -0.045236 | -0.184700 | 4.083058 | 1.166749 |
2022-04-09 | 1.166749 | 0.154221 | -0.700708 | -2.731608 | 3.898357 | 3.738808 |
2022-04-10 | 3.738808 | 1.318767 | 2.204467 | 2.572059 | 1.166749 | NaN |
drop mising values (NaN)
new_df.dropna(axis=0, how='any', inplace=True)
new_df.head()
y | log_y | pct_ch_y | diff_y | y(t-1) | y(t+1) | |
---|---|---|---|---|---|---|
2022-01-02 | 2.254528 | 0.812941 | 0.273384 | 0.484027 | 1.770501 | 2.509331 |
2022-01-03 | 2.509331 | 0.920016 | 0.113019 | 0.254804 | 2.254528 | 2.992274 |
2022-01-04 | 2.992274 | 1.096034 | 0.192459 | 0.482943 | 2.509331 | 3.388413 |
2022-01-05 | 3.388413 | 1.220362 | 0.132387 | 0.396138 | 2.992274 | 2.579468 |
2022-01-06 | 2.579468 | 0.947583 | -0.238739 | -0.808945 | 3.388413 | 2.257843 |
new_df.tail()
y | log_y | pct_ch_y | diff_y | y(t-1) | y(t+1) | |
---|---|---|---|---|---|---|
2022-04-05 | 4.219954 | 1.439824 | 0.624970 | 1.623012 | 2.596942 | 0.251282 |
2022-04-06 | 0.251282 | -1.381180 | -0.940454 | -3.968672 | 4.219954 | 4.083058 |
2022-04-07 | 4.083058 | 1.406846 | 15.248907 | 3.831776 | 0.251282 | 3.898357 |
2022-04-08 | 3.898357 | 1.360555 | -0.045236 | -0.184700 | 4.083058 | 1.166749 |
2022-04-09 | 1.166749 | 0.154221 | -0.700708 | -2.731608 | 3.898357 | 3.738808 |
Rolling Windows¶
Another common transformation of time series data is computing rolling windows statistics, such as rolling average, rolling sum, and rolling standard deviation
rolling average for window with size 7 (weakly average of daily data)
window_size = 7
new_df['y'].rolling(window=window_size).mean().head(9)
2022-01-01 NaN
2022-01-02 NaN
2022-01-03 NaN
2022-01-04 NaN
2022-01-05 NaN
2022-01-06 NaN
2022-01-07 3.106036
2022-01-08 3.224076
2022-01-09 3.271001
Freq: D, Name: y, dtype: float64
note that the first 6 observations are missing and the 7th one is the average of the first 7 values. Look at the following cell to make sure you undestand how the rolloing window values are computed
print(new_df.iloc[:7].loc[:, 'y'].mean())
print(new_df.iloc[1:8].loc[:, 'y'].mean())
print(new_df.iloc[2:9].loc[:, 'y'].mean())
3.106035602883829
3.2240756001112376
3.271000970304381
and if we use the
center=True
option
new_df['y'].rolling(window=window_size, center=True).mean().head(6)
2022-01-01 NaN
2022-01-02 NaN
2022-01-03 NaN
2022-01-04 3.106036
2022-01-05 3.224076
2022-01-06 3.271001
Freq: D, Name: y, dtype: float64
rolling sum for window with size 7 (weakly average of daily data)
new_df['y'].rolling(window=window_size).sum().head(9)
2022-01-01 NaN
2022-01-02 NaN
2022-01-03 NaN
2022-01-04 NaN
2022-01-05 NaN
2022-01-06 NaN
2022-01-07 21.742249
2022-01-08 22.568529
2022-01-09 22.897007
Freq: D, Name: y, dtype: float64
rolling standard deviation for window with size 7 (weakly average of daily data)
new_df['y'].rolling(window=window_size).std().head(9)
2022-01-01 NaN
2022-01-02 NaN
2022-01-03 NaN
2022-01-04 NaN
2022-01-05 NaN
2022-01-06 NaN
2022-01-07 0.447562
2022-01-08 0.643618
2022-01-09 0.648062
Freq: D, Name: y, dtype: float64
See pandas_doc for more info and examples of using the rolling
method