Pandas for time series data

In this notebook, we’re going to introduce some of the basic Pandas functionality for working with time series

We will cover how to:

  • set datetime index to a dataframe

  • filter data using dates

  • temporally aggregate time series from higher to lower frequencies

  • apply some common transformation (log, percent change, difference)

  • apply rolling window calculations

import pandas as pd
import numpy as np

generate some random data using numpy (see numpy doc)

# old way using the randn method (type `?np.random.normal` for more info) 

n_periods = 100
data = np.random.normal(loc=0, scale=1, size=n_periods)
# new (recommended) way

n_periods = 100
from numpy.random import default_rng
rng = default_rng()
data = rng.normal(loc=0, scale=1, size=n_periods)

Note

loc=0, scale=1 are default values so they could be ommitted. We could also use the rng.standard_normal(size=n_periods) method

The datetime index

time_index = pd.date_range(start='2022-01-01', periods = n_periods, freq='D')

freq='D' sets the frequency to daily. Some other useful for us options are:

  • freq='M' - month end frequency

  • freq='MS' - month start frequency

  • freq='Q' - quarter end frequency

  • freq='QS' - quarter start frequency

  • freq='Y' or freq='A' - year end frequency

  • freq='YS' or freq='AS' - year start frequency

see pandas doc for the full list of options

df = pd.DataFrame(data = data, index=time_index, columns=['data',])
df.head()
data
2022-01-01 -0.693954
2022-01-02 -0.209926
2022-01-03 0.044877
2022-01-04 0.527820
2022-01-05 0.923959
type(df.index), type(df['data'])
(pandas.core.indexes.datetimes.DatetimeIndex, pandas.core.series.Series)

Having a datetime index lets us perform time-aware selection and manipulations very easily

  • for instance, we can get the observations from a given month

  • and compute some statistics about them

df.loc['2022-04']
data
2022-04-01 0.072442
2022-04-02 2.313431
2022-04-03 -0.774668
2022-04-04 0.132488
2022-04-05 1.755500
2022-04-06 -2.213172
2022-04-07 1.618603
2022-04-08 1.433903
2022-04-09 -1.297705
2022-04-10 1.274354
print(f"the mean is {df.loc['2022-04'].mean()[0]}")
print(f"the min and max values are {df.loc['2022-04'].min()[0]} and {df.loc['2022-04'].max()[0]}")
the mean is 0.43151774216027733
the min and max values are -2.2131721696816187 and 2.3134305670436035

An aside:

These are examples of using f-strings. We can apply formatting to get a more readable output. The next cell shows how to print only two decimal places.

See f-string doc for more info

print(f"the mean is {df.loc['2022-04'].mean()[0]:.2f}")
print(f"the min and max values are {df.loc['2022-04'].min()[0]:.2f} and {df.loc['2022-04'].max()[0]:.2f}")
the mean is 0.43
the min and max values are -2.21 and 2.31
  • all data starting at given date (note that it is inclusive)

df.loc['2022-03-31':]
data
2022-03-31 0.168551
2022-04-01 0.072442
2022-04-02 2.313431
2022-04-03 -0.774668
2022-04-04 0.132488
2022-04-05 1.755500
2022-04-06 -2.213172
2022-04-07 1.618603
2022-04-08 1.433903
2022-04-09 -1.297705
2022-04-10 1.274354
  • all data before a given date (note that is it inclusive)

df.loc[:'2022-01-12']
data
2022-01-01 -0.693954
2022-01-02 -0.209926
2022-01-03 0.044877
2022-01-04 0.527820
2022-01-05 0.923959
2022-01-06 0.115014
2022-01-07 -0.206611
2022-01-08 0.791693
2022-01-09 -0.271810
2022-01-10 -0.125983
2022-01-11 -1.379321
2022-01-12 -0.972250
  • all data between two dates (note that is it inclusive on both ends)

df.loc['2022-01-12':'2022-02-12']
data
2022-01-12 -0.972250
2022-01-13 -1.053409
2022-01-14 -1.650255
2022-01-15 -1.251144
2022-01-16 0.725652
2022-01-17 2.123418
2022-01-18 -0.367741
2022-01-19 1.032479
2022-01-20 0.449727
2022-01-21 -1.204295
2022-01-22 -0.008893
2022-01-23 1.236906
2022-01-24 -1.444002
2022-01-25 0.987860
2022-01-26 0.121889
2022-01-27 1.140643
2022-01-28 1.076552
2022-01-29 0.184321
2022-01-30 -0.050446
2022-01-31 0.589317
2022-02-01 0.466329
2022-02-02 -1.049473
2022-02-03 1.206759
2022-02-04 -0.840185
2022-02-05 1.142660
2022-02-06 -1.486865
2022-02-07 1.243011
2022-02-08 0.489471
2022-02-09 -0.080931
2022-02-10 0.098374
2022-02-11 -0.368505
2022-02-12 -2.364454

Temporal aggregation

This is when we move from a shorter to a longer period. For example, daily data can be aggregated to weekly, monthly, quarterly, annual. Typically, aggregation is done by taking the sum or the average

See pandas doc for more info and examples of using the resample method

df.resample('M').sum()
data
2022-01-31 1.182088
2022-02-28 -9.062631
2022-03-31 7.715243
2022-04-30 4.315177
df.resample('M').mean()
data
2022-01-31 0.038132
2022-02-28 -0.323665
2022-03-31 0.248879
2022-04-30 0.431518
  • aggregation can be done also by taking the first/last observation of the higher frequency (daily here) data

df.resample('M').first()
data
2022-01-31 -0.693954
2022-02-28 0.466329
2022-03-31 -0.360594
2022-04-30 0.072442
df.resample('M').last()
data
2022-01-31 0.589317
2022-02-28 -1.773561
2022-03-31 0.168551
2022-04-30 1.274354
  • we can apply multiple functions at the same time using the agg method

df.resample('M').agg(['mean', 'sum', 'min', 'max', 'first', 'last'])
data
mean sum min max first last
2022-01-31 0.038132 1.182088 -1.650255 2.123418 -0.693954 0.589317
2022-02-28 -0.323665 -9.062631 -2.364454 1.305562 0.466329 -1.773561
2022-03-31 0.248879 7.715243 -2.146517 2.109419 -0.360594 0.168551
2022-04-30 0.431518 4.315177 -2.213172 2.313431 0.072442 1.274354

Note

we could have the monthly frequency index for the first day of the month using df.resample('MS')

Pivot Tables

  • create a new monthly time series for 2 years

monthly_index = pd.date_range(start='2021-01-01', end='2022-12-31', freq='M')
len(monthly_index)
24
df_monthly = pd.DataFrame(data=np.random.normal(loc=0, scale=1, size=len(monthly_index)),
                          columns=['data',],
                          index=monthly_index)
df_monthly.head()
data
2021-01-31 0.484735
2021-02-28 0.687497
2021-03-31 0.147638
2021-04-30 0.404760
2021-05-31 -0.883069
df_monthly['month'] = df_monthly.index.month
df_monthly['quarter'] = df_monthly.index.quarter
df_monthly['year'] = df_monthly.index.year
df_monthly
data month quarter year
2021-01-31 0.484735 1 1 2021
2021-02-28 0.687497 2 1 2021
2021-03-31 0.147638 3 1 2021
2021-04-30 0.404760 4 2 2021
2021-05-31 -0.883069 5 2 2021
2021-06-30 -0.854109 6 2 2021
2021-07-31 -1.276728 7 3 2021
2021-08-31 0.254640 8 3 2021
2021-09-30 0.577472 9 3 2021
2021-10-31 -1.186852 10 4 2021
2021-11-30 -0.664026 11 4 2021
2021-12-31 -0.321033 12 4 2021
2022-01-31 0.832603 1 1 2022
2022-02-28 -0.773677 2 1 2022
2022-03-31 0.319106 3 1 2022
2022-04-30 -0.745475 4 2 2022
2022-05-31 0.898935 5 2 2022
2022-06-30 -0.431810 6 2 2022
2022-07-31 -1.345803 7 3 2022
2022-08-31 0.143264 8 3 2022
2022-09-30 -0.004844 9 3 2022
2022-10-31 1.102417 10 4 2022
2022-11-30 0.713127 11 4 2022
2022-12-31 0.347538 12 4 2022
  • we can use the pivot_table method to see the data for each month by year

pd.pivot_table(df_monthly, 
               index='month', 
               columns="year", 
               values='data')
year 2021 2022
month
1 0.484735 0.832603
2 0.687497 -0.773677
3 0.147638 0.319106
4 0.404760 -0.745475
5 -0.883069 0.898935
6 -0.854109 -0.431810
7 -1.276728 -1.345803
8 0.254640 0.143264
9 0.577472 -0.004844
10 -1.186852 1.102417
11 -0.664026 0.713127
12 -0.321033 0.347538
  • a more useful exmaple of pivot_table to to compute a summary statistic for some level of aggregattion

  • for example average value by quarter

pd.pivot_table(df_monthly, 
               index='quarter', 
               columns="year", 
               values='data',
               aggfunc=np.mean)
year 2021 2022
quarter
1 0.439957 0.126010
2 -0.444139 -0.092783
3 -0.148205 -0.402461
4 -0.723971 0.721027
  • compare with using groupby method

df_monthly[['year', 'quarter', 'data']].groupby(['year', 'quarter']).mean()
data
year quarter
2021 1 0.439957
2 -0.444139
3 -0.148205
4 -0.723971
2022 1 0.126010
2 -0.092783
3 -0.402461
4 0.721027

Transformations

log-transformation

the log is defined for positive values only. To assure that all values are positive, below I add a constant to the values in the data column. I also create a new dataframe and rename the column as y.

min_value = np.abs(df.min())
new_df  = df + min_value + 0.1  # this adds min_value + 0.1 to all elements of the data column of df
new_df.columns = ['y']
new_df.head()
y
2022-01-01 1.770501
2022-01-02 2.254528
2022-01-03 2.509331
2022-01-04 2.992274
2022-01-05 3.388413
## log of y
np.log(new_df['y'])
2022-01-01    0.571262
2022-01-02    0.812941
2022-01-03    0.920016
2022-01-04    1.096034
2022-01-05    1.220362
                ...   
2022-04-06   -1.381180
2022-04-07    1.406846
2022-04-08    1.360555
2022-04-09    0.154221
2022-04-10    1.318767
Freq: D, Name: y, Length: 100, dtype: float64

percent change from last period

new_df.pct_change(periods=1)
y
2022-01-01 NaN
2022-01-02 0.273384
2022-01-03 0.113019
2022-01-04 0.192459
2022-01-05 0.132387
... ...
2022-04-06 -0.940454
2022-04-07 15.248907
2022-04-08 -0.045236
2022-04-09 -0.700708
2022-04-10 2.204467

100 rows × 1 columns

difference relative to last period

new_df.diff(periods=1)
y
2022-01-01 NaN
2022-01-02 0.484027
2022-01-03 0.254804
2022-01-04 0.482943
2022-01-05 0.396138
... ...
2022-04-06 -3.968672
2022-04-07 3.831776
2022-04-08 -0.184700
2022-04-09 -2.731608
2022-04-10 2.572059

100 rows × 1 columns

lagged values \(y(t-1)\)

shifts the index forward by a number of periods, one here

new_df.shift(1)
y
2022-01-01 NaN
2022-01-02 1.770501
2022-01-03 2.254528
2022-01-04 2.509331
2022-01-05 2.992274
... ...
2022-04-06 4.219954
2022-04-07 0.251282
2022-04-08 4.083058
2022-04-09 3.898357
2022-04-10 1.166749

100 rows × 1 columns

lead values \(y(t+1)\)

shifts the index backward

new_df.shift(-1)
y
2022-01-01 2.254528
2022-01-02 2.509331
2022-01-03 2.992274
2022-01-04 3.388413
2022-01-05 2.579468
... ...
2022-04-06 4.083058
2022-04-07 3.898357
2022-04-08 1.166749
2022-04-09 3.738808
2022-04-10 NaN

100 rows × 1 columns

  • add the transformed y as new columns of the dataframe

new_df['log_y'] = np.log(new_df['y'])
new_df['pct_ch_y'] = new_df['y'].pct_change(periods=1)
new_df['diff_y'] = new_df['y'].diff(periods=1)

new_df['y(t-1)'] = new_df['y'].shift(periods=1)
new_df['y(t+1)'] = new_df['y'].shift(periods=-1)
new_df.head()
y log_y pct_ch_y diff_y y(t-1) y(t+1)
2022-01-01 1.770501 0.571262 NaN NaN NaN 2.254528
2022-01-02 2.254528 0.812941 0.273384 0.484027 1.770501 2.509331
2022-01-03 2.509331 0.920016 0.113019 0.254804 2.254528 2.992274
2022-01-04 2.992274 1.096034 0.192459 0.482943 2.509331 3.388413
2022-01-05 3.388413 1.220362 0.132387 0.396138 2.992274 2.579468
new_df.tail()
y log_y pct_ch_y diff_y y(t-1) y(t+1)
2022-04-06 0.251282 -1.381180 -0.940454 -3.968672 4.219954 4.083058
2022-04-07 4.083058 1.406846 15.248907 3.831776 0.251282 3.898357
2022-04-08 3.898357 1.360555 -0.045236 -0.184700 4.083058 1.166749
2022-04-09 1.166749 0.154221 -0.700708 -2.731608 3.898357 3.738808
2022-04-10 3.738808 1.318767 2.204467 2.572059 1.166749 NaN
  • drop mising values (NaN)

new_df.dropna(axis=0, how='any', inplace=True)
new_df.head()
y log_y pct_ch_y diff_y y(t-1) y(t+1)
2022-01-02 2.254528 0.812941 0.273384 0.484027 1.770501 2.509331
2022-01-03 2.509331 0.920016 0.113019 0.254804 2.254528 2.992274
2022-01-04 2.992274 1.096034 0.192459 0.482943 2.509331 3.388413
2022-01-05 3.388413 1.220362 0.132387 0.396138 2.992274 2.579468
2022-01-06 2.579468 0.947583 -0.238739 -0.808945 3.388413 2.257843
new_df.tail()
y log_y pct_ch_y diff_y y(t-1) y(t+1)
2022-04-05 4.219954 1.439824 0.624970 1.623012 2.596942 0.251282
2022-04-06 0.251282 -1.381180 -0.940454 -3.968672 4.219954 4.083058
2022-04-07 4.083058 1.406846 15.248907 3.831776 0.251282 3.898357
2022-04-08 3.898357 1.360555 -0.045236 -0.184700 4.083058 1.166749
2022-04-09 1.166749 0.154221 -0.700708 -2.731608 3.898357 3.738808

Rolling Windows

Another common transformation of time series data is computing rolling windows statistics, such as rolling average, rolling sum, and rolling standard deviation

  • rolling average for window with size 7 (weakly average of daily data)

window_size = 7
new_df['y'].rolling(window=window_size).mean().head(9)
2022-01-01         NaN
2022-01-02         NaN
2022-01-03         NaN
2022-01-04         NaN
2022-01-05         NaN
2022-01-06         NaN
2022-01-07    3.106036
2022-01-08    3.224076
2022-01-09    3.271001
Freq: D, Name: y, dtype: float64
  • note that the first 6 observations are missing and the 7th one is the average of the first 7 values. Look at the following cell to make sure you undestand how the rolloing window values are computed

print(new_df.iloc[:7].loc[:, 'y'].mean())
print(new_df.iloc[1:8].loc[:, 'y'].mean())
print(new_df.iloc[2:9].loc[:, 'y'].mean())
3.106035602883829
3.2240756001112376
3.271000970304381
  • and if we use the center=True option

new_df['y'].rolling(window=window_size, center=True).mean().head(6)
2022-01-01         NaN
2022-01-02         NaN
2022-01-03         NaN
2022-01-04    3.106036
2022-01-05    3.224076
2022-01-06    3.271001
Freq: D, Name: y, dtype: float64
  • rolling sum for window with size 7 (weakly average of daily data)

new_df['y'].rolling(window=window_size).sum().head(9)
2022-01-01          NaN
2022-01-02          NaN
2022-01-03          NaN
2022-01-04          NaN
2022-01-05          NaN
2022-01-06          NaN
2022-01-07    21.742249
2022-01-08    22.568529
2022-01-09    22.897007
Freq: D, Name: y, dtype: float64
  • rolling standard deviation for window with size 7 (weakly average of daily data)

new_df['y'].rolling(window=window_size).std().head(9)
2022-01-01         NaN
2022-01-02         NaN
2022-01-03         NaN
2022-01-04         NaN
2022-01-05         NaN
2022-01-06         NaN
2022-01-07    0.447562
2022-01-08    0.643618
2022-01-09    0.648062
Freq: D, Name: y, dtype: float64

See pandas_doc for more info and examples of using the rolling method