Pandas for time series data

Pandas for time series data¶

In this notebook, we’re going to introduce some of the basic Pandas functionality for working with time series

We will cover how to:

set datetime index to a dataframe
filter data using dates
temporally aggregate time series from higher to lower frequencies
apply some common transformation (log, percent change, difference)
apply rolling window calculations

import pandas as pd
import numpy as np

generate some random data using numpy (see numpy doc)

# old way using the randn method (type `?np.random.normal` for more info) 

n_periods = 100
data = np.random.normal(loc=0, scale=1, size=n_periods)

# new (recommended) way

n_periods = 100
from numpy.random import default_rng
rng = default_rng()
data = rng.normal(loc=0, scale=1, size=n_periods)

Note

loc=0, scale=1 are default values so they could be ommitted. We could also use the rng.standard_normal(size=n_periods) method

The datetime index¶

time_index = pd.date_range(start='2022-01-01', periods = n_periods, freq='D')

freq='D' sets the frequency to daily. Some other useful for us options are:

freq='M' - month end frequency
freq='MS' - month start frequency
freq='Q' - quarter end frequency
freq='QS' - quarter start frequency
freq='Y' or freq='A' - year end frequency
freq='YS' or freq='AS' - year start frequency

see pandas doc for the full list of options

df = pd.DataFrame(data = data, index=time_index, columns=['data',])
df.head()

	data
2022-01-01	-0.693954
2022-01-02	-0.209926
2022-01-03	0.044877
2022-01-04	0.527820
2022-01-05	0.923959

type(df.index), type(df['data'])

(pandas.core.indexes.datetimes.DatetimeIndex, pandas.core.series.Series)

Having a datetime index lets us perform time-aware selection and manipulations very easily

for instance, we can get the observations from a given month
and compute some statistics about them

df.loc['2022-04']

	data
2022-04-01	0.072442
2022-04-02	2.313431
2022-04-03	-0.774668
2022-04-04	0.132488
2022-04-05	1.755500
2022-04-06	-2.213172
2022-04-07	1.618603
2022-04-08	1.433903
2022-04-09	-1.297705
2022-04-10	1.274354

print(f"the mean is {df.loc['2022-04'].mean()[0]}")
print(f"the min and max values are {df.loc['2022-04'].min()[0]} and {df.loc['2022-04'].max()[0]}")

the mean is 0.43151774216027733
the min and max values are -2.2131721696816187 and 2.3134305670436035

An aside:

These are examples of using f-strings. We can apply formatting to get a more readable output. The next cell shows how to print only two decimal places.

See f-string doc for more info

print(f"the mean is {df.loc['2022-04'].mean()[0]:.2f}")
print(f"the min and max values are {df.loc['2022-04'].min()[0]:.2f} and {df.loc['2022-04'].max()[0]:.2f}")

the mean is 0.43
the min and max values are -2.21 and 2.31

all data starting at given date (note that it is inclusive)

df.loc['2022-03-31':]

	data
2022-03-31	0.168551
2022-04-01	0.072442
2022-04-02	2.313431
2022-04-03	-0.774668
2022-04-04	0.132488
2022-04-05	1.755500
2022-04-06	-2.213172
2022-04-07	1.618603
2022-04-08	1.433903
2022-04-09	-1.297705
2022-04-10	1.274354

all data before a given date (note that is it inclusive)

df.loc[:'2022-01-12']

	data
2022-01-01	-0.693954
2022-01-02	-0.209926
2022-01-03	0.044877
2022-01-04	0.527820
2022-01-05	0.923959
2022-01-06	0.115014
2022-01-07	-0.206611
2022-01-08	0.791693
2022-01-09	-0.271810
2022-01-10	-0.125983
2022-01-11	-1.379321
2022-01-12	-0.972250

all data between two dates (note that is it inclusive on both ends)

df.loc['2022-01-12':'2022-02-12']

	data
2022-01-12	-0.972250
2022-01-13	-1.053409
2022-01-14	-1.650255
2022-01-15	-1.251144
2022-01-16	0.725652
2022-01-17	2.123418
2022-01-18	-0.367741
2022-01-19	1.032479
2022-01-20	0.449727
2022-01-21	-1.204295
2022-01-22	-0.008893
2022-01-23	1.236906
2022-01-24	-1.444002
2022-01-25	0.987860
2022-01-26	0.121889
2022-01-27	1.140643
2022-01-28	1.076552
2022-01-29	0.184321
2022-01-30	-0.050446
2022-01-31	0.589317
2022-02-01	0.466329
2022-02-02	-1.049473
2022-02-03	1.206759
2022-02-04	-0.840185
2022-02-05	1.142660
2022-02-06	-1.486865
2022-02-07	1.243011
2022-02-08	0.489471
2022-02-09	-0.080931
2022-02-10	0.098374
2022-02-11	-0.368505
2022-02-12	-2.364454

Temporal aggregation¶

This is when we move from a shorter to a longer period. For example, daily data can be aggregated to weekly, monthly, quarterly, annual. Typically, aggregation is done by taking the sum or the average

See pandas doc for more info and examples of using the resample method

df.resample('M').sum()

	data
2022-01-31	1.182088
2022-02-28	-9.062631
2022-03-31	7.715243
2022-04-30	4.315177

df.resample('M').mean()

	data
2022-01-31	0.038132
2022-02-28	-0.323665
2022-03-31	0.248879
2022-04-30	0.431518

aggregation can be done also by taking the first/last observation of the higher frequency (daily here) data

df.resample('M').first()

	data
2022-01-31	-0.693954
2022-02-28	0.466329
2022-03-31	-0.360594
2022-04-30	0.072442

df.resample('M').last()

	data
2022-01-31	0.589317
2022-02-28	-1.773561
2022-03-31	0.168551
2022-04-30	1.274354

we can apply multiple functions at the same time using the agg method

df.resample('M').agg(['mean', 'sum', 'min', 'max', 'first', 'last'])

	data
	mean	sum	min	max	first	last
2022-01-31	0.038132	1.182088	-1.650255	2.123418	-0.693954	0.589317
2022-02-28	-0.323665	-9.062631	-2.364454	1.305562	0.466329	-1.773561
2022-03-31	0.248879	7.715243	-2.146517	2.109419	-0.360594	0.168551
2022-04-30	0.431518	4.315177	-2.213172	2.313431	0.072442	1.274354

Note

we could have the monthly frequency index for the first day of the month using df.resample('MS')

Pivot Tables¶

create a new monthly time series for 2 years

monthly_index = pd.date_range(start='2021-01-01', end='2022-12-31', freq='M')
len(monthly_index)

df_monthly = pd.DataFrame(data=np.random.normal(loc=0, scale=1, size=len(monthly_index)),
                          columns=['data',],
                          index=monthly_index)
df_monthly.head()

	data
2021-01-31	0.484735
2021-02-28	0.687497
2021-03-31	0.147638
2021-04-30	0.404760
2021-05-31	-0.883069

df_monthly['month'] = df_monthly.index.month
df_monthly['quarter'] = df_monthly.index.quarter
df_monthly['year'] = df_monthly.index.year
df_monthly

	data	month	quarter	year
2021-01-31	0.484735	1	1	2021
2021-02-28	0.687497	2	1	2021
2021-03-31	0.147638	3	1	2021
2021-04-30	0.404760	4	2	2021
2021-05-31	-0.883069	5	2	2021
2021-06-30	-0.854109	6	2	2021
2021-07-31	-1.276728	7	3	2021
2021-08-31	0.254640	8	3	2021
2021-09-30	0.577472	9	3	2021
2021-10-31	-1.186852	10	4	2021
2021-11-30	-0.664026	11	4	2021
2021-12-31	-0.321033	12	4	2021
2022-01-31	0.832603	1	1	2022
2022-02-28	-0.773677	2	1	2022
2022-03-31	0.319106	3	1	2022
2022-04-30	-0.745475	4	2	2022
2022-05-31	0.898935	5	2	2022
2022-06-30	-0.431810	6	2	2022
2022-07-31	-1.345803	7	3	2022
2022-08-31	0.143264	8	3	2022
2022-09-30	-0.004844	9	3	2022
2022-10-31	1.102417	10	4	2022
2022-11-30	0.713127	11	4	2022
2022-12-31	0.347538	12	4	2022

we can use the pivot_table method to see the data for each month by year

pd.pivot_table(df_monthly, 
               index='month', 
               columns="year", 
               values='data')

year	2021	2022
month
1	0.484735	0.832603
2	0.687497	-0.773677
3	0.147638	0.319106
4	0.404760	-0.745475
5	-0.883069	0.898935
6	-0.854109	-0.431810
7	-1.276728	-1.345803
8	0.254640	0.143264
9	0.577472	-0.004844
10	-1.186852	1.102417
11	-0.664026	0.713127
12	-0.321033	0.347538

a more useful exmaple of pivot_table to to compute a summary statistic for some level of aggregattion
for example average value by quarter

pd.pivot_table(df_monthly, 
               index='quarter', 
               columns="year", 
               values='data',
               aggfunc=np.mean)

year	2021	2022
quarter
1	0.439957	0.126010
2	-0.444139	-0.092783
3	-0.148205	-0.402461
4	-0.723971	0.721027

compare with using groupby method

df_monthly[['year', 'quarter', 'data']].groupby(['year', 'quarter']).mean()

		data
year	quarter
2021	1	0.439957
	2	-0.444139
	3	-0.148205
	4	-0.723971
2022	1	0.126010
	2	-0.092783
	3	-0.402461
	4	0.721027

Transformations¶

log-transformation¶

the log is defined for positive values only. To assure that all values are positive, below I add a constant to the values in the data column. I also create a new dataframe and rename the column as y.

min_value = np.abs(df.min())
new_df  = df + min_value + 0.1  # this adds min_value + 0.1 to all elements of the data column of df
new_df.columns = ['y']
new_df.head()

	y
2022-01-01	1.770501
2022-01-02	2.254528
2022-01-03	2.509331
2022-01-04	2.992274
2022-01-05	3.388413

## log of y
np.log(new_df['y'])

2022-01-01    0.571262
2022-01-02    0.812941
2022-01-03    0.920016
2022-01-04    1.096034
2022-01-05    1.220362
                ...   
2022-04-06   -1.381180
2022-04-07    1.406846
2022-04-08    1.360555
2022-04-09    0.154221
2022-04-10    1.318767
Freq: D, Name: y, Length: 100, dtype: float64

percent change from last period¶

new_df.pct_change(periods=1)

	y
2022-01-01	NaN
2022-01-02	0.273384
2022-01-03	0.113019
2022-01-04	0.192459
2022-01-05	0.132387
...	...
2022-04-06	-0.940454
2022-04-07	15.248907
2022-04-08	-0.045236
2022-04-09	-0.700708
2022-04-10	2.204467

100 rows × 1 columns

difference relative to last period¶

new_df.diff(periods=1)

	y
2022-01-01	NaN
2022-01-02	0.484027
2022-01-03	0.254804
2022-01-04	0.482943
2022-01-05	0.396138
...	...
2022-04-06	-3.968672
2022-04-07	3.831776
2022-04-08	-0.184700
2022-04-09	-2.731608
2022-04-10	2.572059

100 rows × 1 columns

lagged values $y (t - 1)$ ¶

shifts the index forward by a number of periods, one here

new_df.shift(1)

	y
2022-01-01	NaN
2022-01-02	1.770501
2022-01-03	2.254528
2022-01-04	2.509331
2022-01-05	2.992274
...	...
2022-04-06	4.219954
2022-04-07	0.251282
2022-04-08	4.083058
2022-04-09	3.898357
2022-04-10	1.166749

100 rows × 1 columns

lead values $y (t + 1)$ ¶

shifts the index backward

new_df.shift(-1)

	y
2022-01-01	2.254528
2022-01-02	2.509331
2022-01-03	2.992274
2022-01-04	3.388413
2022-01-05	2.579468
...	...
2022-04-06	4.083058
2022-04-07	3.898357
2022-04-08	1.166749
2022-04-09	3.738808
2022-04-10	NaN

100 rows × 1 columns

add the transformed y as new columns of the dataframe

new_df['log_y'] = np.log(new_df['y'])
new_df['pct_ch_y'] = new_df['y'].pct_change(periods=1)
new_df['diff_y'] = new_df['y'].diff(periods=1)

new_df['y(t-1)'] = new_df['y'].shift(periods=1)
new_df['y(t+1)'] = new_df['y'].shift(periods=-1)

new_df.head()

	y	log_y	pct_ch_y	diff_y	y(t-1)	y(t+1)
2022-01-01	1.770501	0.571262	NaN	NaN	NaN	2.254528
2022-01-02	2.254528	0.812941	0.273384	0.484027	1.770501	2.509331
2022-01-03	2.509331	0.920016	0.113019	0.254804	2.254528	2.992274
2022-01-04	2.992274	1.096034	0.192459	0.482943	2.509331	3.388413
2022-01-05	3.388413	1.220362	0.132387	0.396138	2.992274	2.579468

new_df.tail()

	y	log_y	pct_ch_y	diff_y	y(t-1)	y(t+1)
2022-04-06	0.251282	-1.381180	-0.940454	-3.968672	4.219954	4.083058
2022-04-07	4.083058	1.406846	15.248907	3.831776	0.251282	3.898357
2022-04-08	3.898357	1.360555	-0.045236	-0.184700	4.083058	1.166749
2022-04-09	1.166749	0.154221	-0.700708	-2.731608	3.898357	3.738808
2022-04-10	3.738808	1.318767	2.204467	2.572059	1.166749	NaN

drop mising values (NaN)

new_df.dropna(axis=0, how='any', inplace=True)

new_df.head()

	y	log_y	pct_ch_y	diff_y	y(t-1)	y(t+1)
2022-01-02	2.254528	0.812941	0.273384	0.484027	1.770501	2.509331
2022-01-03	2.509331	0.920016	0.113019	0.254804	2.254528	2.992274
2022-01-04	2.992274	1.096034	0.192459	0.482943	2.509331	3.388413
2022-01-05	3.388413	1.220362	0.132387	0.396138	2.992274	2.579468
2022-01-06	2.579468	0.947583	-0.238739	-0.808945	3.388413	2.257843

new_df.tail()

	y	log_y	pct_ch_y	diff_y	y(t-1)	y(t+1)
2022-04-05	4.219954	1.439824	0.624970	1.623012	2.596942	0.251282
2022-04-06	0.251282	-1.381180	-0.940454	-3.968672	4.219954	4.083058
2022-04-07	4.083058	1.406846	15.248907	3.831776	0.251282	3.898357
2022-04-08	3.898357	1.360555	-0.045236	-0.184700	4.083058	1.166749
2022-04-09	1.166749	0.154221	-0.700708	-2.731608	3.898357	3.738808

Rolling Windows¶

Another common transformation of time series data is computing rolling windows statistics, such as rolling average, rolling sum, and rolling standard deviation

rolling average for window with size 7 (weakly average of daily data)

window_size = 7
new_df['y'].rolling(window=window_size).mean().head(9)

2022-01-01         NaN
2022-01-02         NaN
2022-01-03         NaN
2022-01-04         NaN
2022-01-05         NaN
2022-01-06         NaN
2022-01-07    3.106036
2022-01-08    3.224076
2022-01-09    3.271001
Freq: D, Name: y, dtype: float64

note that the first 6 observations are missing and the 7th one is the average of the first 7 values. Look at the following cell to make sure you undestand how the rolloing window values are computed

print(new_df.iloc[:7].loc[:, 'y'].mean())
print(new_df.iloc[1:8].loc[:, 'y'].mean())
print(new_df.iloc[2:9].loc[:, 'y'].mean())

3.106035602883829
3.2240756001112376
3.271000970304381

and if we use the center=True option

new_df['y'].rolling(window=window_size, center=True).mean().head(6)

2022-01-01         NaN
2022-01-02         NaN
2022-01-03         NaN
2022-01-04    3.106036
2022-01-05    3.224076
2022-01-06    3.271001
Freq: D, Name: y, dtype: float64

rolling sum for window with size 7 (weakly average of daily data)

new_df['y'].rolling(window=window_size).sum().head(9)

2022-01-01          NaN
2022-01-02          NaN
2022-01-03          NaN
2022-01-04          NaN
2022-01-05          NaN
2022-01-06          NaN
2022-01-07    21.742249
2022-01-08    22.568529
2022-01-09    22.897007
Freq: D, Name: y, dtype: float64

rolling standard deviation for window with size 7 (weakly average of daily data)

new_df['y'].rolling(window=window_size).std().head(9)

2022-01-01         NaN
2022-01-02         NaN
2022-01-03         NaN
2022-01-04         NaN
2022-01-05         NaN
2022-01-06         NaN
2022-01-07    0.447562
2022-01-08    0.643618
2022-01-09    0.648062
Freq: D, Name: y, dtype: float64

See pandas_doc for more info and examples of using the rolling method

Pandas

Matplotlib

Pandas for time series data

Contents

Pandas for time series data¶

The datetime index¶

Temporal aggregation¶

Pivot Tables¶

Transformations¶

log-transformation¶

percent change from last period¶

difference relative to last period¶

lagged values y(t−1)¶

lead values y(t+1)¶

Rolling Windows¶

lagged values $y (t - 1)$ ¶

lead values $y (t + 1)$ ¶