{ "cells": [ { "cell_type": "markdown", "id": "a08d0de1-f10d-46bb-9c33-bac4110a663e", "metadata": {}, "source": [ "## Pandas for time series data\n", "\n", "In this notebook, we’re going to introduce some of the basic Pandas functionality for working with time series\n", "\n", "We will cover how to:\n", "\n", "* set datetime index to a dataframe\n", "* filter data using dates\n", "* temporally aggregate time series from higher to lower frequencies\n", "* apply some common transformation (log, percent change, difference)\n", "* apply rolling window calculations" ] }, { "cell_type": "code", "execution_count": 1, "id": "1efe7086-d4f3-4c3e-a5f7-ad4e4a0034b0", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "290740d0-a392-437b-a1e6-a4402a6d147d", "metadata": {}, "source": [ "generate some random data using numpy (see [numpy doc](https://numpy.org/doc/stable/reference/random/index.html#random-quick-start))" ] }, { "cell_type": "code", "execution_count": 2, "id": "7ed0e11d-28e0-45bb-b599-0091a616f2a4", "metadata": {}, "outputs": [], "source": [ "# old way using the randn method (type `?np.random.normal` for more info) \n", "\n", "n_periods = 100\n", "data = np.random.normal(loc=0, scale=1, size=n_periods)" ] }, { "cell_type": "code", "execution_count": 3, "id": "7d64b004-e549-43c3-99c1-18b2a81acf6e", "metadata": {}, "outputs": [], "source": [ "# new (recommended) way\n", "\n", "n_periods = 100\n", "from numpy.random import default_rng\n", "rng = default_rng()\n", "data = rng.normal(loc=0, scale=1, size=n_periods)" ] }, { "cell_type": "markdown", "id": "679218f8-a0c3-4aa0-8725-5aaffd7de263", "metadata": {}, "source": [ "```{note}\n", "`loc=0`, `scale=1` are default values so they could be ommitted. We could also use the `rng.standard_normal(size=n_periods)` method\n", "```" ] }, { "cell_type": "markdown", "id": "59262141-21a9-4a3c-8f09-61714d985772", "metadata": {}, "source": [ "## The datetime index" ] }, { "cell_type": "code", "execution_count": 4, "id": "a2c65a71-368e-49de-9276-07276070e16e", "metadata": {}, "outputs": [], "source": [ "time_index = pd.date_range(start='2022-01-01', periods = n_periods, freq='D')" ] }, { "cell_type": "markdown", "id": "be05fd29-168e-4a8d-9aac-12f1bf312979", "metadata": {}, "source": [ "`freq='D'` sets the frequency to daily. Some other useful for us options are:\n", "\n", "* `freq='M'` - month end frequency\n", "* `freq='MS'` - month start frequency\n", "* `freq='Q'` - quarter end frequency\n", "* `freq='QS'` - quarter start frequency\n", "* `freq='Y'` or `freq='A'` - year end frequency\n", "* `freq='YS'` or `freq='AS'` - year start frequency\n", "\n", "see [pandas doc](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases) for the full list of options" ] }, { "cell_type": "code", "execution_count": 5, "id": "ba47e4af-8a5e-4dd6-b936-745a4531e41f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data
2022-01-01-0.693954
2022-01-02-0.209926
2022-01-030.044877
2022-01-040.527820
2022-01-050.923959
\n", "
" ], "text/plain": [ " data\n", "2022-01-01 -0.693954\n", "2022-01-02 -0.209926\n", "2022-01-03 0.044877\n", "2022-01-04 0.527820\n", "2022-01-05 0.923959" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(data = data, index=time_index, columns=['data',])\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 6, "id": "b326eab4-9a18-41e3-ad13-3b575730df64", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(pandas.core.indexes.datetimes.DatetimeIndex, pandas.core.series.Series)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df.index), type(df['data'])" ] }, { "cell_type": "markdown", "id": "7ae771f3-b15d-41c3-a45c-9288f960dce5", "metadata": {}, "source": [ "Having a datetime index lets us perform time-aware selection and manipulations very easily\n", "\n", "* for instance, we can get the observations from a given month\n", "* and compute some statistics about them" ] }, { "cell_type": "code", "execution_count": 7, "id": "2c3d81dd-0b8d-45a1-ad68-1280d3179852", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data
2022-04-010.072442
2022-04-022.313431
2022-04-03-0.774668
2022-04-040.132488
2022-04-051.755500
2022-04-06-2.213172
2022-04-071.618603
2022-04-081.433903
2022-04-09-1.297705
2022-04-101.274354
\n", "
" ], "text/plain": [ " data\n", "2022-04-01 0.072442\n", "2022-04-02 2.313431\n", "2022-04-03 -0.774668\n", "2022-04-04 0.132488\n", "2022-04-05 1.755500\n", "2022-04-06 -2.213172\n", "2022-04-07 1.618603\n", "2022-04-08 1.433903\n", "2022-04-09 -1.297705\n", "2022-04-10 1.274354" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['2022-04']" ] }, { "cell_type": "code", "execution_count": 8, "id": "f6eb55e3-1eae-42d3-9586-8041838a67e6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "the mean is 0.43151774216027733\n", "the min and max values are -2.2131721696816187 and 2.3134305670436035\n" ] } ], "source": [ "print(f\"the mean is {df.loc['2022-04'].mean()[0]}\")\n", "print(f\"the min and max values are {df.loc['2022-04'].min()[0]} and {df.loc['2022-04'].max()[0]}\")" ] }, { "cell_type": "markdown", "id": "440267c6-180d-4ba7-8f67-352a48cb1b4d", "metadata": {}, "source": [ "```{admonition} **An aside**: \n", "\n", "These are examples of using __f-strings__. \n", "We can apply formatting to get a more readable output.\n", "The next cell shows how to print only two decimal places.\n", "\n", "See [f-string doc](https://docs.python.org/3/reference/lexical_analysis.html#f-strings) for more info\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "id": "997a76c1-b35e-42bb-8438-14ecb09082ba", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "the mean is 0.43\n", "the min and max values are -2.21 and 2.31\n" ] } ], "source": [ "print(f\"the mean is {df.loc['2022-04'].mean()[0]:.2f}\")\n", "print(f\"the min and max values are {df.loc['2022-04'].min()[0]:.2f} and {df.loc['2022-04'].max()[0]:.2f}\")" ] }, { "cell_type": "markdown", "id": "532067b7-f44c-4dce-8deb-e1cc59f0bde9", "metadata": {}, "source": [ "* all data **starting** at given date (note that it is inclusive)" ] }, { "cell_type": "code", "execution_count": 10, "id": "45ac2677-e42f-4d80-888f-dbe1d776af58", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data
2022-03-310.168551
2022-04-010.072442
2022-04-022.313431
2022-04-03-0.774668
2022-04-040.132488
2022-04-051.755500
2022-04-06-2.213172
2022-04-071.618603
2022-04-081.433903
2022-04-09-1.297705
2022-04-101.274354
\n", "
" ], "text/plain": [ " data\n", "2022-03-31 0.168551\n", "2022-04-01 0.072442\n", "2022-04-02 2.313431\n", "2022-04-03 -0.774668\n", "2022-04-04 0.132488\n", "2022-04-05 1.755500\n", "2022-04-06 -2.213172\n", "2022-04-07 1.618603\n", "2022-04-08 1.433903\n", "2022-04-09 -1.297705\n", "2022-04-10 1.274354" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['2022-03-31':]" ] }, { "cell_type": "markdown", "id": "28dd114d-df90-4e62-a7a8-ab8ca59fdb8a", "metadata": {}, "source": [ "* all data **before** a given date (note that is it inclusive)" ] }, { "cell_type": "code", "execution_count": 11, "id": "b1b8c437-e596-4af3-95c3-75dd303d9c71", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data
2022-01-01-0.693954
2022-01-02-0.209926
2022-01-030.044877
2022-01-040.527820
2022-01-050.923959
2022-01-060.115014
2022-01-07-0.206611
2022-01-080.791693
2022-01-09-0.271810
2022-01-10-0.125983
2022-01-11-1.379321
2022-01-12-0.972250
\n", "
" ], "text/plain": [ " data\n", "2022-01-01 -0.693954\n", "2022-01-02 -0.209926\n", "2022-01-03 0.044877\n", "2022-01-04 0.527820\n", "2022-01-05 0.923959\n", "2022-01-06 0.115014\n", "2022-01-07 -0.206611\n", "2022-01-08 0.791693\n", "2022-01-09 -0.271810\n", "2022-01-10 -0.125983\n", "2022-01-11 -1.379321\n", "2022-01-12 -0.972250" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[:'2022-01-12']" ] }, { "cell_type": "markdown", "id": "5feb9b85-7d8c-438d-b8f6-64e2d5579896", "metadata": {}, "source": [ "* all data **between** two dates (note that is it inclusive on both ends)" ] }, { "cell_type": "code", "execution_count": 12, "id": "29203e7e-502b-49b0-9506-416387407fd7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data
2022-01-12-0.972250
2022-01-13-1.053409
2022-01-14-1.650255
2022-01-15-1.251144
2022-01-160.725652
2022-01-172.123418
2022-01-18-0.367741
2022-01-191.032479
2022-01-200.449727
2022-01-21-1.204295
2022-01-22-0.008893
2022-01-231.236906
2022-01-24-1.444002
2022-01-250.987860
2022-01-260.121889
2022-01-271.140643
2022-01-281.076552
2022-01-290.184321
2022-01-30-0.050446
2022-01-310.589317
2022-02-010.466329
2022-02-02-1.049473
2022-02-031.206759
2022-02-04-0.840185
2022-02-051.142660
2022-02-06-1.486865
2022-02-071.243011
2022-02-080.489471
2022-02-09-0.080931
2022-02-100.098374
2022-02-11-0.368505
2022-02-12-2.364454
\n", "
" ], "text/plain": [ " data\n", "2022-01-12 -0.972250\n", "2022-01-13 -1.053409\n", "2022-01-14 -1.650255\n", "2022-01-15 -1.251144\n", "2022-01-16 0.725652\n", "2022-01-17 2.123418\n", "2022-01-18 -0.367741\n", "2022-01-19 1.032479\n", "2022-01-20 0.449727\n", "2022-01-21 -1.204295\n", "2022-01-22 -0.008893\n", "2022-01-23 1.236906\n", "2022-01-24 -1.444002\n", "2022-01-25 0.987860\n", "2022-01-26 0.121889\n", "2022-01-27 1.140643\n", "2022-01-28 1.076552\n", "2022-01-29 0.184321\n", "2022-01-30 -0.050446\n", "2022-01-31 0.589317\n", "2022-02-01 0.466329\n", "2022-02-02 -1.049473\n", "2022-02-03 1.206759\n", "2022-02-04 -0.840185\n", "2022-02-05 1.142660\n", "2022-02-06 -1.486865\n", "2022-02-07 1.243011\n", "2022-02-08 0.489471\n", "2022-02-09 -0.080931\n", "2022-02-10 0.098374\n", "2022-02-11 -0.368505\n", "2022-02-12 -2.364454" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['2022-01-12':'2022-02-12']" ] }, { "cell_type": "markdown", "id": "50fb8bb3-589f-4073-9f9c-2e9ef1e998b0", "metadata": {}, "source": [ "## Temporal aggregation\n", "\n", "This is when we move from a shorter to a longer period. For example, daily data can be aggregated to weekly, monthly, quarterly, annual. Typically, aggregation is done by taking the sum or the average\n", "\n", "\n", "See [pandas doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html?highlight=resample#pandas.DataFrame.resample) for more info and examples of using the `resample` method" ] }, { "cell_type": "code", "execution_count": 13, "id": "b4e96e55-910a-40a9-b032-2f9af2498582", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data
2022-01-311.182088
2022-02-28-9.062631
2022-03-317.715243
2022-04-304.315177
\n", "
" ], "text/plain": [ " data\n", "2022-01-31 1.182088\n", "2022-02-28 -9.062631\n", "2022-03-31 7.715243\n", "2022-04-30 4.315177" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.resample('M').sum()" ] }, { "cell_type": "code", "execution_count": 14, "id": "51ba7f5a-8c82-49e1-959c-955647e13585", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data
2022-01-310.038132
2022-02-28-0.323665
2022-03-310.248879
2022-04-300.431518
\n", "
" ], "text/plain": [ " data\n", "2022-01-31 0.038132\n", "2022-02-28 -0.323665\n", "2022-03-31 0.248879\n", "2022-04-30 0.431518" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.resample('M').mean()" ] }, { "cell_type": "markdown", "id": "3e865563-8249-454c-bd7c-9178e33a8458", "metadata": {}, "source": [ "* aggregation can be done also by taking the first/last observation of the higher frequency (daily here) data" ] }, { "cell_type": "code", "execution_count": 17, "id": "88be638c-6d68-4d1d-996a-5a4825c1feda", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data
2022-01-31-0.693954
2022-02-280.466329
2022-03-31-0.360594
2022-04-300.072442
\n", "
" ], "text/plain": [ " data\n", "2022-01-31 -0.693954\n", "2022-02-28 0.466329\n", "2022-03-31 -0.360594\n", "2022-04-30 0.072442" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.resample('M').first()" ] }, { "cell_type": "code", "execution_count": 18, "id": "37c2d841-b454-4e99-a82e-09e493db371a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data
2022-01-310.589317
2022-02-28-1.773561
2022-03-310.168551
2022-04-301.274354
\n", "
" ], "text/plain": [ " data\n", "2022-01-31 0.589317\n", "2022-02-28 -1.773561\n", "2022-03-31 0.168551\n", "2022-04-30 1.274354" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.resample('M').last()" ] }, { "cell_type": "markdown", "id": "e010e74a-5d3e-4c2c-8cc3-e56c9230ac4a", "metadata": {}, "source": [ "* we can apply multiple functions at the same time using the `agg` method" ] }, { "cell_type": "code", "execution_count": 20, "id": "71a13dc0-9bbc-4bf9-86ea-d06ec08a3adc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data
meansumminmaxfirstlast
2022-01-310.0381321.182088-1.6502552.123418-0.6939540.589317
2022-02-28-0.323665-9.062631-2.3644541.3055620.466329-1.773561
2022-03-310.2488797.715243-2.1465172.109419-0.3605940.168551
2022-04-300.4315184.315177-2.2131722.3134310.0724421.274354
\n", "
" ], "text/plain": [ " data \n", " mean sum min max first last\n", "2022-01-31 0.038132 1.182088 -1.650255 2.123418 -0.693954 0.589317\n", "2022-02-28 -0.323665 -9.062631 -2.364454 1.305562 0.466329 -1.773561\n", "2022-03-31 0.248879 7.715243 -2.146517 2.109419 -0.360594 0.168551\n", "2022-04-30 0.431518 4.315177 -2.213172 2.313431 0.072442 1.274354" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.resample('M').agg(['mean', 'sum', 'min', 'max', 'first', 'last'])" ] }, { "cell_type": "markdown", "id": "86071b24-09e7-41e5-a377-9ea1c8ff58fa", "metadata": {}, "source": [ "```{note}\n", "we could have the monthly frequency index for the **first day** of the month using `df.resample('MS')`\n", "```" ] }, { "cell_type": "markdown", "id": "5870ce5c-55a3-4af2-92aa-d933d1868922", "metadata": {}, "source": [ "## Pivot Tables" ] }, { "cell_type": "markdown", "id": "fe9e0b95-5556-42f9-8802-ecb5bd1af70c", "metadata": {}, "source": [ "* create a new monthly time series for 2 years " ] }, { "cell_type": "code", "execution_count": 57, "id": "d4be0370-f99c-4fa5-bed2-e40ae2d5ccfc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "24" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "monthly_index = pd.date_range(start='2021-01-01', end='2022-12-31', freq='M')\n", "len(monthly_index)" ] }, { "cell_type": "code", "execution_count": 63, "id": "2a5779e3-4acd-415f-abd2-c8d17ae27b3d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data
2021-01-310.484735
2021-02-280.687497
2021-03-310.147638
2021-04-300.404760
2021-05-31-0.883069
\n", "
" ], "text/plain": [ " data\n", "2021-01-31 0.484735\n", "2021-02-28 0.687497\n", "2021-03-31 0.147638\n", "2021-04-30 0.404760\n", "2021-05-31 -0.883069" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_monthly = pd.DataFrame(data=np.random.normal(loc=0, scale=1, size=len(monthly_index)),\n", " columns=['data',],\n", " index=monthly_index)\n", "df_monthly.head()" ] }, { "cell_type": "code", "execution_count": 64, "id": "a7feb731-06df-4219-be3a-15f75da01906", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datamonthquarteryear
2021-01-310.484735112021
2021-02-280.687497212021
2021-03-310.147638312021
2021-04-300.404760422021
2021-05-31-0.883069522021
2021-06-30-0.854109622021
2021-07-31-1.276728732021
2021-08-310.254640832021
2021-09-300.577472932021
2021-10-31-1.1868521042021
2021-11-30-0.6640261142021
2021-12-31-0.3210331242021
2022-01-310.832603112022
2022-02-28-0.773677212022
2022-03-310.319106312022
2022-04-30-0.745475422022
2022-05-310.898935522022
2022-06-30-0.431810622022
2022-07-31-1.345803732022
2022-08-310.143264832022
2022-09-30-0.004844932022
2022-10-311.1024171042022
2022-11-300.7131271142022
2022-12-310.3475381242022
\n", "
" ], "text/plain": [ " data month quarter year\n", "2021-01-31 0.484735 1 1 2021\n", "2021-02-28 0.687497 2 1 2021\n", "2021-03-31 0.147638 3 1 2021\n", "2021-04-30 0.404760 4 2 2021\n", "2021-05-31 -0.883069 5 2 2021\n", "2021-06-30 -0.854109 6 2 2021\n", "2021-07-31 -1.276728 7 3 2021\n", "2021-08-31 0.254640 8 3 2021\n", "2021-09-30 0.577472 9 3 2021\n", "2021-10-31 -1.186852 10 4 2021\n", "2021-11-30 -0.664026 11 4 2021\n", "2021-12-31 -0.321033 12 4 2021\n", "2022-01-31 0.832603 1 1 2022\n", "2022-02-28 -0.773677 2 1 2022\n", "2022-03-31 0.319106 3 1 2022\n", "2022-04-30 -0.745475 4 2 2022\n", "2022-05-31 0.898935 5 2 2022\n", "2022-06-30 -0.431810 6 2 2022\n", "2022-07-31 -1.345803 7 3 2022\n", "2022-08-31 0.143264 8 3 2022\n", "2022-09-30 -0.004844 9 3 2022\n", "2022-10-31 1.102417 10 4 2022\n", "2022-11-30 0.713127 11 4 2022\n", "2022-12-31 0.347538 12 4 2022" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_monthly['month'] = df_monthly.index.month\n", "df_monthly['quarter'] = df_monthly.index.quarter\n", "df_monthly['year'] = df_monthly.index.year\n", "df_monthly" ] }, { "cell_type": "markdown", "id": "f8908df2-37a1-4c11-b3f6-da28b9b3dbe0", "metadata": {}, "source": [ "* we can use the `pivot_table` method to see the data for each month by year" ] }, { "cell_type": "code", "execution_count": 69, "id": "440b6a9f-3f37-4a49-a517-2d5f4289a460", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
year20212022
month
10.4847350.832603
20.687497-0.773677
30.1476380.319106
40.404760-0.745475
5-0.8830690.898935
6-0.854109-0.431810
7-1.276728-1.345803
80.2546400.143264
90.577472-0.004844
10-1.1868521.102417
11-0.6640260.713127
12-0.3210330.347538
\n", "
" ], "text/plain": [ "year 2021 2022\n", "month \n", "1 0.484735 0.832603\n", "2 0.687497 -0.773677\n", "3 0.147638 0.319106\n", "4 0.404760 -0.745475\n", "5 -0.883069 0.898935\n", "6 -0.854109 -0.431810\n", "7 -1.276728 -1.345803\n", "8 0.254640 0.143264\n", "9 0.577472 -0.004844\n", "10 -1.186852 1.102417\n", "11 -0.664026 0.713127\n", "12 -0.321033 0.347538" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.pivot_table(df_monthly, \n", " index='month', \n", " columns=\"year\", \n", " values='data')" ] }, { "cell_type": "markdown", "id": "98628334-9c9c-41e6-a4eb-05218f5f98a3", "metadata": {}, "source": [ "* a more useful exmaple of `pivot_table` to to compute a summary statistic for some level of aggregattion\n", "* for example average value by quarter" ] }, { "cell_type": "code", "execution_count": 73, "id": "044c6f47-2daf-46a9-aed2-870b557d61f4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
year20212022
quarter
10.4399570.126010
2-0.444139-0.092783
3-0.148205-0.402461
4-0.7239710.721027
\n", "
" ], "text/plain": [ "year 2021 2022\n", "quarter \n", "1 0.439957 0.126010\n", "2 -0.444139 -0.092783\n", "3 -0.148205 -0.402461\n", "4 -0.723971 0.721027" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.pivot_table(df_monthly, \n", " index='quarter', \n", " columns=\"year\", \n", " values='data',\n", " aggfunc=np.mean)" ] }, { "cell_type": "markdown", "id": "3894d2b6-c2dd-45b3-ab3f-9aac9c8b1f78", "metadata": {}, "source": [ "* compare with using `groupby` method" ] }, { "cell_type": "code", "execution_count": 79, "id": "01de58f4-8f93-4025-9167-ceaa3aacc3f6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data
yearquarter
202110.439957
2-0.444139
3-0.148205
4-0.723971
202210.126010
2-0.092783
3-0.402461
40.721027
\n", "
" ], "text/plain": [ " data\n", "year quarter \n", "2021 1 0.439957\n", " 2 -0.444139\n", " 3 -0.148205\n", " 4 -0.723971\n", "2022 1 0.126010\n", " 2 -0.092783\n", " 3 -0.402461\n", " 4 0.721027" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_monthly[['year', 'quarter', 'data']].groupby(['year', 'quarter']).mean()" ] }, { "cell_type": "markdown", "id": "cf4bc365-8445-4766-88a9-8eef9a9b9730", "metadata": {}, "source": [ "## Transformations" ] }, { "cell_type": "markdown", "id": "63f8104c-79cf-4acb-b828-53e15e615a3d", "metadata": {}, "source": [ "### log-transformation \n", "\n", "the log is defined for positive values only. To assure that all values are positive, below I add a constant to the values in the data column. I also create a new dataframe and rename the column as **y**." ] }, { "cell_type": "code", "execution_count": 61, "id": "f6d3b358-3102-457a-9552-ebfe4b4ad87e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
y
2022-01-011.770501
2022-01-022.254528
2022-01-032.509331
2022-01-042.992274
2022-01-053.388413
\n", "
" ], "text/plain": [ " y\n", "2022-01-01 1.770501\n", "2022-01-02 2.254528\n", "2022-01-03 2.509331\n", "2022-01-04 2.992274\n", "2022-01-05 3.388413" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "min_value = np.abs(df.min())\n", "new_df = df + min_value + 0.1 # this adds min_value + 0.1 to all elements of the data column of df\n", "new_df.columns = ['y']\n", "new_df.head()" ] }, { "cell_type": "code", "execution_count": 62, "id": "2fd09629-709c-4e7c-aeab-4e3bddf7e689", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2022-01-01 0.571262\n", "2022-01-02 0.812941\n", "2022-01-03 0.920016\n", "2022-01-04 1.096034\n", "2022-01-05 1.220362\n", " ... \n", "2022-04-06 -1.381180\n", "2022-04-07 1.406846\n", "2022-04-08 1.360555\n", "2022-04-09 0.154221\n", "2022-04-10 1.318767\n", "Freq: D, Name: y, Length: 100, dtype: float64" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## log of y\n", "np.log(new_df['y'])" ] }, { "cell_type": "markdown", "id": "391afbdc-1ad0-4e2f-b820-d841e2674760", "metadata": {}, "source": [ "### percent change from last period" ] }, { "cell_type": "code", "execution_count": 63, "id": "1a3af2c1-8234-41d4-8bd4-a980f95201d6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
y
2022-01-01NaN
2022-01-020.273384
2022-01-030.113019
2022-01-040.192459
2022-01-050.132387
......
2022-04-06-0.940454
2022-04-0715.248907
2022-04-08-0.045236
2022-04-09-0.700708
2022-04-102.204467
\n", "

100 rows × 1 columns

\n", "
" ], "text/plain": [ " y\n", "2022-01-01 NaN\n", "2022-01-02 0.273384\n", "2022-01-03 0.113019\n", "2022-01-04 0.192459\n", "2022-01-05 0.132387\n", "... ...\n", "2022-04-06 -0.940454\n", "2022-04-07 15.248907\n", "2022-04-08 -0.045236\n", "2022-04-09 -0.700708\n", "2022-04-10 2.204467\n", "\n", "[100 rows x 1 columns]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df.pct_change(periods=1)" ] }, { "cell_type": "markdown", "id": "7e8d90ff-3361-4521-8e59-2f36242db632", "metadata": {}, "source": [ "### difference relative to last period" ] }, { "cell_type": "code", "execution_count": 64, "id": "0b15b3da-1ad0-4f7a-b8bc-b401890deb40", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
y
2022-01-01NaN
2022-01-020.484027
2022-01-030.254804
2022-01-040.482943
2022-01-050.396138
......
2022-04-06-3.968672
2022-04-073.831776
2022-04-08-0.184700
2022-04-09-2.731608
2022-04-102.572059
\n", "

100 rows × 1 columns

\n", "
" ], "text/plain": [ " y\n", "2022-01-01 NaN\n", "2022-01-02 0.484027\n", "2022-01-03 0.254804\n", "2022-01-04 0.482943\n", "2022-01-05 0.396138\n", "... ...\n", "2022-04-06 -3.968672\n", "2022-04-07 3.831776\n", "2022-04-08 -0.184700\n", "2022-04-09 -2.731608\n", "2022-04-10 2.572059\n", "\n", "[100 rows x 1 columns]" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df.diff(periods=1)" ] }, { "cell_type": "markdown", "id": "4d990c9f-b9c1-445e-9525-10f79bf6822b", "metadata": {}, "source": [ "### lagged values $y(t-1)$\n", "\n", "shifts the index forward by a number of periods, one here" ] }, { "cell_type": "code", "execution_count": 65, "id": "15b2ff57-5725-4cbf-bfda-7cae6ad7cec7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
y
2022-01-01NaN
2022-01-021.770501
2022-01-032.254528
2022-01-042.509331
2022-01-052.992274
......
2022-04-064.219954
2022-04-070.251282
2022-04-084.083058
2022-04-093.898357
2022-04-101.166749
\n", "

100 rows × 1 columns

\n", "
" ], "text/plain": [ " y\n", "2022-01-01 NaN\n", "2022-01-02 1.770501\n", "2022-01-03 2.254528\n", "2022-01-04 2.509331\n", "2022-01-05 2.992274\n", "... ...\n", "2022-04-06 4.219954\n", "2022-04-07 0.251282\n", "2022-04-08 4.083058\n", "2022-04-09 3.898357\n", "2022-04-10 1.166749\n", "\n", "[100 rows x 1 columns]" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df.shift(1)" ] }, { "cell_type": "markdown", "id": "300dd084-b44f-4ae9-91cb-c0dd3115fafd", "metadata": {}, "source": [ "### lead values $y(t+1)$\n", "shifts the index backward" ] }, { "cell_type": "code", "execution_count": 66, "id": "ce9c09e6-6030-42ac-9dab-196de2acc52e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
y
2022-01-012.254528
2022-01-022.509331
2022-01-032.992274
2022-01-043.388413
2022-01-052.579468
......
2022-04-064.083058
2022-04-073.898357
2022-04-081.166749
2022-04-093.738808
2022-04-10NaN
\n", "

100 rows × 1 columns

\n", "
" ], "text/plain": [ " y\n", "2022-01-01 2.254528\n", "2022-01-02 2.509331\n", "2022-01-03 2.992274\n", "2022-01-04 3.388413\n", "2022-01-05 2.579468\n", "... ...\n", "2022-04-06 4.083058\n", "2022-04-07 3.898357\n", "2022-04-08 1.166749\n", "2022-04-09 3.738808\n", "2022-04-10 NaN\n", "\n", "[100 rows x 1 columns]" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df.shift(-1)" ] }, { "cell_type": "markdown", "id": "42c2d294-85a4-4769-8bd7-904fc300d3d8", "metadata": {}, "source": [ "* add the transformed **y** as new columns of the dataframe" ] }, { "cell_type": "code", "execution_count": 69, "id": "65105cfa-99d9-436b-a131-b2eb38d73dd3", "metadata": {}, "outputs": [], "source": [ "new_df['log_y'] = np.log(new_df['y'])\n", "new_df['pct_ch_y'] = new_df['y'].pct_change(periods=1)\n", "new_df['diff_y'] = new_df['y'].diff(periods=1)\n", "\n", "new_df['y(t-1)'] = new_df['y'].shift(periods=1)\n", "new_df['y(t+1)'] = new_df['y'].shift(periods=-1)" ] }, { "cell_type": "code", "execution_count": 70, "id": "13762cc3-90ce-4587-980a-a24ca25cb874", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ylog_ypct_ch_ydiff_yy(t-1)y(t+1)
2022-01-011.7705010.571262NaNNaNNaN2.254528
2022-01-022.2545280.8129410.2733840.4840271.7705012.509331
2022-01-032.5093310.9200160.1130190.2548042.2545282.992274
2022-01-042.9922741.0960340.1924590.4829432.5093313.388413
2022-01-053.3884131.2203620.1323870.3961382.9922742.579468
\n", "
" ], "text/plain": [ " y log_y pct_ch_y diff_y y(t-1) y(t+1)\n", "2022-01-01 1.770501 0.571262 NaN NaN NaN 2.254528\n", "2022-01-02 2.254528 0.812941 0.273384 0.484027 1.770501 2.509331\n", "2022-01-03 2.509331 0.920016 0.113019 0.254804 2.254528 2.992274\n", "2022-01-04 2.992274 1.096034 0.192459 0.482943 2.509331 3.388413\n", "2022-01-05 3.388413 1.220362 0.132387 0.396138 2.992274 2.579468" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df.head()" ] }, { "cell_type": "code", "execution_count": 71, "id": "8faf1866-2246-41d1-8122-4972b0701f0f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ylog_ypct_ch_ydiff_yy(t-1)y(t+1)
2022-04-060.251282-1.381180-0.940454-3.9686724.2199544.083058
2022-04-074.0830581.40684615.2489073.8317760.2512823.898357
2022-04-083.8983571.360555-0.045236-0.1847004.0830581.166749
2022-04-091.1667490.154221-0.700708-2.7316083.8983573.738808
2022-04-103.7388081.3187672.2044672.5720591.166749NaN
\n", "
" ], "text/plain": [ " y log_y pct_ch_y diff_y y(t-1) y(t+1)\n", "2022-04-06 0.251282 -1.381180 -0.940454 -3.968672 4.219954 4.083058\n", "2022-04-07 4.083058 1.406846 15.248907 3.831776 0.251282 3.898357\n", "2022-04-08 3.898357 1.360555 -0.045236 -0.184700 4.083058 1.166749\n", "2022-04-09 1.166749 0.154221 -0.700708 -2.731608 3.898357 3.738808\n", "2022-04-10 3.738808 1.318767 2.204467 2.572059 1.166749 NaN" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df.tail()" ] }, { "cell_type": "markdown", "id": "da877bc9-ddb9-4c06-b83b-344baa8f447c", "metadata": {}, "source": [ "* drop mising values (NaN)" ] }, { "cell_type": "code", "execution_count": 72, "id": "9688ec5b-1713-4cf4-84d2-ab2c9c9b0441", "metadata": {}, "outputs": [], "source": [ "new_df.dropna(axis=0, how='any', inplace=True)" ] }, { "cell_type": "code", "execution_count": 73, "id": "84ff93fb-542b-4229-ac46-1d0eeb7b6bcc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ylog_ypct_ch_ydiff_yy(t-1)y(t+1)
2022-01-022.2545280.8129410.2733840.4840271.7705012.509331
2022-01-032.5093310.9200160.1130190.2548042.2545282.992274
2022-01-042.9922741.0960340.1924590.4829432.5093313.388413
2022-01-053.3884131.2203620.1323870.3961382.9922742.579468
2022-01-062.5794680.947583-0.238739-0.8089453.3884132.257843
\n", "
" ], "text/plain": [ " y log_y pct_ch_y diff_y y(t-1) y(t+1)\n", "2022-01-02 2.254528 0.812941 0.273384 0.484027 1.770501 2.509331\n", "2022-01-03 2.509331 0.920016 0.113019 0.254804 2.254528 2.992274\n", "2022-01-04 2.992274 1.096034 0.192459 0.482943 2.509331 3.388413\n", "2022-01-05 3.388413 1.220362 0.132387 0.396138 2.992274 2.579468\n", "2022-01-06 2.579468 0.947583 -0.238739 -0.808945 3.388413 2.257843" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df.head()" ] }, { "cell_type": "code", "execution_count": 74, "id": "520a1ecd-4e64-40b2-bb6f-31eb036e54de", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ylog_ypct_ch_ydiff_yy(t-1)y(t+1)
2022-04-054.2199541.4398240.6249701.6230122.5969420.251282
2022-04-060.251282-1.381180-0.940454-3.9686724.2199544.083058
2022-04-074.0830581.40684615.2489073.8317760.2512823.898357
2022-04-083.8983571.360555-0.045236-0.1847004.0830581.166749
2022-04-091.1667490.154221-0.700708-2.7316083.8983573.738808
\n", "
" ], "text/plain": [ " y log_y pct_ch_y diff_y y(t-1) y(t+1)\n", "2022-04-05 4.219954 1.439824 0.624970 1.623012 2.596942 0.251282\n", "2022-04-06 0.251282 -1.381180 -0.940454 -3.968672 4.219954 4.083058\n", "2022-04-07 4.083058 1.406846 15.248907 3.831776 0.251282 3.898357\n", "2022-04-08 3.898357 1.360555 -0.045236 -0.184700 4.083058 1.166749\n", "2022-04-09 1.166749 0.154221 -0.700708 -2.731608 3.898357 3.738808" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df.tail()" ] }, { "cell_type": "markdown", "id": "3c569e62-8be7-4a31-8e94-315f37d2a6da", "metadata": {}, "source": [ "### Rolling Windows\n", "Another common transformation of time series data is computing rolling windows statistics, such as rolling average, rolling sum, and rolling standard deviation" ] }, { "cell_type": "markdown", "id": "02372791-3350-43bf-959c-90a2c6e86ec9", "metadata": {}, "source": [ "* rolling average for window with size 7 (weakly average of daily data)" ] }, { "cell_type": "code", "execution_count": 200, "id": "13448095-1e38-438f-9f51-d3c7eba55ccb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2022-01-01 NaN\n", "2022-01-02 NaN\n", "2022-01-03 NaN\n", "2022-01-04 NaN\n", "2022-01-05 NaN\n", "2022-01-06 NaN\n", "2022-01-07 3.106036\n", "2022-01-08 3.224076\n", "2022-01-09 3.271001\n", "Freq: D, Name: y, dtype: float64" ] }, "execution_count": 200, "metadata": {}, "output_type": "execute_result" } ], "source": [ "window_size = 7\n", "new_df['y'].rolling(window=window_size).mean().head(9)" ] }, { "cell_type": "markdown", "id": "5ee19e26-a67a-480f-a636-82b6daaaf767", "metadata": {}, "source": [ "* note that the first 6 observations are missing and the 7th one is the average of the first 7 values. Look at the following cell to make sure you undestand how the rolloing window values are computed" ] }, { "cell_type": "code", "execution_count": 201, "id": "9e892a6e-ee06-4c91-80c2-06d34e690dcd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.106035602883829\n", "3.2240756001112376\n", "3.271000970304381\n" ] } ], "source": [ "print(new_df.iloc[:7].loc[:, 'y'].mean())\n", "print(new_df.iloc[1:8].loc[:, 'y'].mean())\n", "print(new_df.iloc[2:9].loc[:, 'y'].mean())" ] }, { "cell_type": "markdown", "id": "e27e2fe7-74bd-48f2-80b8-f047658057ca", "metadata": {}, "source": [ "* and if we use the `center=True` option" ] }, { "cell_type": "code", "execution_count": 202, "id": "feab6380-c38a-4357-a684-fe960601f9d2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2022-01-01 NaN\n", "2022-01-02 NaN\n", "2022-01-03 NaN\n", "2022-01-04 3.106036\n", "2022-01-05 3.224076\n", "2022-01-06 3.271001\n", "Freq: D, Name: y, dtype: float64" ] }, "execution_count": 202, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df['y'].rolling(window=window_size, center=True).mean().head(6)" ] }, { "cell_type": "markdown", "id": "9b0723bd-4b8f-4cf1-b1db-201834d173d3", "metadata": {}, "source": [ "* rolling sum for window with size 7 (weakly average of daily data)" ] }, { "cell_type": "code", "execution_count": 178, "id": "2ec941bb-3509-4c4b-a772-7061b137b483", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2022-01-01 NaN\n", "2022-01-02 NaN\n", "2022-01-03 NaN\n", "2022-01-04 NaN\n", "2022-01-05 NaN\n", "2022-01-06 NaN\n", "2022-01-07 21.742249\n", "2022-01-08 22.568529\n", "2022-01-09 22.897007\n", "Freq: D, Name: y, dtype: float64" ] }, "execution_count": 178, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df['y'].rolling(window=window_size).sum().head(9)" ] }, { "cell_type": "markdown", "id": "fedd1aa8-9ba3-4ea0-bb61-6009d82becc7", "metadata": {}, "source": [ "* rolling standard deviation for window with size 7 (weakly average of daily data)" ] }, { "cell_type": "code", "execution_count": 179, "id": "9a27a5be-1e8b-4d1d-8285-5f28bfa4512f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2022-01-01 NaN\n", "2022-01-02 NaN\n", "2022-01-03 NaN\n", "2022-01-04 NaN\n", "2022-01-05 NaN\n", "2022-01-06 NaN\n", "2022-01-07 0.447562\n", "2022-01-08 0.643618\n", "2022-01-09 0.648062\n", "Freq: D, Name: y, dtype: float64" ] }, "execution_count": 179, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df['y'].rolling(window=window_size).std().head(9)" ] }, { "cell_type": "markdown", "id": "fff7c93f-4ea0-4746-818e-fd07c8305f9b", "metadata": {}, "source": [ "See [pandas_doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html?highlight=rolling#pandas.DataFrame.rolling) for more info and examples of using the `rolling` method" ] }, { "cell_type": "code", "execution_count": null, "id": "6d21fff5-d3c6-4534-af47-7d63a4403a43", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" } }, "nbformat": 4, "nbformat_minor": 5 }