Pandas

Contents

Pandas¶

In this notebook, we’re going to show how to:

create data frames
inspect data frames

import pandas as pd
import numpy as np

Create a dataframe with random data¶

from numpy.random import default_rng
rng = default_rng(42)

index_list = list('ABCDEFG')
column_list= list('0123')

n_rows, n_cols = len(index_list), len(column_list)
index_list, column_list

(['A', 'B', 'C', 'D', 'E', 'F', 'G'], ['0', '1', '2', '3'])

data = rng.normal(size=(n_rows, n_cols))

df = pd.DataFrame(data=data, index=index_list, columns=column_list)
df

	0	1	2	3
A	0.304717	-1.039984	0.750451	0.940565
B	-1.951035	-1.302180	0.127840	-0.316243
C	-0.016801	-0.853044	0.879398	0.777792
D	0.066031	1.127241	0.467509	-0.859292
E	0.368751	-0.958883	0.878450	-0.049926
F	-0.184862	-0.680930	1.222541	-0.154529
G	-0.428328	-0.352134	0.532309	0.365444

Another way to get list of letters and digits to use as index/columns¶

import string

index_list = list(string.ascii_uppercase)
column_list = list(string.digits)

n_rows, n_cols = len(index_list), len(column_list)

see string doc for info on the string module

data = rng.normal(size=(n_rows, n_cols))

df = pd.DataFrame(data=data, index=index_list, columns=column_list)
df

	0	1	2	3	4	5	6	7	8	9
A	0.412733	0.430821	2.141648	-0.406415	-0.512243	-0.813773	0.615979	1.128972	-0.113947	-0.840156
B	-0.824481	0.650593	0.743254	0.543154	-0.665510	0.232161	0.116686	0.218689	0.871429	0.223596
C	0.678914	0.067579	0.289119	0.631288	-1.457156	-0.319671	-0.470373	-0.638878	-0.275142	1.494941
D	-0.865831	0.968278	-1.682870	-0.334885	0.162753	0.586222	0.711227	0.793347	-0.348725	-0.462352
E	0.857976	-0.191304	-1.275686	-1.133287	-0.919452	0.497161	0.142426	0.690485	-0.427253	0.158540
F	0.625590	-0.309347	0.456775	-0.661926	-0.363054	-0.381738	-1.195840	0.486972	-0.469402	0.012494
G	0.480747	0.446531	0.665385	-0.098485	-0.423298	-0.079718	-1.687334	-1.447112	-1.322700	-0.997247
H	0.399774	-0.905479	-0.378163	1.299228	-0.356264	0.737516	-0.933618	-0.205438	-0.950022	-0.339033
I	0.840308	-1.727320	0.434424	0.237736	-0.594150	-1.446058	0.072130	-0.529493	0.232676	0.021852
J	1.601779	-0.239356	-1.023497	0.179276	0.219997	1.359188	0.835111	0.356871	1.463303	-1.188763
K	-0.639752	-0.926576	-0.389810	-1.376686	0.635151	-0.222223	-1.470806	-1.015579	0.313514	0.838127
L	1.996731	2.913862	0.414409	-0.989538	-2.132046	0.267711	-0.812941	-0.415357	-0.612097	-0.140791
M	1.065980	0.157049	-0.158635	-1.035654	-1.674683	-0.486308	-0.053783	1.767930	0.130275	0.982740
N	-0.499296	-1.184944	-0.965117	-0.725226	2.128470	-0.821387	0.838489	-0.902927	0.931573	0.384951
O	-0.156638	-0.040763	-0.654788	0.446072	-0.454983	-1.225606	-1.277938	0.172588	1.579091	0.159992
P	-0.118638	0.285826	1.306002	0.219383	-0.410927	1.106289	0.428756	1.535756	0.183234	-1.224469
Q	-1.368159	1.650928	1.723666	-0.179519	-0.383187	1.461444	-1.107046	-0.894727	0.643327	-0.394605
R	-0.005122	-0.163443	0.337575	1.407482	0.090585	0.643939	-2.050172	-0.048718	-0.843230	-1.218813
S	-0.878152	-0.334123	0.915903	-1.326393	0.030631	-0.484169	-0.327673	1.002758	0.538115	1.337398
T	-0.154506	-0.695943	-0.223859	0.242497	0.176573	-1.084388	0.090490	0.228228	2.517474	1.876845
U	-0.853243	-0.287383	-1.463442	-0.590707	0.315605	1.205854	-0.729084	-0.654146	-2.147289	-0.162666
V	-1.062414	-0.529439	-0.876861	-0.094263	-1.757728	-1.467045	2.129247	-1.287423	-1.096786	1.836914
W	2.905067	-1.171567	-0.368249	0.341556	1.728698	-0.986857	-0.245278	0.777338	0.434766	-0.376156
X	-0.133823	-1.374896	-0.238174	-0.266387	0.232170	-0.555327	0.471539	1.012716	0.155429	0.351756
Y	0.053155	0.000084	-0.721558	0.316494	-0.097287	2.093168	1.573355	0.385847	-0.763057	-1.112411
Z	1.191143	0.262749	0.480143	-1.744586	0.927438	0.454420	-1.110431	-0.471525	0.263717	0.052467

Inspect the dataframe¶

see the first/last 5 rows¶

df.head()

	0	1	2	3	4	5	6	7	8	9
A	0.412733	0.430821	2.141648	-0.406415	-0.512243	-0.813773	0.615979	1.128972	-0.113947	-0.840156
B	-0.824481	0.650593	0.743254	0.543154	-0.665510	0.232161	0.116686	0.218689	0.871429	0.223596
C	0.678914	0.067579	0.289119	0.631288	-1.457156	-0.319671	-0.470373	-0.638878	-0.275142	1.494941
D	-0.865831	0.968278	-1.682870	-0.334885	0.162753	0.586222	0.711227	0.793347	-0.348725	-0.462352
E	0.857976	-0.191304	-1.275686	-1.133287	-0.919452	0.497161	0.142426	0.690485	-0.427253	0.158540

df.tail()

	0	1	2	3	4	5	6	7	8	9
V	-1.062414	-0.529439	-0.876861	-0.094263	-1.757728	-1.467045	2.129247	-1.287423	-1.096786	1.836914
W	2.905067	-1.171567	-0.368249	0.341556	1.728698	-0.986857	-0.245278	0.777338	0.434766	-0.376156
X	-0.133823	-1.374896	-0.238174	-0.266387	0.232170	-0.555327	0.471539	1.012716	0.155429	0.351756
Y	0.053155	0.000084	-0.721558	0.316494	-0.097287	2.093168	1.573355	0.385847	-0.763057	-1.112411
Z	1.191143	0.262749	0.480143	-1.744586	0.927438	0.454420	-1.110431	-0.471525	0.263717	0.052467

see different number of top/bottom rows¶

df.head(2)

	0	1	2	3	4	5	6	7	8	9
A	0.412733	0.430821	2.141648	-0.406415	-0.512243	-0.813773	0.615979	1.128972	-0.113947	-0.840156
B	-0.824481	0.650593	0.743254	0.543154	-0.665510	0.232161	0.116686	0.218689	0.871429	0.223596

see a random sample of rows¶

df.sample(5)

	0	1	2	3	4	5	6	7	8	9
G	0.480747	0.446531	0.665385	-0.098485	-0.423298	-0.079718	-1.687334	-1.447112	-1.322700	-0.997247
I	0.840308	-1.727320	0.434424	0.237736	-0.594150	-1.446058	0.072130	-0.529493	0.232676	0.021852
N	-0.499296	-1.184944	-0.965117	-0.725226	2.128470	-0.821387	0.838489	-0.902927	0.931573	0.384951
R	-0.005122	-0.163443	0.337575	1.407482	0.090585	0.643939	-2.050172	-0.048718	-0.843230	-1.218813
F	0.625590	-0.309347	0.456775	-0.661926	-0.363054	-0.381738	-1.195840	0.486972	-0.469402	0.012494

get basic information about the dataframe¶

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26 entries, A to Z
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       26 non-null     float64
 1   1       26 non-null     float64
 2   2       26 non-null     float64
 3   3       26 non-null     float64
 4   4       26 non-null     float64
 5   5       26 non-null     float64
 6   6       26 non-null     float64
 7   7       26 non-null     float64
 8   8       26 non-null     float64
 9   9       26 non-null     float64
dtypes: float64(10)
memory usage: 2.2+ KB

get summary statistics¶

df.describe()

	0	1	2	3	4	5	6	7	8	9
count	26.000000	26.000000	26.000000	26.000000	26.000000	26.000000	26.000000	26.000000	26.000000	26.000000
mean	0.213455	-0.086445	-0.019708	-0.196146	-0.213611	0.010416	-0.209495	0.078737	0.034164	0.049044
std	1.017632	0.968577	0.958027	0.792970	0.965434	0.964080	1.015842	0.870329	0.990060	0.914550
min	-1.368159	-1.727320	-1.682870	-1.744586	-2.132046	-1.467045	-2.050172	-1.447112	-2.147289	-1.224469
25%	-0.604638	-0.654317	-0.704865	-0.709401	-0.573673	-0.749161	-1.063689	-0.611532	-0.576423	-0.445415
50%	0.024017	-0.177374	-0.191247	-0.139002	-0.359659	-0.150970	-0.149530	0.195638	0.142852	0.017173
75%	0.799959	0.280057	0.474301	0.297995	0.209141	0.629510	0.460843	0.755625	0.512278	0.376652
max	2.905067	2.913862	2.141648	1.407482	2.128470	2.093168	2.129247	1.767930	2.517474	1.876845

see only the datatypes by column¶

dict(zip(df.columns, df.dtypes))

{'0': dtype('float64'),
 '1': dtype('float64'),
 '2': dtype('float64'),
 '3': dtype('float64'),
 '4': dtype('float64'),
 '5': dtype('float64'),
 '6': dtype('float64'),
 '7': dtype('float64'),
 '8': dtype('float64'),
 '9': dtype('float64')}

previous

Pandas

next

Pandas for time series data