Pandas

In this notebook, we’re going to show how to:

  • create data frames

  • inspect data frames

import pandas as pd
import numpy as np

Create a dataframe with random data

from numpy.random import default_rng
rng = default_rng(42)
index_list = list('ABCDEFG')
column_list= list('0123')

n_rows, n_cols = len(index_list), len(column_list)
index_list, column_list
(['A', 'B', 'C', 'D', 'E', 'F', 'G'], ['0', '1', '2', '3'])
data = rng.normal(size=(n_rows, n_cols))
df = pd.DataFrame(data=data, index=index_list, columns=column_list)
df
0 1 2 3
A 0.304717 -1.039984 0.750451 0.940565
B -1.951035 -1.302180 0.127840 -0.316243
C -0.016801 -0.853044 0.879398 0.777792
D 0.066031 1.127241 0.467509 -0.859292
E 0.368751 -0.958883 0.878450 -0.049926
F -0.184862 -0.680930 1.222541 -0.154529
G -0.428328 -0.352134 0.532309 0.365444

Another way to get list of letters and digits to use as index/columns

import string

index_list = list(string.ascii_uppercase)
column_list = list(string.digits)

n_rows, n_cols = len(index_list), len(column_list)

see string doc for info on the string module

data = rng.normal(size=(n_rows, n_cols))
df = pd.DataFrame(data=data, index=index_list, columns=column_list)
df
0 1 2 3 4 5 6 7 8 9
A 0.412733 0.430821 2.141648 -0.406415 -0.512243 -0.813773 0.615979 1.128972 -0.113947 -0.840156
B -0.824481 0.650593 0.743254 0.543154 -0.665510 0.232161 0.116686 0.218689 0.871429 0.223596
C 0.678914 0.067579 0.289119 0.631288 -1.457156 -0.319671 -0.470373 -0.638878 -0.275142 1.494941
D -0.865831 0.968278 -1.682870 -0.334885 0.162753 0.586222 0.711227 0.793347 -0.348725 -0.462352
E 0.857976 -0.191304 -1.275686 -1.133287 -0.919452 0.497161 0.142426 0.690485 -0.427253 0.158540
F 0.625590 -0.309347 0.456775 -0.661926 -0.363054 -0.381738 -1.195840 0.486972 -0.469402 0.012494
G 0.480747 0.446531 0.665385 -0.098485 -0.423298 -0.079718 -1.687334 -1.447112 -1.322700 -0.997247
H 0.399774 -0.905479 -0.378163 1.299228 -0.356264 0.737516 -0.933618 -0.205438 -0.950022 -0.339033
I 0.840308 -1.727320 0.434424 0.237736 -0.594150 -1.446058 0.072130 -0.529493 0.232676 0.021852
J 1.601779 -0.239356 -1.023497 0.179276 0.219997 1.359188 0.835111 0.356871 1.463303 -1.188763
K -0.639752 -0.926576 -0.389810 -1.376686 0.635151 -0.222223 -1.470806 -1.015579 0.313514 0.838127
L 1.996731 2.913862 0.414409 -0.989538 -2.132046 0.267711 -0.812941 -0.415357 -0.612097 -0.140791
M 1.065980 0.157049 -0.158635 -1.035654 -1.674683 -0.486308 -0.053783 1.767930 0.130275 0.982740
N -0.499296 -1.184944 -0.965117 -0.725226 2.128470 -0.821387 0.838489 -0.902927 0.931573 0.384951
O -0.156638 -0.040763 -0.654788 0.446072 -0.454983 -1.225606 -1.277938 0.172588 1.579091 0.159992
P -0.118638 0.285826 1.306002 0.219383 -0.410927 1.106289 0.428756 1.535756 0.183234 -1.224469
Q -1.368159 1.650928 1.723666 -0.179519 -0.383187 1.461444 -1.107046 -0.894727 0.643327 -0.394605
R -0.005122 -0.163443 0.337575 1.407482 0.090585 0.643939 -2.050172 -0.048718 -0.843230 -1.218813
S -0.878152 -0.334123 0.915903 -1.326393 0.030631 -0.484169 -0.327673 1.002758 0.538115 1.337398
T -0.154506 -0.695943 -0.223859 0.242497 0.176573 -1.084388 0.090490 0.228228 2.517474 1.876845
U -0.853243 -0.287383 -1.463442 -0.590707 0.315605 1.205854 -0.729084 -0.654146 -2.147289 -0.162666
V -1.062414 -0.529439 -0.876861 -0.094263 -1.757728 -1.467045 2.129247 -1.287423 -1.096786 1.836914
W 2.905067 -1.171567 -0.368249 0.341556 1.728698 -0.986857 -0.245278 0.777338 0.434766 -0.376156
X -0.133823 -1.374896 -0.238174 -0.266387 0.232170 -0.555327 0.471539 1.012716 0.155429 0.351756
Y 0.053155 0.000084 -0.721558 0.316494 -0.097287 2.093168 1.573355 0.385847 -0.763057 -1.112411
Z 1.191143 0.262749 0.480143 -1.744586 0.927438 0.454420 -1.110431 -0.471525 0.263717 0.052467

Inspect the dataframe

see the first/last 5 rows

df.head()
0 1 2 3 4 5 6 7 8 9
A 0.412733 0.430821 2.141648 -0.406415 -0.512243 -0.813773 0.615979 1.128972 -0.113947 -0.840156
B -0.824481 0.650593 0.743254 0.543154 -0.665510 0.232161 0.116686 0.218689 0.871429 0.223596
C 0.678914 0.067579 0.289119 0.631288 -1.457156 -0.319671 -0.470373 -0.638878 -0.275142 1.494941
D -0.865831 0.968278 -1.682870 -0.334885 0.162753 0.586222 0.711227 0.793347 -0.348725 -0.462352
E 0.857976 -0.191304 -1.275686 -1.133287 -0.919452 0.497161 0.142426 0.690485 -0.427253 0.158540
df.tail()
0 1 2 3 4 5 6 7 8 9
V -1.062414 -0.529439 -0.876861 -0.094263 -1.757728 -1.467045 2.129247 -1.287423 -1.096786 1.836914
W 2.905067 -1.171567 -0.368249 0.341556 1.728698 -0.986857 -0.245278 0.777338 0.434766 -0.376156
X -0.133823 -1.374896 -0.238174 -0.266387 0.232170 -0.555327 0.471539 1.012716 0.155429 0.351756
Y 0.053155 0.000084 -0.721558 0.316494 -0.097287 2.093168 1.573355 0.385847 -0.763057 -1.112411
Z 1.191143 0.262749 0.480143 -1.744586 0.927438 0.454420 -1.110431 -0.471525 0.263717 0.052467

see different number of top/bottom rows

df.head(2)
0 1 2 3 4 5 6 7 8 9
A 0.412733 0.430821 2.141648 -0.406415 -0.512243 -0.813773 0.615979 1.128972 -0.113947 -0.840156
B -0.824481 0.650593 0.743254 0.543154 -0.665510 0.232161 0.116686 0.218689 0.871429 0.223596

see a random sample of rows

df.sample(5)
0 1 2 3 4 5 6 7 8 9
G 0.480747 0.446531 0.665385 -0.098485 -0.423298 -0.079718 -1.687334 -1.447112 -1.322700 -0.997247
I 0.840308 -1.727320 0.434424 0.237736 -0.594150 -1.446058 0.072130 -0.529493 0.232676 0.021852
N -0.499296 -1.184944 -0.965117 -0.725226 2.128470 -0.821387 0.838489 -0.902927 0.931573 0.384951
R -0.005122 -0.163443 0.337575 1.407482 0.090585 0.643939 -2.050172 -0.048718 -0.843230 -1.218813
F 0.625590 -0.309347 0.456775 -0.661926 -0.363054 -0.381738 -1.195840 0.486972 -0.469402 0.012494

get basic information about the dataframe

df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 26 entries, A to Z
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       26 non-null     float64
 1   1       26 non-null     float64
 2   2       26 non-null     float64
 3   3       26 non-null     float64
 4   4       26 non-null     float64
 5   5       26 non-null     float64
 6   6       26 non-null     float64
 7   7       26 non-null     float64
 8   8       26 non-null     float64
 9   9       26 non-null     float64
dtypes: float64(10)
memory usage: 2.2+ KB

get summary statistics

df.describe()
0 1 2 3 4 5 6 7 8 9
count 26.000000 26.000000 26.000000 26.000000 26.000000 26.000000 26.000000 26.000000 26.000000 26.000000
mean 0.213455 -0.086445 -0.019708 -0.196146 -0.213611 0.010416 -0.209495 0.078737 0.034164 0.049044
std 1.017632 0.968577 0.958027 0.792970 0.965434 0.964080 1.015842 0.870329 0.990060 0.914550
min -1.368159 -1.727320 -1.682870 -1.744586 -2.132046 -1.467045 -2.050172 -1.447112 -2.147289 -1.224469
25% -0.604638 -0.654317 -0.704865 -0.709401 -0.573673 -0.749161 -1.063689 -0.611532 -0.576423 -0.445415
50% 0.024017 -0.177374 -0.191247 -0.139002 -0.359659 -0.150970 -0.149530 0.195638 0.142852 0.017173
75% 0.799959 0.280057 0.474301 0.297995 0.209141 0.629510 0.460843 0.755625 0.512278 0.376652
max 2.905067 2.913862 2.141648 1.407482 2.128470 2.093168 2.129247 1.767930 2.517474 1.876845

see only the datatypes by column

dict(zip(df.columns, df.dtypes))
{'0': dtype('float64'),
 '1': dtype('float64'),
 '2': dtype('float64'),
 '3': dtype('float64'),
 '4': dtype('float64'),
 '5': dtype('float64'),
 '6': dtype('float64'),
 '7': dtype('float64'),
 '8': dtype('float64'),
 '9': dtype('float64')}