Pandas
Contents
Pandas¶
In this notebook, we’re going to show how to:
create data frames
inspect data frames
import pandas as pd
import numpy as np
Create a dataframe with random data¶
from numpy.random import default_rng
rng = default_rng(42)
index_list = list('ABCDEFG')
column_list= list('0123')
n_rows, n_cols = len(index_list), len(column_list)
index_list, column_list
(['A', 'B', 'C', 'D', 'E', 'F', 'G'], ['0', '1', '2', '3'])
data = rng.normal(size=(n_rows, n_cols))
df = pd.DataFrame(data=data, index=index_list, columns=column_list)
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
A | 0.304717 | -1.039984 | 0.750451 | 0.940565 |
B | -1.951035 | -1.302180 | 0.127840 | -0.316243 |
C | -0.016801 | -0.853044 | 0.879398 | 0.777792 |
D | 0.066031 | 1.127241 | 0.467509 | -0.859292 |
E | 0.368751 | -0.958883 | 0.878450 | -0.049926 |
F | -0.184862 | -0.680930 | 1.222541 | -0.154529 |
G | -0.428328 | -0.352134 | 0.532309 | 0.365444 |
Another way to get list of letters and digits to use as index/columns¶
import string
index_list = list(string.ascii_uppercase)
column_list = list(string.digits)
n_rows, n_cols = len(index_list), len(column_list)
see string doc for info on the string
module
data = rng.normal(size=(n_rows, n_cols))
df = pd.DataFrame(data=data, index=index_list, columns=column_list)
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
A | 0.412733 | 0.430821 | 2.141648 | -0.406415 | -0.512243 | -0.813773 | 0.615979 | 1.128972 | -0.113947 | -0.840156 |
B | -0.824481 | 0.650593 | 0.743254 | 0.543154 | -0.665510 | 0.232161 | 0.116686 | 0.218689 | 0.871429 | 0.223596 |
C | 0.678914 | 0.067579 | 0.289119 | 0.631288 | -1.457156 | -0.319671 | -0.470373 | -0.638878 | -0.275142 | 1.494941 |
D | -0.865831 | 0.968278 | -1.682870 | -0.334885 | 0.162753 | 0.586222 | 0.711227 | 0.793347 | -0.348725 | -0.462352 |
E | 0.857976 | -0.191304 | -1.275686 | -1.133287 | -0.919452 | 0.497161 | 0.142426 | 0.690485 | -0.427253 | 0.158540 |
F | 0.625590 | -0.309347 | 0.456775 | -0.661926 | -0.363054 | -0.381738 | -1.195840 | 0.486972 | -0.469402 | 0.012494 |
G | 0.480747 | 0.446531 | 0.665385 | -0.098485 | -0.423298 | -0.079718 | -1.687334 | -1.447112 | -1.322700 | -0.997247 |
H | 0.399774 | -0.905479 | -0.378163 | 1.299228 | -0.356264 | 0.737516 | -0.933618 | -0.205438 | -0.950022 | -0.339033 |
I | 0.840308 | -1.727320 | 0.434424 | 0.237736 | -0.594150 | -1.446058 | 0.072130 | -0.529493 | 0.232676 | 0.021852 |
J | 1.601779 | -0.239356 | -1.023497 | 0.179276 | 0.219997 | 1.359188 | 0.835111 | 0.356871 | 1.463303 | -1.188763 |
K | -0.639752 | -0.926576 | -0.389810 | -1.376686 | 0.635151 | -0.222223 | -1.470806 | -1.015579 | 0.313514 | 0.838127 |
L | 1.996731 | 2.913862 | 0.414409 | -0.989538 | -2.132046 | 0.267711 | -0.812941 | -0.415357 | -0.612097 | -0.140791 |
M | 1.065980 | 0.157049 | -0.158635 | -1.035654 | -1.674683 | -0.486308 | -0.053783 | 1.767930 | 0.130275 | 0.982740 |
N | -0.499296 | -1.184944 | -0.965117 | -0.725226 | 2.128470 | -0.821387 | 0.838489 | -0.902927 | 0.931573 | 0.384951 |
O | -0.156638 | -0.040763 | -0.654788 | 0.446072 | -0.454983 | -1.225606 | -1.277938 | 0.172588 | 1.579091 | 0.159992 |
P | -0.118638 | 0.285826 | 1.306002 | 0.219383 | -0.410927 | 1.106289 | 0.428756 | 1.535756 | 0.183234 | -1.224469 |
Q | -1.368159 | 1.650928 | 1.723666 | -0.179519 | -0.383187 | 1.461444 | -1.107046 | -0.894727 | 0.643327 | -0.394605 |
R | -0.005122 | -0.163443 | 0.337575 | 1.407482 | 0.090585 | 0.643939 | -2.050172 | -0.048718 | -0.843230 | -1.218813 |
S | -0.878152 | -0.334123 | 0.915903 | -1.326393 | 0.030631 | -0.484169 | -0.327673 | 1.002758 | 0.538115 | 1.337398 |
T | -0.154506 | -0.695943 | -0.223859 | 0.242497 | 0.176573 | -1.084388 | 0.090490 | 0.228228 | 2.517474 | 1.876845 |
U | -0.853243 | -0.287383 | -1.463442 | -0.590707 | 0.315605 | 1.205854 | -0.729084 | -0.654146 | -2.147289 | -0.162666 |
V | -1.062414 | -0.529439 | -0.876861 | -0.094263 | -1.757728 | -1.467045 | 2.129247 | -1.287423 | -1.096786 | 1.836914 |
W | 2.905067 | -1.171567 | -0.368249 | 0.341556 | 1.728698 | -0.986857 | -0.245278 | 0.777338 | 0.434766 | -0.376156 |
X | -0.133823 | -1.374896 | -0.238174 | -0.266387 | 0.232170 | -0.555327 | 0.471539 | 1.012716 | 0.155429 | 0.351756 |
Y | 0.053155 | 0.000084 | -0.721558 | 0.316494 | -0.097287 | 2.093168 | 1.573355 | 0.385847 | -0.763057 | -1.112411 |
Z | 1.191143 | 0.262749 | 0.480143 | -1.744586 | 0.927438 | 0.454420 | -1.110431 | -0.471525 | 0.263717 | 0.052467 |
Inspect the dataframe¶
see the first/last 5 rows¶
df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
A | 0.412733 | 0.430821 | 2.141648 | -0.406415 | -0.512243 | -0.813773 | 0.615979 | 1.128972 | -0.113947 | -0.840156 |
B | -0.824481 | 0.650593 | 0.743254 | 0.543154 | -0.665510 | 0.232161 | 0.116686 | 0.218689 | 0.871429 | 0.223596 |
C | 0.678914 | 0.067579 | 0.289119 | 0.631288 | -1.457156 | -0.319671 | -0.470373 | -0.638878 | -0.275142 | 1.494941 |
D | -0.865831 | 0.968278 | -1.682870 | -0.334885 | 0.162753 | 0.586222 | 0.711227 | 0.793347 | -0.348725 | -0.462352 |
E | 0.857976 | -0.191304 | -1.275686 | -1.133287 | -0.919452 | 0.497161 | 0.142426 | 0.690485 | -0.427253 | 0.158540 |
df.tail()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
V | -1.062414 | -0.529439 | -0.876861 | -0.094263 | -1.757728 | -1.467045 | 2.129247 | -1.287423 | -1.096786 | 1.836914 |
W | 2.905067 | -1.171567 | -0.368249 | 0.341556 | 1.728698 | -0.986857 | -0.245278 | 0.777338 | 0.434766 | -0.376156 |
X | -0.133823 | -1.374896 | -0.238174 | -0.266387 | 0.232170 | -0.555327 | 0.471539 | 1.012716 | 0.155429 | 0.351756 |
Y | 0.053155 | 0.000084 | -0.721558 | 0.316494 | -0.097287 | 2.093168 | 1.573355 | 0.385847 | -0.763057 | -1.112411 |
Z | 1.191143 | 0.262749 | 0.480143 | -1.744586 | 0.927438 | 0.454420 | -1.110431 | -0.471525 | 0.263717 | 0.052467 |
see different number of top/bottom rows¶
df.head(2)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
A | 0.412733 | 0.430821 | 2.141648 | -0.406415 | -0.512243 | -0.813773 | 0.615979 | 1.128972 | -0.113947 | -0.840156 |
B | -0.824481 | 0.650593 | 0.743254 | 0.543154 | -0.665510 | 0.232161 | 0.116686 | 0.218689 | 0.871429 | 0.223596 |
see a random sample of rows¶
df.sample(5)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
G | 0.480747 | 0.446531 | 0.665385 | -0.098485 | -0.423298 | -0.079718 | -1.687334 | -1.447112 | -1.322700 | -0.997247 |
I | 0.840308 | -1.727320 | 0.434424 | 0.237736 | -0.594150 | -1.446058 | 0.072130 | -0.529493 | 0.232676 | 0.021852 |
N | -0.499296 | -1.184944 | -0.965117 | -0.725226 | 2.128470 | -0.821387 | 0.838489 | -0.902927 | 0.931573 | 0.384951 |
R | -0.005122 | -0.163443 | 0.337575 | 1.407482 | 0.090585 | 0.643939 | -2.050172 | -0.048718 | -0.843230 | -1.218813 |
F | 0.625590 | -0.309347 | 0.456775 | -0.661926 | -0.363054 | -0.381738 | -1.195840 | 0.486972 | -0.469402 | 0.012494 |
get basic information about the dataframe¶
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 26 entries, A to Z
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 26 non-null float64
1 1 26 non-null float64
2 2 26 non-null float64
3 3 26 non-null float64
4 4 26 non-null float64
5 5 26 non-null float64
6 6 26 non-null float64
7 7 26 non-null float64
8 8 26 non-null float64
9 9 26 non-null float64
dtypes: float64(10)
memory usage: 2.2+ KB
get summary statistics¶
df.describe()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
count | 26.000000 | 26.000000 | 26.000000 | 26.000000 | 26.000000 | 26.000000 | 26.000000 | 26.000000 | 26.000000 | 26.000000 |
mean | 0.213455 | -0.086445 | -0.019708 | -0.196146 | -0.213611 | 0.010416 | -0.209495 | 0.078737 | 0.034164 | 0.049044 |
std | 1.017632 | 0.968577 | 0.958027 | 0.792970 | 0.965434 | 0.964080 | 1.015842 | 0.870329 | 0.990060 | 0.914550 |
min | -1.368159 | -1.727320 | -1.682870 | -1.744586 | -2.132046 | -1.467045 | -2.050172 | -1.447112 | -2.147289 | -1.224469 |
25% | -0.604638 | -0.654317 | -0.704865 | -0.709401 | -0.573673 | -0.749161 | -1.063689 | -0.611532 | -0.576423 | -0.445415 |
50% | 0.024017 | -0.177374 | -0.191247 | -0.139002 | -0.359659 | -0.150970 | -0.149530 | 0.195638 | 0.142852 | 0.017173 |
75% | 0.799959 | 0.280057 | 0.474301 | 0.297995 | 0.209141 | 0.629510 | 0.460843 | 0.755625 | 0.512278 | 0.376652 |
max | 2.905067 | 2.913862 | 2.141648 | 1.407482 | 2.128470 | 2.093168 | 2.129247 | 1.767930 | 2.517474 | 1.876845 |
see only the datatypes by column¶
dict(zip(df.columns, df.dtypes))
{'0': dtype('float64'),
'1': dtype('float64'),
'2': dtype('float64'),
'3': dtype('float64'),
'4': dtype('float64'),
'5': dtype('float64'),
'6': dtype('float64'),
'7': dtype('float64'),
'8': dtype('float64'),
'9': dtype('float64')}