Pandas Exercises Part 2

Pandas Exercises Part 2

Welcome back, guys! We will continue with part 2 in this series of Pandas exercise. I am very excited about this post because we will introducing DataFrame, the most used Pandas data structure. I hope you guys will enjoy this post.

With no further due, let’s get started.

We will start by importing Pandas and NumPy

import pandas as pd
import numpy as np

Ex 26: How to get the mean of a series grouped by another series?

Q: Compute the mean of weights of each fruit.

fruits = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))

Desired output

# Keep in mind that your values will be different from mine and you might only randomly select only 2 fruits instead of 3.

Pandas_ex26

Solution

fruits_weights = pd.concat({"fruits":fruits,"weights":weights},axis=1)
fruits_weights.groupby(by="fruits").mean()
weights
fruits
apple 5.4
banana 6.5
carrot 2.0

We concatenate horizontally (by setting the axis = 1) the two series into a dataframe by using the concat function and use that dataframe to group the fruits by the name of the fruit. After the grouping the dataframe, we get the mean of each fruit using the mean function.

Ex 27: How to compute the euclidean distance between two series?

Q: Compute the Euclidean distance between series (points) p and q, using a packaged formula and another solution without.

Euclidean distance formular:

Pandas_ex27

p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])

Desired output

# 18.165

Solution

1st Method using a built-in function

np.linalg.norm(p-q)
18.16590212458495

We can get the Euclidean distance by calling the NumPy function linalg.norm function and pass in the difference in the two series.

2nd Method without using a built-in function

sum((p - q)**2)**.5
18.16590212458495

Using the Euclidean formula provided, we can use operators to find the Euclidean distance. We first subtract the corresponding elements in the two series and apply 2 as an exponent then sum it up and finally get the square root.

Ex 28: How to find all the local maxima (or peaks) in a numeric series?

Q: Get the positions of peaks (values surrounded by smaller values on both sides) in ser.

ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])

Desired output

# array([1, 5, 7])

Solution

from scipy.signal import argrelextrema

argrelextrema(ser.values, np.greater)
(array([1, 5, 7]),)

To calculate the relative extrema of the series, we use argrelextrema function from the scipy (Scientific Python) which is a Python library close to NumPy used for mathematics, science, and engineering.

In that function, we pass in the series and the comparator. Since we are looking for the maxima, in this case, the comparator will be np.greater.

Ex 29: How to replace missing spaces in a string with the least frequent character?

Q: Replace the spaces in my_str with the least frequent character.

ser = pd.Series(list('dbc deb abed gagde'))

Desired output

# least frequent element is c

# ['d',
#  'b',
#  'c',
#  'c',
#  'd',
#  'e',
#  'b',
#  'c',
#  'a',
#  'b',
#  'e',
#  'd',
#  'c',
#  'g',
#  'a',
#  'g',
#  'd',
#  'e']

Solution

from collections import Counter

least_common_char = Counter(ser.replace(" ","")).most_common()[-1][0]
Counter(ser.replace(" ","")).most_common()
[('d', 4), ('b', 3), ('', 3), ('e', 3), ('a', 2), ('g', 2), ('c', 1)]
least_common_char
'c'
ser.replace(" ",least_common_char)
['d',
 'b',
 'c',
 'c',
 'd',
 'e',
 'b',
 'c',
 'a',
 'b',
 'e',
 'd',
 'c',
 'g',
 'a',
 'g',
 'd',
 'e']

To replace the white space with the most common element in the series, we need first to find the most common character in the series.

To find it, we use the counter function from the collection library. We pass in the series without the white space (by replacing “ “ by “”) and apply to the counter function, the most_common function. We will get back a list of tuples will all characters and their counts in decreasing order. We use -1 to target the last tuple and 0 to get the character in that tuple.

Now that we have the least common character, we can replace all the instances of white space by the least common character.

Ex 30: How to create a TimeSeries starting ‘2000-01-01’ and 10 weekends (Saturdays) and have random numbers as values?

Q: Create a TimeSeries starting ‘2000-01-01’ and 10 weekends (Saturdays) and have random numbers as values?

Desired output

# values will be different due to randomness

# 2000-01-01    4
# 2000-01-08    1
# 2000-01-15    8
# 2000-01-22    4
# 2000-01-29    4
# 2000-02-05    2
# 2000-02-12    4
# 2000-02-19    9
# 2000-02-26    6
# 2000-03-04    6

Solution

pd.Series(np.random.randint(1,high=10,size=10),pd.date_range(start="2000-01-01",periods=10,freq="W-SAT"))
2000-01-01    1
2000-01-08    9
2000-01-15    1
2000-01-22    3
2000-01-29    8
2000-02-05    7
2000-02-12    3
2000-02-19    8
2000-02-26    4
2000-03-04    7
Freq: W-SAT, dtype: int64

We create as the values of the series, the ten random numbers from 1 to 10 and as indexes, we create a date_range function which returns a date rage starting from 2000-01-01 and set the number of periods to generate to 10 with the frequency set to Saturday weekly. To get the list of all the frequencies, visit this link.

31: How to fill an intermittent time series so all missing dates show up with values of previous non-missing date?

Q: ser has missing dates and values. Make all missing dates appear and fill up with value from previous date.

import pandas as pd 
import numpy as np
ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))

Desired output

# 2000-01-01     1.0
# 2000-01-02     1.0
# 2000-01-03    10.0
# 2000-01-04    10.0
# 2000-01-05    10.0
# 2000-01-06     3.0
# 2000-01-07     3.0
# 2000-01-08     NaN

Solution

ser.resample("D").ffill()
2000-01-01     1.0
2000-01-02     1.0
2000-01-03    10.0
2000-01-04    10.0
2000-01-05    10.0
2000-01-06     3.0
2000-01-07     3.0
2000-01-08     NaN
Freq: D, dtype: float64

We use the resample function to fill up all the days that are missing starting from 2000-01-01 to 2000-01-08 and uses the character D to specify that we want days as the interval. We then use ffill function to fill up the missing values from the previous row et voila!

32: How to compute the autocorrelations of a numeric series?

Q: Compute autocorrelations for the first 10 lags of ser. Find out which lag has the largest correlation.

ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))

Desired output

# values will change due to randomness

# [-0.462232351922819,
#  0.24702149262453904,
#  -0.3667824631718427,
#  0.09378057953432406,
#  0.3382941938771548,
#  -0.04450324725676436,
#  0.16361925861505003,
#  -0.5351035019540977,
#  0.26359968436232056,
#  0.03944833988252732]
# the lag with the highest correlation is 8

Solution

autocorr = [ser.autocorr(lag=i) for i in range(11)][1:]
autocorr
[-0.462232351922819,
 0.24702149262453904,
 -0.3667824631718427,
 0.09378057953432406,
 0.3382941938771548,
 -0.04450324725676436,
 0.16361925861505003,
 -0.5351035019540977,
 0.26359968436232056,
 0.03944833988252732]
print("the lag with the highest correlation is {}".format(np.argmax(np.abs(autocorr))+1))
the lag with the highest correlation is 8

We first have to calculate the correlation between each consecutive number and to do that we loop through all the elements in the series using range and list comprehension. We use indexing to ignore the first correlation since the correction with the same element is 1.

After finding all the correlation, it is time to find the position of the largest correlation. To do this, we use the NumPy function argmax to get back the position of the largest absolute (by changing negative correlation to positive) correlation number and add 1 since the count starts from 0.

Ex 33: How to import only every nth row from a csv file to create a dataframe?

Q: Import every 50th row of BostonHousing dataset as a dataframe.

Desired output

Pandas_ex33

Solution

boston_housing_dataset = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv")
boston_housing_dataset.head()
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
boston_housing_dataset[::50]
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
50 0.08873 21.0 5.64 0 0.439 5.963 45.7 6.8147 4 243 16.8 395.56 13.45 19.7
100 0.14866 0.0 8.56 0 0.520 6.727 79.9 2.7778 5 384 20.9 394.76 9.42 27.5
150 1.65660 0.0 19.58 0 0.871 6.122 97.3 1.6180 5 403 14.7 372.80 14.10 21.5
200 0.01778 95.0 1.47 0 0.403 7.135 13.9 7.6534 3 402 17.0 384.30 4.45 32.9
250 0.14030 22.0 5.86 0 0.431 6.487 13.0 7.3967 7 330 19.1 396.28 5.90 24.4
300 0.04417 70.0 2.24 0 0.400 6.871 47.4 7.8278 5 358 14.8 390.86 6.07 24.8
350 0.06211 40.0 1.25 0 0.429 6.490 44.4 8.7921 1 335 19.7 396.90 5.98 22.9
400 25.04610 0.0 18.10 0 0.693 5.987 100.0 1.5888 24 666 20.2 396.90 26.77 5.6
450 6.71772 0.0 18.10 0 0.713 6.749 92.6 2.3236 24 666 20.2 0.32 17.44 13.4
500 0.22438 0.0 9.69 0 0.585 6.027 79.7 2.4982 6 391 19.2 396.90 14.33 16.8

To import a csv(comma-separated value) dataset, we use the read_csv function and pass in the link to the csv file. Now we have the dataset imported and stored in the boston_housing_dataset.

To get every 50th row in the dataset, we use indexing with a step of 50.

Ex 34: How to change column values when importing csv to a dataframe?

Q: Import the BostonHousing dataset dataset, but while importing change the ‘medv’ (median house value) column so that values < 25 becomes ‘Low’ and > 25 becomes ‘High’.

Desired output

Pandas_ex34

Solution

boston_housing_dataset = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv",converters={"medv": lambda x: "High" if float(x) > 25 else "Low"})
boston_housing_dataset.head()
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 Low
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 Low
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 High
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 High
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 High

To change the value of a column while importing the dataset, we use the parameter converters from the read_csv function. We pass in a dictionary with a key corresponding to the name of the column we want to change and value to be a lambda expression where “High” is for the value greater than 25 and “Low” is for the value less than 25.

Ex 35: How to create a dataframe with rows as strides from a given series?

Q: Create a dataframe with rows as strides from a given series with 2 as the stride length and 4 as length of each row.

the_series = pd.Series(range(15))

Desired output

# array([[ 0,  1,  2,  3],
#        [ 2,  3,  4,  5],
#        [ 4,  5,  6,  7],
#        [ 6,  7,  8,  9],
#        [ 8,  9, 10, 11],
#        [10, 11, 12, 13]])

Solution

def gen_strides(the_series, stride_len=2, window_len=4):
    n_strides = ((the_series.size - window_len)//stride_len) + 1
    return np.array([the_series[i:(i+window_len)] for i in np.arange(0,the_series.size,stride_len)[:n_strides]])

gen_strides(the_series)

[0 2 4]

Strides are used in CNN (convolutional neural network), which will be covered in a future post. To get the stride, we first create a function that takes in the stride length 2 (which mean that the two last elements in the row will be the same as the first two elements in the following row) and a window length of 4 (corresponding to the number of elements in the row or the number of columns).

We need first to get the numbers of rows by subtracting the size of the series by the number of element desired in a row and floor divide with the stride length and finally add 1.

We proceed by using the list comprehension by looping through an array that starts from 0, step by stride_len and stops at the_series.size. We use indexing with n_stride to get the first six elements because we only have six rows.

Now it is time to populate the rows, for each row we use the original series and start from index i to index i + window_len to get the strides.

Ex 36: How to import only specified columns from a csv file?

Q: Import crim and medv columns of the BostonHousing dataset as a dataframe.

Desired output

Pandas_ex36

Solution

pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv",usecols=["crim","medv"]).head()
crim medv
0 0.00632 24.0
1 0.02731 21.6
2 0.02729 34.7
3 0.03237 33.4
4 0.06905 36.2

To import only the crim and the medv column, we pass a list of the names of those two columns the usecols parameter.

Ex 37: How to get the nrows, ncolumns, datatype, summary stats of each column of a dataframe? Also, get the array and list equivalent.

Q: Get the number of rows, columns, datatype, columns for each datatype and statistical summary of each column of the Cars93 dataset. Also, get the numpy array and list equivalent of the dataframe.

Desired output

# Rows and columns

#(93, 27)

# Datatype

# Manufacturer           object
# Model                  object
# Type                   object
# Min.Price             float64
# Price                 float64
# Max.Price             float64
# MPG.city              float64
# MPG.highway           float64
# AirBags                object
# DriveTrain             object
# Cylinders              object
# EngineSize            float64
# Horsepower            float64
# RPM                   float64
# Rev.per.mile          float64
# Man.trans.avail        object
# Fuel.tank.capacity    float64
# Passengers            float64
# Length                float64
# Wheelbase             float64
# Width                 float64
# Turn.circle           float64
# Rear.seat.room        float64
# Luggage.room          float64
# Weight                float64
# Origin                 object
# Make                   object
# dtype: object

# Columns for each datatype

# float64    18
# object      9
# dtype: int64

# Statistical summary

# Min.Price	Price	Max.Price	MPG.city	MPG.highway	EngineSize	Horsepower	RPM	Rev.per.mile	Fuel.tank.capacity	Passengers	Length	Wheelbase	Width	Turn.circle	Rear.seat.room	Luggage.room	Weight
# count	86.000000	91.000000	88.000000	84.000000	91.000000	91.000000	86.000000	90.000000	87.000000	85.000000	91.000000	89.000000	92.000000	87.000000	88.000000	89.000000	74.000000	86.000000
# mean	17.118605	19.616484	21.459091	22.404762	29.065934	2.658242	144.000000	5276.666667	2355.000000	16.683529	5.076923	182.865169	103.956522	69.448276	38.954545	27.853933	13.986486	3104.593023
# std	8.828290	9.724280	10.696563	5.841520	5.370293	1.045845	53.455204	605.554811	486.916616	3.375748	1.045953	14.792651	6.856317	3.778023	3.304157	3.018129	3.120824	600.129993
# min	6.700000	7.400000	7.900000	15.000000	20.000000	1.000000	55.000000	3800.000000	1320.000000	9.200000	2.000000	141.000000	90.000000	60.000000	32.000000	19.000000	6.000000	1695.000000
# 25%	10.825000	12.350000	14.575000	18.000000	26.000000	1.800000	100.750000	4800.000000	2017.500000	14.500000	4.000000	174.000000	98.000000	67.000000	36.000000	26.000000	12.000000	2647.500000
# 50%	14.600000	17.700000	19.150000	21.000000	28.000000	2.300000	140.000000	5200.000000	2360.000000	16.500000	5.000000	181.000000	103.000000	69.000000	39.000000	27.500000	14.000000	3085.000000
# 75%	20.250000	23.500000	24.825000	25.000000	31.000000	3.250000	170.000000	5787.500000	2565.000000	19.000000	6.000000	192.000000	110.000000	72.000000	42.000000	30.000000	16.000000	3567.500000
# max	45.400000	61.900000	80.000000	46.000000	50.000000	5.700000	300.000000	6500.000000	3755.000000	27.000000	8.000000	219.000000	119.000000	78.000000	45.000000	36.000000	22.000000	4105.000000

# NumPy array

# array(['Acura', 'Integra', 'Small', 12.9, 15.9, 18.8, 25.0, 31.0, 'None',
#        'Front', '4', 1.8, 140.0, 6300.0, 2890.0, 'Yes', 13.2, 5.0, 177.0,
#        102.0, 68.0, 37.0, 26.5, nan, 2705.0, 'non-USA', 'Acura Integra'],
#       dtype=object)

# List

# ['Acura',
#   'Integra',
#   'Small',
#   12.9,
#   15.9,
#   18.8,
#   25.0,
#   31.0,
#   'None',
#   'Front',
#   '4',
#   1.8,
#   140.0,
#   6300.0,
#   2890.0,
#   'Yes',
#   13.2,
#   5.0,
#   177.0,
#   102.0,
#   68.0,
#   37.0,
#   26.5,
#   nan,
#   2705.0,
#   'non-USA',
#   'Acura Integra']

Solution

cars_dataset = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv")

Shape

cars_dataset.shape
(93, 27)

We call the shape function on the dataset, we back a tuple with the first element as the number of rows and the second element is the number of columns in the dataframe.

Datatype

cars_dataset.dtypes
Manufacturer           object
Model                  object
Type                   object
Min.Price             float64
Price                 float64
Max.Price             float64
MPG.city              float64
MPG.highway           float64
AirBags                object
DriveTrain             object
Cylinders              object
EngineSize            float64
Horsepower            float64
RPM                   float64
Rev.per.mile          float64
Man.trans.avail        object
Fuel.tank.capacity    float64
Passengers            float64
Length                float64
Wheelbase             float64
Width                 float64
Turn.circle           float64
Rear.seat.room        float64
Luggage.room          float64
Weight                float64
Origin                 object
Make                   object
dtype: object

To get the datatype for each column, we call the dtypes function on the dataset.

Columns for each datatype

cars_dataset.dtypes.value_counts()
float64    18
object      9
dtype: int64

To get the columns count for each datatype, we call the value_counts on the dtype function.

Statistical summary

cars_dataset.describe()
Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower RPM Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight
count 86.000000 91.000000 88.000000 84.000000 91.000000 91.000000 86.000000 90.000000 87.000000 85.000000 91.000000 89.000000 92.000000 87.000000 88.000000 89.000000 74.000000 86.000000
mean 17.118605 19.616484 21.459091 22.404762 29.065934 2.658242 144.000000 5276.666667 2355.000000 16.683529 5.076923 182.865169 103.956522 69.448276 38.954545 27.853933 13.986486 3104.593023
std 8.828290 9.724280 10.696563 5.841520 5.370293 1.045845 53.455204 605.554811 486.916616 3.375748 1.045953 14.792651 6.856317 3.778023 3.304157 3.018129 3.120824 600.129993
min 6.700000 7.400000 7.900000 15.000000 20.000000 1.000000 55.000000 3800.000000 1320.000000 9.200000 2.000000 141.000000 90.000000 60.000000 32.000000 19.000000 6.000000 1695.000000
25% 10.825000 12.350000 14.575000 18.000000 26.000000 1.800000 100.750000 4800.000000 2017.500000 14.500000 4.000000 174.000000 98.000000 67.000000 36.000000 26.000000 12.000000 2647.500000
50% 14.600000 17.700000 19.150000 21.000000 28.000000 2.300000 140.000000 5200.000000 2360.000000 16.500000 5.000000 181.000000 103.000000 69.000000 39.000000 27.500000 14.000000 3085.000000
75% 20.250000 23.500000 24.825000 25.000000 31.000000 3.250000 170.000000 5787.500000 2565.000000 19.000000 6.000000 192.000000 110.000000 72.000000 42.000000 30.000000 16.000000 3567.500000
max 45.400000 61.900000 80.000000 46.000000 50.000000 5.700000 300.000000 6500.000000 3755.000000 27.000000 8.000000 219.000000 119.000000 78.000000 45.000000 36.000000 22.000000 4105.000000

To get the statistical summary (mean, std, percentile, min, max and count), we call the describe function on the dataset.

Dataframe to NumPy

cars_dataset.iloc[0].to_numpy()
array(['Acura', 'Integra', 'Small', 12.9, 15.9, 18.8, 25.0, 31.0, 'None',
       'Front', '4', 1.8, 140.0, 6300.0, 2890.0, 'Yes', 13.2, 5.0, 177.0,
       102.0, 68.0, 37.0, 26.5, nan, 2705.0, 'non-USA', 'Acura Integra'],
      dtype=object)

We extract the first row and call the to_numpy function to cast the row to a NumPy array. It is also possible to cast the whole dataframe to an array.

Dataframe to list

cars_dataset.iloc[0].values.tolist()
['Acura',
 'Integra',
 'Small',
 12.9,
 15.9,
 18.8,
 25.0,
 31.0,
 'None',
 'Front',
 '4',
 1.8,
 140.0,
 6300.0,
 2890.0,
 'Yes',
 13.2,
 5.0,
 177.0,
 102.0,
 68.0,
 37.0,
 26.5,
 nan,
 2705.0,
 'non-USA',
 'Acura Integra']

We extract the first row and call the tolist function on the values function to cast the row to a list. It is also possible to cast the whole dataframe to a list.

Ex 38: How to extract the row and column number of a particular cell with a given criterion?

Q: Which manufacturer, model and type have the highest Price? What are the row and column number of the cell with the highest Price value?

cars_dataset = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired output

# Manufacturer with the highest price

# 'Mercedes-Benz'

# Model with the highest price

# '300E'

# Type with the highest price

# 'Midsize'

# row and column number of the cell with the highest Price value

# (array([58]), array([4]))

Solution

Manufacturer, model and midsize with the highest price

cars_dataset.iloc[np.argmax(cars_dataset["Price"])]["Manufacturer"]
'Mercedes-Benz'
cars_dataset.iloc[np.argmax(cars_dataset["Price"])]["Model"]
'300E'
cars_dataset.iloc[np.argmax(cars_dataset["Price"])]["Type"]
'Midsize'

We first find the row with the highest price using the NumPy argmax function by passing in the price column as an argument. With the index number of the row with the highest price, we use iloc to get all the columns of that row as Series and then use indexing on that Series to get the manufacturer, model and type.

Row and column with the highest price

np.where(cars_dataset.values == np.max(cars_dataset["Price"]))
(array([58]), array([4]))

We use the NumPy where function to compares all the values in the dataset with the highest price and returns a tuple with the row and column.

Ex 39: How to rename a specific column in a dataframe?

Q: Rename the column Type as CarType in df and replace the ‘.’ in column names with ‘_’.

cars_dataset = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
cars_dataset.columns
Index(['Manufacturer', 'Model', 'Type', 'Min.Price', 'Price', 'Max.Price',
       'MPG.city', 'MPG.highway', 'AirBags', 'DriveTrain', 'Cylinders',
       'EngineSize', 'Horsepower', 'RPM', 'Rev.per.mile', 'Man.trans.avail',
       'Fuel.tank.capacity', 'Passengers', 'Length', 'Wheelbase', 'Width',
       'Turn.circle', 'Rear.seat.room', 'Luggage.room', 'Weight', 'Origin',
       'Make'],
      dtype='object')

Desired output

# Index(['Manufacturer', 'Model', 'Type', 'Min_Price', 'Price', 'Max_Price',
#        'MPG_city', 'MPG_highway', 'AirBags', 'DriveTrain', 'Cylinders',
#        'EngineSize', 'Horsepower', 'RPM', 'Rev_per_mile', 'Man_trans_avail',
#        'Fuel_tank_capacity', 'Passengers', 'Length', 'Wheelbase', 'Width',
#        'Turn_circle', 'Rear_seat_room', 'Luggage_room', 'Weight', 'Origin',
#        'Make'],
#       dtype='object')

Solution

cars_dataset.columns.str.replace(".","_")
Index(['Manufacturer', 'Model', 'Type', 'Min_Price', 'Price', 'Max_Price',
       'MPG_city', 'MPG_highway', 'AirBags', 'DriveTrain', 'Cylinders',
       'EngineSize', 'Horsepower', 'RPM', 'Rev_per_mile', 'Man_trans_avail',
       'Fuel_tank_capacity', 'Passengers', 'Length', 'Wheelbase', 'Width',
       'Turn_circle', 'Rear_seat_room', 'Luggage_room', 'Weight', 'Origin',
       'Make'],
      dtype='object')

To replace all the occurrences of . by _ we use the str function which captures the columns as string data type and uses the replace function to achieve the column’s name change.

Ex 40: How to check if a dataframe has any missing values?

Q: Check if cars_dataset has any missing values.

cars_dataset = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired output

# There are nan values in the dataset, so true has to be returned

# True

Solution

cars_dataset.isnull().values.any()
True

To check if there is any nan value in the whole dataset, we first use the isnull function to return booleans in each cell of the dataframe. true will be returned in a cell with a nan value and false will be a cell without a nan value. So to check if there any nan in the dataset, we use any on the values in the dataset. If there is at least one nan value true will be returned otherwise it will befalse.

Ex 41: How to count the number of missing values in each column?

Q: Count the number of missing values in each column of df. Which column has the maximum number of missing values?

cars_dataset = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired output

# Count of missing values in each column

# Manufacturer           4
# Model                  1
# Type                   3
# Min.Price              7
# Price                  2
# Max.Price              5
# MPG.city               9
# MPG.highway            2
# AirBags                6
# DriveTrain             7
# Cylinders              5
# EngineSize             2
# Horsepower             7
# RPM                    3
# Rev.per.mile           6
# Man.trans.avail        5
# Fuel.tank.capacity     8
# Passengers             2
# Length                 4
# Wheelbase              1
# Width                  6
# Turn.circle            5
# Rear.seat.room         4
# Luggage.room          19
# Weight                 7
# Origin                 5
# Make                   3
# dtype: int64

# column with the maximum number of missing values

# 'Luggage.room'

Solution

nan_per_column = cars_dataset.isnull().sum()
nan_per_column
Manufacturer           4
Model                  1
Type                   3
Min.Price              7
Price                  2
Max.Price              5
MPG.city               9
MPG.highway            2
AirBags                6
DriveTrain             7
Cylinders              5
EngineSize             2
Horsepower             7
RPM                    3
Rev.per.mile           6
Man.trans.avail        5
Fuel.tank.capacity     8
Passengers             2
Length                 4
Wheelbase              1
Width                  6
Turn.circle            5
Rear.seat.room         4
Luggage.room          19
Weight                 7
Origin                 5
Make                   3
dtype: int64
cars_data.columns[nan_per_column.argmax()]
'Luggage.room'

To get the count of nan values in each column, we use the isnull function. This function will return True in each cell where there is a nan value or False where there is any other value other than nan. Finally, we apply the sum function, to get back a series where the indexes are the column names and the values are the count of nan value in each column.

To know which column has the highest number of nan, we use the argmax function which will return the index (column name in this case) with the highest count.

Ex 42: How to replace missing values of multiple numeric columns with the mean?

Q: Replace missing values in Min.Price and Max.Price columns with their respective mean.

cars_dataset = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired output

Pandas_ex42

Solution

1st Method

min_price_mean = round(cars_dataset["Min.Price"].mean(),2)
max_price_mean = round(cars_dataset["Max.Price"].mean(),2)
cars_dataset["Min.Price"].replace(np.nan,min_price_mean,inplace=True)
cars_dataset["Max.Price"].replace(np.nan,max_price_mean,inplace=True)
cars_dataset.head()
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway AirBags DriveTrain ... Passengers Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
0 Acura Integra Small 12.90 15.9 18.80 25.0 31.0 None Front ... 5.0 177.0 102.0 68.0 37.0 26.5 NaN 2705.0 non-USA Acura Integra
1 NaN Legend Midsize 29.20 33.9 38.70 18.0 25.0 Driver & Passenger Front ... 5.0 195.0 115.0 71.0 38.0 30.0 15.0 3560.0 non-USA Acura Legend
2 Audi 90 Compact 25.90 29.1 32.30 20.0 26.0 Driver only Front ... 5.0 180.0 102.0 67.0 37.0 28.0 14.0 3375.0 non-USA Audi 90
3 Audi 100 Midsize 17.12 37.7 44.60 19.0 26.0 Driver & Passenger NaN ... 6.0 193.0 106.0 NaN 37.0 31.0 17.0 3405.0 non-USA Audi 100
4 BMW 535i Midsize 17.12 30.0 21.46 22.0 30.0 NaN Rear ... 4.0 186.0 109.0 69.0 39.0 27.0 13.0 3640.0 non-USA BMW 535i

5 rows × 27 columns

The first way to solve this problem is to use the replace function. First, we need to find the mean of the Min.Price and Max.Price columns and round it by two decimal places using round.

After that, we select the columns and use the replace function to replace all the occurrences of the nan value with the means previously found and finally set the inplace argument to True, to apply the changes on the original dataset.

2nd Method

cars_dataset[["Min.Price","Max.Price"]] = cars_dataset[["Min.Price","Max.Price"]].apply(lambda x: x.fillna(round(x.mean(),2)))
cars_dataset.head()
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway AirBags DriveTrain ... Passengers Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
0 Acura Integra Small 12.90 15.9 18.80 25.0 31.0 None Front ... 5.0 177.0 102.0 68.0 37.0 26.5 NaN 2705.0 non-USA Acura Integra
1 NaN Legend Midsize 29.20 33.9 38.70 18.0 25.0 Driver & Passenger Front ... 5.0 195.0 115.0 71.0 38.0 30.0 15.0 3560.0 non-USA Acura Legend
2 Audi 90 Compact 25.90 29.1 32.30 20.0 26.0 Driver only Front ... 5.0 180.0 102.0 67.0 37.0 28.0 14.0 3375.0 non-USA Audi 90
3 Audi 100 Midsize 17.12 37.7 44.60 19.0 26.0 Driver & Passenger NaN ... 6.0 193.0 106.0 NaN 37.0 31.0 17.0 3405.0 non-USA Audi 100
4 BMW 535i Midsize 17.12 30.0 21.46 22.0 30.0 NaN Rear ... 4.0 186.0 109.0 69.0 39.0 27.0 13.0 3640.0 non-USA BMW 535i

5 rows × 27 columns

The alternative way to solve this is to use the fillna function. We start by selecting the two columns and then use the apply function with the lambda expression and pass in the mean function of x (x is the column selected). Then we assign the expression to the columns to make it in place.

43 : How to use the apply function on existing columns with global variables as additional arguments?

Q: In dataframe, use the apply method to replace the missing values in Min.Price with the column’s mean and those in Max.Price with the column’s median.

cars_dataset = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired output

Pandas_ex43

Solution

dict_fun = {"Min.Price":np.nanmean,"Max.Price":np.nanmedian}

cars_dataset[["Min.Price","Max.Price"]].apply(lambda x, dict_fun: x.fillna(dict_fun[x.name](x)),args=(dict_fun,))
Min.Price Max.Price
0 12.900000 18.80
1 29.200000 38.70
2 25.900000 32.30
3 17.118605 44.60
4 17.118605 19.15
... ... ...
88 16.600000 22.70
89 17.600000 22.40
90 22.900000 23.70
91 21.800000 23.50
92 24.800000 28.50

93 rows × 2 columns

We first create a dictionary with Min.Price column as the key and value as the NumPy function to calculate the mean, then the second value has Max.Price column as the key and NumPy median function as the value.

We select the Min.Price and Max.Price columns, apply the lambda expression, and replace in each instance of nan in x(representing the columns) by the mean or median of that column using the fillna function and pass in the dictionary’s value.

Note: Refer to this StackOverflow question to learn more.

44: How to select a specific column from a dataframe as a dataframe instead of a series?

Q: Get the first column a in dataframe as a dataframe (rather than as a Series).

tab = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

Desired output

Pandas_ex44

# Data type

# pandas.core.frame.DataFrame

Solution

1st Method

type(the_dataframe["a"].to_frame())
pandas.core.frame.DataFrame

The first method is straight forward, we get column a as a series and then we cast it to a dataframe.

2nd Method

type(the_dataframe.loc[:,["a"]])
pandas.core.frame.DataFrame

We can get directly deduce a column as a dataframe by using the loc function and pass in the name of the column with brackets around it.

3rd Method

type(the_dataframe[['a']])
pandas.core.frame.DataFrame

Something similar can be achieve using indexing.

4th Method

type(the_dataframe.iloc[:, [0]])
pandas.core.frame.DataFrame

Same as iloc.

Note: If you want to understand the difference between loc and iloc, read this page from StackOverflow

Ex 45: How to change the order of columns of a Dataframe?

Q:

  1. In the_dataframe, interchange columns a and c.

  2. Create a generic function to interchange two columns, without hardcoding column names.

  3. Sort the columns in reverse alphabetical order, starting from column e first through column a last.

import pandas as pd
import numpy as np
the_dataframe = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

Desired output

Pandas_ex45

Solution

Q1

the_dataframe
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
the_columns = list(the_dataframe.columns)
the_columns
['a', 'b', 'c', 'd', 'e']
a, c = the_columns.index("a"), the_columns.index("c")
the_columns[a], the_columns[c] = the_columns[c], the_columns[a]
the_dataframe = the_dataframe[the_columns]
the_dataframe
c b a d e
0 2 1 0 3 4
1 7 6 5 8 9
2 12 11 10 13 14
3 17 16 15 18 19

To interchange column a by c, we first get all the column name as a list and store it in the_columns variable. We then extract the indexes of a and c which are 0 and 2 respectively, then swap them in the_columns list using the indexes previously extracted.

We finally use indexing in the original dataframe to now swap the columns name and values from column a and c.

Q2

def swap_col(col_1,col_2,df):
    all_col = list(df.columns)
    col_1_idx, col_2_idx = all_col.index(col_1), all_col.index(col_2)
    all_col[col_1_idx],all_col[col_2_idx] = all_col[col_2_idx], all_col[col_1_idx]
    return df[all_col]
    
    
print(swap_col("d","b",the_dataframe))
    c   d   a   b   e
0   2   3   0   1   4
1   7   8   5   6   9
2  12  13  10  11  14
3  17  18  15  16  19

This function is based on the same steps as Q1. Instead of using the names of the columns a and b, we use generic names parameters col_1 and col_2 and pass in the original dataframe as df. The rest is the same as Q1.

Q3

the_columns_reversed = sorted(the_dataframe.columns,reverse=True)
the_dataframe[the_columns_reversed]
e d c b a
0 4 3 2 1 0
1 9 8 7 6 5
2 14 13 12 11 10
3 19 18 17 16 15

For this subquestion, we get first the reversed list of columns starting from e to a and store it in the the_columns_reversed variable. We use indexing on the original dataframe to get the columns aligned in reverse alphabetical order.

Ex 46: How to set the number of rows and columns displayed in the output?

Q: Change the pandas display settings on printing the dataframe df it shows a maximum of 10 rows and 10 columns.

the_dataframe = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired output

Pandas_ex46

Solution

1st Method

pd.options.display.max_columns = 10
pd.options.display.max_rows = 10
the_dataframe
Manufacturer Model Type Min.Price Price ... Rear.seat.room Luggage.room Weight Origin Make
0 Acura Integra Small 12.9 15.9 ... 26.5 NaN 2705.0 non-USA Acura Integra
1 NaN Legend Midsize 29.2 33.9 ... 30.0 15.0 3560.0 non-USA Acura Legend
2 Audi 90 Compact 25.9 29.1 ... 28.0 14.0 3375.0 non-USA Audi 90
3 Audi 100 Midsize NaN 37.7 ... 31.0 17.0 3405.0 non-USA Audi 100
4 BMW 535i Midsize NaN 30.0 ... 27.0 13.0 3640.0 non-USA BMW 535i
... ... ... ... ... ... ... ... ... ... ... ...
88 Volkswagen Eurovan Van 16.6 19.7 ... 34.0 NaN 3960.0 NaN Volkswagen Eurovan
89 Volkswagen Passat Compact 17.6 20.0 ... 31.5 14.0 2985.0 non-USA Volkswagen Passat
90 Volkswagen Corrado Sporty 22.9 23.3 ... 26.0 15.0 2810.0 non-USA Volkswagen Corrado
91 Volvo 240 Compact 21.8 22.7 ... 29.5 14.0 2985.0 non-USA Volvo 240
92 NaN 850 Midsize 24.8 26.7 ... 30.0 15.0 3245.0 non-USA Volvo 850

93 rows × 27 columns

2nd Method

pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)
the_dataframe
Manufacturer Model Type Min.Price Price ... Rear.seat.room Luggage.room Weight Origin Make
0 Acura Integra Small 12.9 15.9 ... 26.5 NaN 2705.0 non-USA Acura Integra
1 NaN Legend Midsize 29.2 33.9 ... 30.0 15.0 3560.0 non-USA Acura Legend
2 Audi 90 Compact 25.9 29.1 ... 28.0 14.0 3375.0 non-USA Audi 90
3 Audi 100 Midsize NaN 37.7 ... 31.0 17.0 3405.0 non-USA Audi 100
4 BMW 535i Midsize NaN 30.0 ... 27.0 13.0 3640.0 non-USA BMW 535i
... ... ... ... ... ... ... ... ... ... ... ...
88 Volkswagen Eurovan Van 16.6 19.7 ... 34.0 NaN 3960.0 NaN Volkswagen Eurovan
89 Volkswagen Passat Compact 17.6 20.0 ... 31.5 14.0 2985.0 non-USA Volkswagen Passat
90 Volkswagen Corrado Sporty 22.9 23.3 ... 26.0 15.0 2810.0 non-USA Volkswagen Corrado
91 Volvo 240 Compact 21.8 22.7 ... 29.5 14.0 2985.0 non-USA Volvo 240
92 NaN 850 Midsize 24.8 26.7 ... 30.0 15.0 3245.0 non-USA Volvo 850

93 rows × 27 columns

By change the max_columns or max_rows from the pandas’ display option either by using the dot operator or by calling the set_option, we can print only the five first and last rows/columns. All the rows/columns in between are abstracted by three dotes...

Ex 47: How to format or suppress scientific notations in a pandas dataframe?

Q: Suppress scientific notations like e-03 in the_dataframe and print up to 4 numbers after the decimal.

the_dataframe = pd.DataFrame(np.random.random(4)**10, columns=['random'])
the_dataframe
random
0 7.694142e-18
1 2.132125e-02
2 1.673025e-02
3 3.668673e-01

Desired output

# Your result will be different from mine as the numbers are randomly generated

Pandas_ex47

Solution

1st Method

pd.options.display.float_format = '{:.4f}'.format
the_dataframe
random
0 0.0000
1 0.0213
2 0.0167
3 0.3669

2nd Method

pd.set_option("display.float_format", '{:.4f}'.format)
the_dataframe
random
0 0.0000
1 0.0213
2 0.0167
3 0.3669

To suppress the scientific notation, we use display.float_format or the set_option function just like we did in the previous exercise and we use this time '{:.4f}'.format to tell Pandas that we want to display numbers with a four decimal points.

Ex 48: How to format all the values in a Dataframe as percentages?

Q: Format the values in column random of the_dataframe as percentages.

the_dataframe = pd.DataFrame(np.random.random(4), columns=['random'])
the_dataframe
random
0 0.5170
1 0.0474
2 0.4878
3 0.9220

Desired output

# Your result will be different from mine as the numbers are randomly generated

Pandas_ex48

Solution

1st Method

pd.set_option("display.float_format", '{:.2%}'.format)
the_dataframe
random
0 51.70%
1 4.74%
2 48.78%
3 92.20%

2nd Method

pd.options.display.float_format = '{:.2%}'.format
the_dataframe
random
0 51.70%
1 4.74%
2 48.78%
3 92.20%

Just as we did in the previous exercise, we use display.float_format or the set_option function but this time to display the values as percentages, we use '{:.2%}'.format.

Ex 49: How to filter every nth row in a Dataframe?

Q: From cars_dataset, filter the Manufacturer, Model and Type for every 20th row starting from 1st (row 0).

cars_dataset = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired output

Pandas_ex49

Solution

cars_dataset[["Manufacturer", "Model", "Type"]][::20]
Manufacturer Model Type
0 Acura Integra Small
20 Chrysler LeBaron Compact
40 Honda Prelude Sporty
60 Mercury Cougar Midsize
80 Subaru Loyale Small

We start by selecting the 3 columns and step by 20 on each row using indexing.

Ex 50: How to create a primary key index by combining relevant columns?

Q: In cars_dataset, Replace NaNs with missing in columns Manufacturer, Model and Type and create an index as a combination of these three columns and check if the index is a primary key.

cars_dataset = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', usecols=[0,1,2,3,5])

Desired output

Pandas_ex50

Solution

replaced_nan = cars_dataset[["Manufacturer", "Model", "Type"]].fillna(value="missing")
replaced_nan
Manufacturer Model Type
0 Acura Integra Small
1 missing Legend Midsize
2 Audi 90 Compact
3 Audi 100 Midsize
4 BMW 535i Midsize
... ... ... ...
88 Volkswagen Eurovan Van
89 Volkswagen Passat Compact
90 Volkswagen Corrado Sporty
91 Volvo 240 Compact
92 missing 850 Midsize

93 rows × 3 columns

replaced_nan.set_index(replaced_nan["Manufacturer"]+"_"+replaced_nan["Model"]+"_"+replaced_nan["Type"])
Manufacturer Model Type
Acura_Integra_Small Acura Integra Small
missing_Legend_Midsize missing Legend Midsize
Audi_90_Compact Audi 90 Compact
Audi_100_Midsize Audi 100 Midsize
BMW_535i_Midsize BMW 535i Midsize
... ... ... ...
Volkswagen_Eurovan_Van Volkswagen Eurovan Van
Volkswagen_Passat_Compact Volkswagen Passat Compact
Volkswagen_Corrado_Sporty Volkswagen Corrado Sporty
Volvo_240_Compact Volvo 240 Compact
missing_850_Midsize missing 850 Midsize

93 rows × 3 columns

We first select the 3 columns and use the fillna function to replace all the occurrences of nan by the string missing then store it the replace_nan variable.

We call the set_index function on that dataframe to give it a new index column whereby each index value is a concatenation of the values from Manufacturer with the values in Model and lastly the values in Type.

Conclusion

In this part 2 of the pandas series, we have introduced the dataframe data structure which is the main data structure of Pandas. We have discovered how to import data from a CSV file, different ways to manipulate the data and many more techniques that you will be using on a daily basis if you will be working as a machine learning engineer or data scientist.

Remember that 60% of the time spent on end-to-end machine learning project is dedicated to data cleaning and visualization. So Pandas, NumPy and Matplotlib (and Seaborn) are fantastic tools to learn and master. Practice makes perfect.

In the next post, we will explore more advanced Pandas exercises that I am sure you will enjoy. Find the jupyter notebook version of this post on my GitHub profile here.

Thank you again for doing these exercises with me. I hope you have learned one or two things. If you like this post, please subscribe to stay updated with new posts, and if you have a thought or a question, I would love to hear it by commenting below. Cheers, and keep learning!