Pandas Exercises Part 3

Pandas Exercises Part 3

Great to see you again here! In this last post of the Pandas series, we will continue exploring advanced DataFrame exercises. Pandas is easer to learn than NumPy, in my opinion. Its documentation is well written, so don’t be shy! Read its documentation throughout if you get stuck here.

Let’s get started by importing NumPy and Pandas.

import numpy as np
import pandas as pd 

Ex 51: How to get the row number of the nth largest value in a column?

Q: Find the row position of the 5th largest value of column a of the_dataframe.

from numpy.random import default_rng
np.random.seed(42)
rng = default_rng()
the_dataframe = pd.DataFrame(rng.choice(30, size=30, replace=False).reshape(10,-1), columns=list('abc'))
the_dataframe
a b c
0 16 11 9
1 25 22 15
2 1 18 8
3 27 12 3
4 19 17 20
5 0 24 28
6 23 13 21
7 29 5 2
8 7 10 26
9 6 14 4

Note: We import and use default_rng to generated no duplicate values in the DataFrame and use random.seed to always generate the same numbers even on a different computer.

Desired out

# The row with the 5th largest number is 6

1st Method

row_fifth_largest_num = the_dataframe["a"].sort_values(reversed)[::-1].index[4]
print("The row with the 5th largest number is {}".format(row_fifth_largest_num))
The row with the 5th largest number is 6

We first sort the values in column a, and then reverse it. To get the index(the row number) of the 5th position, we pass in 4 in the index (indexes start from 0).

2nd Method

row_fifth_largest_num = the_dataframe["a"].argsort()[::-1][5]
print("The row with the 5th largest number is {}".format(row_fifth_largest_num))
The row with the 5th largest number is 6

Another way is by using argsort which will return sorted indexes according to the values in those indexes. Then we reverse those indexes using [::1] and get the fifth element using [5].

Ex 52: How to find the position of the nth largest value greater than a given value?

Q: In the_serie find the position of the 2nd largest value greater than the mean.

np.random.seed(42)
the_serie = pd.Series(np.random.randint(1, 100, 15))
the_serie
0     52
1     93
2     15
3     72
4     61
5     21
6     83
7     87
8     75
9     75
10    88
11    24
12     3
13    22
14    53
dtype: int64

Desired output

# The mean is 55.0 and the row of the second largest number is 3

Solution

the_mean = np.mean(the_serie.values).round()
the_mean
55.0
greater_than_mean_arr = the_serie.where(the_serie > the_mean).dropna().sort_values()
row_second_largest_num = greater_than_mean_arr.index[1]
row_second_largest_num
3
print("The mean is {} and the row of the second largest number is {}".format(the_mean, row_second_largest_num))
The mean is 55.0 and the row of the second largest number is 3

We start by calculating the mean of the values in the series using np.mean and round it. Then we use where to get all rows with the values superior to the mean.

We drop NaN values (which are values inferior to the mean) and sort the remaining values to finally get the second value in the sorted series using .index[1] which correspond to the second largest number superior to the mean.

Ex 53: How to get the last n rows of a DataFrame with row sum > 100?

Q: Get the last two rows of the_dataframe whose row sum is superior to 100.

np.random.seed(42)
the_dataframe = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))
the_dataframe
0 1 2 3
0 16 29 38 24
1 20 17 38 30
2 16 35 28 32
3 20 20 33 30
4 13 17 33 12
5 31 30 11 33
6 21 39 15 11
7 37 30 10 21
8 35 31 38 21
9 34 26 36 36
10 19 37 37 25
11 24 39 39 24
12 39 28 21 32
13 29 34 12 14
14 28 16 30 18

Desired output

Pandas_ex53

Solution

rows_sum = the_dataframe.sum(axis=1).sort_values()
rows_sum
4      75
6      86
13     89
14     92
7      98
3     103
1     105
5     105
0     107
2     111
10    118
12    120
8     125
11    126
9     132
dtype: int64
rows_sum_greater_100 = rows_sum.where(rows_sum > 100).dropna()
rows_sum_greater_100
3     103.0
1     105.0
5     105.0
0     107.0
2     111.0
10    118.0
12    120.0
8     125.0
11    126.0
9     132.0
dtype: float64
the_dataframe.iloc[rows_sum_greater_100[::-1][:2].index]
0 1 2 3
9 34 26 36 36
11 24 39 39 24

We calculate a series with of the sum of all the elements row-wise and sort it. We then use where function to get only the row with element greater than 100 and drop the rest using dropna function.

Finally, we reverse that row using [::-1] and get the indexes first two rows using indexing from 0 to 2 (exclusive). We replace those indexes in the original dataframe to get the two rows using iloc.

Ex 54: How to find and cap outliers from a series or DataFrame column?

Q: Replace all values of the_serie lower to the 5th percentile and greater than 95th percentile respectively with the 5th and 95th percentile value.

the_serie = ser = pd.Series(np.logspace(-2, 2, 30))

Desired output

# 0      0.016049
# 1      0.016049
# 2      0.018874
# 3      0.025929
# 4      0.035622
# 5      0.048939
# 6      0.067234
# 7      0.092367
# 8      0.126896
# 9      0.174333
# 10     0.239503
# 11     0.329034
# 12     0.452035
# 13     0.621017
# 14     0.853168
# 15     1.172102
# 16     1.610262
# 17     2.212216
# 18     3.039195
# 19     4.175319
# 20     5.736153
# 21     7.880463
# 22    10.826367
# 23    14.873521
# 24    20.433597
# 25    28.072162
# 26    38.566204
# 27    52.983169
# 28    63.876672
# 29    63.876672
# dtype: float64

Solution

low_perc = the_serie.quantile(q=0.05)
low_perc
0.016049294076965887
high_perc = the_serie.quantile(q=0.95)
high_perc
63.876672220183934
the_serie.where(the_serie > low_perc, other=low_perc, inplace=True) 
the_serie.where(the_serie < high_perc, other=high_perc, inplace=True) 
the_serie
0      0.016049
1      0.016049
2      0.018874
3      0.025929
4      0.035622
5      0.048939
6      0.067234
7      0.092367
8      0.126896
9      0.174333
10     0.239503
11     0.329034
12     0.452035
13     0.621017
14     0.853168
15     1.172102
16     1.610262
17     2.212216
18     3.039195
19     4.175319
20     5.736153
21     7.880463
22    10.826367
23    14.873521
24    20.433597
25    28.072162
26    38.566204
27    52.983169
28    63.876672
29    63.876672
dtype: float64

We first calculate the 5th and the 95th percentile using the quantile function and pass in q the number 0.05 and 0.95 respectively. Then we call the where function on the original series, pass in the condition as the_serie > low_perc.

This condition will target the elements superior to the 5th percentile and set other which is the remaining element (inferior to the 5th percentile) to be the 5th percentile. The assignment will replace all the values in the series lower than the 5th percentile by the 5th percentile value. Finally, we set inplace to True.

We do the same for the 95th percentile, just that this time we are targeting elements inferior to the 95th percentile and set other to the value of the 95th percentile.

Ex 55: How to reshape a DataFrame to the largest possible square after removing the negative values?

Q: Reshape the_dataframe to the largest possible square with negative values removed. Drop the smallest values if need be. The order of the positive numbers in the result should remain the same as the original.

np.random.seed(42)
the_dataframe = pd.DataFrame(np.random.randint(-20, 50, 100).reshape(10,-1))
the_dataframe
0 1 2 3 4 5 6 7 8 9
0 31 -6 40 0 3 -18 1 32 -19 9
1 17 -19 43 39 0 12 37 1 28 38
2 21 39 -6 41 41 26 41 30 34 43
3 -18 30 -14 0 18 -3 -17 39 -7 -12
4 32 -19 39 23 -13 26 14 15 29 -17
5 -19 -15 33 -17 33 42 -3 23 13 41
6 -7 27 -6 41 19 32 3 5 39 20
7 8 -6 24 44 -12 -20 -13 42 -10 -13
8 14 14 12 -16 20 7 -14 -9 13 12
9 27 2 41 16 23 14 44 26 -18 -20

Desired output

# array([[31., 40.,  3., 32.,  9., 17., 43., 39.],
#        [12., 37., 28., 38., 21., 39., 41., 41.],
#        [26., 41., 30., 34., 43., 30., 18., 39.],
#        [32., 39., 23., 26., 14., 15., 29., 33.],
#        [33., 42., 23., 13., 41., 27., 41., 19.],
#        [32.,  3.,  5., 39., 20.,  8., 24., 44.],
#        [42., 14., 14., 12., 20.,  7., 13., 12.],
#        [27.,  2., 41., 16., 23., 14., 44., 26.]])

Solution

Step 1: Remove the negatives

the_arr = the_dataframe[the_dataframe > 0].values.flatten()
the_arr
array([31., nan, 40., nan,  3., nan,  1., 32., nan,  9., 17., nan, 43.,
       39., nan, 12., 37.,  1., 28., 38., 21., 39., nan, 41., 41., 26.,
       41., 30., 34., 43., nan, 30., nan, nan, 18., nan, nan, 39., nan,
       nan, 32., nan, 39., 23., nan, 26., 14., 15., 29., nan, nan, nan,
       33., nan, 33., 42., nan, 23., 13., 41., nan, 27., nan, 41., 19.,
       32.,  3.,  5., 39., 20.,  8., nan, 24., 44., nan, nan, nan, 42.,
       nan, nan, 14., 14., 12., nan, 20.,  7., nan, nan, 13., 12., 27.,
        2., 41., 16., 23., 14., 44., 26., nan, nan])

We use indexing with [] to get all the positive elements in the_dataframe and reshaped them into a 1D array using flatten() function to finally store it into the_arr.

pos_arr = the_arr[~np.isnan(the_arr)]
np.isnan(the_arr)
array([False,  True, False,  True, False,  True, False, False,  True,
       False, False,  True, False, False,  True, False, False, False,
       False, False, False, False,  True, False, False, False, False,
       False, False, False,  True, False,  True,  True, False,  True,
        True, False,  True,  True, False,  True, False, False,  True,
       False, False, False, False,  True,  True,  True, False,  True,
       False, False,  True, False, False, False,  True, False,  True,
       False, False, False, False, False, False, False, False,  True,
       False, False,  True,  True,  True, False,  True,  True, False,
       False, False,  True, False, False,  True,  True, False, False,
       False, False, False, False, False, False, False, False,  True,
        True])
pos_arr
array([31., 40.,  3.,  1., 32.,  9., 17., 43., 39., 12., 37.,  1., 28.,
       38., 21., 39., 41., 41., 26., 41., 30., 34., 43., 30., 18., 39.,
       32., 39., 23., 26., 14., 15., 29., 33., 33., 42., 23., 13., 41.,
       27., 41., 19., 32.,  3.,  5., 39., 20.,  8., 24., 44., 42., 14.,
       14., 12., 20.,  7., 13., 12., 27.,  2., 41., 16., 23., 14., 44.,
       26.])

To drop the nan in the array, we use indexing and isnan function to return a boolean array where False represent a non nan value and True is a position of a nan value. We then ~ sign to inverse the boolean to get True in where there is non nan value. Now we get a new array with no nan values.

Step 2: Find side-length of largest possible square

len(pos_arr)
66
n = int(np.floor(np.sqrt(pos_arr.shape[0])))
n
8

To search for the largest possible square, we get first the length of the array using shape function (we could also use len() function). We then find the square root of the number of elements in the array and remove the decimal using floor function cast it to an integer.

Step 3: Take top n^2 items without changing positions

top_indexes = np.argsort(pos_arr)[::-1]
top_indexes
array([64, 49, 22,  7, 50, 35, 16, 40, 19, 17, 60, 38,  1, 15, 27,  8, 25,
       45, 13, 10, 21, 33, 34, 42, 26,  4,  0, 23, 20, 32, 12, 39, 58, 65,
       29, 18, 48, 62, 36, 28, 14, 54, 46, 41, 24,  6, 61, 31, 63, 51, 52,
       30, 37, 56,  9, 57, 53,  5, 47, 55, 44, 43,  2, 59, 11,  3])
np.take(pos_arr, sorted(top_indexes[:n**2])).reshape(n,-1)
array([[31., 40.,  3., 32.,  9., 17., 43., 39.],
       [12., 37., 28., 38., 21., 39., 41., 41.],
       [26., 41., 30., 34., 43., 30., 18., 39.],
       [32., 39., 23., 26., 14., 15., 29., 33.],
       [33., 42., 23., 13., 41., 27., 41., 19.],
       [32.,  3.,  5., 39., 20.,  8., 24., 44.],
       [42., 14., 14., 12., 20.,  7., 13., 12.],
       [27.,  2., 41., 16., 23., 14., 44., 26.]])

We then sort the element indexes using argsort and reverse the order into a descending order using slicing [::-1] and store it in top_indexes.

Finally, we use take NumPy function that takes the pos_arr and as indices the sorted top_indexes (from the first indices up to n raised to the power of 2). Then we reshape the array using (n,-1) to let Pandas figure out the best reshape argument to use depending on n value.

Ex 56: How to swap two rows of a DataFrame?

Q: Swap rows 1 and 2 in the_dataframe

the_dataframe = pd.DataFrame(np.arange(25).reshape(5, -1))
the_dataframe
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24

Desired solution

Pandas_ex56

Solution

def row_swap(the_dataframe,row_index_1,row_index_2):
    row_1, row_2 = the_dataframe.iloc[row_index_1,:].copy(), the_dataframe.iloc[row_index_2,:].copy()
    the_dataframe.iloc[row_index_1,:], the_dataframe.iloc[row_index_2,:] = row_2, row_1
    return the_dataframe
row_swap(the_dataframe,0,1)
0 1 2 3 4
0 5 6 7 8 9
1 0 1 2 3 4
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24

We create a function that performs the swap it takes in the dataframe and the two indexes of the rows that need to be swap. We then copy the rows using iloc and store them in row_1 and row_2.

To do the swap, we do the opposite of what we did by assigning row_1 and row_2 to the equivalent row index we want to change to occur. So row_2 will be assigned to row_index_1 and row_1 will be assigned to row_index_2. Finally, we return the_dataframe.

Ex 57: How to reverse the rows of a DataFrame?

Q: Reverse all the rows of a DataFrame.

the_dataframe = pd.DataFrame(np.arange(25).reshape(5, -1))
the_dataframe
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24

Desired output

Pandas_ex57

Solution

1st method

the_dataframe.iloc[::-1]
0 1 2 3 4
4 20 21 22 23 24
3 15 16 17 18 19
2 10 11 12 13 14
1 5 6 7 8 9
0 0 1 2 3 4

2nd method

the_dataframe[::-1]
0 1 2 3 4
4 20 21 22 23 24
3 15 16 17 18 19
2 10 11 12 13 14
1 5 6 7 8 9
0 0 1 2 3 4

To reverse the dataframe row-wise, we use indexing with [::-1]. The second method is the short form of the first method we used iloc.

Note: In Exercise 75, we will see how to achieve the same result column-wise.

Ex 58: How to create one-hot encodings of a categorical variable (dummy variables)?

Q: Get one-hot encodings for column a in the dataframe the_dataframe and append it as columns.

the_dataframe = pd.DataFrame(np.arange(25).reshape(5,-1), columns=list('abcde'))
the_dataframe
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24

Pandas_ex58

Desired output

Solution

one_hot = pd.get_dummies(the_dataframe["a"])
one_hot
0 5 10 15 20
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 0 0 0 0 1
pd.concat([one_hot, the_dataframe[["b","c","d","e"]]],axis=1)
0 5 10 15 20 b c d e
0 1 0 0 0 0 1 2 3 4
1 0 1 0 0 0 6 7 8 9
2 0 0 1 0 0 11 12 13 14
3 0 0 0 1 0 16 17 18 19
4 0 0 0 0 1 21 22 23 24

We get the one-hot encoding of column a by using the get_dummies function and pass in the column a.

To append the newly created one_hot, we use concat and pass in one_hot and the remaining columns of dataframe (except column a). Finally, we set the axis to 1 since we want to concatenate column-wise.

Ex 59: Which column contains the highest number of row-wise maximum values?

Q: Obtain the column name with the highest number of row-wise in the_dataframe.

np.random.seed(42)
the_dataframe = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1))

Desired output

# The column name with the highest number of row-wise is 0

Solution

row_high_num = the_dataframe.sum(axis=0).argmax()
print("The column name with the highest number of row-wise is {}".format(row_high_num))
The column name with the highest number of row-wise is 0

To get the row with the largest sum row-wise, we use the sum function and pass in the axis argument set to 0 (telling Pandas to calculate the sum of elements row-wise) and then use argmax to get the index with the highest value in the series.

Ex 60: How to know the maximum possible correlation value of each column against other columns?

Q: Compute the maximum possible absolute correlation value of each column against other columns in the_dataframe.

np.random.seed(42)
the_dataframe = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1), columns=list('pqrstuvwxy'), index=list('abcdefgh'))

Desired output

# array([0.81941445, 0.84466639, 0.44944264, 0.44872809, 0.81941445,
#        0.80618428, 0.44944264, 0.5434561 , 0.84466639, 0.80618428])

Solution

abs_corr = np.abs(the_dataframe.corr())
abs_corr
p q r s t u v w x y
p 1.000000 0.019359 0.088207 0.087366 0.819414 0.736955 0.070727 0.338139 0.163112 0.665627
q 0.019359 1.000000 0.280799 0.121217 0.172215 0.262234 0.304781 0.042748 0.844666 0.243317
r 0.088207 0.280799 1.000000 0.184988 0.223515 0.017763 0.449443 0.127091 0.228455 0.018350
s 0.087366 0.121217 0.184988 1.000000 0.210879 0.096695 0.422290 0.306141 0.100174 0.448728
t 0.819414 0.172215 0.223515 0.210879 1.000000 0.576720 0.334690 0.543456 0.047136 0.273478
u 0.736955 0.262234 0.017763 0.096695 0.576720 1.000000 0.137836 0.352145 0.363597 0.806184
v 0.070727 0.304781 0.449443 0.422290 0.334690 0.137836 1.000000 0.158152 0.188482 0.033227
w 0.338139 0.042748 0.127091 0.306141 0.543456 0.352145 0.158152 1.000000 0.325939 0.071704
x 0.163112 0.844666 0.228455 0.100174 0.047136 0.363597 0.188482 0.325939 1.000000 0.338705
y 0.665627 0.243317 0.018350 0.448728 0.273478 0.806184 0.033227 0.071704 0.338705 1.000000
max_abs_corr = abs_corr.apply(lambda x: sorted(x)[-2]).values
max_abs_corr
array([0.81941445, 0.84466639, 0.44944264, 0.44872809, 0.81941445,
       0.80618428, 0.44944264, 0.5434561 , 0.84466639, 0.80618428])

We first calculate the absolute correlation the whole dataset. We use the corr() function and pass it as the argument to the NumPy function abs to get the absolute values (non-negative values).

Now that we have the absolute correction, use lambda expression with the apply function to find the highest correlation value in each row.

We sorted first each row (represented by x in the lambda expression), secondly get the second last element in the row using indexing. The reason why we get the second-highest value instead of the last one is because the last value is 1 (calculated from the correlation of the same column). We then form an array with the highest correlation values in each row.

Ex 61: How to create a column containing the minimum by the maximum of each row?

Q: Compute the minimum-by-maximum for every row of the_dataframe.

np.random.seed(42)
the_dataframe = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
the_dataframe
0 1 2 3 4 5 6 7 8 9
0 52 93 15 72 61 21 83 87 75 75
1 88 24 3 22 53 2 88 30 38 2
2 64 60 21 33 76 58 22 89 49 91
3 59 42 92 60 80 15 62 62 47 62
4 51 55 64 3 51 7 21 73 39 18
5 4 89 60 14 9 90 53 2 84 92
6 60 71 44 8 47 35 78 81 36 50
7 4 2 6 54 4 54 93 63 18 90

Desired output

# 0    0.161290
# 1    0.022727
# 2    0.230769
# 3    0.163043
# 4    0.041096
# 5    0.021739
# 6    0.098765
# 7    0.021505
# dtype: float64

Solution

1st Method

the_min = the_dataframe.min(axis=1)
the_max = the_dataframe.max(axis=1)
min_by_max = the_min/the_max
min_by_max
0    0.161290
1    0.022727
2    0.230769
3    0.163043
4    0.041096
5    0.021739
6    0.098765
7    0.021505
dtype: float64

The easiest way to solve this problem is to find the minimum values in each column using the min function by setting the axis to 1. We do the same for the maximum to finally divide the minimum values by the maximum values.

2nd Method

the_dataframe.apply(lambda x: np.min(x)/np.max(x), axis=1)
0    0.161290
1    0.022727
2    0.230769
3    0.163043
4    0.041096
5    0.021739
6    0.098765
7    0.021505
dtype: float64

The previous method uses three lines of codes we can write in one line of code. We use the lambda expression to calculate the division of the minimum by the maximum of x and set axis to 1.

Ex 62: How to create a column that contains the penultimate value in each row?

Q: Create a new column penultimate which has the second-largest value of each row of the_dataframe.

np.random.seed(42)
the_dataframe = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

Desire output

Pandas_ex62

Solution

the_dataframe['penultimate'] = the_dataframe.apply(lambda x: x.sort_values().unique()[-2], axis=1)

As previously seen, to solve this type of challenge, we use a lambda expression with the apply function. We first set axis to 1 as we are calculating the values row-wise and in the lambda expression, we sort x ignore the duplicate with ignore function and return the second largest value using indexing [-2].

Ex 63: How to normalize all columns in a dataframe?

Q1: Normalize all columns of the_dataframe by subtracting the column mean and divide by standard deviation.

Q2: Range all columns values of the_dataframe such that the minimum value in each column is 0 and max is 1.

Note: Don’t use external packages like sklearn.

np.random.seed(42)
the_dataframe = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

Desired output

Q1

Pandas_ex63_q1

Q2

Pandas_ex63_q2

Solution

Q1

the_dataframe.apply(lambda x: (x - x.mean()) / x.std(),axis=0)
0 1 2 3 4 5 6 7 8 9
0 0.144954 1.233620 -0.722109 1.495848 0.478557 -0.471883 0.718777 0.860589 1.241169 0.434515
1 1.372799 -0.977283 -1.096825 -0.434278 0.192317 -1.101061 0.894088 -1.017060 -0.475588 -1.680126
2 0.554235 0.176231 -0.534751 -0.009651 1.015256 0.753357 -1.420023 0.926472 0.034799 0.897999
3 0.383701 -0.400526 1.682318 1.032617 1.158376 -0.670571 -0.017531 0.037059 -0.057999 0.057935
4 0.110847 0.016021 0.807981 -1.167726 0.120757 -0.935488 -1.455085 0.399412 -0.429189 -1.216643
5 -1.492172 1.105451 0.683076 -0.743098 -1.382001 1.813025 -0.333092 -1.939414 1.658759 0.926966
6 0.417808 0.528694 0.183455 -0.974714 -0.022362 -0.008279 0.543466 0.662942 -0.568386 -0.289677
7 -1.492172 -1.682209 -1.003146 0.801002 -1.560900 0.620899 1.069400 0.070000 -1.403565 0.869031

To solve this issue, we need to know the formula for normalizing. The formulation is as follow:

Pandas_ex63_formular

So now we can proceed with the apply function and lambda expression. We set the axis to 0 since we are normalizing column-wise and with the lambda expression we do the calculation according to the formula above.

Q2

the_dataframe.apply(lambda x: (x.max() - x)/(x.max()-x.min()),axis=0)
0 1 2 3 4 5 6 7 8 9
0 0.428571 0.000000 0.865169 0.000000 0.250000 0.784091 0.138889 0.022989 0.136364 0.188889
1 0.000000 0.758242 1.000000 0.724638 0.355263 1.000000 0.069444 0.678161 0.696970 1.000000
2 0.285714 0.362637 0.797753 0.565217 0.052632 0.363636 0.986111 0.000000 0.530303 0.011111
3 0.345238 0.560440 0.000000 0.173913 0.000000 0.852273 0.430556 0.310345 0.560606 0.333333
4 0.440476 0.417582 0.314607 1.000000 0.381579 0.943182 1.000000 0.183908 0.681818 0.822222
5 1.000000 0.043956 0.359551 0.840580 0.934211 0.000000 0.555556 1.000000 0.000000 0.000000
6 0.333333 0.241758 0.539326 0.927536 0.434211 0.625000 0.208333 0.091954 0.727273 0.466667
7 1.000000 1.000000 0.966292 0.260870 1.000000 0.409091 0.000000 0.298851 1.000000 0.022222

To change the values in the dataframe to put them on a scale of 0 to 1 we use the following formulaMAX - Z / MAX - MIN and do the same as we did in Q1.

Ex 64: How to compute the correlation of each row with the succeeding row?

Q: Compute the correlation of each row with its previous row, round the result by 2.

np.random.seed(42)
the_dataframe = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

Desired output

# [0.31, -0.14, -0.15, 0.47, -0.32, -0.07, 0.12]

Solution

[the_dataframe.iloc[i].corr(the_dataframe.iloc[i+1]).round(2) for i in range(the_dataframe.shape[0]-1)]
[0.31, -0.14, -0.15, 0.47, -0.32, -0.07, 0.12]

We first loop through the range of the number of rows in the dataframe using shape function (excluding the last row because we are comparing a pair of rows). We then call the corr function on each row using the iloc function and pass in the corr function the next row location by adding 1 to i. Finally, we round the result by two decimal point and place it in a list comprehension.

Ex 65: How to replace both the diagonals of dataframe with 0?

Q: Replace both values in both diagonals of the_dataframe with 0.

np.random.seed(42)
the_dataframe = pd.DataFrame(np.random.randint(1,100, 100).reshape(10, -1))

Desired output

Pandas_ex65

Solution

for i in range(the_dataframe.shape[0]):
    the_dataframe.iat[i,i] = 0
    the_dataframe.iat[the_dataframe.shape[0]-i-1,i] = 0
the_dataframe
0 1 2 3 4 5 6 7 8 9
0 0 93 15 72 61 21 83 87 75 0
1 88 0 3 22 53 2 88 30 0 2
2 64 60 0 33 76 58 22 0 49 91
3 59 42 92 0 80 15 0 62 47 62
4 51 55 64 3 0 0 21 73 39 18
5 4 89 60 14 0 0 53 2 84 92
6 60 71 44 0 47 35 0 81 36 50
7 4 2 0 54 4 54 93 0 18 90
8 44 0 74 62 14 95 48 15 0 78
9 0 62 40 85 80 82 53 24 26 0

We are going to fill both diagonal (right-to-left and left-to-right) with 0. There is a NumPy function called fill_diagnol to replace values on the left-to-right diagonal but the issue with this function is that it does not replace the right-to-left diagonal as well. We can’t use this function, therefore.

To solve this challenge, we first loop through rows in the dataframe and then for each loop we are to replace two elements at a specific position with 0 row-wise. For the left-to-right diagonal, we use the iat function which takes in at the first position index the row number and at the second position the column number, we use i for both positions. For the right-to-left diagonal, we use the iat function again but this time the first position we calculate the total number rows in the dataframe minus i (as i changes because of the loop) minus 1 because indexes start from 0 and for the second position corresponding to the columns we use i.

Ex 66: How to get the particular group of a groupby dataframe by key?

Q: This is a question related to the understanding of grouped dataframe. From df_grouped, get the group belonging to apple as a dataframe.

np.random.seed(42)
the_dataframe = pd.DataFrame({'col1': ['apple', 'banana', 'orange'] * 3,
                   'col2': np.random.rand(9),
                   'col3': np.random.randint(0, 15, 9)})

df_grouped = the_dataframe.groupby(['col1'])
the_dataframe
col1 col2 col3
0 apple 0.374540 7
1 banana 0.950714 2
2 orange 0.731994 5
3 apple 0.598658 4
4 banana 0.156019 1
5 orange 0.155995 7
6 apple 0.058084 11
7 banana 0.866176 13
8 orange 0.601115 5
df_grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd121fc30d0>

Desired output

#     col1      col2  col3
# 0  apple  0.374540     7
# 3  apple  0.598658     4
# 6  apple  0.058084    11

Solution

1st Method

df_grouped.get_group("apple")
col1 col2 col3
0 apple 0.374540 7
3 apple 0.598658 4
6 apple 0.058084 11

To get the group belonging to apple we call get_group on the df_grouped.

2nd Method

for i, grp_val in df_grouped:
    if i == "apple":
        print(grp_val)
    col1      col2  col3
0  apple  0.374540     7
3  apple  0.598658     4
6  apple  0.058084    11

Alternatively, we loop through all the elements in df_grouped and use the if statement to print the columns in the apple group.

Ex 67: How to get the nth largest value of a column when grouped by another column?

Q: In the_dataframe, find the second largest value of taste for banana.

np.random.seed(42)
the_dataframe = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                   'taste': np.random.rand(9),
                   'price': np.random.randint(0, 15, 9)})

Desired output

# 0.8661761457749352

Solution

the_dataframe
fruit taste price
0 apple 0.374540 7
1 banana 0.950714 2
2 orange 0.731994 5
3 apple 0.598658 4
4 banana 0.156019 1
5 orange 0.155995 7
6 apple 0.058084 11
7 banana 0.866176 13
8 orange 0.601115 5
df_grouped = the_dataframe.groupby(by="fruit")
sorted(df_grouped.get_group("banana")["taste"])[-2]
0.8661761457749352

Just like we did in the previous exercise, we first group the dataframe using the values in the fruit column and store it in df_grouped. Then we get taste column for banana using get_group function and sort it out. Finally, to get the second largest element, we use indexing [-2].

Ex 68: How to compute grouped mean on pandas DataFrame and keep the grouped column as another column (not index)?

Q: In the_dataframe, Compute the mean price of every fruit, while keeping the fruit as another column instead of an index.

np.random.seed(42)
the_dataframe = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                   'taste': np.random.rand(9),
                   'price': np.random.randint(0, 15, 9)})

Desired output

Pandas_ex69

Solution

1st Method

the_dataframe.groupby(by="fruit").mean()["price"].reset_index()
fruit price
0 apple 7.333333
1 banana 5.333333
2 orange 5.666667

The most straightforward way to go about this exercise is to group the dataframe by fruit and get the mean of the numerical columns grouped by price and reset the index using reset_index which will change the index from fruit column to regular ascending numerical index.

2nd Method

the_dataframe.groupby(by="fruit",as_index=False)["price"].mean()
fruit price
0 apple 7.333333
1 banana 5.333333
2 orange 5.666667

Alternatively, we can reset the index using the as_index parameter from the groupby function.

Ex 69: How to join two DataFrames by 2 columns so that they have only the common rows?

Q: Join dataframes the_dataframe_1 and the_dataframe_2 by fruit-pazham and weight-kilo.

the_dataframe_1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                    'weight': ['high', 'medium', 'low'] * 3,
                    'price': np.random.randint(0, 15, 9)})

the_dataframe_2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
                    'kilo': ['high', 'low'] * 3,
                    'price': np.random.randint(0, 15, 6)})

Desire output

Pandas_ex69

Solution

the_dataframe_1
fruit weight price
0 apple high 1
1 banana medium 11
2 orange low 4
3 apple high 0
4 banana medium 11
5 orange low 9
6 apple high 5
7 banana medium 12
8 orange low 11
the_dataframe_2
pazham kilo price
0 apple high 8
1 orange low 0
2 pine high 10
3 apple low 10
4 orange high 14
5 pine low 9
pd.merge(the_dataframe_1, the_dataframe_2, how="inner", left_on=["fruit","weight"], right_on=["pazham","kilo"], suffixes=["_left","_right"])
fruit weight price_left pazham kilo price_right
0 apple high 1 apple high 8
1 apple high 0 apple high 8
2 apple high 5 apple high 8
3 orange low 4 orange low 0
4 orange low 9 orange low 0
5 orange low 11 orange low 0

We use the merge to combine the two dataframes, and set how parameter to inner which means that we are only interested in rows with the same value in fruit and weight column on the left and pazham and kilo column on the right. Finally, we add suffix “_left” and “_right” on those columns.

Ex 70: How to remove rows from a DataFrame that are present in another DataFrame?

Q: From the_dataframe_1, remove the rows present in the_dataframe_2. All three columns values must be the same for the row to be drop.

np.random.seed(42)
the_dataframe_1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                    'weight': ['high', 'medium', 'low'] * 3,
                    'price': np.random.randint(0, 15, 9)})

the_dataframe_2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
                    'kilo': ['high', 'low'] * 3,
                    'price': np.random.randint(0, 15, 6)})

Desired output

Pandas_ex70

Solution

the_dataframe_1
fruit weight price
0 apple high 6
1 banana medium 3
2 orange low 12
3 apple high 14
4 banana medium 10
5 orange low 7
6 apple high 12
7 banana medium 4
8 orange low 6
the_dataframe_2
pazham kilo price
0 apple high 9
1 orange low 2
2 pine high 6
3 apple low 10
4 orange high 10
5 pine low 7
the_dataframe_1[~the_dataframe_1.isin(the_dataframe_2).all(axis=1)]
fruit weight price
0 apple high 6
1 banana medium 3
2 orange low 12
3 apple high 14
4 banana medium 10
5 orange low 7
6 apple high 12
7 banana medium 4
8 orange low 6

We first get the element in the_dataframe_1 that are present in the_dataframe_2 using the isin function. A new dataframe with boolean values will is return where True represent a similar value between the_dataframe_1 and the_dataframe_2 and False represent a different value. We use all to get an AND operator function between the boolean values row-wise(axis set to 1). For example, in row 4, we’ll have “False” AND “False” AND “True” = “False”.

Finally, we use ~ to reverse all the boolean value (“False” becomes “True” and “True” becomes “False”) and use indexing into the_dataframe_1. We find out that we are keeping all the rows in the_dataframe_1 meaning that no row that is identical in the_dataframe_1 and the_dataframe_2.

Ex 71: How to get the positions where values of two columns match?

Q: Get the positions where values of two columns match

np.random.seed(42)
the_dataframe = pd.DataFrame({'fruit1': np.random.choice(['apple', 'orange', 'banana'], 10),
                    'fruit2': np.random.choice(['apple', 'orange', 'banana'], 10)})
the_dataframe
fruit1 fruit2
0 banana banana
1 apple banana
2 banana apple
3 banana banana
4 apple orange
5 apple apple
6 banana orange
7 orange orange
8 banana orange
9 banana orange

Desired output

# [0, 3, 5, 7]

Solution

1st Method

the_dataframe.where(the_dataframe["fruit1"] == the_dataframe["fruit2"])
fruit1 fruit2
0 banana banana
1 NaN NaN
2 NaN NaN
3 banana banana
4 NaN NaN
5 apple apple
6 NaN NaN
7 orange orange
8 NaN NaN
9 NaN NaN
list(the_dataframe.where(the_dataframe["fruit1"] == the_dataframe["fruit2"]).dropna().index)
[0, 3, 5, 7]

We first call the where function the the_dataframe with the condition that we need to the same fruit in column fruit1 and fruit2. We get boolean values dataframe with the rows where the values are the same and NaN where the values are different. We drop NaN using dropna function and extract the indexes. Finally, place the array into a list et voila!

2nd Method

list(np.where(the_dataframe["fruit1"] == the_dataframe["fruit2"])[0])
[0, 3, 5, 7]
Alternatevely w
list(np.where(the_dataframe["fruit1"] == the_dataframe["fruit2"])[0])
[0, 3, 5, 7]
np.where(the_dataframe["fruit1"] == the_dataframe["fruit2"])
(array([0, 3, 5, 7]),)

Alternatively, we can use the where function which returns a tuple with first element an array of the indexes where the condition is satisfied. We extract that array and cast it into a list. I prefer this second method as it is more concise.

Ex 72: How to create lags and leads of a column in a DataFrame?

Q: Create two new columns in the_dataframe, one of which is a lag1 (shift column a down by 1 row) of column a and the other is a lead1 (shift column b up by 1 row).

np.random.seed(42)
the_dataframe = pd.DataFrame(np.random.randint(1, 100, 20).reshape(-1, 4), columns = list('abcd'))

Desired output

Pandas_ex72

Solution

the_dataframe
a b c d
0 52 93 15 72
1 61 21 83 87
2 75 75 88 24
3 3 22 53 2
4 88 30 38 2
the_dataframe["lag1"] = the_dataframe["a"].shift(periods=1)
the_dataframe["lead1"] = the_dataframe["a"].shift(periods=-1)
the_dataframe
a b c d lag1 lead1
0 52 93 15 72 NaN 61.0
1 61 21 83 87 52.0 75.0
2 75 75 88 24 61.0 3.0
3 3 22 53 2 75.0 88.0
4 88 30 38 2 3.0 NaN

To create a shift of values in a column upward or downward, we use the shift function on the desired column and pass in as the periods number 1 to shift upward or -1 to shift downward.

Ex 73: How to get the frequency of unique values in the entire DataFrame?

Q: Get the frequency of unique values in the entire DataFrame.

np.random.seed(42)
the_dataframe = pd.DataFrame(np.random.randint(1, 10, 20).reshape(-1, 4), columns = list('abcd'))

Desired output

# 8    5
# 5    4
# 7    3
# 6    2
# 4    2
# 3    2
# 2    2
# dtype: int64

Solution

the_dataframe
a b c d
0 7 4 8 5
1 7 3 7 8
2 5 4 8 8
3 3 6 5 2
4 8 6 2 5
pd.value_counts(the_dataframe.values.flatten())
8    5
5    4
7    3
6    2
4    2
3    2
2    2
dtype: int64

To get the frequency of unique values or how many time one value is repeated in the dataframe, we use value_counts and pass in the values of the_dataframe flattened which transform the dataframe from an n-dimensional dataframe into a 1D array.

Ex 74: How to split a text column into two separate columns?

Q: Split the string column in the_dataframe to form a dataframe with 3 columns as shown.

the_dataframe = pd.DataFrame(["Temperature, City    Province",
                              "33, Bujumbura    Bujumbura",
                              "30, Buganda    Cibitoke",
                              "25, Ncendajuru    Cankuzo",
                              "35, Giheta    Gitega"], 
                              columns=['row']
                            )

Desired output

Pandas_ex74

Solution

Step 1: split the string data

the_dataframe
row
0 Temperature, City Province
1 33, Bujumbura Bujumbura
2 30, Buganda Cibitoke
3 25, Ncendajuru Cankuzo
4 35, Giheta Gitega
df_splitted = the_dataframe.row.str.split(pat=",|\t|    ", expand=True)
df_splitted
0 1 2
0 Temperature City Province
1 33 Bujumbura Bujumbura
2 30 Buganda Cibitoke
3 25 Ncendajuru Cankuzo
4 35 Giheta Gitega

In this first step, we split the strings in the one column into three different columns. We call the split function on the str function from the row dataframe. We pass in as the pattern a regular expression that targets , or \t (tab) and (4 spaces) and set expand to True which expand the split strings into separate columns.

Step 2: Rename the columns

new_header = df_splitted.iloc[0]
new_header
0    Temperature
1           City
2       Province
Name: 0, dtype: object
df_splitted.columns = new_header
df_splitted
Temperature City Province
0 Temperature City Province
1 33 Bujumbura Bujumbura
2 30 Buganda Cibitoke
3 25 Ncendajuru Cankuzo
4 35 Giheta Gitega

In step 2, we are going to use the first row strings as the column names. We first extract the row using iloc and store it in the new_header, Then assign it to columns of the dataframe.

Step 3: Drop the first row

df_splitted.drop(labels=0,axis="index",inplace=True)
df_splitted
Temperature City Province
1 33 Bujumbura Bujumbura
2 30 Buganda Cibitoke
3 25 Ncendajuru Cankuzo
4 35 Giheta Gitega

Now that we have the column names all set, we no longer need that first row, so we are dropping it. We call the drop function on the dataframe and pass in as parameters label set to 0 to tell Pandas that we want to drop a row not a column, then set axis to index (We could also have used 0) to tell Pandas that we want to drop labels from the index. Finally set inplace to True to tell Pandas we don’t want a copy of the dataframe that instead, we want the change to occur in the original dataframe.

Ex 75: How to reverse the columns of a DataFrame?

Q: Reverse all the columns of a DataFrame.

np.random.seed(42)
the_dataframe = pd.DataFrame(np.arange(25).reshape(5, -1))

Desired output

Pandas_ex75

Solution

the_dataframe[the_dataframe.columns[::-1]]
4 3 2 1 0
0 4 3 2 1 0
1 9 8 7 6 5
2 14 13 12 11 10
3 19 18 17 16 15
4 24 23 22 21 20

This exercise is similar to exercise 57 the difference is that this time we are reversing columns instead of rows. To do the reversal, we first extract the columns and reverse them using indexing [::-1] and place it into the original dataframe using again indexing.

Conclusion

Yaaayy! We made it finally. In the last posts, we have explored more than 150 exercises on NumPy and Pandas. I am very confident that after going through all these exercises, you are ready to tackle the next step: Machine Learning with Scikit-learn. We will continue using NumPy in end-to-end Machine Learning projects coming in the next blog posts.

In the next post, we will introduce the common jargon used in Machine Learning, code our first Machine Learning algorithm and after that post we will start working on machine learning projects finally. I am super duper excited for the upcoming posts. Remember, practice makes perfect! Find the jupyter notebook version of this post on my GitHub profile here.

Thank you again for doing these exercises with me. I hope you have learned one or two things. If you like this post, please subscribe to stay updated with new posts, and if you have a thought or a question, I would love to hear it by commenting below. Cheers, and keep learning!