@core6620398233 @carpen-them-diems You are not alone! This machine learning course has taken a serious dive post Logistic Regression II, it’s really not good. Below I shall try my best to explain what’s going on after working on this for a while. I ended up rewriting a lot of the code to make more sense to me - something that a student really shouldn’t be doing on a paid course but this is where we’re at right now I suppose… Sorry if I over-explain but I am doing this in part to help with my own understanding as well as people who may see this in the future.
The Problem
First of all, I think the whole point of the lesson was to find which feature to perform an initial split on. This is where the confusion first sets in because in the exercises we are calculating the information gain for the second split i.e. after already splitting the data on safety_low. Going forward I am assuming this to be wrong and an oversight by the person who wrote it.
Splitting the Data:
Like previous ML algorithms, we first split the data using train_test_split:
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)
We will be using the training data, (x_train and y_train) for the rest of the exercise.
Calculating the Gini Impurity of the Root Node:
The first step is to calculate the Gini Impurity of the root node. Now, what is explained poorly in this lesson is that the Gini Impurity of the root node does not change, regardless of which feature you decide to initially split on. This is because the impurity of the root node is simply calculated from the ratio of True and False’s in our target variable - in this case, y_train.
We can verify this and calculate the impurity by doing the following:
print(y_train.value_counts())
False 970
True 412
p = 412 / (970+412) # p is the proportion of True samples in the training data
gi_root = 1 - (p**2 + (1 - p)**2 # formula for gini given as, g = 1 - sum(p^2)
print(gini_root)
0.41848785606128835
Note that is the same result as using the provided gini() function:
gi_root = gini(y_train)
print(gi_root)
0.41848785606128835
Calculating the Gini Impurity of the Left and Right Nodes:
Now that we have the Gini Impurity of the root node. Now we can calculate the impurity of the left and right nodes given a split on a certain feature. To do this we are going to apply a boolean mask on y_train, our outcome variable.
We will do this first for the feature safety_low. To create the mask(s) we want all the rows in our training data where safety_low == 1 and all the rows where safety_low == 0. To create the masks:
left_mask = x_train['safety_low'] == 0
right_mask = x_train['safety_low'] == 1
or
right_mask = ~left_mask # ~ inverts the mask i.e True -> False, False -> True
We can then apply the boolean masks to our outcome variable, y_train:
left = y_tain[left_mask]
right = y_train[right_mask]
left contains all of the rows where safety_low == 0 and right contains all the rows where safety_low == 1.
Now to calculate the GI for each node:
gi_left = gini(left)
gi_right = gini(right)
print(f'Gini impurity of left split: {gi_left}')
print(f'Gini impurity of right split : {gi_right}')
Gini impurity of left split: 0.49534472145275465
Gini impurity of right split : 0.0
Calculating the Information Gain:
Either using the values of gi_left, gi_right, gi_root or the provided info_gain() function, we can now successfully calculate the information gain for splitting on the feature safety_low:
gain = info_gain(left, right, gi_root)
print(f'Info gain when splitting on safety_low: {gain}')
Info Gain when splitting on safety_low: 0.09160335102155442
Repeating the Calculation for persons_2:
Repeat the above steps for a different feature, persons_2 as our initial split. This is where in my opinion the exercise went wrong. I am assuming we’re trying a new split on the root node rather than the already split data.
left_mask = x_train['persons_2'] == 0
right_mask = ~left_mask # invert mask for right node
left = y_train[left_mask]
right = y_train[right_mask]
gain = info_gain(left, right, gi_root)
print(f'Info gain when splitting on persons_2: {gain}')
Info gain when splitting on persons_2: 0.09013468781461476
Now, as expected, splitting the data initially on persons_2 shows a lower information gain than with safety_low. Next we shall do the same calculation for all features in the training data, again assuming that this is for an initial split.
Calculating the Information Gain For All Features in the Training Data:
We begin by initialising a list which will contain the names of each feature and the information gain associated with that feature when splitting the root node. We will then loop over the features in x_train and perform the same calculation as above. You could make this into a function which will be useful for future exercises (I imagine - I haven’t done it yet).
info_gain_list = []
for feature in x_train.columns:
left_mask = x_train[feature] == 0
right_mask = ~left_mask
left = y_train[left_mask]
right = y_train[right_mask]
gain = info_gain(left, right, gi_root)
info_gain_list.append([feature, gain])
We can then turn the info_gain_list into a Pandas DataFrame and sort it in descending order to see which features have the highest information gain:
info_gain_table = pd.DataFrame(info_gain_list, columns=['feature', 'info_gain'])\
.sort_values('information_gain',ascending=False)
print(info_gain_table)
feature information_gain
19 safety_low 0.091603
12 persons_2 0.090135
18 safety_high 0.045116
14 persons_more 0.025261
13 persons_4 0.020254
7 maint_vhigh 0.013622
3 buying_vhigh 0.011001
20 safety_med 0.008480
17 lug_boot_small 0.006758
1 buying_low 0.006519
5 maint_low 0.005343
6 maint_med 0.004197
15 lug_boot_big 0.003913
2 buying_med 0.003338
8 doors_2 0.002021
0 buying_high 0.001094
4 maint_high 0.000530
10 doors_4 0.000423
16 lug_boot_med 0.000386
11 doors_5more 0.000325
9 doors_3 0.000036
Conclusion
As expected, we can now see that splitting on safety_low gives us the highest information gain when splitting our data and should be used first, with persons_2 coming in second, unlike the solution code provided to us which shows persons_2 at the top. Once you have split into left and right nodes the next step would be to split them again (if they’re not perfect i.e. gi=0) and repeat.
Please note: if you copy this code you will not get the ‘correct’ answer and Codecademy will flag an error and prevent you from progressing. The reason they get it wrong is because they are performing the calculation on the second split, using a gi equal to that of gi_left in my calculation. Their results shows what would be the best split for the left node once we’ve already split on safety_low.
Thanks for reading if you’ve gotten this far and I hope Codecademy makes some serious improvements to this part of the course in the near future. You are not stupid, you are great, this was just taught incredibly poorly!