FAQ: Decision Trees - How a Decision Tree is Built

This community-built FAQ covers the “How a Decision Tree is Built” exercise from the lesson “Decision Trees”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Machine Learning Fundamentals

FAQs on the exercise How a Decision Tree is Built

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!
You can also find further discussion and get answers to your questions over in #get-help.

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head to #get-help and #community:tips-and-resources. If you are wanting feedback or inspiration for a project, check out #project.

Looking for motivation to keep learning? Join our wider discussions in #community

Learn more about how to use this guide.

Found a bug? Report it online, or post in #community:Codecademy-Bug-Reporting

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

Hello.
In the lesson you say:
" Now, let’s compare that with a different feature we could have split on first, persons_2. In this case, the left branch will have a Gini impurity of 1 - (505/917)^2 - (412/917)^2 = 0.4949 "
Where 505,412 and 917 comes from ?
And why its not 1 - ( (505/917)^2 + (412/917)^2) as per Gini impurity formula 1 - (P1^2 + P2^2)

Further in the exercises, after placing everything as written in hints into preloaded functions gini and info_gain I get different all three answers.

1. Calculate gini and info gain for a root node split at safety_low<=0.5

y_train_sub = y_train[x_train[‘safety_low’]==0]
x_train_sub = x_train[x_train[‘safety_low’]==0]
gi = gini(y_train_sub)
print(f’Gini impurity at root: {gi}')

2. Information gain when using feature persons_2

left = y_train[x_train[‘persons_2’]==0]
right = y_train[x_train[‘persons_2’]==1]
print(f’Information gain for persons_2: {info_gain(left, right, gi)}')

3. Which feature split maximizes information gain?

info_gain_list =
for i in x_train.columns:
left = y_train_sub[x_train_sub[i]==0]
right = y_train_sub[x_train_sub[i]==1]
info_gain_list.append([i, info_gain(left, right, gi)])
info_gain_table = pd.DataFrame(info_gain_list).sort_values(1,ascending=False)
print(f’Greatest impurity gain at:{info_gain_table.iloc[0,:]}')
print(info_gain_table)

Output:

Gini impurity at root: 0.49534472145275465
Information gain for persons_2: 0.16699155320608106
Greatest impurity gain at:0 persons_2
1 0.208137

  • 0.495 Doesn’t match 0.418
  • safety_low doesn’t give the largest information gain but person_2 gives
2 Likes

I was literally coming here to say the same thing I am so lost. Idk if it actually doesn’t make sense or if it’s just worded badly or if I am just dumb lol. But I usually get things fairly easily and I have gone over it a dozen times and I do not get it. @core6620398233 Let me know if you got an answer/figured it out. @system Can we get an explanation on this? I really have no idea how to interpret the results and I don’t want my future learning to be impeded by misunderstanding here.

1 Like

For the record you get different impurities if you do ‘safety_low’ and ‘persons_2’, I am still not sure how to make sense of it though. And the question saying it should be 4.18 but then giving 4.95 still makes no sense. My theory as of now is it is predicting the gini if it were the beginning point of the tree? But it still doesn’t explain the discrepancies x.x

1 Like

@core6620398233 @carpen-them-diems You are not alone! This machine learning course has taken a serious dive post Logistic Regression II, it’s really not good. Below I shall try my best to explain what’s going on after working on this for a while. I ended up rewriting a lot of the code to make more sense to me - something that a student really shouldn’t be doing on a paid course but this is where we’re at right now I suppose… Sorry if I over-explain but I am doing this in part to help with my own understanding as well as people who may see this in the future.

The Problem

First of all, I think the whole point of the lesson was to find which feature to perform an initial split on. This is where the confusion first sets in because in the exercises we are calculating the information gain for the second split i.e. after already splitting the data on safety_low. Going forward I am assuming this to be wrong and an oversight by the person who wrote it.

Splitting the Data:

Like previous ML algorithms, we first split the data using train_test_split:

x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

We will be using the training data, (x_train and y_train) for the rest of the exercise.

Calculating the Gini Impurity of the Root Node:

The first step is to calculate the Gini Impurity of the root node. Now, what is explained poorly in this lesson is that the Gini Impurity of the root node does not change, regardless of which feature you decide to initially split on. This is because the impurity of the root node is simply calculated from the ratio of True and False’s in our target variable - in this case, y_train.

We can verify this and calculate the impurity by doing the following:

print(y_train.value_counts())
False 970
True 412
p = 412 / (970+412)  # p is the proportion of True samples in the training data
gi_root = 1 - (p**2 + (1 - p)**2  # formula for gini given as, g = 1 - sum(p^2) 
print(gini_root)
0.41848785606128835

Note that is the same result as using the provided gini() function:

gi_root = gini(y_train)
print(gi_root)
0.41848785606128835

Calculating the Gini Impurity of the Left and Right Nodes:

Now that we have the Gini Impurity of the root node. Now we can calculate the impurity of the left and right nodes given a split on a certain feature. To do this we are going to apply a boolean mask on y_train, our outcome variable.

We will do this first for the feature safety_low. To create the mask(s) we want all the rows in our training data where safety_low == 1 and all the rows where safety_low == 0. To create the masks:

left_mask = x_train['safety_low'] == 0
right_mask = x_train['safety_low'] == 1

or

right_mask = ~left_mask # ~ inverts the mask i.e True -> False, False -> True

We can then apply the boolean masks to our outcome variable, y_train:

left = y_tain[left_mask]
right = y_train[right_mask]

left contains all of the rows where safety_low == 0 and right contains all the rows where safety_low == 1.
Now to calculate the GI for each node:

gi_left = gini(left)
gi_right = gini(right)

print(f'Gini impurity of left split: {gi_left}')
print(f'Gini impurity of right split : {gi_right}')
Gini impurity of left split: 0.49534472145275465
Gini impurity of right split : 0.0

Calculating the Information Gain:

Either using the values of gi_left, gi_right, gi_root or the provided info_gain() function, we can now successfully calculate the information gain for splitting on the feature safety_low:

gain = info_gain(left, right, gi_root)
print(f'Info gain when splitting on safety_low: {gain}')
Info Gain when splitting on safety_low: 0.09160335102155442

Repeating the Calculation for persons_2:

Repeat the above steps for a different feature, persons_2 as our initial split. This is where in my opinion the exercise went wrong. I am assuming we’re trying a new split on the root node rather than the already split data.

left_mask = x_train['persons_2'] == 0
right_mask = ~left_mask # invert mask for right node

left = y_train[left_mask]
right = y_train[right_mask]

gain = info_gain(left, right, gi_root)
print(f'Info gain when splitting on persons_2: {gain}')
Info gain when splitting on persons_2: 0.09013468781461476

Now, as expected, splitting the data initially on persons_2 shows a lower information gain than with safety_low. Next we shall do the same calculation for all features in the training data, again assuming that this is for an initial split.

Calculating the Information Gain For All Features in the Training Data:

We begin by initialising a list which will contain the names of each feature and the information gain associated with that feature when splitting the root node. We will then loop over the features in x_train and perform the same calculation as above. You could make this into a function which will be useful for future exercises (I imagine - I haven’t done it yet).

info_gain_list = []

for feature in x_train.columns:
  left_mask = x_train[feature] == 0
  right_mask = ~left_mask
  
  left = y_train[left_mask]
  right = y_train[right_mask]
  
  gain = info_gain(left, right, gi_root)
  info_gain_list.append([feature, gain])

We can then turn the info_gain_list into a Pandas DataFrame and sort it in descending order to see which features have the highest information gain:

info_gain_table = pd.DataFrame(info_gain_list, columns=['feature', 'info_gain'])\
                  .sort_values('information_gain',ascending=False)

print(info_gain_table)
           feature  information_gain
19      safety_low          0.091603
12       persons_2          0.090135
18     safety_high          0.045116
14    persons_more          0.025261
13       persons_4          0.020254
7      maint_vhigh          0.013622
3     buying_vhigh          0.011001
20      safety_med          0.008480
17  lug_boot_small          0.006758
1       buying_low          0.006519
5        maint_low          0.005343
6        maint_med          0.004197
15    lug_boot_big          0.003913
2       buying_med          0.003338
8          doors_2          0.002021
0      buying_high          0.001094
4       maint_high          0.000530
10         doors_4          0.000423
16    lug_boot_med          0.000386
11     doors_5more          0.000325
9          doors_3          0.000036

Conclusion

As expected, we can now see that splitting on safety_low gives us the highest information gain when splitting our data and should be used first, with persons_2 coming in second, unlike the solution code provided to us which shows persons_2 at the top. Once you have split into left and right nodes the next step would be to split them again (if they’re not perfect i.e. gi=0) and repeat.

Please note: if you copy this code you will not get the ‘correct’ answer and Codecademy will flag an error and prevent you from progressing. The reason they get it wrong is because they are performing the calculation on the second split, using a gi equal to that of gi_left in my calculation. Their results shows what would be the best split for the left node once we’ve already split on safety_low.

Thanks for reading if you’ve gotten this far and I hope Codecademy makes some serious improvements to this part of the course in the near future. You are not stupid, you are great, this was just taught incredibly poorly!

3 Likes

This could also go under course suggestions if you haven’t already done so.

Thank you for taking look into.

Hi Lisa can you send a link quickly so I can do so? I can’t recall how to get onto it.

It’s under Community >

1 Like

Apropos of nothing, I found this article which provides a decent explanation:
https://data36.com/coding-a-decision-tree-in-python-classification-tree-gini-impurity/

This is an error, and your output is correct. The numbers should read:
1−(500/912)**2 − (412/912)**2 = 0.4953, not 1−(505/917)**2 − (412/917)**2 = 0.4949
(1 - (value[0]/samples)**2 - (value[1]/samples)**2) - refer to the “persons_2 <= 0.5” node in the tree plot

Codecademy needs to get their act together or I won’t be renewing my subscription.

I figured that out quick a while ago, I just wana make the comment that as I got further into the course I realized a lot. One, no it wasn’t an oversight, being very introductory lessons there is a lot being left out and that is because the lesson needs to be accessible for everyone. Two, there is an expected level of pre-textual knowledge and that the student will research into the concepts themselves to get a better understanding. This is not a bad thing it is both good and necessary, as not every student will make sense of nuanced concepts the same way, and learning to use resources to learn and how to understand the code is also a very big and important part of the learning process.

I just felt the need to clarify that now that I have more understanding and context! I felt the same way back then, and although there is some level of truth to it as the course is still a work in progress (or at least it was). All the big issues I thought it had I realized did not actually exist for the most part, and at most there we’re just a few issues like sometimes unclear explanations or too technical with the expectation that the students would have knowledge/understanding they might not necessarily have.

Overall thought the content makes sense and is good, it has made me fairly competent in a short time and to be honest I think it is the user expectation that they can go into this with little previous experience/knowledge and no off platform research/sources that is the bigger issue. Researching in this manner is a standard programming skill/practice and something that should be expected of us, even if we didn’t realize it at the time.