FAQ: Logistic Regression - Log Loss II

This community-built FAQ covers the “Log Loss II” exercise from the lesson “Logistic Regression”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Machine Learning

FAQs on the exercise Log Loss II

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

def log_loss(probabilities,actual_class):
  return np.sum(-(1/actual_class.shape[0])*(actual_class*np.log(probabilities) + (1-actual_class)*np.log(1-probabilities)))

What does .shape[0] do?

What is the idea behind:

Now that we have calculated the loss for our best coefficients, let’s compare this loss to the loss we begin with when we initialize our coefficients and intercept to 0 . probabilities_2 contains the calculated probabilities of the students passing the exam with the coefficient for hours_studied set to 0

Why do we initialize the coefficients and intercept to 0?

numpy.ndarray.shape — NumPy v1.21.dev0 Manual
the shape attribute contains the dimensions of an array in the form of a tuple:

print(some_array.shape)
>>> (row, column, depth, dim4, dim5, ...)

.shape[0] fetches the first element of the tuple, the number of rows

That quote is from later in the lesson. You start at 0 because we don’t know how close the “best” coefficients are to 0. And then the log loss is computed. And then we reinitialize the coefficients to be higher than 0 and compute log loss, iterating via gradient descent until we reach the best coefficients that give minimal log loss.

Why is the content of probabilities_2 a list of 0.5, 0.5, 0.5, etc. Shouldn’t be zero since the chances of passing without studying are closer to zero? or shouldn’t be zero if we have reinitialized the coeficients to zero? for any of the two reasons the list contents should be zero’s? what is my misunderstading in here?

thanks!

In the context of Predictions in sklearn, I don’t quite get the relationship between coefficients of features, in this case ‘hours_studied’ and ‘practice_test’, and the probabilities calculated through predict_proba().

In the code at the bottom of the page, feature 1 has more impact, almost 12x, on “passed_exam” than the other feature, ‘practice_test’, as shown in print lines print('Coefficients: ',cc_lr.coef_) #prints: Coefficients: [[1.5100409 0.12002228]]. Keep it in mind.
Now, let’s go further with the code. The following print statements

print(cc_lr.predict(X_test))
print(cc_lr.predict_proba(X_test))

are yielding

[0 1 0 1 1]
[[0.67934073 0.32065927]
 [0.2068119  0.7931881 ]
 [0.94452517 0.05547483]
 [0.42252072 0.57747928]
 [0.12929566 0.87070434]]

Looking at the results of these print statements, I find that wherever the second element is higher than 0.5 the prediction is one (e.g. 0.7931881 is classified as 1). I assume that 0.79381881 is probability of feature 2, ‘practice_test’, then it’s in contrast with the first finding at the top of my note where the impact of feature 1, ‘hours_studied’, is higher.

Could you please elaborate on it that how the coefficient of feature is related with the prediction probability? Isn’t the coefficient the weight of the feature? Then, why the feature with less weight is more important in this case than the feature with more weight?

# Import pandas and the data

import pandas as pd

codecademyU = pd.read_csv('codecademyU_2.csv')

# Separate out X and y

X = codecademyU[['hours_studied', 'practice_test']]

y = codecademyU.passed_exam

# print('X.info():')

# print(X.info())

# print()

# print()

# print()

# Transform X

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X)

X = scaler.transform(X)

# Split data into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 27)

print(codecademyU)

print("\n\n")

# print(X)

# print(y_train)

# Create and fit the logistic regression model here:

from sklearn.linear_model import LogisticRegression

cc_lr = LogisticRegression()

cc_lr.fit(X_train, y_train)

# Print the intercept and coefficients here:

print('Coefficients: ',cc_lr.coef_)

print()

print('Intercept: ',cc_lr.intercept_)

print('Coefficient for hours_studied [feature 1]: ',cc_lr.coef_[0][0])

print('Coefficient for practice_test [feature 2]: ',cc_lr.coef_[0][1])

# coef=cc_lr.coef_

# intercept = cc_lr.intercept_

print('\nAs it could be seen from the list of coefficients above, feature 1 has more impact, almost 12x, on "passed_exam" than the other feature.')
print()
print()
# Print out the predicted outcomes for the test data

print(cc_lr.predict(X_test))

# Print out the predicted probabilities for the test data

print(cc_lr.predict_proba(X_test))

print()

print()

print(cc_lr.predict_proba(X_test)[:,1])

print()

# Print out the true outcomes for the test data

print(y_test)

Console:

    hours_studied  practice_test  passed_exam
0               0             55            0
1               1             75            0
2               2             32            0
3               3             80            0
4               4             75            0
5               5             95            0
6               6             83            0
7               7             87            0
8               8             78            0
9               9             85            1
10             10             77            1
11             11             89            0
12             12             96            0
13             13             83            1
14             14             98            1
15             15             87            1
16             16             90            1
17             17             92            1
18             18             92            1
19             19            100            1



Coefficients:  [[1.5100409  0.12002228]]

Intercept:  [-0.13173123]
Coefficient for hours_studied [feature 1]:  1.5100409021782504
Coefficient for practoce_test [feature 2]:  0.12002227802788998

As it could be seen from the list of coefficients above, feature 1 has more impact, almost 12x, on "passed_exam" than the other feature.


[0 1 0 1 1]
[[0.67934073 0.32065927]
 [0.2068119  0.7931881 ]
 [0.94452517 0.05547483]
 [0.42252072 0.57747928]
 [0.12929566 0.87070434]]



[0.32065927 0.7931881  0.05547483 0.57747928 0.87070434]

7     0
15    1
0     0
11    0
17    1
Name: passed_exam, dtype: int64

I got the answer to my issue.
My mistake was assuming that the array returned from predict_proba() method is depicting the features in a way that the first column is representing “hours_studied” and second column representing “practice_test”, whereas, the first column is the negative probability and the second column is positive probability. Therefore, it’s only the second column that should be counted, and it’s representing the probability of features altogether.

Now, the issue for me is that how the probabilities are calculated.
Following is a list of standardized (scaled) values of X (‘hours_studied’ & ‘practice_test’).
Considering the equation aX+b, then multiplying the coefficients of the features with the features and summing them and adding intercept, an shown in an example below, not only is not equal to the probability, but also its absolute value is larger than 1.

[[-1.64750894 -1.79313169]
 [-1.47408695 -0.48666051]
 [-1.30066495 -3.29557355]
 [-1.12724296 -0.16004272]
 [-0.95382097 -0.48666051]
 [-0.78039897  0.81981066]
 [-0.60697698  0.03592796]
 [-0.43355498  0.29722219]
 [-0.26013299 -0.29068984]
 [-0.086711    0.16657508]
 [ 0.086711   -0.3560134 ]
 [ 0.26013299  0.42786931]
 [ 0.43355498  0.88513422]
 [ 0.60697698  0.03592796]
 [ 0.78039897  1.01578134]
 [ 0.95382097  0.29722219]
 [ 1.12724296  0.49319287]
 [ 1.30066495  0.62383999]
 [ 1.47408695  0.62383999]
 [ 1.64750894  1.14642846]]

Intercept: [-0.13173123]
Coefficient for hours_studied [feature 1]: 1.5100409021782504
Coefficient for practoce_test [feature 2]: 0.12002227802788998

Example for the first row of the list above:
(coef_of_feature 1 * scaled_feature_1) + (coef_of_feature 2 * scaled_feature_2) + intercept
(1.5100409021782504 * (-1.64750894) ) + (0.12002227802788998 * (-1.79313169) ) + -0.13173123
is equal to -2.7030216363421332345300422,

while the probability of the first row is 0.32065927.

I would like to ask, how does -2.7030216363421332345300422 transform into probability of the first row 0.32065927, what kind of calculations does it go through in the process? Thank you.