In the context of Predictions in sklearn, I don’t quite get the relationship between coefficients of features, in this case ‘hours_studied’ and ‘practice_test’, and the probabilities calculated through predict_proba().
In the code at the bottom of the page, feature 1 has more impact, almost 12x, on “passed_exam” than the other feature, ‘practice_test’, as shown in print lines print('Coefficients: ',cc_lr.coef_) #prints: Coefficients: [[1.5100409 0.12002228]]
. Keep it in mind.
Now, let’s go further with the code. The following print statements
print(cc_lr.predict(X_test))
print(cc_lr.predict_proba(X_test))
are yielding
[0 1 0 1 1]
[[0.67934073 0.32065927]
[0.2068119 0.7931881 ]
[0.94452517 0.05547483]
[0.42252072 0.57747928]
[0.12929566 0.87070434]]
Looking at the results of these print statements, I find that wherever the second element is higher than 0.5 the prediction is one (e.g. 0.7931881 is classified as 1). I assume that 0.79381881 is probability of feature 2, ‘practice_test’, then it’s in contrast with the first finding at the top of my note where the impact of feature 1, ‘hours_studied’, is higher.
Could you please elaborate on it that how the coefficient of feature is related with the prediction probability? Isn’t the coefficient the weight of the feature? Then, why the feature with less weight is more important in this case than the feature with more weight?
# Import pandas and the data
import pandas as pd
codecademyU = pd.read_csv('codecademyU_2.csv')
# Separate out X and y
X = codecademyU[['hours_studied', 'practice_test']]
y = codecademyU.passed_exam
# print('X.info():')
# print(X.info())
# print()
# print()
# print()
# Transform X
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)
# Split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 27)
print(codecademyU)
print("\n\n")
# print(X)
# print(y_train)
# Create and fit the logistic regression model here:
from sklearn.linear_model import LogisticRegression
cc_lr = LogisticRegression()
cc_lr.fit(X_train, y_train)
# Print the intercept and coefficients here:
print('Coefficients: ',cc_lr.coef_)
print()
print('Intercept: ',cc_lr.intercept_)
print('Coefficient for hours_studied [feature 1]: ',cc_lr.coef_[0][0])
print('Coefficient for practice_test [feature 2]: ',cc_lr.coef_[0][1])
# coef=cc_lr.coef_
# intercept = cc_lr.intercept_
print('\nAs it could be seen from the list of coefficients above, feature 1 has more impact, almost 12x, on "passed_exam" than the other feature.')
print()
print()
# Print out the predicted outcomes for the test data
print(cc_lr.predict(X_test))
# Print out the predicted probabilities for the test data
print(cc_lr.predict_proba(X_test))
print()
print()
print(cc_lr.predict_proba(X_test)[:,1])
print()
# Print out the true outcomes for the test data
print(y_test)
Console:
hours_studied practice_test passed_exam
0 0 55 0
1 1 75 0
2 2 32 0
3 3 80 0
4 4 75 0
5 5 95 0
6 6 83 0
7 7 87 0
8 8 78 0
9 9 85 1
10 10 77 1
11 11 89 0
12 12 96 0
13 13 83 1
14 14 98 1
15 15 87 1
16 16 90 1
17 17 92 1
18 18 92 1
19 19 100 1
Coefficients: [[1.5100409 0.12002228]]
Intercept: [-0.13173123]
Coefficient for hours_studied [feature 1]: 1.5100409021782504
Coefficient for practoce_test [feature 2]: 0.12002227802788998
As it could be seen from the list of coefficients above, feature 1 has more impact, almost 12x, on "passed_exam" than the other feature.
[0 1 0 1 1]
[[0.67934073 0.32065927]
[0.2068119 0.7931881 ]
[0.94452517 0.05547483]
[0.42252072 0.57747928]
[0.12929566 0.87070434]]
[0.32065927 0.7931881 0.05547483 0.57747928 0.87070434]
7 0
15 1
0 0
11 0
17 1
Name: passed_exam, dtype: int64