Scatter Plot Supervised Machine Learning - Multiple Linear Regression

Hi Codecademy team,

I have a question regarding this exercise. Why do we plot y_test vs y_predict like below and what is the goal of plotting this? I don’t follow the logic, I thought the purpose of plotting the scatter in this exercise is to see how features/columns in the x_test influence the predicted rent prices (y_predict)?

df = pd.DataFrame(streeteasy)

x = df[['bedrooms', 'bathrooms', 'size_sqft','floor', 'building_age_yrs', 'has_roofdeck', 'has_washer_dryer', 'has_elevator', 'has_gym']]

y = df[['rent']]

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.8, test_size = 0.2, random_state=6)

lm = LinearRegression()

model = lm.fit(x_train, y_train)

y_predict= lm.predict(x_test)

print("Train score:")
print(lm.score(x_train, y_train))

print("Test score:")
print(lm.score(x_test, y_test))

#Why do we plot y_test against y_predict ? 
#Why not x_test against y_predict? 
plt.scatter(y_test, y_predict)
plt.plot(range(20000), range(20000))

plt.xlabel("Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Actual Rent vs Predicted Rent")

plt.show()

The scatter plot generated from this code is as follow:

I would appreciate any easy explanation that i can get

Thank you as always,
Jimmy Wijaya

I’m afraid I don’t have full access to this lesson (so I may have missed certain steps) but is the purpose not the same as the title on the graph. To view the outcome of your ‘best’ predictive model when compared against the actual data? I believe this is just checking to see your best model and its accuracy for predicting y from a given x value since y is dependent upon x (x would be the independent variable).

I suppose you could plot it more than once but ideally the model with the highest R^2 value would already have the theoretical ‘best’ features/columns and the graph before then would be less useful (you could use one but a measure of variance may be a more robust method than by eye). If there’s not a step to already remove unhelpful data columns then I apologise and multiple plots could be useful but you could also just minimise the variance to find the best fit automatically and therefore have to plot just once.

Hi Tim,

Thanks for replying. As you said

I believe this is just checking to see your best model and its accuracy for predicting y from a given x value since y is dependent upon x ( x would be the independent variable).

But you can see from the graph plotted in the exercise, it’s plotting dependent variable (y_test) against the predicted dependent variable (y_predict). This is where I get confused! What’s the logic behind this? I don’t exactly know how to interpret this particular graph?
I thought we are supposed to plot the independent variable on x-axis and the dependent variable on y-axis on a scatter plot. Not dependent variable vs dependent variable.

Kind Regards,
Jimmy

I’m afraid I still can’t access the full lesson so I may be wrong. The way I interpet it is that you have your training data which you have used to train your model. You have then used this model to predict the y_values for your testing data (predict the y points for the given x_test values).

You are then plotting the predicted y_values against the y_test data to see how your model (based on training data) performed compared to testing data.

Since this will plot point by point it loosely points out how well the model fitted each of the data points since they’re both based on the same x_test values. You could plot two scatter graphs with different colours against the x_test values for a different view if you wanted though it may provide you with some different view/interpretation of the same data. You could do that if you so wished. Y-predict against y-test seems to be used as a straightforward way to see how your model performs when compared with testing data.

2 Likes

Hi Tim,

Thanks for clarifying this, I think I get your point now… Allow me to re-emphasize your point, so you’re saying that since y_test are driven by x_test values too, like the y_predicted, we can plot y_predicted values against y_test values to determine how our model is performing. Correct?
And in the end, both plot (y_predicted vs x_test) and (y_predicted vs y_test) will generate the same insight of how our model performs ?

Kind Regards,

Jimmy

1 Like

That sounds about right to me but I will reiterate the fact I’ve not actually done this lesson so take it with a grain of salt and ensure it all makes sense to you. So far as I can tell you’re simply comparing the testing dataset (house price based on your xtest) against what your model predicts for the same (predicted house price based on xtest). This should give you a fair idea of how well your model matches real data.

I think they’d be likely to generate a similar insight, I wouldn’t go so far as to say the same. You might be able to pick up on certain trends more easily with one over the other but you’d probably have to actually plot both to be sure.

1 Like