FAQ: Multiple Linear Regression - Rebuild the Model

I also just deleted the columns with the lower correlation and had only:
x = df[[‘bedrooms’, ‘bathrooms’, ‘size_sqft’, ‘floor’, ‘building_age_yrs’, ‘no_fee’, ‘has_washer_dryer’, ‘has_elevator’, ‘has_dishwasher’]]
With a test score of: 0.8083584281733305.

I don’t know why they didn’t tell us that using this code will help us to check the correlation between rent and the other variables:
import seaborn as sns

sns.set(rc={“figure.figsize”: (15, 10)})

sns.heatmap(df.corr(), cmap=“seismic”, annot=True, vmin=-1, vmax=1)

After checking the heatmap from this code, one can simply drop or erase the columns with the lowest correlation and get a bit more higher test score.

Hi,

I’ve just finished doing this Exercise, except for the part where it asks to: “Post your best model in the Slack channel!”

Could someone here please direct me to the relevant link for this?

Thank you!

Yeah, I think this lesson was really unclear. I don’t think a larger or smaller coefficient for slope necessarily has to do with the strength of the correlation.

As far as I understand R^2 values, it tells you how much of the variation in y (i.e. the distribution of rent prices) can be attributed to the given variables (sq ft, # of bedrooms, etc.). If you take some of those variables away, you have fewer things to explain the variation and therefore will get a lower R^2.

1 Like

Agree with others - this lesson really falls below the normally-high standards of Codecademy. The solution provided makes no sense, the hints are useless, and the language is ambiguous. They definitely need to rewrite this one. I am walking away frustrated with a suboptimal understanding of multiple regression.

I don’t think a larger or smaller coefficient for slope necessarily has to do with the strength of the correlation.

Amen.
Thanks for confirming this

Typically, the R-squared value of a model does not decrease when more variables are included in a multiple linear regression model. In fact, R-squared is a measure of the proportion of the variance in the dependent variable that is predictable from the independent variables, and it generally either stays the same or increases when additional variables are added. This is because R-squared will increase as long as the new variable has any predictive power, even if it’s minimal.

However, there are important considerations to keep in mind:

  1. Increase in R-squared Might be Misleading: Adding more variables to a model can artificially inflate the R-squared value. This doesn’t necessarily mean the model is better. An increase in R-squared does not always translate to an improvement in model performance, especially if the added variables do not have a meaningful relationship with the dependent variable.
  2. Adjusted R-squared: To address the issue of R-squared potentially increasing with the addition of more variables, it’s common to look at the adjusted R-squared. The adjusted R-squared compensates for the number of predictors in the model and can decrease if the added variables do not contribute enough to the model’s predictive power. It is a more robust measure when comparing models with different numbers of independent variables.
  3. Overfitting: Including too many variables, especially those not relevant to the prediction, can lead to overfitting. An overfitted model may have a high R-squared on the training data but performs poorly on unseen data.
  4. Model Complexity: A more complex model (with more variables) is not always a better model. It’s often beneficial to find the simplest model that adequately explains the data, following the principle of Occam’s razor.

In summary, while R-squared itself may not decrease with the addition of more variables, this increase might not always indicate an improvement in the model. It’s essential to consider other metrics, like adjusted R-squared, and to be mindful of overfitting and the relevance of the included variables.

1 Like