FAQ: Multiple Linear Regression - Rebuild the Model

This community-built FAQ covers the “Rebuild the Model” exercise from the lesson “Multiple Linear Regression”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Machine Learning

FAQs on the exercise Rebuild the Model

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

Working on the Manhattan dataset - I can’t seem to get an R^2 value better than .8, no matter which variables I eliminate. Does someone have a best solution for this?

1 Like

No. I’m having the same issue. Either something isn’t right with this one, or I really, really don’t understand what they’re trying to teach me.

2 Likes

the same issue, when I delete any columns from df, scores stay the same

Which command is for removing features?

You can drop features by doing:

x.drop([‘size_sqft’, ‘building_age_yrs’], axis=1)

But I also cannot get lower than 0.805

I find the whole concept of removing explanatory variables to increase overall accuracy baffling. If there is even the slightest correlation between a variable and an outcome, shouldn’t the inclusion of this variable by definition improve the accuracy of the overall fit?

Is this maybe the real lesson they are trying to teach here?

3 Likes

I checked the step where we ran the regression on the different columns…one thing that I was unsure about was the “binary” choices…i.e. dishwashed, yes or no…these were not as clear.

Now there were two that seemed to have a negative correlation to me , “min_to_subway” and “building_age_years”.

I just ran them all, using manhattan and found that only min_to_subway (.8085711291628321) really made a difference.

OK, seeing the question above…lower was the goal? I got the impression that higher meant a better correlation. No worries, continuing!

@tom-code: Not necessarily.
There might be independent variables/predictors that don’t help the model predict rent but rather implement unnecessary and confusing information (= noise), taking one of these independent variables out will better the prediction of the model. More independent variables is not necessarily better. Also if there are correlations between the predictors, taking one of them out can help the model.

1 Like

Yes, im a bit confused too. For example, if you set the has_gym and has_patio to 0, predicted prices increase… so something is not working well with the model imo.

And btw, i was traying to test with other real condos from StreetEasy, and most of the information like sqr feet etc… was not included, so I couldnt test it.

Hi,
Anyone knows how to Link model.coef_ results to variable.
Basically, from what I understand model.coef how important are certain variable.
As far as I understand, he closer to 0 coef_ number the less important is number for the model acuracy.
So I tried to create a table
Bathrooms: -xxx.xxx
Bedrooms: xx.xxx

But it doesn’t seem to make much sense… :thinking:

Hello.
In exercise is wrote about Slack channel and that I can paste my best model in the Slack channel.
But I can’t find anything about Codecademy’s slack channels.
Where I can post my model? Like it’s written in the lesson.

I tried reducing the parameters to the five largest coefs (I went ahead and used the abs of the coefs since I don’t know the realationship of +/- coeffs). I then made a new dataframe for x values with just the new five columns. Used linear regression and the R^2 dropped to 0.61. My guess was wrong.

# -*- coding: utf-8 -*-
"""
Created on Sun Aug 16 20:20:50 2020

@author: 12253
"""

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt
import numpy as np

df = pd.read_csv('manhattan.csv')

#list all parameters to compare to y
x = df[['bedrooms',
'bathrooms',
'size_sqft',
'min_to_subway',
'floor',
'building_age_yrs',
'no_fee',
'has_roofdeck',
'has_washer_dryer',
'has_doorman',
'has_elevator',
'has_dishwasher',
'has_patio',
'has_gym']]

#main comparison
y = df['rent']

#splitting data 80% to build model and 20% to test model
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2)

#Build regression model and predict y values with the x_test values
mlr = LinearRegression()
mlr.fit(x_train, y_train)
y_predict = mlr.predict(x_test)

#Scatter plot of real ys vs test ys
plt.scatter(y_test, y_predict)
plt.xlabel('Predicted Price ($)')
plt.ylabel('Actual Price ($)')
plt.title('Predicted Prices vs Actual Prices')
plt.show()

labels = ['bedrooms',
'bathrooms',
'size_sqft',
'min_to_subway',
'floor',
'building_age_yrs',
'no_fee',
'has_roofdeck',
'has_washer_dryer',
'has_doorman',
'has_elevator',
'has_dishwasher',
'has_patio',
'has_gym']

for x in labels:
    plt.scatter(df[x], df['rent'], alpha=0.4)
    plt.title(x)
    plt.show()
    
print(mlr.score(x_train, y_train))
print(mlr.score(x_test, y_test))
print(mlr.coef_)

coefs = mlr.coef_
coefs_abs = [abs(x) for x in coefs]


#I tried only using the 5 largest abs(coef) to see if that would fit better. R2 dropped so my idea of only using high coef variables was incorrect
#made a list of lists of labels and coefs then sorted by coefs and seperated the top 5 parameters
combined = [[coefs_abs[i],labels[i]] for i in range(len(labels))]
sorted_combined = sorted(combined)
top_parameters = sorted_combined[-5:]
top_parameters_labels = [top_parameters[i][1] for i in range(len(top_parameters))]

#Create df of top five parameters and fit using linear regression 
X = df[top_parameters_labels]
x_train2, x_test2, y_train2, y_test2 = train_test_split(X, y, train_size=0.8, test_size=0.2)
mlr2 = LinearRegression()
mlr2.fit(x_train2, y_train2)
y_predict2 = mlr2.predict(x_test2)
print(mlr2.score(x_train2, y_train2))
print(mlr2.score(x_test2, y_test2))
print(mlr2.coef_)

1 Like

The highest score (R^2) I could achieve for this lesson was 0.8089311686508096 using:

x = df[[
  'bedrooms', 
  'bathrooms', 
  'size_sqft', 
  'floor', 
  'building_age_yrs', 
  'no_fee', 
  'has_doorman', 
  'has_elevator'
  ]]

I did look at the coefficients but found it easier to simply delete a column and re-run the code, comparing the old and new score.
I have no idea whether or not this is the best case for the test data, but it’s the best I could achieve with my lazy method!

2 Likes

Hello, after the final task when it tells you to play with the coeffiicients to see which ones are more important and so…
How can I visualize the p-value for each coefficient so I can see which coefficients are significant and which ones are not, I tried doing

from spicy import stats

print(model.summary())
but it didn’t work, after all the exercise what else can I do to visualize which coefficients are the most important with the p-value?

1 Like

A simpler way would be directly remove features which you don’t want from X…

No, I think your R2 value is on point. I’ve been experimenting with adding/subtracting variables one by one. Square footage is by far the most significant factor. Number of bedrooms, number of bathrooms, floor, and building age increase the accuracy of both the train and test values, but all together only by a few points - like from 77% to 80% or so.

Some of the binary values (roof deck, min fee) as well as min to subway improve the train R2 but negatively affect the test R2. This is marginal though (we’re talking tenths of percentage points here).

The model seems to hold fairly well up to $7500 rent or slightly above, and gets a bit unglued at $10,000 and slightly before. A lot of these factors seem to hold much more weight at the lower price points, but become less significant as the rates get really high. Also some factors seem to set more of a range or cap than follow a straight line per se (for example, 0 and 1 bedrooms seemed to cap off at a lower price point than 2 or 3, but still a wide range of price, so you won’t get a great line from it).

Also, I would think neighborhood would hold some weight, which is not accounted for. So I think the 80-ish% you are getting is about as good as can be expected.

2 Likes

Its mainly because the value of variable is high compared to others. In this data, the coefficient of ‘size_sqft’ is less but if you look at its value, it is around 600 or above. Therefore it will have greater impact on the result. It is also important to check the range of value relative to other variables. If the value is high, it will have more impact even if coefficient in smaller.

2 Likes

A good way to check the degree of correlation all at once is using a correlation matrix:

import seaborn as sns

corrMatrix = df.corr()

sns.heatmap(corrMatrix, annot=False)

plt.show()

And before you ask: no, I also couldn’t get above 0.802 or so for R^2.

1 Like