FAQs on the exercise Multiple Linear Regression Equation
There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply () below.
If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.
Join the Discussion. Help a fellow learner on their journey.
Ask or answer a question about this exercise by clicking reply () below!
Agree with a comment or answer? Like () to up-vote the contribution!
I’m not understanding how the following line in the lesson is accurate:
“Coefficients are most helpful in determining which independent variable carries more weight. For example, a coefficient of -1.345 will impact the rent more than a coefficient of 0.238, with the former impacting prices negatively and latter positively.”
I’ve played around with isolating different variables out of the equation and measuring the overall accuracy change and found little correlation between the coefficient numbers and that particular variable’s overall affect on the model’s prediction score. (Using LinearRegression from sklearn)
In fact, some of the calculations seem to contradict this completely. Take ‘bathrooms’ (coef = 1199.3859951) and ‘size_sqft’ (coef = 4.79976742) for example. Removing ‘bathrooms’ resulted in a change of 0.016441092844961758 (train) and 0.012947852168074925 (test). While removing ‘size_sqft’ resulted in a change of 0.14465327659439065 (train) and 0.1612098585758206 (test). This despite ‘bathrooms’ having a much larger coefficient value.
Nevermind, I figured out my issue understanding that later. So if the coefficients are the various values of m then the ratio of course if effected by the scale of the feature. Either way I think the wording on that lesson should be changed as “a coefficient of -1.345 will impact the rent more than a coefficient of 0.238” is only true if talking about the same feature.
Thank you for this, I have been struggling today to try and find some meaning in that statement. So from what you are saying this only refers to the same variable, for example if the size_sqft1 coef = 4.79976742 and another size_sqft coef2 = 7.00000001, the second variable coef would influence the rent more than the first. So if size_sqft1 was brooklyn and size_sqft2 coef was manhatten, you could say the sqft affects the rent price in manhatten more than brooklyn.
Am I way off here?
After that, I tried to googling a little more, and apparently the correlation between features seems to be a problem. I found a book, C. Molnar, “Interpretable Machine Learning”. In Section 4.1 the author says about something similar problem to our question:
In the last paragraph (in 4.1.9), the author says:
The interpretation of a weight can be unintuitive because it depends on all other features. A feature with high positive correlation with the outcome y and another feature might get a negative weight in the linear model, because, given the other correlated feature, it is negatively correlated with y in the high-dimensional space. Completely correlated features make it even impossible to find a unique solution for the linear equation. An example: You have a model to predict the value of a house and have features like number of rooms and size of the house. House size and number of rooms are highly correlated: the bigger a house is, the more rooms it has. If you take both features into a linear model, it might happen, that the size of the house is the better predictor and gets a large positive weight. The number of rooms might end up getting a negative weight, because, given that a house has the same size, increasing the number of rooms could make it less valuable or the linear equation becomes less stable, when the correlation is too strong.
bedrooms is highly correlated with bathrooms and size_sqft. This high correlation might be the reason why the coefficient is negative.
This makes a lot of sense. So all else being equal adding bedrooms can be slightly negative as it means less size per room. The trick being that all else is rarely equal in the real world.
This is where you have to take off the “mathematician” hat and understand the core business - in this case, it’s real estate.
Picture two Manhattan apartments:
Apartment A:
3 bedrooms
2 bathrooms
1200 sq ft
Apartment B
1 bedroom
1 bathroom
1200 sq ft
Think carefully about what these two apartments look like on the inside. Apartment A, the 3bd 2ba, would probably look close to a typical apartment. It will probably be furnished like a typical apartment. Statistically, won’t spend much time unoccupied between tenants because it’s something that several working adults could split rent on and therefore afford.
Apartment B has fewer bedrooms and fewer bathrooms, and it rents for more! Why? 1200 sq ft is HUGE for a 1 bd 1 ba apartment anywhere in NYC. For one person or a couple to be able to rent all that space by themselves, they are likely to be well-off financially and can afford nicer things. The amount of space is a luxury and the bedroom is going to be huge as well. The furnishings in that massive apartment are likely to be high-end, like a luxury kitchen, nice bathroom, all that good stuff. There are fewer people that can afford something like this than the more “normal” Apartment A, so Apartment B is probably going to spend more time unoccupied (and not generating rent) between tenants. This means that for it to be worth it for a landlord, it has to cost more during the time it is occupied in order to make up for that deficit.
In the example, the ‘bedrooms’ variable has a negative coefficient in the multiple linear regression model, even though we generally expect that more bedrooms would increase the rent price. This situation is often due to a phenomenon known as “multicollinearity” or the relationships between independent variables:
Multicollinearity: In your regression model, ‘bedrooms’ might be highly correlated with other independent variables like ‘size_sqft’. This can lead to unexpected signs for coefficients. For example, larger apartments (more square footage) usually have more bedrooms. If ‘size_sqft’ is a stronger predictor of rent, it might absorb most of the positive effect associated with more bedrooms, resulting in a negative coefficient for ‘bedrooms’.
Interpretation of Coefficients: In multiple regression, each coefficient represents the relationship between that particular independent variable and the dependent variable, assuming all other variables are held constant. The negative coefficient for ‘bedrooms’ suggests that, all else equal (including apartment size and building age), adding a bedroom is associated with a decrease in rent. This could reflect market preferences for larger, more open spaces, or other factors not captured in the model.
Relevance of the Coefficient: Understanding this coefficient is relevant as it highlights the complex interplay between different features of an apartment. It can indicate that simply adding bedrooms, without increasing the overall size or improving other aspects of the apartment, might not increase the value as expected.
Further Investigation: Such results often warrant further investigation. It could be useful to look at interaction terms (e.g., between ‘bedrooms’ and ‘size_sqft’) or to segment the data (e.g., by apartment type or location) to better understand the underlying dynamics.
Model Evaluation and Refinement: Unexpected signs for coefficients may signal the need to reevaluate the model. This could involve checking for omitted variable bias (important variables that are missing), considering non-linear relationships, or using different types of regression models.
In summary, while the negative coefficient for ‘bedrooms’ might seem counterintuitive, it provides valuable insights into the relative influence of different variables and the potential complexity of the relationships in your data. It’s a prompt for deeper analysis and careful consideration of the model’s explanatory variables and their interrelationships.