Hi guys,
In this project, I used a dataset downloaded from Kaggle. I aimed to build several models using RandomForestRegressor
, LinearRegression
, SVR
and Ridge
to predict data. Also, I will try using some model selection algorithms to facilitate my process including GridSearchCV
.
Here is the Kaggle link to my project: https://www.kaggle.com/breyenguyen/house-prices-model-selection-with-gridsearchcv
All feedback welcomed. Much appreciated.
Hi Breyen, little late but let me give a stab at some advice on your project!
First of all great stuff, this is a very thorough assignment. Only two minor conceptual things come to mind:
-
I would challenge the heuristic “if more than 15% of the data is missing” it should be deleted. Unless it can be conclusively proven that the data missing is totally random (realistically very rare), you could be missing out on some very potent data. For instance, if a survey asking about religiosity is left blank, that could reflect the the sensitivity of the user to the question, which could then be treated as a feature in and itself. I’m not sure about the conditions behind how your dataset was collected but it’s worth keeping in mind.
-
For right skewed data, square root and cube root also work for transformation, and might be even better depending on the skewness of the data. I find that the transformation used can actually have quite a big difference, so it’d be worth checking out next time.
Other than those minor points, nice job!