While trying to do linear regression on this okcupid project, my train score and test score are so different and my test score is even showing up as a very small value close to 0 and negative. The code and result are below:
df = pd.read_csv('profiles.csv')
df_subset = df[['age', 'height', 'location', 'orientation', 'sex', 'speaks', 'status']]
df_dummies = pd.get_dummies(df_subset, columns = ['location', 'orientation', 'sex', 'speaks', 'status'])
outlier_stripped = df_dummies[(df_dummies['height'] >= 57) & (df_dummies['height'] <= 80) & (df_dummies['age'] <= 69)]
for_training = outlier_stripped.dropna()
target = for_training['height']
features = for_training.drop(columns = ['height'])
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2)
model = LinearRegression()
model.fit(X_train, y_train)
print('Train score: ' + str(model.score(X_train, y_train)))
print('Test score: ' + str(model.score(X_test, y_test)))
y_predicted = model.predict(X_test)
sns.regplot(x = y_test, y = y_predicted)
plt.xlabel('actual')
plt.ylabel('predicted')
plt.title('actual vs predicted')
Result for trying to predict height(inches):
Train score: 0.5556998695680855
Test score: -4.198283571774815e+17
Result for trying to predict age:
Train score: 0.1848695561525383
Test score: -7.801184386540202e+19
I don’t know how to rule out:
- If the test is set up incorrectly
- If the data is formatted incorrectly
- If the above two are correct and only the feature selection is incorrect
Asking how can I isolate these possibilities?
I know I can try another ML model to rule out:
- If it is because Linear Regression is not the appropriate model here