# Linear Regression: Train score and test score differ wildly. Asking for next steps

While trying to do linear regression on this okcupid project, my train score and test score are so different and my test score is even showing up as a very small value close to 0 and negative. The code and result are below:

``````df = pd.read_csv('profiles.csv')
df_subset = df[['age', 'height', 'location', 'orientation', 'sex', 'speaks', 'status']]
df_dummies = pd.get_dummies(df_subset, columns = ['location', 'orientation', 'sex', 'speaks', 'status'])
outlier_stripped = df_dummies[(df_dummies['height'] >= 57) & (df_dummies['height'] <= 80) & (df_dummies['age'] <= 69)]

for_training = outlier_stripped.dropna()
target = for_training['height']
features = for_training.drop(columns = ['height'])

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2)
model = LinearRegression()
model.fit(X_train, y_train)

print('Train score: ' + str(model.score(X_train, y_train)))
print('Test score: ' + str(model.score(X_test, y_test)))
y_predicted = model.predict(X_test)
sns.regplot(x = y_test, y = y_predicted)
plt.xlabel('actual')
plt.ylabel('predicted')
plt.title('actual vs predicted')``````

Result for trying to predict height(inches):

``````Train score: 0.5556998695680855
Test score: -4.198283571774815e+17``````

Result for trying to predict age:

``````Train score: 0.1848695561525383
Test score: -7.801184386540202e+19``````

I donâ€™t know how to rule out:

• If the test is set up incorrectly
• If the data is formatted incorrectly
• If the above two are correct and only the feature selection is incorrect

Asking how can I isolate these possibilities?

I know I can try another ML model to rule out:

• If it is because Linear Regression is not the appropriate model here