from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split x = df[['stars']] y = df[['alcohol?', 'good_for_kids', 'has_bike_parking', 'has_wifi', 'price_range', 'review_count', 'stars', 'take_reservations', 'takes_credit_cards', 'average_review_age_x', 'average_review_length_x', 'average_review_sentiment_x', 'number_funny_votes_x', 'number_cool_votes_x', 'number_useful_votes_x', 'average_review_age_y', 'average_review_length_y', 'average_review_sentiment_y', 'number_funny_votes_y', 'number_cool_votes_y', 'number_useful_votes_y', 'average_number_friends', 'average_days_on_yelp', 'average_number_fans', 'average_review_count', 'average_number_years_elite', 'weekday_checkins', 'weekend_checkins', 'average_tip_length', 'number_tips', 'average_caption_length', 'number_pics']] ols = LinearRegression() model = ols.fit(x,y) print(model.coef_) df.corr()
I have created a dataframe using Yelp data called df.
My question is: what is the difference between printing model.coef_ and df.corr() ?
I don’t really understand what the difference is between these 2 methods.
Also if I use model.coef_ or corr() and I choose some items that have a high number how come sometimes the Rsquared is not high?
The Rsquared is higher for some independent variables that have a lower coefficient than the independent variables that have a higher coefficient.
Isn’t that counter intuitive? Someone told me it could be because the data is not normalized, but I don’t think I’ve seen anywhere in the linear regression lesson about normalizing data. So I don’t know when to do it or even how to do it.