Machine Learning Capstone Project - OKCupid Date A Scientist

Hey guys!

I hope you’re doing well. I would be so grateful and happy if you give me your feedback.
I upload both my code and presentation on the github:

My code

My presentation

Hello. I have a question if you don’t mind. In your code, you scale the data (as below) using MinMax scaler after you split the data using train_test_split. Is there a reason you split the data before you scale it?

I am inclined to scale the data prior to splitting it, but you have me second-guessing that decision. Does it matter whether you scale before or after you split it? If so, why does it matter?

I have a related second question. I am assuming numeric data (e,g, age, height) does not need to be normalized, and I am inserting it into the scaled DataFrame after scaling the categorical data. I’m questioning whether or not that’s the proper way to handle a comparison of numeric to categorical. I could not find much online.

Does numeric data need to be scaled with the categorical data?

I assume the answer is no, but I’m not so sure.

Thank you for your consideration.

Robert Pfaff

#Y is the target column, X has the features
X = df_copy.iloc[:, 1:]
y = df_copy[‘sign_cleaned’]

#Split the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

#Pandas Series.ravel() function returns the flattened underlying data as an ndarray(1d array)
y_train = y_train.ravel()
y_test = y_test.ravel()

#use MinMaxScaler to put all features values in the same range and scale
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

1 Like

Hello Robert

There is no difference if you do the feature scaling before or after the split.

You always need to scale all the features before you apply your machine learning algorithm. For example, if you would like to classify movies into different categories and one of your features is Year of release, and another is its revenue, the difference between revenues of two movies could be hundreds of million dollars but the difference between Year of release would be around 125 at the most. If we don’t normalize the data, movie’s revenue will get a much higher weight in determining the class of the movie. We do this because we would like to give equal importance to all features whether they are numerical or categorical.

All categorical and numerical features have to be scaled. We have two type of categorical data: Ordinal and One-Hot-Encoding. One-Hot-Encoding(dummy variable) is automatically between 0 and 1. So we don’t need to normalize dummy variables. But Ordinal categorical variables should be normalized.

At the end, all features have to be between 0-1 for MinMax scaler and between (-1,1) if we use standard scaler.

So, your answer is Yes!

Thank you for such a prompt and clear response. It just a case of my unfortunate tendency to overthink it.

But I understand now. Though the order for scaling and splitting does not matter, we do need to scale numeric data with categorical data, so variables like age need to be scaled with body type. In that way, the breadth of ages (18-69) don’t overwhelm and distort the scale for drugs, which is only three options (0-2).

Much appreciated.

1 Like