OKCupid Date-A-Scientist Portfolio Project

Dear All,

For this project I used GitHub pages, link below:
My OKCupid Project

It took me couple of weeks to finish it , and I’m quite satisfied with result. Now when I know more, I’ll probably create another version with same project objective.

Thanks for constructive feedback.

1 Like

Hi Tamaricki.

Really cool and well structured project: well done on every component of it.

I had a flick through and the only thing that really stands out to me as a red flag is how you encode the variables into numbers for the ML component of the project. The need for the following will vary from project to project but what follows should be considered standard practise for achieving best results in most circumstances.

So, the ML models try to find associations in the data presented to it. Because it can’t interpret strings (usually) you rightfully encode it to values. Consider the following from your code:

map_ethnicity = {‘unknown’:-1,‘native american’:0, ‘middle eastern’:1, ‘pacific islander’: 2, ‘indian’:3, ‘other’: 4, ‘black’: 5, ‘latin’: 6, ‘asian’: 7, ‘multiethnic’: 8, ‘white’: 9}

Now what is the numerical difference between a native american and a pacific islander? There is none of course, but according to how the model will interpret the code, there is an exact number for the difference: 2 in fact, which apparently is exactly one third the distance between a native american and a ‘latin’. You’ve made an ordinal relationship where none exists.

Your best solution to this is one-hot encoding, which was covered earlier in the course, and basically makes a sequence of binary columns to determine whether the individual meets each ethnic condition or not.


map_drinks = {‘not at all’:1, ‘rarely’:2, ‘socially’: 3, ‘often’:4, ‘very often’:5, ‘desperately’: 6}

…DOES make sense in this instance because it IS ordinal (you can argue whether the distance between “socially” and “often” is equivalent to the distance between “not at all” and “rarely” but its not a big deal).

Also if you do want to encode variables like how you have done, it’s probably quicker to use from sklearn.preprocessing import LabelEncoder, or cat.codes if you want to explicitly declare an order to the sequence:

df_for_age_ML[‘drugs’] = pd.Categorical(df_for_age_ML[‘drugs’], [‘never’, ‘Not reported’, ‘sometimes’,
‘often’], ordered = True)

df_for_age_ML[‘drugs’] = df_for_age_ML[‘drugs’].cat.codes

1 Like

Thanks net3879063533,

You are absolutely right: I mapped ethnicity values as they were ordinal but they’re not . That is another lesson learn for me: I need to become more ‘comfortable’ in using available modules instead of manual work.