I’m currently working on the capstone project, called Date-A-Scientist, of the " Build a Machine Learning Model with Python" skill path. You can find the link to the project here.
My problem is that if I drop all missing values in the OKCupid dataset, before performing my machine learning models, I am left with only 1424 out of a total of 59946 rows. I fear this small sample might not be representative anymore for the entire population.
Is this the right approach or would you recommend me another approach to this issue?
Thank you for your help and advice in advance.
This is the code i use to drop missing values:
import pandas as pd #Creating the df df = pd.read_csv('profiles.csv') # Denote -1 values for income as missing values df.loc[df["income"] == -1,"income"] = np.nan df_no_nan = df.dropna() print(len(df.index)) print(len(df_no_nan.index))