I was working through the last section of
HOW TO CLEAN DATA WITH PYTHON and the section for Missing Values stated:
“Most of the time, we use one of two methods to deal with missing values.”
"Method 1: drop all of the rows with a missing value"
"Method 2: fill the missing values with the mean of the column, or with some other aggregate value."
I am most curious about the second suggested method since it doesn’t go into more detail about the why. Of course, I can understand the implications of the first suggestion of dropping missing rows. I can also see the benefit of
However, wouldn’t replacing all missing values or those in a given column with an average or other aggregate throw the dataset off? Why would this be best practice and in what situations?
I think the aggregate value would be the mean, or median, or mode of the values that you have for that column.
(Yes, it could throw the results off, but maybe less than dropping the rows would in some situations.)
I think that an ideal situation for that would be when you have relatively few rows/values in your data set.
Or when many of your data points (rows) are missing one or more variables (columns).
I guess you could go further and try to predict the missing values using the remaining columns. But that’s more complicated to do.
Dealing with missing values in a dataset is very common. How you deal with them depends on how large your data set is. I would recommend reading Pandas documentation and googling examples on how one would deal with missing values.
In machine learning–you wouldn’t want to fill NaNs with the mean of a column (or really, in data analysis either) b/c that would introduce bias & mess up any models you’re building or stats you would want to run for hypothesis or significance testing, etc.
Generally, you’d fill NaNs with 0 or an empty string depending on what the datatype of the column is.