It is suggested that we can safely drop rows that have missing data in columns “Employment” and “Country”. It was shown that data in these columns is missing at approximately same rate across counties, however i think it’s doesn’t mean we can delete these rows. We need to first check if there is any pattern in missing data in the 2 columns. While playing with the dataset, I noticed that lines that say “Student” in “Employment” field - are always missing “Devtype” field. Data missing is clearly MNAR, therefore we have completely removed the whole category of students from our analysis.
Has anyone noticed that too, let me know your thoughts!
I didn’t do this project, so I’ve not looked at the data… How much data (%) is missing from those two columns?
I don’t think you should delete them b/c aren’t they necessary for the EDA?
What does the data look like? (Column names)
Also, just for reference, have you ever looked at the annual StackOverflow Dev survey?
So in the tutorial it is suggested that we can safely remove rows with missing Employment and DevType (sorry I mixed up the column names in my original post ) , because data in these columns is missing at same rate across countries, and also percentage of missing data is small.
However my point was that DevType will always be missing if Employment = “Student” , so we shouldn’t remove rows with missing DevType, because it’s structurally missing data, so we would remove the whole category of Students from our EDA. Let me know your thoughts!
Yes, you make a good point. But, I think the project idea is to analyze actual, employed developers and not students. So, perhaps that’s what they’re going for in the instructions(?) (I don’t know as I’m not in their head).
I’d remove the columns with the highest percentage of missing data–
NEWJobHunt 82.800852
NEWJobHuntResearch 83.200101
If you do a df.['Employment'].value_counts():
count
Employment
Employed full-time 84707
Independent contractor, freelancer, or self-employed 8597
Employed part-time 5114
Not employed, but looking for work 4248
Not employed, and not looking for work 3442
Student 2789
Retired 528
There are only 2789 students in the survey. You could always separate out the students into their own df and analyze them separately…