Stack Overflow Survey Trends

Hi all, i’m working on this project

It is suggested that we can safely drop rows that have missing data in columns “Employment” and “Country”. It was shown that data in these columns is missing at approximately same rate across counties, however i think it’s doesn’t mean we can delete these rows. We need to first check if there is any pattern in missing data in the 2 columns. While playing with the dataset, I noticed that lines that say “Student” in “Employment” field - are always missing “Devtype” field. Data missing is clearly MNAR, therefore we have completely removed the whole category of students from our analysis.

Has anyone noticed that too, let me know your thoughts!

I didn’t do this project, so I’ve not looked at the data… How much data (%) is missing from those two columns?
I don’t think you should delete them b/c aren’t they necessary for the EDA?

What does the data look like? (Column names)

Also, just for reference, have you ever looked at the annual StackOverflow Dev survey?

See:

1 Like

Thank you @lisalisaj for your reply!

here is a list of columns and respective % of missing data:

% Missing Data:
RespondentID 0.000000
Year 0.000000
Country 0.000000
Employment 1.604187
UndergradMajor 11.470295
DevType 9.689863
LanguageWorkedWith 8.264619
LanguageDesireNextYear 13.636486
DatabaseWorkedWith 22.794918
DatabaseDesireNextYear 33.248208
PlatformWorkedWith 17.624473
PlatformDesireNextYear 23.229235
Hobbyist 38.537349
OrgSize 50.719816
YearsCodePro 14.761395
JobSeek 45.547573
ConvertedComp 17.872654
WorkWeekHrs 54.060373
NEWJobHunt 82.800852
NEWJobHuntResearch 83.200101
NEWLearn 78.215792

So in the tutorial it is suggested that we can safely remove rows with missing Employment and DevType (sorry I mixed up the column names in my original post ) , because data in these columns is missing at same rate across countries, and also percentage of missing data is small.
However my point was that DevType will always be missing if Employment = “Student” , so we shouldn’t remove rows with missing DevType, because it’s structurally missing data, so we would remove the whole category of Students from our EDA. Let me know your thoughts!

And thanks for sharing the list to the survey!

Yes, you make a good point. But, I think the project idea is to analyze actual, employed developers and not students. So, perhaps that’s what they’re going for in the instructions(?) (I don’t know as I’m not in their head).

I’d remove the columns with the highest percentage of missing data–

  • NEWJobHunt 82.800852
  • NEWJobHuntResearch 83.200101

If you do a df.['Employment'].value_counts():

	                       count
Employment	
Employed full-time	84707
Independent contractor, freelancer, or self-employed	8597
Employed part-time	5114
Not employed, but looking for work	4248
Not employed, and not looking for work	3442
Student	2789
Retired	528

There are only 2789 students in the survey. You could always separate out the students into their own df and analyze them separately…