Handling Missing Data: Stack Overflow Survey Trends

Hello! I am working through this off-platform project as part of the Business Intelligence Learning Path and have a question about the following in the “Analyze developers by country” section:

Determine what kind of missing data you have for employment and developer type. One way to do that is check, at a country level, where the data is missing for each field:

import seaborn as sns import matplotlib.pyplot as plt df[['RespondentID','Country']].groupby('Country').count() missingData = df[['Employment','DevType']].isnull().groupby(df['Country']).sum().reset_index()

I was wondering why we used sum() instead of count() for missingData so I subbed in count() for sum() and it ended up returning the total count of all rows for each country and I can’t figure out why. Can someone explain why it is returning the same number as the country count and not just a count of the missing rows OR 0 (because my understanding is that count() only returns non-null data)?

Thanks so much in advance!

It should be count. Corroborate by checking documentation.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.count.html

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html

1 Like

In the first bit of code, yes, you’re using .count for the missing data for those columns, grouped by country. In the second line you’re using .isnull() which will result in True (missing) or False (not missing) & .sum() to get the total missing (NaN) values in those two different columns, grouped by country.

https://www.statology.org/pandas-count-missing-values/

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html

Thanks so much to you both for the replies. I still don’t fully understand why if I use count() for the second line instead of sum - it counts all the values in the dataframe and not just the null (True) ones, but I’ll keep digging.

I’m sorry, I misspoke above bc I misread the question. (I also can’t see the output of what you’re referring to.)

The first one you’re counting how many IDs there are for each country and grouping them by country. The second, you’re determining the sum of null values for employment and dev type & grouping that by country.

See: python - pandas isnull sum with column headers - Stack Overflow