Basic Data Analysis Question (Data Science: NLP)

Hey everyone!

I hope you’re all doing well. I have a question regarding the “Data Science: NLP” path, specifically about the concepts of “missing at random” and “missing completely at random.” I encountered a question during a quiz that confused me, and I wanted to get your thoughts on it.

Here’s a quick rundown: I was presented with a table containing missing values, and I had to determine whether the missing data was “missing at random” or “missing completely at random.” I’ve attached a screenshot of the question for reference. Although I answered it correctly this time (after getting it wrong before), I’m still struggling to grasp why this particular example would fall under “missing at random” instead of “missing completely at random.”

From what I understand, “missing completely at random” refers to data where the missing values occur randomly across participants, without any systematic relationship to other variables. In this table, the missing values do appear to be randomly distributed among the participants, without any obvious pattern related to specific variables. To me, this aligns with the definition of “missing completely at random.” However, the learning module suggests that it’s an example of “missing at random,” which usually implies a consistent missing pattern based on another variable (e.g., all missing height data for Redwood trees due to equipment limitations).

If I’ve misunderstood something, I would greatly appreciate it if you could shed some light on the difference between “missing at random” and “missing completely at random.” I’m eager to deepen my understanding, and any explanation would be incredibly helpful.

Thank you all in advance for your assistance. I truly value your input!

I also could not understand the difference.

Do you have a link to this quiz question in the NLP lesson? I vaguely recall seeing this discussed but don’t remember how it was explained or taught at any length. It appeared in my Codecademy Go questions one day and I was thrown off.

It is a confusing topic for sure.

Yes, as I understand it too, this is correct:

This is the way I (try) to understand it:

  • MAR- is independent and the data can be explained by the observed/non-missing data.
  • MCAR- the reason for the missing data values has nothing to do with (is independent of) the other data whether it’s observed or missing.

I’m not sure if I’ve cleared anything up here or not. :thinking: :joy:
Perhaps the link I provided above would be of some help.

Hey @lisalisaj, thanks for the response!

I wasn’t able to locate the URL to that particular question in the quiz, but I do have a link to the 8-question quiz itself here.

But yeah, I crossed-checked this question using multiple different sources and from what I can tell, it appears that the quiz question itself might actually be incorrect.

I tried reporting it to customer service, but they sent me right back here. Any way to bring this to their attention to get them to fix it, you think?

But, wouldn’t it be MAR data? (b/c it could be explained by the non-missing data.) Maybe the participants had a reason for not supplying their weight or…? :thinking:

If you think that it is indeed a mistake, you can report it here.