FAQ: Data Types and Quality - Working with Missing Data

This community-built FAQ covers the “Working with Missing Data” exercise from the lesson “Data Types and Quality”.

Paths and Courses
This exercise can be found in the following Codecademy content:

FAQs on the exercise Working with Missing Data

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!
You can also find further discussion and get answers to your questions over in Language Help.

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head to Language Help and Tips and Resources. If you are wanting feedback or inspiration for a project, check out Projects.

Looking for motivation to keep learning? Join our wider discussions in Community

Learn more about how to use this guide.

Found a bug? Report it online, or post in Bug Reporting

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

I don’t quite understand the difference between Missing completely at random and Missing at random.

So the former just means data wasn’t entered properly. What about the latter? How is it different from the former?

Thanks!

1 Like

I found more information in the cheatsheet that answered my question: https://www.codecademy.com/learn/paths/data-science-nlp/tracks/dsf-data-literacy/modules/introduction-to-data-38e13b33-2ba6-4515-bfbf-4a785c9194a9/cheatsheet

3 Likes

The difference is that Missing completely at random is a random error not linked to a variable (i.e bad entry due to fatigue and sloppiness) where Missing at random can have something with a variable causing the error (tree bigger than the tape measure).

Make sure you do the “Handling Missing Data” course that is linked to get deeper into it. The cheat sheet doesn’t have enough detail but the course is good.
Handling Missing Data | Codecademy

4 Likes

Hello there! I didn’t really get what exactly we can do about missing data, what are the steps? Imagine I have such a situation at work, what exactly am I supposed to do? Thank you so much for the answer!

That’s why I recommended the Handling Missing Data course. It teaches you how to fix the different types of problems arising in categorical or numeric data types. I had no problems with it until the last page when it gets heavy into pandas. I also had to learn Jupyter Notebooks and am working on Git and Github to get ready for the first project. There’s a lot to learn.

I have a question about the tree census table included in the course curriculum under DATA TYPES AND QUALITY, specifically regarding the handling of missing data. The instruction section explains that some data is structurally missing, while other data is Missing at Random or Missing Completely at Random.

However, I am confused about the data in the table itself. The “Single” variable shows 0 values numerous times, indicating that the trees are in groups. However, the “Distance (ft)” variable shows NaN values, indicating missing data. The “Single” variable represents trees that are alone, with a value of 1 meaning “True,” while a value of 0 means “False” and indicates that the tree is in a group. Based on this, I expected to see some kind of value in the “Distance (ft)” variable for trees in groups.

Since every 0 in the “Single” variable suggests that trees are in groups, there should not be any structurally missing data points. However, the NaN values in the “Distance (ft)” variable seem to contradict this. I would appreciate some clarification on this issue.

4 Likes

My best guess is that the “person” recording the data chose not to record any distances for trees that were grouped together. It seems like an example of Missing at Random data because we can see every tree that is not alone doesn’t have a distance variable. I think it’s an exercise showing us what a mistake or poor organization of a data set looks like.

2 Likes

@newbyallen It says on the page of the Handling Missing Data course there are three prerequisites. Do you know if the sections in this career path are going to include those prerequisites? I feel like if I attempt this course I would be in over my head because I haven’t done any of the work they’re requiring but I’ve bookmarked the course to revisit at some point. What do you think?

no, when the person who was supposed to collect data didn’t write and record data for these fields of course there is recorded as NaN but we must distiguish that these missing values are structured ones so we must do nothing with them . the missing value is there, nothing means missing value but it is ok

would anyone care to provide feedback for me? in the chart they had us look at which of the data was structurally missing, mar, or macr.

At first i didnt see structurally missing data. But, if we see that a single tree would not have a distance next to another tree (because it’s alone) then we shouldn’t expect that data and could be structural. however, if by single 0 it translates to no, or false, it is not single, then it wouldn’t make sense.

The rest of the data seems MACR because i’m not seeing any patterns to connect the missing data other than maybe the back to back serial numbers 13820 13821 could be one person on one day that needs to be redone.
What did other people get for MAR and MACR?

Is “Handling Missing Data” part of the Data Scientist Learning Path? I took a quick look at the course and it seem a bit advanced for my current level. I want to ensure I keep tabs on it.

I was thinking along the same route. ID 12139 and 21110 have values entered for distance and are both True (1) for Single which leads me to believe ID 12139 Should also be 1 vs nan.

Species name missing, Missing Completely at Random. Doesn’t make sense they would not know the species they are looking for.
ID 13280 and 12381 have heights values missing, but there are two other entries of the same species with values which leads me to believe this was Missing Completely at Random.
Distance Ft. Column for the most part I would say i in fact Missing at Random/True NAN.

The prettiness… I am not undertanding this subjective data… I assume Missing Completely at random? <Help on this.

I think that based on the distance inputs in my eyes it makes sense that the recorder put 0’s in for all the trees that were alone and 1’s in for tress that were grouped together. If you look at it that way the entire column makes sense.