FAQ: The Data Science Process - Cleaning Data

This community-built FAQ covers the “Cleaning Data” exercise from the lesson “The Data Science Process”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Code Foundations

FAQs on the exercise Cleaning Data

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

in cleaning data part in data science process why user_data.csv first 15 columns a bit different from the 15 columns of new_df database.How is the population_proper of brookyln is checked in pop_data?

1 Like

Hi, welcome to the forums.

What step are you referring to? 2 or 3?
Your new_df is the two csv’s merged…user_data and pop_data.

Then in the 3rd step you’re creating a new col called location which is based on the condition of population either being <= 100000 or >= 100000 which would be “rural” or “urban” respectively. You’re able to check the condition b/c you’re using new_df.loc. “.loc” checks the population in that “population_proper” column and applies the condition accordingly.

Wow, I am so proud of myself that I managed to do everything correctly after several attempts :slight_smile: Thank you for this excercise!

I’m having issues getting the column “location” to appear. Instead I still have the “population proper” column appearing… Any pointers?

Did you try to print the new_df at the end, e.g. print(new_df.head(15))? If so, the location column should appear. If you still don’t see that, try to resize the right-most window.

1 Like

yeah, I was missing the extra print command. I felt silly. Luckily I spotted the “get unstuck” button not too long after posting this. What I was originally trying to do was replace previous tables with just one, new and improved, table. But I guess I need all the tables in there?

Glad to hear that you managed to display the new table. Maybe I misunderstood your question, but if you want to display the very last table, you can simply remove the previous print statements. If what you want to do is to show only certain columns then you can do, e.g.,

new_new_df=new_df[['city','education','age','location']]
print(new_new_df.head(15))

In this case, the new table will contain only “city”, “education”, “age”, and “location”, i.e. no “population_proper” is showing up.

1 Like

Hi, I also was not able to view the Location field even after I only had the print statement on the last line. I tried to resize and still it did not show up. Finally, I only selected certain columns including Location and then was able to see it!

did you have a problem with step one. i am it says copy and paste the line but im trying to figure out where because it doesnt work when i copy and paste in line 205 and the solution doesnt have the copy and paste line in it. can you help me figure this out. dont want to just use the solution without truly understanding what im doing or looking at

Hi.
On all exercises, when required to copy and paste - I can copy but I cannot paste. Paste function not present on clicking correct location.
Please advise.
JP Greeff. CPT SA

I prefer to use this syntax:

new_df.loc["population proper" < 100000, "location"] = "rural"
new_df.loc["population proper" >= 100000, "location"] = "urban"

But the exercise advises us to use this syntaxt instead:

new_df.loc[new_df.population_proper < 100000, "location"] = "rural"
new_df.loc[new_df.population_proper >= 100000, "location"] = "urban"

Is there a reason for this? To me it looks like the df.column_name syntax could end up getting confusing (imagine if your dataframe has a column labelled “loc” for example), but with the “column name” syntax it’s obvious you’re referring to a column/row label. It’s also more consistent because the “location” column is referred to with this syntax as well.

1 Like

Hi everyone,
I’m new to data science but I have some background in bioinformatics and R. I know nothing of python tho.

I found this exercise curious. Whenever I merged datasets in R using the “merge” function, I always had to give the instruction regarding what variable to merge by in both datasets. Here however the two datasets were merged apparently without those instructions. I thought it was going to be a “paste” (the two dataframes side by side), instead of a merge, but the datasets were merged by “city” without me giving instructions. How does this happen? And what if there were more than one column that could match in both datasets but they had dissimilar data but the same heading?

Thanks!