FAQ: Data Cleaning with Pandas - String Parsing

This community-built FAQ covers the “String Parsing” exercise from the lesson “Data Cleaning with Pandas”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Practical Data Cleaning

FAQs on the exercise String Parsing

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

Why is the regex in this example ‘[$,]’? That is, why is the comma necessary? and in the solution, why is the backslash listed before %, and why is the comma necessary in the solution as well?

9 Likes

The backslash allows you to use an “escape” special characters. Both the $ and the % are special characters in python. In the exercise, the Run works with or without the backslash and the comma – at least on my browser. The comma is a convention we sometimes see employed, though it is often optional.

https://docs.python.org/3.5/library/re.html

4 Likes

I tried to replace ‘fract’ with ‘act’ in exam column. I got actactactactactions as a result. I used regex ^ to not match the ‘fract’ , then I got fractactactactact as a result. Is there a way to replace ‘fract’ in fractions with ‘act’ that gives us the result like ‘actions’.

students.exam = students[‘exam’].replace(’[fract,]’,‘act’,regex=True)
print(students)

I got the same result without using comma after fract. So, Is it optional to use comma?

I am using the following line of code to try and convert the column. But it says that I must return string instead of float. when did i change the type to float in this code? it seems to me like it should have been left as a string unless i am missing something.

students.score = students[‘score’].replace(’[%,]’,’’, regex=True)

I’m not sure on your second example where you use ^ unless I see that specific line of code. but for the example where you include the code this is because by using brackets you are telling python that any of these individual values is acceptable to replace with act. So it looks at the the first character in ‘fractions’ and says ‘f’ in ‘fract’ so it updates the final string to actractions. it then looks at the next character from the original string and says ‘r’ in ‘fract’ so it updates the final string to actactactions. it repeats this for the entire string and thus you end up with 5 ‘act’ strings. try something like this instead without the brackets. There are multiple right answers but this fits best into your description

students.exam = students[‘exam’].replace(’fract’,‘act’,regex=True)
print(students)

1 Like

I found nothing wrong in this code.

Is there a glitch of some kind? My code was exactly the same as the solution. I even copy and paste the solution code to run step 1 and it still said “must be str, not float”.
Thu, is the code taught in this section correct?

The code:

students.score = students[‘score’].replace(’[ \ %,]’, ‘’, regex=True)

You are not alone.


but I also noticed that the data in the column was cleaned (the %'s are removed)
(note the not escaping the %…not thinking it needs escaping…though it is harmless…)

Oh shoot, no, it is obvious…I just still needed to add the to_numeric call on the result.

@harrjt

I already understood the need for the \ to escape the special characters, and your explanation for that makes sense. However, you also reference the use of the comma as a convention (even if optional), but I can’t find any information or reference for it in either the Python docs for string.replace(…), Python regular expressions, or pandas DataFrame.replace(…).

Isn’t the regex ‘[\$,]’ just providing a character set that will match either $ or , in a string? Why is the comma necessary at all, for use in a data series that doesn’t include any commas?

Can you provide a more detailed explanation for the use of the comma in the regex and/or please cite a source for that convention? I appreciate any help you can give.

1 Like

You’re right. The comma , in this case is not necessary, but in my opinion, they include the comma for more general cases. For example, if there is an item whose price has more than 3 digits (e.g. $4,025), in this case, you should not only remove the $ but also the comma , as there shouldn’t be a comma in a number in Python.
Only after removing the comma can you convert a string to a numerical datatype.
Hope this helps.

6 Likes

Hi coders!

I understood everything but my problem is (maybe a bit basic for most of you) instead of having “69” and I’m having “69.0” so with decimals… anybody there knows how to get just two (2) int numbers / or remove them? e.x: 69, 53…so on?

Thanks a lot

1 Like

I think that one way is to change dtype from float64 to Int64 with .astype after applying .to_numeric to the score column.

students.score = pd.to_numeric(students.score).astype('Int64')

The reason why the return type of .to_numeric is float64 is probably because the score column contains nan. So the code above cast it to Int64, which is one of pandas’ nullable-integer extension dtypes. If the Int64 part in the code is set to a Numpy’s integer type such as int64, which will not be able to hold nan, it results in an error. The following two articles in the User Guide may be helpful:

2 Likes

Thanks a lot cocoder!!

1 Like

I have a question. Since pd.to_numeric(df[‘column’] returns a series object, how is it that the resulting column turns to not be an object?

Question: both of the following lines work to describe the ‘score’ column. Is there a reason to use one or the other? Is there a difference?
students[‘score’] = students[‘score’].replace(’%’, ‘’, regex=True)
students.score = students.score.replace(’%’, ‘’, regex=True)

The students[‘score’] option is useful if the column name has whitespace, for example if the column was called ‘student score’ instead you would use students[‘student score’] as the other version would throw an error. It’s also useful if you are altering multiple columns at once. Otherwise, there is no real difference as far as I am aware

Is there any particular reason we’re using pd.to_numeric() rather than adding .astype() to the end of the line where we replace the % or is it just a matter of preference?

it is possible that you still have the second part of the exercise in your compiler?

the values in the dataframe are first supplied as objects, so codecademy expects for the first part of the exercise to see them returned as objects.