This community-built FAQ covers the “Review” exercise from the lesson “Web Scraping with Beautiful Soup”.
Paths and Courses
This exercise can be found in the following Codecademy content:
FAQs on the exercise Review
There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply () below.
If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.
Join the Discussion. Help a fellow learner on their journey.
Ask or answer a question about this exercise by clicking reply () below!
Agree with a comment or answer? Like () to up-vote the contribution!
Here is my attempt at cleaning the dataframe. I was struggling at first because The columns and rows were swapped. I thought I had to use pivot but I didn’t realize that wasn’t the right method. Then I discovered transpose() and it was smooth sailing from there.
prefix = "https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/"
webpage_response = requests.get('https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/shellter.html')
webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")
turtle_links = soup.find_all("a")
links = []
#go through all of the a tags and get the links associated with them"
for a in turtle_links:
links.append(prefix+a["href"])
#Define turtle_data:
turtle_data = {}
#follow each link:
for link in links:
webpage = requests.get(link)
turtle = BeautifulSoup(webpage.content, "html.parser")
turtle_name = turtle.select(".name")[0].get_text()
stats = turtle.find("ul")
stats_text = stats.get_text("|")
turtle_data[turtle_name] = stats_text.split("|")
pd.set_option('display.max_columns', None)
# RAW DATA, the cleaning starts here
df = pd.DataFrame(turtle_data)
# Removing all \n
df = df.replace('\\n','',regex=True)
# Dropping all empty rows
df = df.drop([0,2,4,6,8,10]).reset_index()
# Swapping columns and rows by using transpose()
df = df.transpose().reset_index()
# dropping additional useless row
df = df.drop(0).reset_index(drop=True)
# Naming the columns
df.columns = ['name', 'age','weight','sex','breed','source']
# age column to keep only digit and turn into numeric
df.age = pd.to_numeric(df.age.replace('[^0-9]','',regex=True))
# weight column to keep only digit and turn into numeric (added /. to regex to avoid removing decimal as well)
df.weight = pd.to_numeric(df.weight.replace('[^0-9/.]','',regex=True))
# The rest is easy
df.sex = df.sex.replace('SEX:','',regex=True)
df.breed = df.breed.replace('BREED:','',regex=True)
df.source = df.source.replace('SOURCE:','',regex=True)
df
Thank you for sharing, it was very helpful. The problem of row and column being swapped can also be solved by using the Pandas’ .from_dict() method when make a DataFrame, as in Hint of Instruction 1. And there’s a subtle issue that Sparrow’s age is not an integer but 1.5.
I’ve also used the posts of @mahak_gupta and @cdrowley as references (thanks to them too!). Here is my answer on cleaning the DataFrame:
According to User Guide, it seems that, for backwards-compatibility, the default dtype to store text data in pandas is object. Because object was the only option prior to pandas 1.0.
To use string dtype, we need to specify the dtype explicitly when we create a DataFrame or a Series, or to cast it with .astype.
It would be good it you told us that requests, bs4 and pandas are not default modules and we will need to install them in order to complete this work offline or in jupyter notebook. It would take one sentance and save a lot of swearing…
#raw data
turtle_df=pd.DataFrame(turtle_data)
rows=[]
#puts values for each turtle's column into a list
for turtle_name in turtle_df.columns:
turtle_row=[i for i in list(turtle_df[turtle_name].values) if i !='\n']
#example turtle_row: ['AGE: 7 Years Old', 'WEIGHT: 6 lbs', 'SEX: Female', 'BREED: African Aquatic Sideneck Turtle', 'SOURCE: found in Lake Erie']
new_turtle_row=[]
#splits each element by : and gets the column names
column_names=[i.split(': ')[0].title() for i in turtle_row]
column_names.insert(0,'Name')
for i in turtle_row:
#for every element in each turtle_row this gets the info after :
info=i.split(': ')[-1]
#example info: 7 Years Old (string)
#splits info into a list for the numerical values (age and weight)
info2=info.split()
#example info2: ['7', 'Years', 'Old']
try:
#tries making a float out of first element, which would throw an error for words
i=float(info2[0])
new_turtle_row.append(i)
except Exception:
new_turtle_row.append(info)
new_turtle_row.insert(0,turtle_name) #inserts the name of each turtle in the beginning of its turtle row
rows.append(new_turtle_row) # rows is a list of lists
turtle_df=pd.DataFrame(data=rows,columns=column_names) #remaking the df here
#print(turtle_df)
Hi, everyone.
I’m doing the “Data Scientist” career path, and this Web Scraping Review seems pretty advanced for where I am at in the learning process. I have very little experience with Pandas because it is covered later in the sequence, so being asked to clean up this dataframe looks daunting. The instructions also mention using “Regex,” which has not been covered yet.
I have been looking up things when I don’t know them, but I would rather come back to this if I am going to learn most of the skills later.
Is this activity something I should be able to do after I go through the tutorial on Pandas?
Hello fellow Codecademy Learners!
This exercise was quite challenging, but not really once I slowed down and followed what was happening, step-by-step. I wish I had more info on using regex and the proper syntax as well as Pandas and manipulating the DataFrame. However, I was able to find some simple solutions and I am pleased with the result. I will probably go back and tabulate the DataFrame so it prints in a better format.
NOTE: I have NO IDEA how that extra B got in there in the Breed column. If you spot the error, please let me know! It drove me crazy for over an hour and I finally just let it go.
Thanks! Happy coding!
No idea what’s going on with the pandas stuff at the end. Copy pasted other people’s solutions to get through this stupid lesson. Web scraping looks interesting, but this was an exercise in discouragement.