FAQ: Web Scraping with Beautiful Soup - Review

wiki-bot · April 14, 2019, 10:22am

This community-built FAQ covers the “Review” exercise from the lesson “Web Scraping with Beautiful Soup”.

Paths and Courses
This exercise can be found in the following Codecademy content:

FAQs on the exercise Review

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply () below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply () below!

Agree with a comment or answer? Like () to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

adamchao1984 · April 28, 2019, 4:26am

Who made this? It’s all sorts of buggy.

mtf · April 28, 2019, 6:06pm

Please post a link and we can ask the curriculum team look into it.

chip8582398760 · September 10, 2019, 8:06pm

When we’re cleaning the dataframe at the end, how do we fix the column names?

mahak_gupta · April 12, 2020, 12:37pm

import requests
from bs4 import BeautifulSoup
import pandas as pd

prefix = "https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/"
webpage_response = requests.get('https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/shellter.html')

webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")

turtle_links = soup.find_all("a")
links = []
#go through all of the a tags and get the links associated with them"
for a in turtle_links:
    links.append(prefix+a["href"])
    
#Define turtle_data:
turtle_data = {}
#follow each link:
for link in links:
  webpage = requests.get(link)
  turtle = BeautifulSoup(webpage.content, "html.parser")
  turtle_name = turtle.select(".name")[0].get_text()
  
  stats = turtle.find("ul")
  stats_text = stats.get_text("|")
  turtle_data[turtle_name] = stats_text.split("|")
  

turtle_df = pd.DataFrame.from_dict(turtle_data,orient ="index")
print(turtle_df.head())
print(turtle_df.columns)
print(turtle_df.dtypes)
turtle_df = turtle_df.drop(turtle_df.columns[-1], axis=1)

print(turtle_df.head())

turtle_df2 = turtle_df.drop(turtle_df.columns[8],axis=1)

print(turtle_df2.head())
print(turtle_df2.dtypes)

turtle_df3 = turtle_df2.drop(turtle_df2.columns[0],axis=1)
pd.set_option('display.max_columns', None)

print(turtle_df3.head())
turtle_df4 = turtle_df3.drop(turtle_df3.columns[1],axis=1).reset_index()

print(turtle_df4.head())

turtle_df5 = turtle_df4.drop(turtle_df4.columns[3],axis=1)
print(turtle_df5.head())

turtle_df6 = turtle_df5.drop(turtle_df5.columns[4],axis=1)
print(turtle_df6.head())

final_df = turtle_df6
print(final_df)

final_df.columns= ['name','age','weight','sex','breed','source']
print(final_df)
age_split_df = final_df['age'].str.split(':', expand=True)
print(age_split_df)
age_split_2 = age_split_df.get(1).str.split('(\s)', expand = True)
age_split_2 = pd.to_numeric(age_split_2.get(2))
age_split_2
final_df['age'] = age_split_2
print(final_df)

wt_split_df = final_df['weight'].str.split(':', expand=True)
print(wt_split_df)

wt_split_2 = wt_split_df.get(1).str.split('(\s)', expand = True)
print(wt_split_2)


final_df['weight'] = pd.to_numeric(wt_split_2.get(2))
final_df
sex_split = final_df['sex'].str.split(':', expand=True)
breed_split = final_df['breed'].str.split(':', expand=True)
source_split = final_df['source'].str.split(':', expand=True)

print(sex_split)
print(breed_split)
source_split
final_df['sex'] = sex_split.get(1)
final_df['breed'] = breed_split.get(1)
final_df['source'] = source_split.get(1)

final_df

mahak_gupta · April 12, 2020, 12:40pm

i hope this helps. This took a while but really improved my data cleaning skills. All suggestions/corrections are welcome

cdrowley · May 25, 2020, 12:19am

Building on @mahak_gupta’s answer I’ve shortened the cleaning process:

final_df = turtle_df.drop([0,2,4,6,8,10], axis = 1).reset_index()
final_df.columns = ['name','age','weight_lbs','sex','breed','source']

final_df['age'] = final_df['age'].str.extract('(\d+)').apply(pd.to_numeric)
final_df['weight_lbs'] = final_df['weight_lbs'].str.extract('(\d+)').apply(pd.to_numeric)

final_df['sex'] = final_df['sex'].str.split().str[-1]
final_df['breed'] = final_df['breed'].str.split(':').str[-1]
final_df['source'] = final_df['source'].str.split(':').str[-1]

final_df.head()

0x01alpha · June 21, 2020, 4:02am

could you please state why are we using the axis argument in drop function ?

sionchen · July 18, 2020, 10:12pm

Here is my attempt at cleaning the dataframe. I was struggling at first because The columns and rows were swapped. I thought I had to use pivot but I didn’t realize that wasn’t the right method. Then I discovered transpose() and it was smooth sailing from there.

prefix = "https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/"
webpage_response = requests.get('https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/shellter.html')

webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")

turtle_links = soup.find_all("a")
links = []
#go through all of the a tags and get the links associated with them"
for a in turtle_links:
  links.append(prefix+a["href"])
    
#Define turtle_data:
turtle_data = {}

#follow each link:
for link in links:
  webpage = requests.get(link)
  turtle = BeautifulSoup(webpage.content, "html.parser")
  turtle_name = turtle.select(".name")[0].get_text()
  
  stats = turtle.find("ul")
  stats_text = stats.get_text("|")
  turtle_data[turtle_name] = stats_text.split("|")
pd.set_option('display.max_columns', None)


# RAW DATA, the cleaning starts here
df = pd.DataFrame(turtle_data)

# Removing all \n
df = df.replace('\\n','',regex=True)

# Dropping all empty rows
df = df.drop([0,2,4,6,8,10]).reset_index()

# Swapping columns and rows by using transpose()
df = df.transpose().reset_index()

# dropping additional useless row
df = df.drop(0).reset_index(drop=True)

# Naming the columns
df.columns = ['name', 'age','weight','sex','breed','source']

# age column to keep only digit and turn into numeric
df.age = pd.to_numeric(df.age.replace('[^0-9]','',regex=True))

# weight column to keep only digit and turn into numeric (added /. to regex to avoid removing decimal as well)
df.weight = pd.to_numeric(df.weight.replace('[^0-9/.]','',regex=True))

# The rest is easy
df.sex = df.sex.replace('SEX:','',regex=True)
df.breed = df.breed.replace('BREED:','',regex=True)
df.source = df.source.replace('SOURCE:','',regex=True)

df

object2161442840 · July 22, 2020, 6:55am

Thank you for sharing, it was very helpful. The problem of row and column being swapped can also be solved by using the Pandas’ .from_dict() method when make a DataFrame, as in Hint of Instruction 1. And there’s a subtle issue that Sparrow’s age is not an integer but 1.5.

I’ve also used the posts of @mahak_gupta and @cdrowley as references (thanks to them too!). Here is my answer on cleaning the DataFrame:

# Raw data
turtle_df = pd.DataFrame.from_dict(turtle_data, orient='index')

# Dropping unnecessary columns
df = turtle_df.drop([0, 2, 4, 6, 8, 10], axis=1).reset_index()

# Renaming the columns
df.columns = ['name', 'age', 'weight', 'sex', 'breed', 'source']

# Strip unnecessary parts from each column
df['age'] = pd.to_numeric(df['age'].apply(lambda x: x.split(' ')[1]))
df['weight'] = pd.to_numeric(df['weight'].apply(lambda x: x.split(' ')[1]))
df['sex'] = df['sex'].apply(lambda x: x.split(' ')[1])
df['breed'] = df['breed'].apply(lambda x: x.split(':')[1].strip())
df['source'] = df['source'].apply(lambda x: x.split(':')[1].strip())

johngutentag · August 16, 2020, 9:50pm

Just finished cleaning to dataframe up. It was good practice. If you have any feedback let me know.

import requests
from bs4 import BeautifulSoup
import pandas as pd

prefix = "https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/"
webpage_response = requests.get('https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/shellter.html')

webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")

turtle_links = soup.find_all("a")
links = []
for a in turtle_links:
    links.append(prefix+a["href"])
    
turtle_data = {}
for link in links:
  webpage = requests.get(link)
  turtle = BeautifulSoup(webpage.content, "html.parser")
  turtle_name = turtle.select(".name")[0].get_text()
  
  stats = turtle.find("ul")
  stats_text = stats.get_text("|")
  turtle_data[turtle_name] = stats_text.split("|")
  

turtle_df = pd.DataFrame.from_dict(turtle_data,orient ="index").reset_index()

#Loop backwards through columns to delete null columns
for x in range(len(turtle_df.columns))[::-1]:
    if (x-1 in [0,2,4,6,8,10]):
        turtle_df = turtle_df.drop(turtle_df.columns[x], axis=1)  
#Renaming columns
old_columns = ['name', 'age_old', 'weight_old', 'sex_old', 'breed_old','source_old']
turtle_df.columns = old_columns

#Cleaning cells of unnecessary characters
split_age = turtle_df['age_old'].str.split(' ')
turtle_df['age'] = split_age.str.get(1)
split_weight = turtle_df['weight_old'].str.split(' ')
turtle_df['weight'] = split_weight.str.get(1)
split_sex = turtle_df['sex_old'].str.split(' ')
turtle_df['sex'] = split_sex.str.get(1)
split_breed = turtle_df['breed_old'].str.split(': ')
turtle_df['breed'] = split_breed.str.get(1)
split_source = turtle_df['source_old'].str.split(': ')
turtle_df['source'] = split_source.str.get(1)

#Removing uncleaned columns
unclean_columns = turtle_df.columns
for x in range(len(unclean_columns)):
    if (unclean_columns[x] == 'name'):
        pass
    elif (unclean_columns[x] in old_columns):
        turtle_df = turtle_df.drop(unclean_columns[x], axis=1)

print(turtle_df)

johngutentag · August 16, 2020, 9:54pm

Thanks for this. I just uploaded mine and your code is much simpler to understand and follow.

thattallguy58 · September 12, 2020, 6:30am

This is all very helpful! I am curious about one thing - when I print the .dtypes of the cleaned table, I still see Name, Sex, Breed, etc. as objects:

Name       object
Age       float64
Weight    float64
Sex        object
Breed      object
Source     object

Why are they not converted to strings? When I use .to_string it does not change either. does it matter if they are strings?

object2161442840 · September 13, 2020, 9:07am

According to User Guide, it seems that, for backwards-compatibility, the default dtype to store text data in pandas is object. Because object was the only option prior to pandas 1.0.

To use string dtype, we need to specify the dtype explicitly when we create a DataFrame or a Series, or to cast it with .astype.

mahak_gupta · October 7, 2020, 7:14pm

it is used to tell python if what is intended to be removed is a column or a row.

rachelclements295358 · November 6, 2020, 12:17pm

It would be good it you told us that requests, bs4 and pandas are not default modules and we will need to install them in order to complete this work offline or in jupyter notebook. It would take one sentance and save a lot of swearing…

h1lo · November 27, 2020, 8:04pm

#raw data
turtle_df=pd.DataFrame(turtle_data) 

rows=[]
#puts values for each turtle's column into a list
for turtle_name in turtle_df.columns:
  turtle_row=[i for i in list(turtle_df[turtle_name].values) if i !='\n']
  #example turtle_row: ['AGE: 7 Years Old', 'WEIGHT: 6 lbs', 'SEX: Female', 'BREED: African Aquatic Sideneck Turtle', 'SOURCE: found in Lake Erie']
  new_turtle_row=[]
  
#splits each element by : and gets the column names
  column_names=[i.split(': ')[0].title() for i in turtle_row]
  column_names.insert(0,'Name')

  for i in turtle_row:
    #for every element in each turtle_row this gets the info after :
    info=i.split(': ')[-1]
    #example info: 7 Years Old (string)
    #splits info into a list for the numerical values (age and weight)
    info2=info.split()
    #example info2: ['7', 'Years', 'Old']
    try:
      #tries making a float out of first element, which would throw an error for words
      i=float(info2[0])
      new_turtle_row.append(i)
    except Exception:
      new_turtle_row.append(info)
  new_turtle_row.insert(0,turtle_name) #inserts the name of each turtle in the beginning of its turtle row
  rows.append(new_turtle_row) # rows is a list of lists

turtle_df=pd.DataFrame(data=rows,columns=column_names) #remaking the df here

#print(turtle_df)

This was hard!

jmatuch2 · February 8, 2021, 4:23pm

Hi, everyone.
I’m doing the “Data Scientist” career path, and this Web Scraping Review seems pretty advanced for where I am at in the learning process. I have very little experience with Pandas because it is covered later in the sequence, so being asked to clean up this dataframe looks daunting. The instructions also mention using “Regex,” which has not been covered yet.
I have been looking up things when I don’t know them, but I would rather come back to this if I am going to learn most of the skills later.

Is this activity something I should be able to do after I go through the tutorial on Pandas?

Thanks for the help.

momsgotskillz · March 9, 2021, 4:18am

Hello fellow Codecademy Learners!
This exercise was quite challenging, but not really once I slowed down and followed what was happening, step-by-step. I wish I had more info on using regex and the proper syntax as well as Pandas and manipulating the DataFrame. However, I was able to find some simple solutions and I am pleased with the result. I will probably go back and tabulate the DataFrame so it prints in a better format.

NOTE: I have NO IDEA how that extra B got in there in the Breed column. If you spot the error, please let me know! It drove me crazy for over an hour and I finally just let it go.
Thanks! Happy coding!

github.com

CorieT102723/codecademy_data_scientist/blob/main/Beautiful Soup - Web Scraping with Turtle Shellter.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "import pandas as pd\n",
    "import re\n",
    "\n",
    "prefix = \"https://content.codecademy.com/courses/beautifulsoup/\"\n",
    "webpage_response = requests.get('https://content.codecademy.com/courses/beautifulsoup/shellter.html')\n",
    "\n",
    "webpage = webpage_response.content\n",
    "soup = BeautifulSoup(webpage, \"html.parser\")\n",
    "\n",
    "turtle_links = soup.find_all(\"a\")\n",

This file has been truncated. show original

gentlebreath · March 16, 2021, 9:56pm

No idea what’s going on with the pandas stuff at the end. Copy pasted other people’s solutions to get through this stupid lesson. Web scraping looks interesting, but this was an exercise in discouragement.