Solution: Chocolate Scraping with Beautiful Soup (Project)

method7564419324 · April 19, 2020, 4:57am

I was doing this project and just noticed that unlike all projects there is no video walkthrough available for this one. So thought it might help someone who is having trouble completing it.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

webpage = requests.get(‘https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/cacao/index.html’)
soup = BeautifulSoup(webpage.content, “html.parser”)

ratings_data = soup.find_all(attrs={‘class’: ‘Rating’})
ratings =
for rating in ratings_data[1:]:
ratings.append(float(rating.string))
print(ratings)

plt.hist(ratings)
plt.show()

company_data = soup.select(’.Company’)
companies =
for company in company_data[1:]:
companies.append(company.string)
print(companies)

dict = {
‘Company’: companies,
‘Rating’: ratings
}
df = pd.DataFrame.from_dict(dict)
df.head()

avg_ratings = df.groupby(‘Company’).Rating.mean()
top_ten = avg_ratings.nlargest(10)
print(top_ten)

cocoa_data = soup.select(’.CocoaPercent’)
cocoa_pcts =
for cocoa_pct in cocoa_data[1:]:
cocoa_pcts.append(int(float(cocoa_pct.string[:-1])))
print(cocoa_pcts)

df[‘CocoaPercentage’] = cocoa_pcts
df.head()

plt.cla()
plt.scatter(df.CocoaPercentage, df.Rating)
z = np.polyfit(df.CocoaPercentage, df.Rating, 1)
line_function = np.poly1d(z)
plt.plot(df.CocoaPercentage, line_function(df.CocoaPercentage), “r–”)
plt.show()

angie_sheng · May 13, 2020, 9:56am

Definitely love that! Thanks!

midds · May 25, 2020, 9:40am

Thank you - much appreciated. This wasn’t the best-written of modules.

mbf2234 · June 2, 2020, 8:45pm

Add my thanks! When you do that, it helps give us a reference point to check our own efforts and promotes better learning.

chip5575035774 · June 9, 2020, 9:38am

Does this run super slow for anyone else? It like breaks my browser at the histogram part. I really wish this module had a video solution : (

method7564419324 · June 9, 2020, 1:19pm

I had to go through the same issue, it was super slow…
I don’t know maybe it’s something to do with the backend or the database where the data is being fetched from.

myungsubkim734749217 · June 19, 2020, 2:28am

hey all,
If you guys run this module on jupyter notebook, it’s quick and you don’t need to go around to look at your graphs!

868guy · July 29, 2020, 3:29am

Life saving is what this is

jamesth44 · July 29, 2020, 1:53pm

It doesn’t run. LOL
Returns an error for the line:
ratings_data = soup.find_all(attrs={‘class’: ‘Rating’})

chloe_mcc · December 8, 2020, 3:37am

Excellent job. Did you do it on the jupyter notebook? It seems you did not " import codecademylib3_seaborn".

I have trouble to do the project on jupyter. Do I need to import more ‘modules’ ?

chloe_mcc · December 8, 2020, 3:39am

hi, do you import extra stuff to get it working on jupyter? Thank you.

ddiran · December 28, 2020, 5:16pm

Here’s my “as few lines of code as possible” solution:

import codecademylib3_seaborn
from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# 1. Access data

web = requests.get('https://content.codecademy.com/courses/beautifulsoup/cacao/index.html')
soup = BeautifulSoup(web.content, 'html.parser')

# 2. Grab relevant data

ratings = [float(rating.get_text()) for rating in soup.find_all(attrs={"class": 'Rating'})[1:]]
companies = [company.get_text() for company in soup.find_all(attrs={"class": "Company"})[1:]]
cocoa = [float(cocoa.get_text()[:-1]) for cocoa in soup.find_all(attrs={"class": "CocoaPercent"})[1:]]

# 3. Create df

df = pd.DataFrame.from_dict({"Company": companies, "Rating": ratings, "CocoaPercentage": cocoa})

# 4. Ratings Histogram

plt.hist(ratings)
plt.show()
plt.clf()

#3. Top 10 chocolate by Rating

mean_ratings = df.groupby("Company").Rating.mean()
top_ten = mean_ratings.nlargest(10)

# 4. Cocoa / Rating relationship

plt.scatter(df.CocoaPercentage, df.Rating)
z = np.polyfit(df.CocoaPercentage, df.Rating, 1)
line_function = np.poly1d(z)
plt.plot(df.CocoaPercentage, line_function(df.CocoaPercentage), "r--")

plt.show()

dunst4n · January 21, 2021, 5:50pm

This was super helpful, thank you. I spent 30 minutes trying to work out why my scatter graph didn’t look right until I noticed that it had the same y-axis as the histogram. Turns out that’s what plt.clf() is for!

lemonstu · May 2, 2021, 3:46pm

Thank you so much. This was extremly difficult

henrylin03 · August 17, 2021, 10:23am

Thanks OP! However, I’m getting this weird issue:

Traceback (most recent call last):
  File "script.py", line 53, in <module>
    plt.scatter(df.CocoaPercentage, df.Rating)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5179, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'CocoaPercentage'

If anyone knows what I can do to fix, please let me know!!

alyciajenkins6675869 · August 28, 2021, 7:05pm

Not sure if anyone actually tried to print the top ten list, but it only prints one company and rating…

courserockstar38612 · September 11, 2021, 12:55am

did you use find_all?

array3249503141 · October 28, 2021, 3:30pm

To create the lists ratings, companies, and cocoa, why do you end your code in [1:]? I saw codecademy use [0] before but never explained why or what it is.

code5028064402 · November 11, 2021, 8:13am

Hello. While we’re comparing notes on web scraping, I could use a little help/pointers here. So I have this code and I’m trying to capture the duration of a movie file but I’ve hit a roadblock

import requests
from bs4 import BeautifulSoup

url = "https://yts.mx/movies/the-thirteenth-tale-2013"

page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
for tag in soup.find_all('span', class_ = "icon-clock"):
    print(tag)

And these are the results

<span class="icon-clock" title="Runtime"></span>
<span class="icon-clock" title="Runtime"></span>

Somehow the scraper is failing to capture the time shown here

<span class="icon-clock" title="Runtime"></span>
 1 hr 29 min 
<div class="visible-xs"></div>

What am I doing wrong?

miguelgarca88 · January 4, 2022, 2:06pm

Same issue, although mine was because printing the ratings and converting to floats took forever; other data types (sans convertion) worked much faster