Solution: Chocolate Scraping with Beautiful Soup (Project)

I was doing this project and just noticed that unlike all projects there is no video walkthrough available for this one. So thought it might help someone who is having trouble completing it. :blush:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

webpage = requests.get(‘https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/cacao/index.html’)
soup = BeautifulSoup(webpage.content, “html.parser”)

ratings_data = soup.find_all(attrs={‘class’: ‘Rating’})
ratings =
for rating in ratings_data[1:]:
ratings.append(float(rating.string))
print(ratings)

plt.hist(ratings)
plt.show()

company_data = soup.select(’.Company’)
companies =
for company in company_data[1:]:
companies.append(company.string)
print(companies)

dict = {
‘Company’: companies,
‘Rating’: ratings
}
df = pd.DataFrame.from_dict(dict)
df.head()

avg_ratings = df.groupby(‘Company’).Rating.mean()
top_ten = avg_ratings.nlargest(10)
print(top_ten)

cocoa_data = soup.select(’.CocoaPercent’)
cocoa_pcts =
for cocoa_pct in cocoa_data[1:]:
cocoa_pcts.append(int(float(cocoa_pct.string[:-1])))
print(cocoa_pcts)

df[‘CocoaPercentage’] = cocoa_pcts
df.head()

plt.cla()
plt.scatter(df.CocoaPercentage, df.Rating)
z = np.polyfit(df.CocoaPercentage, df.Rating, 1)
line_function = np.poly1d(z)
plt.plot(df.CocoaPercentage, line_function(df.CocoaPercentage), “r–”)
plt.show()

14 Likes

Definitely love that! Thanks!

3 Likes

Thank you - much appreciated. This wasn’t the best-written of modules.

4 Likes

Add my thanks! When you do that, it helps give us a reference point to check our own efforts and promotes better learning.

1 Like

Does this run super slow for anyone else? It like breaks my browser at the histogram part. I really wish this module had a video solution : (

I had to go through the same issue, it was super slow…
I don’t know maybe it’s something to do with the backend or the database where the data is being fetched from.

hey all,
If you guys run this module on jupyter notebook, it’s quick and you don’t need to go around to look at your graphs!

Life saving is what this is

It doesn’t run. LOL
Returns an error for the line:
ratings_data = soup.find_all(attrs={‘class’: ‘Rating’})

2 Likes

Excellent job. Did you do it on the jupyter notebook? It seems you did not " import codecademylib3_seaborn".

I have trouble to do the project on jupyter. Do I need to import more ‘modules’ ?

hi, do you import extra stuff to get it working on jupyter? Thank you.

Here’s my “as few lines of code as possible” solution:

import codecademylib3_seaborn
from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# 1. Access data

web = requests.get('https://content.codecademy.com/courses/beautifulsoup/cacao/index.html')
soup = BeautifulSoup(web.content, 'html.parser')

# 2. Grab relevant data

ratings = [float(rating.get_text()) for rating in soup.find_all(attrs={"class": 'Rating'})[1:]]
companies = [company.get_text() for company in soup.find_all(attrs={"class": "Company"})[1:]]
cocoa = [float(cocoa.get_text()[:-1]) for cocoa in soup.find_all(attrs={"class": "CocoaPercent"})[1:]]

# 3. Create df

df = pd.DataFrame.from_dict({"Company": companies, "Rating": ratings, "CocoaPercentage": cocoa})

# 4. Ratings Histogram

plt.hist(ratings)
plt.show()
plt.clf()

#3. Top 10 chocolate by Rating

mean_ratings = df.groupby("Company").Rating.mean()
top_ten = mean_ratings.nlargest(10)

# 4. Cocoa / Rating relationship

plt.scatter(df.CocoaPercentage, df.Rating)
z = np.polyfit(df.CocoaPercentage, df.Rating, 1)
line_function = np.poly1d(z)
plt.plot(df.CocoaPercentage, line_function(df.CocoaPercentage), "r--")

plt.show()
2 Likes

This was super helpful, thank you. I spent 30 minutes trying to work out why my scatter graph didn’t look right until I noticed that it had the same y-axis as the histogram. Turns out that’s what plt.clf() is for!