Chocolate Scraping with Beautiful Soup Project

Hello all,
I had a lot of fun working on this project. For reference https://www.codecademy.com/paths/data-analyst/tracks/dacp-data-acquisition/modules/dscp-web-scraping/projects/chocolate-scraping-with-beautiful-soup

Find posted below my code. The only issue I have is on step 16 and 17. I cant get rid of the histogram from my scatter plot and ultimately cannot implement the code provided on step 17.

I get a TypeError: File “script.py”, line 77, in
z = np.polyfit(choco_df.CocoaPercentage, choco_df.Ratings, 1)
TypeError: must be str, not float

Can anyone help me solve around this problem. It appears to me that Ratings is a float and must be str. However if I change all that I get other errors.

import codecademylib3_seaborn
from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# MAKE SOME CHOCOLATE SOUP
# Creating the request
webpage = requests.get("https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/cacao/index.html", 'html.parser')

# Making the soup
webpage_data = webpage.content
soup = BeautifulSoup(webpage_data, "html.parser")
# print(soup)

# Getting all class ratings using find_all
rating_tags = soup.find_all(attrs={"class": "Rating"})
# print(rating_tags)
ratings = []

# Looping through rating_tags and appending rating numbers to ratings
for number in rating_tags[1:]: 
  ratings.append(float(number.get_text())) 
# print(ratings)

plt.hist(ratings)
plt.show()


# WHICH CHOCOLATIER MAKES THE BEST CHOCOLATE?

# Getting all company names using find_all
company_name_tags = soup.find_all(attrs=({'class':'Company'}))
company_name = []

#Apending results into the list
for name in company_name_tags:
  company_name.append(name.get_text())
del company_name[0] #Dropping Headers

# Cleaning List if we need to "Removing list duplicates"
# company_name_cleaned = []
# for name in company_name:
#   if name not in company_name_cleaned:
#     company_name_cleaned.append(name)
# del company_name_cleaned[0]

#Using pandas to create dataframe with company names and ratings
dict = {'Company': company_name, 'Ratings':ratings}
company_df = pd.DataFrame.from_dict(dict)
print(company_df)

# Find average ratings by company and listing 10 highest rated
groupby_avg = company_df.groupby('Company').Ratings.mean()
print(groupby_avg)
top_10 = groupby_avg.nlargest(10)
print(top_10)


# IS MORE CACAO BETTER
cocoa_percent_tags = soup.find_all(attrs=({'class':'CocoaPercent'}))
cocoa_percent = []


for percentage in cocoa_percent_tags:
  cocoa_percent.append(percentage.get_text().strip('%'))
del cocoa_percent[0] #dropping header
print(cocoa_percent)

#Ploting graphs with matplotlib and finding correlations with numpy
dict = {'Company': company_name, 'Ratings':ratings,'CocoaPercentage': cocoa_percent}
choco_df = pd.DataFrame.from_dict(dict)
print(choco_df)

plt.scatter(choco_df.CocoaPercentage, choco_df.Ratings)
z = np.polyfit(choco_df.CocoaPercentage, choco_df.Ratings, 1)
line_function = np.poly1d(z)
plt.plot(choco_df.CocoaPercentage, line_function(choco_df.CocoaPercentage), "r--")
plt.show()
plt.clf()

plt.clf() will clear plots/figures. Or, you could comment out that histogram code and see if the scatterplot prints.
One thing, here in your code:

for percentage in cocoa_percent_tags:
  cocoa_percent.append(percentage.get_text().strip('%'))
del cocoa_percent[0] #dropping header
print(cocoa_percent)

You need to convert the float to an int when you strip the % off. Something like, cocoa_percent = ...
first and then append the percents to the list. Fix that and see if it works.

https://www.activestate.com/resources/quick-reads/how-to-clear-a-plot-in-python/

Thank you Lisa! That seems to work well. I managed to find the error and indeed was in the lines that you pointed out. I had to actually convert the STR into a float and the program seems to run fine without any errors. Thank you for your help!

1 Like