Chocolate Scraping with Beautiful Soup

Chocolate Scraping with Beautiful Soup

Use .groupby to group your DataFrame by Company and take the average of the grouped ratings.
Then, use the .nlargest command to get the 10 highest rated chocolate companies. Print them out.

Not sure if this error occurred prior to this step, but I am on this step. When I print I get this error. I’m unsure how to resolve this error and need help.

File “c:\Users\scott\Documents\Coding\Tutorials\Codecademy\Learn Web Scraping with Beautiful Soup\00 - LEARN WEB SCRAPING WITH BEAUTIFUL SOUP\01 - Chocolate Scraping with Beautiful Soup\main.py”, line 60, in
ten_best_rated = average_group_rating.nlargest(10, “Rating”)
File “C:\Users\scott\miniconda3\lib\site-packages\pandas\core\series.py”, line 3309, in nlargest
return algorithms.SelectNSeries(self, n=n, keep=keep).nlargest()
File “C:\Users\scott\miniconda3\lib\site-packages\pandas\core\algorithms.py”, line 1080, in init
raise ValueError(‘keep must be either “first”, “last” or “all”’)
ValueError: keep must be either “first”, “last” or “all”

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Let’s make a request to this site to get the raw HTML,
# which we can later turn into a BeautifulSoup object.
# You can pass this into the .get() method of the requests module to get the HTML.
url = requests.get(
    "https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/cacao/index.html")

# Create a BeautifulSoup object called soup to traverse this HTML.
# Use "html.parser" as the parser, and the content of the response you got from your request as the document.
soup = BeautifulSoup(url.content, "html.parser")

# If you want, print out the soup object to explore the HTML.
# print(soup)

# How many terrible chocolate bars are out there? And how many earned a perfect 5? Let’s make a histogram of this data.
# The first thing to do is to put all of the ratings into a list.
# Use a command on the soup object to get all of the tags that contain the ratings.
ratings_tags = soup.find_all(attrs={"class": "Rating"})

# Create an empty list called ratings to store all the ratings in.
ratings = []

# Loop through the ratings tags and get the text contained in each one. Add it to the ratings list.
# As you do this, convert the rating to a float, so that the ratings list will be numerical.
# This should help with calculations later.
for text in ratings_tags[1:]:
    ratings.append(float(text.get_text()))

# Using Matplotlib, create a histogram of the ratings values:
plt.hist(ratings)
plt.show()

# We want to now find the 10 most highly rated chocolatiers.
# One way to do this is to make a DataFrame that has the chocolate companies in one column, and the ratings in another.
# Then, we can do a groupby to find the ones with the highest average rating.
# First, let’s find all the tags on the webpage that contain the company names.
company_names_tags = soup.find_all(attrs={"class": "Company"})

# Just like we did with ratings, we now want to make an empty list to hold company names.
company = []

# Loop through the tags containing company names, and add the text from each tag to the list you just created.
for name in company_names_tags[1:]:
    company.append(name.get_text())

# Create a DataFrame with a column “Company” corresponding to your companies list,
# and a column “Ratings” corresponding to your ratings list.
df = pd.DataFrame.from_dict({"Company": company, "Rating": ratings})

# Use .groupby to group your DataFrame by Company and take the average of the grouped ratings.
average_group_rating = df.groupby("Company").Rating.mean()

# Then, use the .nlargest command to get the 10 highest rated chocolate companies. Print them out.
ten_best_rated = average_group_rating.nlargest(10, "Rating")

print(ten_best_rated)

@psmilliorn,

You’re getting that error because you passed in two arguments for .nlargest(). All you needed to pass in was your integer for n.

If you look at the documentation, you’ll see that the second parameter for .nlargest() is keep, which only takes the values first, last, or all. You passed in "Rating", which is what caused the error.

2 Likes

Also, your error message shows you where in pandas things went wrong, you can look at the source code to get a better idea of what’s supposed to happen there.

File “C:\Users\scott\miniconda3\lib\site-packages\pandas\core\series.py”, line 3309, in nlargest
return algorithms.SelectNSeries(self, n=n, keep=keep).nlargest()

$ vim C:\\Users\\scott\\miniconda3\\lib\\site-packages\\pandas\\core\\series.py

go to line 3309, and looking at that function, what it does and what it expects, there’s documentation there as well, will tell you a bunch. you can see that “keep” is something you supplied.

1 Like

Thank you for pointing that out. I can see how I got confused on that.

That was the issue. Thank you for your help!

Is it possible to use pd.DataFrame.rank() to get the highest 10?

Not sure, but I think it already does? My memory is fuzzy from this lesson.

Yes and no. Yes, it is possible, but not without a few additional steps (unless you don’t care about the actual ratings of the chocolate).

DataFrame.rank() and Series.rank() don’t rearrange your table. Those methods return the ranking for the values or your rows or columns. (Documentation here)

So, simply calling .rank() at the end will return the name of the company and it’s ranking, but not the Rating column.

If you want to use .rank() and still keep the Rating column, you have to go through the following steps:

# Although it starts as a DataFrame, grouping by the mean of `Rating`
# returns a Pandas Series here.
company_ratings = df.groupby('Company').Rating.mean()

# So, we must first convert back to a DataFrame to add another column
company_ratings_df = pd.DataFrame(company_ratings)

# Then, we add a column, `rank`, that ranks the highest mean as rank 1
company_ratings_df['rank'] = company_ratings_df.rank(ascending=False)

# Then we still have to sort it by the rank and get the top 10 ranks
top_ten_ranked = company_ratings_df.sort_values('rank').head(10)

print(top_ten_ranked)

As you can see, this is a lot more work than just using .nlargest() on the original company_ratings Series.

Even if you wanted to avoid using .nlargest() for whatever reason, using .rank() is still not the fastest way to go about it. Consider the following two print() statements that will print the top ten chocolate ratings from the company_ratings_df DataFrame and the company_ratings Series, respectively:

# For DataFrames, we specify the column name
print(company_ratings_df.sort_values('Rating', ascending=False).head(10))

# Series only have 1 column by definition, so no column name needed
print(company_ratings.sort_values(ascending=False).head(10))

Hope this helps — happy coding!

1 Like