FAQ: Aggregates in Pandas - Calculating Aggregate Functions IV

This community-built FAQ covers the “Calculating Aggregate Functions IV” exercise from the lesson “Aggregates in Pandas”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Data Science

Data Analysis with Pandas

FAQs on the exercise Calculating Aggregate Functions IV

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

The example text is: df.groupby([‘Location’, ‘Day of Week’])[‘Total Sales’].mean().reset_index() which is missing a . between the groupby series and [‘Total Sales’]. It’s odd, but I tested this input in the console with and without the period, and both seemed to work. It worked in both groupby(column1).[‘column’].function() format and groupby().column.function() format. Does this period matter at any point?

1 Like

I am VERY frustrated by this.

The example is inconsistent with the rest of the lessons.

Codecademy example:
df.groupby([‘column1’], [‘column2’])[‘Total Sales’].mean().reset_index()

Proceeding into the lesson, we are asked to create a solution using multiple columns as the example.
My solution was:
shoe_counts = orders.groupby([‘shoe_type’],[‘shoe_color’])[‘id’].count().reset.index()

This threw errors and wouldn’t let me pass. I followed the expressed syntax.

So I frustratedly click the ‘show solution’ button, and the solution states:
shoe_counts = orders.groupby([‘shoe_type’, ‘shoe_color’]).id.count().reset_index()

I would VERY much appreciate this to be clarified.
I will utilize the knowledge from the example, as it APPEARS to be more syntactically proper.

Please validate if the example is a data entry error.

@array3000191634, @analytics-nerd

Your questions are related to one another, so I will attempt to address them both with this response. There are two possible functionally correct syntaxes which provide a solution to the code problem in this lesson. They are:

shoe_counts = orders.groupby([‘shoe_type’, ‘shoe_color’]).id.count().reset_index()

and

shoe_counts = orders.groupby([‘shoe_type’, ‘shoe_color’])[‘id’].count().reset_index()

The reason either of these options will work was covered earlier in the module, in the lesson ‘Creating, Loading, and Selecting Data with Pandas’, exercise '7. Select Columns’. An excerpt from that lesson’s text reads:

There are two possible syntaxes for selecting all values from a column:

  1. Select the column as if you were selecting a value from a dictionary using a key. In our example, we would type customers[‘age’] to select the ages.
  2. If the name of a column follows all of the rules for a variable name (doesn’t start with a number, doesn’t contain spaces or special characters, etc.), then you can select it using the following notation: df.MySecondColumn . In our example, we would type customers.age .

Either one will work, but typically you would see the df[‘column’] syntax used when the name of the column includes spaces or other special characters that cannot be parsed using the df.column syntax or when you need to select more than one column. As another contributor pointed out here – Pandas does not impose a strict requirement on how to case your column names, but it is recommended to utilize “snake_case” column names, because it gives you the freedom to select columns from the dataframe using either syntax.

Note that in this lesson’s example, the column that is the subject of the calculation function is [‘Total Sales’]. The column name includes a space, which requires using the df[‘column’] format.

Now, @array3000191634, as for why the code appears to work with or without periods, this is an issue with the Codecademy script that fails to update the ‘mini-browser’ window after certain lesson points are reached. I noticed this occurring in my browser on both this and a previous exercise – sometimes, once you have printed a df table to the mini-browser, even if you comment out all of your code and hit ‘Run’ the mini-browser does not refresh properly and will continue to display the old data. If you completely reset the exercise and attempt to enter the code for the first time using the incorrect syntax (note use of both . and [‘’] or neither): …orders.groupby([‘shoe_type’, ‘shoe_color’]).[‘id’]… or …orders.groupby([‘shoe_type’, ‘shoe_color’])id…, it will throw a syntax error.

Hope that helps!

3 Likes

shoe_counts = orders.groupby([‘shoe_type’, ‘shoe_color’]).id.count().reset_index()

Question: for the code above, what is the reasoning to include “id” ?

1 Like

Hi! I’ll try to explain:

With groupby you are just grouping the data in the categories you desire (this being shoe_type AND shoe_color for this exercise), but once you group by those criteria, then what? There are many calculations that can be done with those groups, one of them being to count the shoes that fall into the different groups created for every type of shoe in every color.

Count performs a count of every row in the dataframe that falls under the criteria of every group, but you still have to provide count with a column to work with.

ID is used to instruct count where to count, but you could also use shoe_type or last_name (or any other column with data in them) as it will make no difference as long as there is something to count.

Hope it helps to resolve your question, cheers!

2 Likes

there is a df.columnName.mean(), .max(), etc, but no .sum()?

In the example:

df = pd.DataFrame([
  ["Amy","Assignment 1",75],
  ["Amy","Assignment 2",35],
  ["Bob","Assignment 1",99],
  ["Bob","Assignment 2",35]
  ], columns=["Name", "Assignment", "Grade"])

df.groupby('Name').Grade.mean()

What if I wasn’t interested in each student’s mean score, but rather the total points? How would I determine the total points per student?

There is also a .sum() method. See the documentation.

thank you – I had looked in a couple of places and saw a subset of methods but no sum. And i tried it on my own, and initially didn’t get it to work. The codecademy cheatsheet does not mention it.

In the exercise immediately before this one, the syntax used only parenthesis when selecting the column to group by.

orders.groupby('shoe_color').price.etc......

But in this exercise, it uses parenthesis and brackets…

orders.groupby(['shoe_type','shoe_color']).id.etc....

Why isn’t it just groupby('shoe_type','shoe_color').id.etc...
Why the sudden need for brackets that weren’t needed with only one column, is it reading it like a list?

I would like to make a suggestion to this exercise to add renaming the final column in the created database as review.
We learned how to rename columns but it was ages ago and having ID as the column name makes NO sense for your resulting table. It is not “id” it should be “count” or something similar This suggestion would also be good for the excercise “Calculating Aggregate Functions II” where the price is not really true it is “max_price” it is misleading to the table.
Here is the code to

shoe_counts.rename(columns={"id": 'count'}, inplace = True) print(shoe_counts)
2 Likes

Does the order of columns you want to group by matter?

how do I get the proportion of certain element in a column? say I have a DataFrame with 3 columns : shape, color, area. I want to have the the proportion of red objects within the shape.
my thought process is: get the count of each color within the shape first, so I have this (correct code):

df1 = df.groupby([‘shape’,‘color’]).side.count().reset_index()
df1= df1.rename(columns = {“side” :“count”})

then I am thinking of using apply to get a formula or something. please give me ideas lol

This problem is intriguing enough that I did extra practice to see if I can solve it. It requires a bit more than what has been taught in the course so far (merging dataframes). I am sure there are ways to do it only based on what has been taught but I thought the syntax is simple enough to make a fairly clean code here.

The following code expand on the exercise from this particular lesson and adds two new columns, one for the total based on the first group and second that shows the percentage.

import codecademylib3
import numpy as np
import pandas as pd

orders = pd.read_csv('orders.csv')
#Creating shoe counts by type and colour as per exercise:
shoe_counts = orders.groupby(["shoe_type", "shoe_color"]).id.count().reset_index()
shoe_counts.rename(columns={"id": "counts"}, inplace=True)
#Creating total counts grouped by the shoe type:
shoe_total_counts = orders.groupby(["shoe_type"]).id.count().reset_index()
shoe_total_counts.rename(columns={"id": "total_counts"}, inplace=True)
print(shoe_total_counts)
#Merge the total counts with the table created by the exercise so they are in the same table
shoe_counts = shoe_counts.merge(shoe_total_counts, how="left", on="shoe_type")
#Use Lambda function to calculate percentage, with percentage formatting and output it into its own column:
shoe_counts["percentage"] = shoe_counts.apply(lambda row: "{:.0%}".format(row["counts"] / row["total_counts"]), axis=1)
print(shoe_counts)

Resulting table:

shoe_type shoe_color counts total_counts percentage
0 ballet flats black 2 50 4%
1 ballet flats brown 11 50 22%
2 ballet flats navy 17 50 34%
3 ballet flats red 13 50 26%
4 ballet flats white 7 50 14%
5 sandles black 3 50 6%
6 sandles brown 10 50 20%
7 sandles navy 13 50 26%
8 sandles red 14 50 28%
9 sandles white 10 50 20%
10 stilettos black 8 50 16%
11 stilettos brown 14 50 28%
12 stilettos navy 7 50 14%
13 stilettos red 16 50 32%
14 stilettos white 5 50 10%
15 wedges brown 13 50 26%
16 wedges navy 16 50 32%
17 wedges red 4 50 8%
18 wedges white 17 50 34%