FAQ: Aggregates in Pandas - Calculating Aggregate Functions III

This community-built FAQ covers the “Calculating Aggregate Functions III” exercise from the lesson “Aggregates in Pandas”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Data Science

Data Analysis with Pandas

FAQs on the exercise Calculating Aggregate Functions III

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

What should I do if a column name has a comma?
i.e. “price.item”, not price?

cheap_shoes=orders.groupby(‘shoe_color’).price.apply(lambda x: np.percentile(x, 25)).reset_index(),

1 Like

We calculate percentile this way:

high_earners = df.groupby('category').wage
    .apply(lambda x: np.percentile(x, 75))
    .reset_index()

Why can not we use synthax without ‘lambda function’ like for calculating the other statistical values - mean, median ets:

high_earners = df.groupby('category').wage.np.percentile(x, 75).reset_index()

I think if we create a class-method/attribute for calculating percentile for the series object, then it may be applied directly like other statistical values like max, min etc. I can only conclude the the .percentile is a function in the numpy module which can be called on a series object x passed to it. Therefore the lambda function is used in passing the wage object

where is np defined? can you please explain?

sorry for the first message, I found the np.

Run the code to calculate 25th percentiles and also run calculations separately in excel.

cheap_shoes = orders.groupby(‘shoe_color’).price.apply(lambda x: np.percentile(x, 25)).reset_index()

Results as follows:
shoe_color percentile(x, 25) Excel
0 black 130 130
1 brown 248 248
2 navy 200 200
3 red 157 149
4 white 188 181
The results match for 3 colors, but are different for two colors. Furthermore, I checked the CSV data and there is no data point (red, 157) or (white, 188)… Any insights?

1 Like

The syntax of lambda functions wasn’t addressed. Anyone able to explain it?

I tried to calculate in Excel the percentile(x,25) only for red-coloured shoes in order to check what is going on.

I noticed that in Excel there are two functions for calculating percentiles , “PERCENTILE.INC” & “PERCENTILE.EXC”; their difference is that the first includes the min. and max. datapoints but the second doesn’t.
Using the first function (INC), the result is the same as that in codeacademy’s solution i.e. 157. Using the second (EXC) , it is 149. It seems that you used the PERCENTILE.EXC function. It is also apparent that python’s method here for calculating percentiles is to include the min. and max.

As far as your second comment that the result is 157 whereas there isn’t any datapoint (red shoe) in the dataframe priced at 157:
I recall that we calculate percentiles , quartiles etc. in order to find those values that divide our dataset into the groups we want. Then we check which of our data falls into each group. Those values might be the same, equal, to datapoints of our dataframe or might be not.

In the red-shoe dataframe example, the percentile (x,25) actually lies between id 11 & 12 which corresponds to prices 149 & 165. Thus, the percentile, in terms of price, is the mean of those two values (157).

Have a look at the following link , the first graph is quite helpful:.
https://www.codecademy.com/paths/data-science/tracks/learn-statistics-with-python/modules/quartiles-quantiles-and-interquartile-range/lessons/quartiles/exercises/quartiles

3 Likes

Okay, the instructions here have a few misleading instructions.

  1. Prior to this, the student is not introduced to the numpy library in any formal way, nor is it explained how and why we import it. This needs to be explained, especially considering we are using it in this exercise. At the very least, we need some sort of lesson on python libraries and how and why we use them.

  2. We are given the following code as instructive in the lesson write-up:

high_earners = df.groupby('category').wage
    .apply(lambda x: np.percentile(x, 75))
    .reset_index()

Technically, this syntax wouldn’t work, if a student tried to emulate it. For example, in the exercise, when you write the following, it will generate an error:

cheap_shoes = orders.groupby('shoe_color').price
      .apply(lambda x: np.percentile(x, 25))
      .reset_index()

You’ll actually need to write the following:

cheap_shoes = orders.groupby('shoe_color').price \ 
      .apply(lambda x: np.percentile(x, 25)) \
      .reset_index()

Note the backslashes in the above. Again, this sort of thing needs to be explained in the lesson write-up, since previous lesson solutions have this in the code without any explanation of what it does.

4 Likes

orders = pd.read_csv(‘orders.csv’)
print(orders)
cheap_shoes = orders.groupby(‘shoe_color’).price.apply(lambda x : np.percentile(x,25)).reset_index()

In here, Can i add any column to print?
I want to show 3 columns for example shoe_color, price(25th percentile), max price

2 Likes

You can use brackets instead of dot notation.

cheap_shoes = orders.groupby('shoe_color')['price.item'].apply(lambda x: np.percentile(x, 25)).reset_index()
1 Like

I came up with a way to add columns later.

cheap_shoes = orders.groupby('shoe_color').price.apply(lambda p: np.percentile(p, 25)).reset_index()

cheap_shoes['max_price'] = orders.groupby('shoe_color').price.max().reset_index(drop=True)

print(cheap_shoes)

Perhaps I’ll find a better way in the exercises which follows later, but I haven’t learned yet.

Shouldn’t there be a window for command output for this exercise? I can finish this exercise but I can’t see my results. Perhaps a small bug.

hey!
when i type in the command:
cheap_shoes = orders.groupby('shoe_color').price.apply(lambda x: np.percentile(x, 25)).reset_index()
the dot after price differs in color. it’s brown where all the others are white.

Does anyone understand why this is?
just wondering :smiley:

Hi,
I can see that in this lesson we are using

“import numpy as np”

Is there any specific reason to use that? I couldn’t find any explanation for using it in the lesson.
Thanks in advance

Hi,
Well, if you let editor go to next line by keep writing code and don’t change the line by pressing “enter” it works. Otherwise, you have to insert backslashes. It worked for me.

It is because percentile is one of the numpy’s method. You can see the following documentation:
numpy percentile

Hey, I have a question about how the percentile operates with groupby.

In a previous section groupby is explained as enabling us to loop through a subset of values. But, np.percentile takes an array like input not a single value.

So when we refer to x in the lambda function we are referring to the array output by the groupby function?

1 Like

when should I add axis=1 to the end of a lambda function?