For the provided dataframe, what was the purpose of the "label" value?

Question

In the context of this exercise, for the provided dataframe, what was the purpose of the “label” value?

Answer

In this exercise, we are provided the following dataframe

n=500
df = pd.DataFrame({
    "label": ["set_one"] * n + ["set_two"] * n + ["set_three"] * n + ["set_four"] * n,
    "value": np.concatenate([set_one, set_two, set_three, set_four])
})

The purpose of the “label” value is to provide the x values for every y value of the plot. The x values being “label”, and the y values being “value”.

Each dataset, set_one, set_two, set_three, and set_four have 500 values each, which is why the variable n has been set to 500.

This gives us a total of 500 + 500 + 500 + 500 = 2000 values in the concatenated dataset. As a result, we need 500 x values for each dataset. The x values will be set as the strings "set_one", "set_two", "set_three", "set_four". So,

["set_one"] * n + ["set_two"] * n + ["set_three"] * n + ["set_four"] * n
= 
["set_one"] * 500 + ["set_two"] * 500 + ["set_three"] * 500 + ["set_four"] * 500

# gives us 500 of each label string, 
# for a total of 2000 x values in a single list:
["set_one", ..., "set_two", ..., "set_three", ..., "set_four", ...]

We thus have the 2000 x values, so that each one is paired one-to-one with each of the 2000 y values.

3 Likes

Hi,

Sorry I don’t understand if every set consists 500 items already, why do we need multiply/set each data set by 500 ?

3 Likes

I had the same question. Then I refreshed myself on the what makes up the dataframe. This article really helped me. I figured it had something to do with building a second column, and that’s precisely what’s needed. The label adjustment is simply adding a label to each of the data. Since the data spans 500 per group, you are simply labeling the data for 500 intervals.

3 Likes
1 Like

Expanding a little more over this, you can also print df to check out the dataframe. The resulting dataframe is nothing complicated, just a new column with labels that is created to be able to split sets between each other in the plot.

Just type:
print(df.head(10))

The resulting dataframe is something like this:

 label	   value

0 set_one 94.60563097311541
1 set_one 54.772845125360085
2 set_one 73.6373383113399
3 set_one 55.015799487801196
4 set_one 83.60937763169507
5 set_one 80.4478701408596
6 set_one 73.76404453256899
7 set_one 64.59549487329862
8 set_one 92.71874024465632
9 set_one 75.7731226321676
etc.

Hope it helps, cheers! :beer:

2 Likes

Previously, we have plotted bar charts using the same dataset. But in this exercise, we are trying to combine the data from four datasets (set_one, set_two, set_three, set_four), and then graph them to display mean values for each of the datasets.

Bar charts plot bivariate datasets, that has two columns for x & y (in this case: ‘label’ & ‘value’). In order to plot all four datasets, on the same figure to compare them, we have to combine them into a single dataframe (df). Since each of the data sets contain 500 values each, the n= 500 was used to multiply each of the sets (set_one, set_two, set_three ansd set_four), to bring a total of 2000 labels. The values for each of them is also mapped using np.concatenate.

You can print out the first ten items of the created dataframe to get a better understanding:

Just type
print(df.head(10))

and it prints:

label value

0 set_one 94.60563097311541
1 set_one 54.772845125360085
2 set_one 73.6373383113399
3 set_one 55.015799487801196
4 set_one 83.60937763169507
5 set_one 80.4478701408596
6 set_one 73.76404453256899
7 set_one 64.59549487329862
8 set_one 92.71874024465632
9 set_one 75.7731226321676

2 Likes