Referring to a column in a pivot


#1

In the “A/B Testing for Shoefly.com’” project for the Aggregates in Pandas lesson in the course Data Analysis With Pandas, there is the following piece of code presented as a solution to question 6:

clicks_pivot[‘percent_clicked’] = (clicks_pivot[True] * 100)/(clicks_pivot[True] + clicks_pivot[False])

Here is a link to the project:

https://www.codecademy.com/paths/data-science/tracks/data-processing-pandas/modules/dspath-agg-pandas/projects/pandas-shoefly-ab-test

Throughout the course it was reiterated that columns in a DataFrame should be referred to as either df.column_name or df[‘column_name’]. However none of those formats seemed to work when I tried to define the columns in a pivot table as either clicks_pivot.True or clicks_pivot[‘True’] as it gave me errors. There is no explanation as to why clicks_pivot[True] works here neither in the course nor in the explanatory video at the end of the project.

Could anybody please elaborate why defining a column as df[column_name] without quotations works for a pivot table?

Thank you


#2

Hi @ivannaydenov92804819,

In addition to str objects, int or bool objects can serve as the names of columns. In order to offer flexibility, Pandas was designed so that dot notation can be used to refer to columns that have names of str type, so long as those names can serve as valid names of objects. However, in cases where column names contain characters such as spaces, or are of int or bool type, we need to use brackets that contain expressions that evaluate to the column names. True, False, 7, 'largest city', or variables with appropriate values are some examples of such expressions.


#3

Hi appylpye,

Thank you for the quick reply. I understand the purpose of using df['column name'] in cases where there is a space in the name of the variable. However I do not understand why, if the column’s name is "True" can we not use df.True or df['True'] which I have both tried unsuccessfully.

Kind regards


#4

Evidently, many of the Pandas methods, as well as some familiar Python operators, when used in Pandas, produce side effects that introduce behaviors that may, at first, seem to transcend conventional Python behaviors. For example, when a Pandas method is executed that creates a DataFrame or a new column in an existing DataFrame, it also creates variables that can be accessed via dot notation in cases where the column name is a str that can serve as a valid name for an object. Where that is not the case, the methods may have been written so that they do not create those variables that can be used to refer to column names. It’s a design decision by the creators of Pandas.

This paragraph was added on October 22, 2018: Note that the columns here are named True and False, as bool objects, rather than "True" and "False" as str objects, so that could be why the option of using dot notation to refer to them is not available. You could experiment with trying to create such columns with str names. It is probably best not to use such names in an actual production environment, as it may cause confusion.

You may notice other behaviors in Pandas that seem “magical”, such as ones associated with applying conditions to objects within DataFrames. Some of these might be attributed to the fact that the functioning of operators can be redefined via Python’s “magic methods”. See Data model. Accordingly, Pandas may define __eq__ and other such methods for some of its types, and introduced hidden behaviors into those definitions. These can alter the behavior of ==, <, >, and other familiar operators for selected types of objects.