What is cross tabulation?


In the context of this exercise, what is cross tabulation?


In general, cross tabulation is a way to analyze relationships and other information about data.

An example of using cross tabulation is shown in this exercise, which shows the labels and the number of points that actually fall under each species.

labels    setosa    versicolor    virginica
0              0             2           36
1             50             0            0
2              0            48           14

We can see, based on this cross tabulation, how accurate the results were, and understand it a lot easier and more clearly than if we had gone through all the rows in the dataset and determined the accuracy in that manner.

Using the pd.crosstab() method in Pandas, we can perform cross tabulation on our data, and by default, it will provide a frequency table. However, we can also apply aggregate functions to the data to analyze information other than the frequencies. For example, we can get the average of values by providing the np.average function to the aggfunction parameter of the method, like so,

pd.crosstab(..., aggfunction=np.average)

I still don’t know how to read the results by cross tabulation. For example, the results shown here is bad because 50 samples that should have label 1 are categorized into “setosa” which has label 0?

1 Like

The important thing is how many data points are put into one group. 1, 2, 3, in this case, are just the number of the groups. For setosa all data points got put into one group, this means that it is very exact


Does that mean that our method was very exact, but wrong?

Like, high accuracy and low precision?

It should be noted that 0, 1, 2 in target and 0, 1, 2 in labels do not necessarily have the same meaning. In this example, 1 in labels means setosa (which is 0 in target).

cross-tabulation does the following:

for every value you have in field x, you will have a corresponding output for every unique instance possible of y. So, if in field x you have 3 shirts, plaid, solid, and flower print, and in field y you have blue jeans and cargo pants, your output will show:

plaid shirt | blue jeans
plaid shirt | cargo pants
solid shirt | blue jeans
solid shirt | cargo pants
flower print shirt | blue jeans
flower print shirt | cargo pants

Since every “setosa” is grouped in the label 1 , we could decided that “setosa” have been accurately cassfied.
It’s different from the class we take, which defined the “setosa” as label 0 in advance.