Cleaning Data with Python: Tidying Data by Reshaping Comprehension Problem

Tidy Data: Each Row as a Separate Observation

In the course How to Clean Data with Python the concept of “tidy data” is introduced.

For data to be tidy, it must have:
- Each variable as a separate column
- Each row as a separate observation (Diagnose the Data).

I Don’t Understand the Second Requirement

I don’t understand what “each row as a separate observation” means, or rather how this would look in practice. I will give two examples and pose my question below.

Example From the Course

For having “each row as a separate observation” it is recommended to reshape a table like this:

Account Checkings Savings
“12456543” 8500 8900
“12283942” 6410 8020
“12839485” 78000 92000

To a table of this form:

Account Account Type Amount
“12456543” “Checking” 8500
“12456543” “Savings” 8900
“12283942” “Checking” 6410
“12283942” “Savings” 8020
“12839485” “Checking” 78000
“12839485” “Savings” 920000

The Problem With My Understanding: Creating Huge and Unwieldy Tables

If I understand this correctly, this would mean that a table like this (taken from the final project of the course):

State TotalPop Hispanic White Black Native Asian Pacific Income GenderPop Men Woman
Colorado 5278906 20.78 69.9 3.55 0.57 2.66 NaN 6465780 2648667M_2630239F 2648667 2630239
Delaware 926454 8.82 64.63 20.74 0.26 3.27 NaN 6182798 448413M_478041F 448413 478041
Kansas 2892987 11.64 75.96 6.57 0.73 2.33 NaN 5388561 1439862M_1453125F 1439862 1453125
North Dakota 721640 2.83 87.45 1.28 5.65 0.96 NaN 5818811 367963M_353677F 367963 353677

Would have to be reshaped to this form:

State Variable Name Variable Amount
Colorado “Total Population” 5278906
Colorado “Hispanic Population Percentage” 20.87
Colorado “White Population Percentage” 69.9

… and so on.

Now, this is obviously not the case. To the best of my knowledge, this isn’t done. The solution to the final project doesn’t do this. This just seems wrong.

This leads me to believe I fundamentally don’t understand neither the base requirement: “each row as a separate observation”, nor the given example.

So my question is: What does “each row as a separate observation” mean? I would appreciate it a lot if someone could explain it to me and maybe give another example!

Yikes. Their examples are not very helpful so I see why you’re confused! Why would they list each account twice? The account number should be unique and not repeated. Right? :thinking: The way they have it–one would end up doing more work and using groupby['Account'] in pandas. I really don’t see how the 2nd version is “cleaner” b/c it’s not. (you’re correct).

Did you say that df1 was clean in that lesson? df2 is not clean; it’s confusing and the items repeat. The one thing I would add to df1 is a “total” column for each item.

df['Total'] = df.apply(lambda row: row['Checking'] + row['Savings'], axis =1)

            Checking  Savings  Total
12456543      8500     8900  17400
12283942      6410     8020  14430
12839485      7800    92000  99800

As far as your states & demographic data example is concerned, no, you wouldn’t reshape it to the 2nd version. That is unwieldy and confusing as you stated.

"each row is a separate observation’:

Account	Checkings	Savings
12456543	8500	8900
12283942	6410	8020
12839485	78000	92000

Each account number is a separate observation, each column is a variable.


state   total pop,   Hispanic,   White,   Black,   Native American, Asian

Same again: each row is a separate observation, or, state and each column is a variable.
You can group by states, calculate the means, etc. It is organized in a way that allows one to do so.

I hope that makes sense.?

1 Like

Hello @lisalisaj! Thank you very much for your reply!

Regarding your question about the example: The first dataframe is framed as untidy and the second one is framed as tidy.

For example, we would want to reshape a table like: [FIRST TABLE] [i]nto a table that looks more like: [SECOND TABLE] (Diagnose the Data).

Your example/explanation makes sense to me. Every row is a separate case/observation (e.g. a state, a person, a bank account) which can be associated with different variables. Thank you for clarifying. I can only assume the example in the above-mentioned course is wrong.

I appreciate your help!

1 Like