Page Visits Funnel: Why do the numbers not add up?

Hi there
I was wondering about the relationship between the numbers.

I understand that num of visits = num of cart + num of no_cart.
However after I calculate the num of visits (after the merge) as 2052, num of no_cart as 1652, I got 162 when calculating the num of the people putting stuff in the cart (in step 6). Shouldn’t it be 400 (2052-1652)?

When I went back to check len(visits), it came out as 2000, which is even more confusing. When calculating the num of visits, I used left merge, which includes all the data in visits. How can the numbers be different?

This is my code. I would really appreiciate the help!

https://gist.github.com/fd31e9269f4d1eebbb9014e6e73fd9be

2 Likes

Hi!

I had the same problem. I decided to calculate the percentage of visitors who don’t click thru to the carts by using the length of the visits table itself. It made sense to me that this would be the same length as the left join table of visits and cart. But that is not the case! I checked out the data in both the visits table and the merged table and found that a unique user might have several entries in the merged table. I believe this results from a user putting more than one item in their cart in a single visit. This explains the difference in size between the visits table and the merged table. So not anything that you’ve done wrong! And your code looks fine to me. It does find the percentages that they found in the help video.

The problem that I have is the difference in the percentage calculation itself. Isn’t it more accurate to only count the UNIQUE users for the percentage? But taking that into account means also only counting the cart click thrus for the first item placed into the cart for each user. In my mind, the funnel would be incorrect if it allowed counting multiple items from single users (as it does in the walkthru code example).

If anyone knows why one way is better than the other, I would love to hear it!
Thanks!

4 Likes

There seem to be 4 cases that can come out.

id visit cart check buy
1 O X X X
1 O O X X
1 O O O X
1 O O O O
There is a case where one user is duplicated, so the number of rows seems to increase as they are merged.

Here, my question is, is it correct to eliminate duplicate cases?
Can you make code while removing duplicates?
Or am I wrong?

3 Likes

This has been driving me crazy! I thought I WAS crazy. Haha.

I agree with @lesvrolyk that it seems that the numbers obtained counting duplicate entries would not be an accurate analysis of the funnel!

For the last percentage (users who checked out but did not purchase), I tried working with the all_data merged table rather than creating an additional merged table of checkout and purchase, by using this code:

#Number of users who checked out (not null values for checkout_time)
users_checkout = all_data[~all_data.checkout_time.isnull()])

#Number of users who checked out but did not purchase (not null values for checkout_time and null values for purchase_time)
users_checkout_but_not_purchase = all_data[(all_data.purchase_time.isnull()) & (~all_data.checkout_time.isnull())]

I then used len() to get the numbers to calculate the percentage and the numbers are different than in the walkthrough by the developer.

The problem is, I am not sure which is more accurate and why.

I would love a Code Academy developer to weigh in on this.

3 Likes

I was confused by the same problem. I tried to find out how much duplicates in each DataFrames, and it seems that there are considerable amount of duplicates compared to the overall size:

print(len(visits))  # 2000
print(visits.user_id.nunique())  # 2000
# There is no duplicate user id at least in visits.

print(len(cart))  # 400
print(cart.user_id.nunique())  # 348
# There are 52 duplicates.

print(len(checkout))  # 360
print(checkout.user_id.nunique())  # 226
# There are 134 duplicates.

print(len(purchase))  # 252
print(purchase.user_id.nunique())  # 144
# There are 108 duplicates.

I agree with @lesvrolyk too. Duplicate rows should not be counted. I found another topic discussing the same issue and a way to eliminate duplication is mentioned there:

The way to do it is to add .drop_duplicates(subset='user_id') to the DataFrames. As described in Pandas document, this method removes duplicate rows.

2 Likes

.drop_duplicates! Nice and easy solution to this problem. Thanks for the post!

1 Like

looks like someone did some clean up on this exercise… but there are still duplicates in the purchase csv
still… they led me here and now i know how to drop_duplicates :smiley:

visits: 2000, dedupped: 2000
cart: 348, dedupped: 348
checkout: 226, dedupped: 226
purchase: 252, dedupped: 144