Page Visits Funnel Project - note same users listed multiple times within the tables

This is regarding the data science path, module 9, Page Visits Funnel Project - https://www.codecademy.com/paths/data-science/tracks/data-processing-pandas/modules/dspath-multiple-tables-pandas/projects/multi-tables-proj

I was reminded of the importance of always making sure you are familiar with the data you are working with when I noticed that the number of rows in the merged visits_cart list created in task #2 was 2052, which exceeds the number of rows (2000) in the original visits list. This didn’t make sense to me at first since we did a left join on visits. visits_cart = pd.merge(visits, cart, how = 'left')

After looking into it further, I noticed that multiple users with the same ID appear more than once in the cart list and confirmed this is due to multiple users visiting the cart page more than once within the same visit - a fairly normal occurrence.

You can test this by comparing print(len(cart)), (400) to print(cart.user_id.nunique()) (348).

There are also users that started the checkout process more than once within the same visit.

The instructional video does not take this into account. Technically, we can’t accurately answer the question about the percentage of users that put a t-shirt in their cart and then moved on to checkout and then made a purchase without accounting for the fact that some users visited their cart more than once and others entered checkout more than once.

For instance, when answering question #5 in the project, " What percent of users who visited Cool T-Shirts Inc. ended up not placing a t-shirt in their cart?", we need to divide the total number of null cart-times by the length of the original visits table instead of the merged list-cart table since the merged list contains duplicate user-id values.

print(null_cart_times) / float(len(visits)) 

Based on this use case, I looked up how to remove duplicate rows from a dataframe based on the values in just one column. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

So, if you wanted to remove duplicate user Ids from the checkout table, you would enter checkout_no_duplicate_ids = checkout.drop_duplicates(subset='user_id', keep='first').reset_index(drop=True).

8 Likes

I believe I have just run into this same concern. I completed the exercise alone (my first one without looking at the video, was very pleased with myself!) but then watched the solution video after I’d finished to check my results and my thought process.

I was concerned when my result of % of users that proceeded to checkout but did not purchase a t-shirt was different to the one shown in the video solution - I got 6%, vs 16% shown in the video.

My approach was to first subtract a count of all purchase events from a count of all checkout events (i.e. rows with a timestamp present), and then dividing that product by the number of unique users in the user_id column. My code was:

checkout_no_purchase = (all_data.checkout_time.count() - all_data.purchase_time.count()) / float(all_data.user_id.nunique())

I would be grateful to hear anyone else’s thoughts on my approach here and if / where I went wrong?

Thank you!

1 Like

Hello remember the Python Operator precedence… PEMDAS. Your code will first divide and then sustract the result to tthe all_data.checkout_time.count().

I agree with anna5185 there are duplicates. We need to clean the data before making ratios or calculating percentages.

I’ve took a different approach, I count the user_id of each merged table… but in my case the result if the evolution of each client, probably I will lose some data if one user buy several times.

not_cart=visits_cart[visits_cart.cart_time.isnull()].user_id.count().

1 Like

Is it correct to make these computations not using merge method at all?

visits_total = visits.user_id.nunique()
cart_total = cart.user_id.nunique()
checkout_total = checkout.user_id.nunique()
purchase_total = purchase.user_id.nunique()

print("Percent of users who visited but didn't placed in their cart: " + str(float(visits_total-cart_total)/visits_total))
print("Percent of users who placed in their cart but didn't proceed to checkout: " + str(float(cart_total-checkout_total)/cart_total))
print("Percent of users who proceeded to checkout but didn't purchase: " + str(float(checkout_total-purchase_total)/checkout_total))

My answers:

Percent of users who visited but didn’t placed in their cart: 0.826
Percent of users who placed in their cart but didn’t proceed to checkout: 0.350574712644
Percent of users who proceeded to checkout but didn’t purchase: 0.362831858407

1 Like

Correct me please if I’m wrong. Based on what you all said, when two dataframes are left merged, if the right dataframe has matching duplicated values, then the resulting dataframe will create two rows even though the left one has the matching value only once?

This is really a good observation.
When I tried to use all_data to calculate the percentage, things went wrong.
It seems that after merging, all_data has 2373 records, while the visits has only 2000 records.
This can be explained by the duplication.
If the right DataFrame has multiple rows with the same id that could be matched to the left, these rows will be piled up. So I guess there are 363 duplicated user records in total.

I’m glad so many others have spotted this issue, it needs to be fixed no? as the average calculation time at the end is skewed

1 Like

Thanks so much for discovering the duplicates users in cart data frame!

I was looking for the reason why the percentages calculated with all_data were different from calculating them separately. Finally found the answer here.

Thanks so much for discussing this issue. I’ve encountered the same thing while working with the data, and I was wondering what went wrong with my codes. Yet, when I calculated the number of rows of visit_cart, it’s exactly 2000 rows. The number became different only when I combined all tables into all_data.

At first, I thought there were some problems with my calculations because I did it differently compared to the video. So I revised my codes and followed the same way the guy did. And both ways yielded the same results: 82.6%, 25.31% and 16.8% respectively. Anyone has the same results?

Also, my other question here is whether we should drop the rows where user_id is duplicated since one user can leave and comeback. And this should be considered valid visits though.

My codes look a bit different though. Could anyone review these lines for me? Much appreciated.

Hey man. Interestingly, after cleaning the data using .drop_duplicates(), I got the same results as yours.
Yet, probably our solutions are problematic since we don’t know if we should drop all the duplicates or not.

all_data = visits.merge(cart, how='left').merge(checkout, how='left').merge(purchase, how='left').reset_index(drop=True)
all_data.drop_duplicates(subset='user_id', inplace=True)

Here is my code, btw:

I got the same answers as you. I’m learning still, but your solution to use nunique was more efficient than what I did which was to groupby and then count.