Page Visits Funnel Project - note same users listed multiple times within the tables

This is regarding the data science path, module 9, Page Visits Funnel Project -

I was reminded of the importance of always making sure you are familiar with the data you are working with when I noticed that the number of rows in the merged visits_cart list created in task #2 was 2052, which exceeds the number of rows (2000) in the original visits list. This didn’t make sense to me at first since we did a left join on visits. visits_cart = pd.merge(visits, cart, how = 'left')

After looking into it further, I noticed that multiple users with the same ID appear more than once in the cart list and confirmed this is due to multiple users visiting the cart page more than once within the same visit - a fairly normal occurrence.

You can test this by comparing print(len(cart)), (400) to print(cart.user_id.nunique()) (348).

There are also users that started the checkout process more than once within the same visit.

The instructional video does not take this into account. Technically, we can’t accurately answer the question about the percentage of users that put a t-shirt in their cart and then moved on to checkout and then made a purchase without accounting for the fact that some users visited their cart more than once and others entered checkout more than once.

For instance, when answering question #5 in the project, " What percent of users who visited Cool T-Shirts Inc. ended up not placing a t-shirt in their cart?", we need to divide the total number of null cart-times by the length of the original visits table instead of the merged list-cart table since the merged list contains duplicate user-id values.

print(null_cart_times) / float(len(visits)) 

Based on this use case, I looked up how to remove duplicate rows from a dataframe based on the values in just one column.

So, if you wanted to remove duplicate user Ids from the checkout table, you would enter checkout_no_duplicate_ids = checkout.drop_duplicates(subset='user_id', keep='first').reset_index(drop=True).


I believe I have just run into this same concern. I completed the exercise alone (my first one without looking at the video, was very pleased with myself!) but then watched the solution video after I’d finished to check my results and my thought process.

I was concerned when my result of % of users that proceeded to checkout but did not purchase a t-shirt was different to the one shown in the video solution - I got 6%, vs 16% shown in the video.

My approach was to first subtract a count of all purchase events from a count of all checkout events (i.e. rows with a timestamp present), and then dividing that product by the number of unique users in the user_id column. My code was:

checkout_no_purchase = (all_data.checkout_time.count() - all_data.purchase_time.count()) / float(all_data.user_id.nunique())

I would be grateful to hear anyone else’s thoughts on my approach here and if / where I went wrong?

Thank you!

1 Like

Hello remember the Python Operator precedence… PEMDAS. Your code will first divide and then sustract the result to tthe all_data.checkout_time.count().

I agree with anna5185 there are duplicates. We need to clean the data before making ratios or calculating percentages.

I’ve took a different approach, I count the user_id of each merged table… but in my case the result if the evolution of each client, probably I will lose some data if one user buy several times.


1 Like

Is it correct to make these computations not using merge method at all?

visits_total = visits.user_id.nunique()
cart_total = cart.user_id.nunique()
checkout_total = checkout.user_id.nunique()
purchase_total = purchase.user_id.nunique()

print("Percent of users who visited but didn't placed in their cart: " + str(float(visits_total-cart_total)/visits_total))
print("Percent of users who placed in their cart but didn't proceed to checkout: " + str(float(cart_total-checkout_total)/cart_total))
print("Percent of users who proceeded to checkout but didn't purchase: " + str(float(checkout_total-purchase_total)/checkout_total))

My answers:

Percent of users who visited but didn’t placed in their cart: 0.826
Percent of users who placed in their cart but didn’t proceed to checkout: 0.350574712644
Percent of users who proceeded to checkout but didn’t purchase: 0.362831858407

Correct me please if I’m wrong. Based on what you all said, when two dataframes are left merged, if the right dataframe has matching duplicated values, then the resulting dataframe will create two rows even though the left one has the matching value only once?

This is really a good observation.
When I tried to use all_data to calculate the percentage, things went wrong.
It seems that after merging, all_data has 2373 records, while the visits has only 2000 records.
This can be explained by the duplication.
If the right DataFrame has multiple rows with the same id that could be matched to the left, these rows will be piled up. So I guess there are 363 duplicated user records in total.

I’m glad so many others have spotted this issue, it needs to be fixed no? as the average calculation time at the end is skewed