Page Visits Funnel Project - note same users listed multiple times within the tables

This is regarding the data science path, module 9, Page Visits Funnel Project - https://www.codecademy.com/paths/data-science/tracks/data-processing-pandas/modules/dspath-multiple-tables-pandas/projects/multi-tables-proj

I was reminded of the importance of always making sure you are familiar with the data you are working with when I noticed that the number of rows in the merged visits_cart list created in task #2 was 2052, which exceeds the number of rows (2000) in the original visits list. This didn’t make sense to me at first since we did a left join on visits. visits_cart = pd.merge(visits, cart, how = 'left')

After looking into it further, I noticed that multiple users with the same ID appear more than once in the cart list and confirmed this is due to multiple users visiting the cart page more than once within the same visit - a fairly normal occurrence.

You can test this by comparing print(len(cart)), (400) to print(cart.user_id.nunique()) (348).

There are also users that started the checkout process more than once within the same visit.

The instructional video does not take this into account. Technically, we can’t accurately answer the question about the percentage of users that put a t-shirt in their cart and then moved on to checkout and then made a purchase without accounting for the fact that some users visited their cart more than once and others entered checkout more than once.

For instance, when answering question #5 in the project, " What percent of users who visited Cool T-Shirts Inc. ended up not placing a t-shirt in their cart?", we need to divide the total number of null cart-times by the length of the original visits table instead of the merged list-cart table since the merged list contains duplicate user-id values.

print(null_cart_times) / float(len(visits)) 

Based on this use case, I looked up how to remove duplicate rows from a dataframe based on the values in just one column. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

So, if you wanted to remove duplicate user Ids from the checkout table, you would enter checkout_no_duplicate_ids = checkout.drop_duplicates(subset='user_id', keep='first').reset_index(drop=True).

1 Like

I believe I have just run into this same concern. I completed the exercise alone (my first one without looking at the video, was very pleased with myself!) but then watched the solution video after I’d finished to check my results and my thought process.

I was concerned when my result of % of users that proceeded to checkout but did not purchase a t-shirt was different to the one shown in the video solution - I got 6%, vs 16% shown in the video.

My approach was to first subtract a count of all purchase events from a count of all checkout events (i.e. rows with a timestamp present), and then dividing that product by the number of unique users in the user_id column. My code was:

checkout_no_purchase = (all_data.checkout_time.count() - all_data.purchase_time.count()) / float(all_data.user_id.nunique())

I would be grateful to hear anyone else’s thoughts on my approach here and if / where I went wrong?

Thank you!