This is regarding the data science path, module 9, Page Visits Funnel Project - https://www.codecademy.com/paths/data-science/tracks/data-processing-pandas/modules/dspath-multiple-tables-pandas/projects/multi-tables-proj
I was reminded of the importance of always making sure you are familiar with the data you are working with when I noticed that the number of rows in the merged visits_cart list created in task #2 was 2052, which exceeds the number of rows (2000) in the original visits list. This didn’t make sense to me at first since we did a left join on visits.
visits_cart = pd.merge(visits, cart, how = 'left')
After looking into it further, I noticed that multiple users with the same ID appear more than once in the cart list and confirmed this is due to multiple users visiting the cart page more than once within the same visit - a fairly normal occurrence.
You can test this by comparing
print(len(cart)), (400) to
There are also users that started the checkout process more than once within the same visit.
The instructional video does not take this into account. Technically, we can’t accurately answer the question about the percentage of users that put a t-shirt in their cart and then moved on to checkout and then made a purchase without accounting for the fact that some users visited their cart more than once and others entered checkout more than once.
For instance, when answering question #5 in the project, " What percent of users who visited Cool T-Shirts Inc. ended up not placing a t-shirt in their cart?", we need to divide the total number of null cart-times by the length of the original visits table instead of the merged list-cart table since the merged list contains duplicate user-id values.
print(null_cart_times) / float(len(visits))
Based on this use case, I looked up how to remove duplicate rows from a dataframe based on the values in just one column. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
So, if you wanted to remove duplicate user Ids from the checkout table, you would enter
checkout_no_duplicate_ids = checkout.drop_duplicates(subset='user_id', keep='first').reset_index(drop=True).