I’m working on the funneling project where I have to determine at what steps of the purchasing process customers “drop out”.
There are four tables: visits, cart, checkouts, and purchases.
Printing visits.count() tells me that the visits table contains 2000 entries. However, when I left join the rest of the tables, using
all_data = visits.merge(cart, how = ‘left’).merge(checkout, how = ‘left’).merge(purchase, how = ‘left’)
all_data.count() gives me 2557 entries. If a left merge only adds the rows from the right table that can be matched to rows from the left table, how do I end up with more entries in the final table than I have in my leftmost table?
I tried adding .drop_duplicates() to the end of my definition but that didn’t change anything.
Same here. If you look at the video tutorial, they have the same thing at the very beginning:
print(len(visits_cart)) #returns a result of 2052 #visits_cart = pd.merge(visits, cart, how=“left”)
print(len(visits) #returns a result of 2000
…and I’ve just worked out the answer. Basically in the cart DataFrame there are duplicate values of user_id. If you run the following code:
cart_unique = cart.drop_duplicates(“user_id”)
visits_cart = visits.merge(cart_unique, how=“left”)
visits_cart ends up having 2000 rows, just like visit