(pandas) increased dataframe size when using left merge

I’m working on the funneling project where I have to determine at what steps of the purchasing process customers “drop out”.

There are four tables: visits, cart, checkouts, and purchases.

Printing visits.count() tells me that the visits table contains 2000 entries. However, when I left join the rest of the tables, using

all_data = visits.merge(cart, how = ‘left’).merge(checkout, how = ‘left’).merge(purchase, how = ‘left’)

all_data.count() gives me 2557 entries. If a left merge only adds the rows from the right table that can be matched to rows from the left table, how do I end up with more entries in the final table than I have in my leftmost table?

I tried adding .drop_duplicates() to the end of my definition but that didn’t change anything.

https://www.codecademy.com/paths/data-science/tracks/data-processing-pandas/modules/dspath-multiple-tables-pandas/projects/multi-tables-proj

2 Likes

I am running into the same issue. Were you able to figure it out?

Same here. If you look at the video tutorial, they have the same thing at the very beginning:

print(len(visits_cart))
#returns a result of 2052
#visits_cart = pd.merge(visits, cart, how=“left”)

print(len(visits)
#returns a result of 2000

…and I’ve just worked out the answer. Basically in the cart DataFrame there are duplicate values of user_id. If you run the following code:
cart_unique = cart.drop_duplicates(“user_id”)
visits_cart = visits.merge(cart_unique, how=“left”)

visits_cart ends up having 2000 rows, just like visit

1 Like