Help with pandas

Hello,

I´m working on the Page Visits Funnel project, doing research for Cool T-shirts :

https://www.codecademy.com/paths/data-science/tracks/data-processing-pandas/modules/e0ebc837507d014014cc356d97a1250c/projects/multi-tables-proj

There´s something I just don´t understand, and I feel I´m not totally getting the differences between the types of join.

There´s four tables to work with: visits, cart, checkout and purchase. The goal is to calculate how many users abandon the purchasing process at each step.

We first take the “visits” table and there are 2000 rows there, i.e. 2000 instances of someone visiting the website. However, when I perform a left join with “cart”, I see that the visits_time column in the resulting table has swelled to 2052. How is this possible? If I´m not mistaken, what I´m getting with this left join is a column with all the visit times and another with the cart times. In the latter, there are many null entries, i.e. all those who visited the website but did not proceed to the cart. This seems pretty easy to understand, but I still can´t see what those 52 extra entries at visit times are. Where do they come from?

I feel I´m missing something quite obvious here, would appreciate if someone helped me with this :slight_smile:

Thanks!!

@jesusgamiz,

Great question. As with most coding questions, the answer can be found in the documentation.

I highly encourage you read more about DataFrame.merge() in the above link. However, I’ll give you the brief answer here as well.

The left merge in Pandas is similar to the SQL Left Outer merge. It will only include rows that have keys contained in the left table. However, what if there are multiple rows in the right table that use the same key and that key is in the left table? All of those rows are included (remember, left outer merge).

Here, the visits and cart tables are merged on their shared column: user_id. If you take a close look at the resulting visits_cart table, you will see that there are some rows with duplicate user_id and visit_time values, but with different cart_time values:
image
That is because there were multiple rows in the cart table with that user_id (think: someone visited the site and went to their cart more than once).

Hope this clears things up a little.

Happy coding!

1 Like