Hello! So I am a bit confused about why we would ever use left_on, right_on. It adds an extra column which just repeats another column and seems unnecessarily complicated when compared to just renaming the column in the merge() call.
I just don’t understand in what situation we would prefer to do this, since the result seems to always be more clunky.
Left table we are merging on column ‘product id’ right table we are merging on column ‘id’
But in the merged table the columns that get changed to id_x and id_y (or renamed using suffix) is the ‘id’ column in both tables, not the ‘product id’ column in the left table
According to the Pandas document of the merge method, suffixes are applied to overlapping column names in the left and right side, respectively. In our example, only id is overlapping and others are not, so suffixes are added to id.
Hi,
I’m having a hard time understanding what right_on and left_on actually do. I have read through the exercise several times and read through the pandas documentation, but i’m still really unclear as to what the point of them are.
Thanks for this explanation. I understand now the function of the code.
The only thing I don’t understand is why do we need two identical columns? And why does the code changes the order of the rows? Which order do this follow?
Does identical columns mean product_id and id_products in the merged table? I think it depends on case by case whether it is necessary. If the column names are different, it can be better to keep both. If you think you don’t need it, you can drop it later.
The values in the product_id column of the left table are ordered 3, 2, 1, 4 except for duplicates. It seems that they are organized in this order.
Thanks for your answer. My second question was about which rule the method follows to reordering the rows. I guess I was thinking about a dataFrame where order of rows matters. Maybe a date ordered dataFrame. But I think this is a very specific question. If you know something about it or have a source you can share eventually, I’d be grateful.
If you’re talking about merges themselves then it can be a bit complex, in some circumstances you can maintain them and in others they’re somewhat forgotten (is the join on one or more indices?). Check the docs for some of the options pandas.DataFrame.merge — pandas 1.3.4 documentation
I haven’t followed the source code in detail, so I don’t know the exact rules. I’ve just guessed from the merged results. Checking the product_id column in the left table from the top, first process the rows whose product_id is 3, then process the rows whose product_id is 2, … and so on.
As @tgrtim mentioned, we have some options to control ordering. For example, if we set sort=True, the result is sorted in lexicographic order with respect to product_id.
orders_products = pd.merge(..., sort=True)
If you want to sort by a column which order matters, such as timestamp, you can use .sort_values().
I find it very unattractive to have two columns id_orders and id_products while the other columns end with id. Specifying a prefix on merge is not possible, is it?
Was curious about this and looked into it. Doesn’t appear to be an option within the Pandas merge method.
You can always modify the name afterwards or use add_prefix. It’s mainly to deal with the situation when your final table would otherwise end up with identical column names after the merge and the suffix is there to differentiate them in those situations. You can always process it further later down the stream.