FAQ: Working with Multiple DataFrames - Merge on Specific Columns II

This community-built FAQ covers the “Merge on Specific Columns II” exercise from the lesson “Working with Multiple DataFrames”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Data Science

Data Analysis with Pandas

FAQs on the exercise Merge on Specific Columns II

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

Hello! So I am a bit confused about why we would ever use left_on, right_on. It adds an extra column which just repeats another column and seems unnecessarily complicated when compared to just renaming the column in the merge() call.

I just don’t understand in what situation we would prefer to do this, since the result seems to always be more clunky.

5 Likes

In what principle the columns are sorted after being merged?

How come it is the ‘id’ columns that get renamed to id_x and id_y?

import codecademylib
import pandas as pd

orders = pd.read_csv('orders.csv')
print(orders)
products = pd.read_csv('products.csv')
print(products)

orders_products = pd.merge(
  orders,
  products,
  left_on = 'product_id',
  right_on = 'id',
  suffixes = ['_orders', '_products']
)
print(orders_products)

Left table we are merging on column ‘product id’ right table we are merging on column ‘id’

But in the merged table the columns that get changed to id_x and id_y (or renamed using suffix) is the ‘id’ column in both tables, not the ‘product id’ column in the left table

1 Like

According to the Pandas document of the merge method, suffixes are applied to overlapping column names in the left and right side, respectively. In our example, only id is overlapping and others are not, so suffixes are added to id.

2 Likes

Can anyone explain if we are merging id_x and id_y, why arent they equal in the table?
Thank

Hi,
I’m having a hard time understanding what right_on and left_on actually do. I have read through the exercise several times and read through the pandas documentation, but i’m still really unclear as to what the point of them are.

Any explanations would be much appreciated.

1 Like

Thanks for this explanation. I understand now the function of the code.
The only thing I don’t understand is why do we need two identical columns? And why does the code changes the order of the rows? Which order do this follow?

Does identical columns mean product_id and id_products in the merged table? I think it depends on case by case whether it is necessary. If the column names are different, it can be better to keep both. If you think you don’t need it, you can drop it later.

The values in the product_id column of the left table are ordered 3, 2, 1, 4 except for duplicates. It seems that they are organized in this order.

2 Likes

Thanks for your answer. My second question was about which rule the method follows to reordering the rows. I guess I was thinking about a dataFrame where order of rows matters. Maybe a date ordered dataFrame. But I think this is a very specific question. If you know something about it or have a source you can share eventually, I’d be grateful.

1 Like

The standard sorting comes from the index of the DataFrame- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html. This can be a simple numerical range, by date, category and more if you look into it.

If you’re talking about merges themselves then it can be a bit complex, in some circumstances you can maintain them and in others they’re somewhat forgotten (is the join on one or more indices?). Check the docs for some of the options pandas.DataFrame.merge — pandas 1.3.4 documentation

1 Like

I haven’t followed the source code in detail, so I don’t know the exact rules. I’ve just guessed from the merged results. Checking the product_id column in the left table from the top, first process the rows whose product_id is 3, then process the rows whose product_id is 2, … and so on.

As @tgrtim mentioned, we have some options to control ordering. For example, if we set sort=True, the result is sorted in lexicographic order with respect to product_id.

orders_products = pd.merge(..., sort=True)

If you want to sort by a column which order matters, such as timestamp, you can use .sort_values().

orders_products = pd.merge(...).sort_values(by=['timestamp'])
2 Likes

I find it very unattractive to have two columns id_orders and id_products while the other columns end with id. Specifying a prefix on merge is not possible, is it?

Was curious about this and looked into it. Doesn’t appear to be an option within the Pandas merge method.

You can always modify the name afterwards or use add_prefix. It’s mainly to deal with the situation when your final table would otherwise end up with identical column names after the merge and the suffix is there to differentiate them in those situations. You can always process it further later down the stream.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html