Can we merge on more than one specific column?

Question

In the context of this exercise, can we merge on more than one specific column?

Answer

Yes, you can perform a merge on one column, or on multiple specified columns, by passing in a list of the column names for each dataframe.

When listing multiple column names, it will only return rows for which all the column values match. Furthermore, the number of columns listed must match, and the order they are listed will matter.

Example

# This will match the values for 
# column "a" with "c" 
# and column "b" with "d".

pd.merge(
  df1,
  df2,
  left_on=["a", "b"],
  right_on=["c", "d"]
)
6 Likes

I think this two sentences mean different things, in the first I assume that ‘a’, ‘b’, ‘c’ and ‘d’ will only appear in the merged table if all four columns have the same value in a given row.

Whereas in the second I assume that they only need to have the same value in the pairs (‘a’ and ‘c’) and (‘b’ and ‘d’) but ‘a’ and ‘b’ can be different and that row will still be in the merged table.

Witch one of the two assumptions is true?

2 Likes

My guess is the second one. The “all” in the first quote likely is just referencing the two pairs, not all four of them together.

2 Likes

@adrin_estvez for clarity. When we are merging the data frames. The first column specified in left_on array will match with the first column in right_on array. and the second column in left_on array will match the second column in right_on array.
For a row in the left dataframe to match with a row in the right dataframe then the values must be equal:

left_dataframe[ left_on[0] ] == right_dataframe [ right_on[0]] & left_dataframe[ left_on[1] ] == right_dataframe [ right_on[1]]

Here’s an example:

import pandas as pd

personal = pd.DataFrame([[“lionel”, “Messi”, 36,10, “Argentina”],[“Luis”, “Suarez”, 36, 9, “Uruguay”], [“junior”, “Neymar”, 31, 10, “Brazil”]],columns=[“first_name”, “last_name”, “age”, “shirt_number”, “country”])

print(personal)

stats = pd.DataFrame([[“lionel”, “Messi”, 21, 20, “PSG”],[“Junior”, “Neymar”, 18, 17, “PSG” ], [“Luis”, “Suarez”, 11, 8, “Gremio”] ], columns = [“name”, “surname”,“goals”, “assits”, “club”])

print(stats)

all_data = pd.merge(personal, stats, left_on = [“first_name”, “last_name”], right_on = [“name”, “surname”] )
print(all_data)

first_name last_name age shirt_number country
0 lionel Messi 36 10 Argentina
1 Luis Suarez 36 9 Uruguay
2 junior Neymar 31 10 Brazil
name surname goals assits club
0 lionel Messi 21 20 PSG
1 Junior Neymar 18 17 PSG
2 Luis Suarez 11 8 Gremio
first_name last_name age shirt_number country name surname goals
0 lionel Messi 36 10 Argentina lionel Messi 21
1 Luis Suarez 36 9 Uruguay Luis Suarez 11

assits club
0 20 PSG
1 8 Gremio

Notice how The row for Neymar did not match because in the personal dataframe first_name is lower case junior and in the stats data frame it is uppercase hence first_name != name column even though the last_name == surname.