FAQ: Joining Tables in R - Join on Specific Columns II

This community-built FAQ covers the “Join on Specific Columns II” exercise from the lesson “Joining Tables in R”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Learn R

FAQs on the exercise Join on Specific Columns II

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!
You can also find further discussion and get answers to your questions over in #get-help.

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head to #get-help and #community:tips-and-resources. If you are wanting feedback or inspiration for a project, check out #project.

Looking for motivation to keep learning? Join our wider discussions in #community

Learn more about how to use this guide.

Found a bug? Report it online, or post in #community:Codecademy-Bug-Reporting

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

Hello. I was just wondering why in this exercise when you add a suffix argument to product_orders it doesn’t add a suffix to the product id? This was my code:

products_orders <- products %>%
inner_join(orders,
by = c(‘id’ = ‘product_id’),
suffix = c(’_product’, ‘_order’))
products_orders

but in the resulting table the first column was still just labeled id.

4 Likes

Firstly, in my opinion, the explanatory text seems inconsistent and inaccurate. The text up to “The resulting data frame will look like this:” and the embedded dataframe is fine. But, after that i.e. from “If we use this syntax, …” onward, the explanation doesn’t seem quite accurate. I think they should have used a slightly different example so that the subsequent discussion about suffixes would have been accurate and applicable to the example.

Putting that aside and getting back to your observation about the exercise, here is what I think is happening.
The products table has columns: id, description, price
The orders table has columns: id, product_id, customer_id, quantity, timestamp

As per the instructions, we can do an inner_join using the code you have mentioned i.e.:

products_orders <- products %>%
  inner_join(orders,
            by = c('id' = 'product_id'),
            suffix = c('_product', '_order'))

This does an inner_join using the id column from products and the product_id column from orders.
The resulting dataframe products_orders has all the columns from products and all the columns (except product_id) from orders. Because of the statement c('id' = 'product_id'), the common column between the two tables is labelled as id in the products_orders dataframe (taking on the label of the left table). The product_id column isn’t included as a separate explicitly labelled column, because the id column of products_orders table already holds the information common to both tables.
The products_orders dataframe also inherits an id column from the orders table. This shows up as id_order in the final table because of our specified suffixes.
If we read the R documentation for inner_join (https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/join), the thing to note is:
suffix: If there are non-joined duplicate variables in x and y , these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2.
The key phrase is non-joined duplicate variables.
The columns of products_orders are: id, description, price, id_order, customer_id, quantity, timestamp
The first id column was actually used when we did the inner_join, so the suffixes don’t apply to it. The second id which was inherited from the orders table and wasn’t used in the join qualifies as being a non-joined duplicate variable. So, it takes on the suffix '_order'.

As an experiment, you can try playing around with the code to verify the above. Suppose, we omit the suffixes, then we see that the products_orders table has columns id and id.y

Now, let us do a inner_join with suffixes but on two nonsense columns (the motivation being that the final table will inherit an id column from each table and neither of the two id columns would have been used to do the join).

So, I did a nonsense inner_join between the price and product_id columns from the two tables. If you look at the screenshot, the resulting table has both an id_product and a id_order column BECAUSE neither of these columns was used to do the join. Hence, they qualify as being non-joined duplicate variables and the suffixes are applied to both of them.

4 Likes

i am not able to understand this although i can solve the exercise,

isn’t the function c() used for defining a vector in R?
so how does it work.

Yes, the c function can be used to create vectors in R.

join function documentation
c function documentation

If we look at the documentation for the join function, we can see that the suffix argument should “be a character vector of length 2”. We accomplish this by using the c function to create a vector of length 2 with the first member of the vector being used for non-joined duplicate variables from the left table, while the second member of the vector refers to those from the right table. e.g.

suffix = c('_product', '_order')

Again, if we look at the documentation for the join function, we can see that if the columns we are joining on have different names in the two tables, then we are expected to “To join by different variables on x and y use a named vector. For example, by = c("a" = "b") will match x.a to y.b .”

In our case, we are creating a named vector using the statement

by = c('id' = 'product_id')

This creates a vector with a single member ‘product_id’. We give this member the name ‘id’. For more on named vectors, you may want to look at: