Firstly, in my opinion, the explanatory text seems inconsistent and inaccurate. The text up to “The resulting data frame will look like this:” and the embedded dataframe is fine. But, after that i.e. from “If we use this syntax, …” onward, the explanation doesn’t seem quite accurate. I think they should have used a slightly different example so that the subsequent discussion about suffixes would have been accurate and applicable to the example.
Putting that aside and getting back to your observation about the exercise, here is what I think is happening.
products table has columns:
id, description, price
orders table has columns:
id, product_id, customer_id, quantity, timestamp
As per the instructions, we can do an inner_join using the code you have mentioned i.e.:
products_orders <- products %>%
by = c('id' = 'product_id'),
suffix = c('_product', '_order'))
This does an
inner_join using the
id column from
products and the
product_id column from
The resulting dataframe
products_orders has all the columns from products and all the columns (except
product_id) from orders. Because of the statement
c('id' = 'product_id'), the common column between the two tables is labelled as
id in the
products_orders dataframe (taking on the label of the left table). The
product_id column isn’t included as a separate explicitly labelled column, because the
id column of
products_orders table already holds the information common to both tables.
The products_orders dataframe also inherits an id column from the orders table. This shows up as
id_order in the final table because of our specified suffixes.
If we read the R documentation for inner_join (https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/join), the thing to note is:
suffix: If there are non-joined duplicate variables in x
, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2.
The key phrase is non-joined duplicate variables.
The columns of
id, description, price, id_order, customer_id, quantity, timestamp
id column was actually used when we did the inner_join, so the suffixes don’t apply to it. The second
id which was inherited from the
orders table and wasn’t used in the join qualifies as being a non-joined duplicate variable. So, it takes on the suffix
As an experiment, you can try playing around with the code to verify the above. Suppose, we omit the suffixes, then we see that the products_orders table has columns id and id.y
Now, let us do a inner_join with suffixes but on two nonsense columns (the motivation being that the final table will inherit an id column from each table and neither of the two id columns would have been used to do the join).
So, I did a nonsense inner_join between the price and product_id columns from the two tables. If you look at the screenshot, the resulting table has both an id_product and a id_order column BECAUSE neither of these columns was used to do the join. Hence, they qualify as being non-joined duplicate variables and the suffixes are applied to both of them.