ANALYZE REAL DATA WITH SQL exercise question

Hey can someone explain me why is it different when filtering in this 2 places?

this is how the solution is:

WITH last_touch AS (
SELECT user_id,
max(timestamp) as last_touch_at
FROM page_visits
where page_name= “4 - purchase” <----
GROUP BY user_id)
SELECT
–lt.user_id,
–lt.last_touch_at,
pv.utm_source,
pv.utm_campaign,
count (utm_campaign)
FROM last_touch lt
JOIN page_visits pv
ON lt.user_id = pv.user_id
AND lt.last_touch_at = pv.timestamp
group by utm_campaign
order by count (utm_campaign) desc;

with a similar result my answer was:

WITH last_touch AS (
SELECT user_id,
max(timestamp) as last_touch_at
FROM page_visits
GROUP BY user_id)
SELECT
–lt.user_id,
–lt.last_touch_at,
pv.utm_source,
pv.utm_campaign,
count (utm_campaign)
FROM last_touch lt
JOIN page_visits pv
ON lt.user_id = pv.user_id
AND lt.last_touch_at = pv.timestamp
where pv.page_name= “4 - purchase” <----
group by utm_campaign
order by count (utm_campaign) desc

@eyalbre,

Welcome to the forums!

Do you mind pasting the link to the exercise as well?

Thanks.

sure, the link to the exercise is:
https://www.codecademy.com/paths/data-science/tracks/analyze-data-sql-analyze-real-data/modules/analyze-data-sql-marketing-attribution/projects/analyze-data-sql-attribution-queries

@eyalbre,

There is a short answer and a long answer to this question. I highly encourage you to read through my longer explanation (in the dropdown below) in order to fully understand what is going on and how to figure out what was going on, using other SQL queries.

The short answer is that putting the WHERE statement inside your WITH statement finds the last touch that lead to a purchase for each purchaser, while putting your WHERE statement outside your WITH statement finds the absolute last touch for any user who has made a purchase (whether or not the absolute last touch lead to a purchase as well) and then filters that down to only those absolute last touches that were on the purchase page.

Click here for the in-depth explanation

The key difference between the two queries is how the WHERE statement is filtering the data depending on its location. Let’s break it down.

Query 1 (or last_touch1)

I am going to start with the query that has the WHERE statement outside of the WITH statement (absolute last touch).

Looking at the WITH statement, we can see that we

  • take the page_visits table,
  • group it by user_id, then
  • select user_id and the maximum value of time_stamp for each user:

Click here for code
WITH last_touch AS (
SELECT user_id,
  MAX(timestamp) as last_touch_at
FROM page_visits
GROUP BY user_id)

This finds the absolute the last touch.

In the remainder of our query, we

  • take our last_touch CTE and join it with the original page_visits table,
    • joining on user_id AND lt.last_touch_at = pv.timestamp ensures that we are only including records for each user’s last touch.
  • then we filter it to only those records (of last touches) where the page name is “4 - purchase”,
  • and finally we group, order and select the columns we want:

Click here for code
SELECT
  lt.user_id,
  lt.last_touch_at,
  pv.utm_source,
  pv.utm_campaign,
  COUNT (utm_campaign)
FROM last_touch lt
JOIN page_visits pv
  ON lt.user_id = pv.user_id
  AND lt.last_touch_at = pv.timestamp
WHERE pv.page_name = “4 - purchase” 
GROUP BY utm_campaign
ORDER BY COUNT (utm_campaign) DESC;

If you counted them, you would see that there are 358 users whose absolute last touch was on the purchase page (as opposed to 361 total purchasers).

Query 2 (or last_touch2)

In the other query, the WHERE statement is inside of the WITH statement.

In this case, the WITH statement:

  • takes the page_visits table,
  • filters it to the records where page_name is “4 - purchase”,
  • then groups this by user_id, and
  • selects the user_id and maximum timestamp for each user:

Click here for code
WITH last_touch AS (
SELECT user_id,
  MAX(timestamp) as last_touch_at
FROM page_visits
WHERE page_name = “4 - purchase” 
GROUP BY user_id)

When we do it this way, we are selecting the MAX(timestamp) from only the records where the page name was “4 - purchase” and labeling them as last_touch_at. In reality, this would represent the last time the user purchased, and not necessarily their last touch on the website. However, this is important for attributing that purchase to the correct UTM source. We don’t want to attribute that purchase to the UTM source of the user’s absolute last touch if the absolute last touch was not the purchase.

The query then goes on to:

  • take our last_touch CTE and join it with the original page_visits table,
    • joining on user_id AND lt.last_touch_at = pv.timestamp results in only the records for each user’s last time on the purchases page (i.e., one record for each purchaser).
  • group, order by and select the columns we want:

Click here for code
SELECT
  lt.user_id,
  lt.last_touch_at,
  pv.utm_source,
  pv.utm_campaign,
  COUNT(utm_campaign)
FROM last_touch lt
JOIN page_visits pv
  ON lt.user_id = pv.user_id
  AND lt.last_touch_at = pv.timestamp
GROUP BY utm_campaign
ORDER BY COUNT (utm_campaign) DESC;

If you counted them, you would see that there are 361 records (the same as total purchasers), because last_touch_at was defined as the most recent time that person was on the purchase page, regardless of whether that was the user’s absolute last touch.

Click here for instructions on reverse engineering this finding

How to reverse engineer this insight (w/o SQL order of execution knowledge)

Even if you don’t have a solid grasp on SQL order of execution yet, you can figure this out with a little bit of time and some basic knowledge of using subqueries.

Let’s say that you noticed there were a different number of users in these two queries (358 and 361) and wanted to find out why. One way you could do this is to put each query in its own CTE using a WITH statement, and then do a LEFT OUTER JOIN, with the larger one on the left. This will result in NULL values where there is information in the left table and not in the right one.

To accomplish this, we will need to move our initial WITH statements into subqueries in the FROM statement. Then we will wrap each one in a WITH statement, calling them last_touch1 and last_touch2, join them with a LEFT OUTER JOIN and select all columns.

The code will look like this:

WITH last_touch1 AS (
SELECT lt.user_id, lt.last_touch_at,
    pv.page_name,
    pv.utm_source, 
    pv.utm_campaign
FROM (SELECT user_id,
        MAX(timestamp) as last_touch_at
    FROM page_visits
    GROUP BY user_id) lt
JOIN page_visits pv
    ON lt.user_id = pv.user_id
    AND lt.last_touch_at = pv.timestamp
WHERE pv.page_name = '4 - purchase'
),
last_touch2 AS (
SELECT lt.user_id, lt.last_touch_at,
    pv.page_name,
    pv.utm_source, 
    pv.utm_campaign
FROM (SELECT user_id,
        MAX(timestamp) as last_touch_at
    FROM page_visits
    WHERE page_name = '4 - purchase'
    GROUP BY user_id) lt
JOIN page_visits pv
    ON lt.user_id = pv.user_id
    AND lt.last_touch_at = pv.timestamp
    )
SELECT * 
FROM last_touch2
LEFT OUTER JOIN last_touch1
  ON last_touch2.user_id = last_touch1.user_id;

and the output will look something like this (notice the null values):

From there, you can scroll through your results and find the user_ids by hand for each row containing nulls.
Or, you can expedite the process by adding one more line to the bottom of your query with a WHERE statement:

WHERE last_touch1.user_id IS NULL;

Doing so will return this result:


These are the thee users who were purchasers, but whose absolute last touch was not on the purchase page.

What does that mean? Well, it could mean two things. They either purchased an item, but then continue browsing; or, they came back again after their purchase, but did not buy anything the second time around.

Can we confirm this is actually what they did? Yes! Now that we know the user_ids we are looking for, this is very easy:

SELECT * 
FROM page_visits
WHERE user_id IN (94852, 76466, 29180)
ORDER BY user_id, timestamp;

In the results, we can see that each of these three people followed a similar pattern:
image

  • They each came to the website and added something to their cart on their first visit.
  • Then, on their second visit they checked out their cart and purchased the item(s).
  • And finally, they browsed the website on a third day and added something to their carts…but did not buy.

This is why they were included in last_touch2 (the last touch that lead to a purchase for each purchaser), but not in last_touch1 (the absolute last touch for any user who has made a purchase, filtered down to those whose absolute last touch was on the purchase page).

Hope this helps clarify things! Happy coding!

@heatherkushner975531,

You may find this interesting or informative as well, given your recent post.