WARBY PARKER Project: help analyzing survey questions using advanced queries

I am new to SQL and have been working through the course and I’m not working on the Warby Parker Project and I’d like to complete a more in-depth analysis of questions 2-5 by looking at the “gender” of the user based on how the first question was answered. This would gain insight in how Male and Female respondents answered the other 4 questions, but specifically wanting to look at question 5. “When was your last eye exam” to see if men or women are more likely to respond. I wrote an SQL query to create a temporary table and assign a “gender” which I joined back to the survey table which returned what appears to be what I am looking for, but I don’t know how to proceed.

CREATE TEMPORARY TABLE gender AS
SELECT CASE
    WHEN response = "Men's Styles" THEN 'Male'
    WHEN response = "Women's Styles" THEN 'Female'
    ELSE "Prefer Not to Answer"
    END AS "Gender",
  user_id
FROM survey;

SELECT survey.user_id, survey.question, survey.response, gender
FROM survey
JOIN gender
ON survey.user_id = gender.user_id;

I have tried “GROUP BY” and did not get the intended results. I have also tried adding a WHERE clause but got no results back (and no errors)…

SELECT survey.user_id, survey.question, survey.response, gender
FROM survey
JOIN gender
ON survey.user_id = gender.user_id
WHERE survey.question = "5. When was your last eye exam?";

Something to consider: what about women who purchase men’s frames or vice versa?
What were the results?

You can create a CTE with just the following:

WITH gender AS
    (query goes 
    here);

SELECT...

Did you try using COUNT for survey.response and GROUP BY gender?

1 Like

It is only an assumption, and not a great one, but it’s all that is available without an actual question on the survey asking about gender. This could be a possible recommendation to add such a question to allow for further insight and consider inclusivity in future marketing campaigns.

Back to the problem I did somewhat manage to get the answer using the query below but I found that count(distinct suvey.user_id) was placing the total count for each response in the “Prefer Not to Answer” response when grouped. I figured this out and ended up subtracting the male + female counts from the Prefer Not to Answer count to get the count for the Prefer Not to Answer.

CREATE TEMPORARY TABLE gender AS
SELECT CASE
    WHEN response = "Men's Styles" THEN 'Male'
    WHEN response = "Women's Styles" THEN 'Female'
    ELSE "Prefer Not to Answer"
    END AS "Gender",
  user_id
FROM survey;

SELECT survey.question, survey.response, gender, count(distinct gender.user_id)
FROM survey
JOIN gender
ON survey.user_id = gender.user_id
  WHERE question = "5. When was your last eye exam?"
GROUP BY 2, 3;

I’m so close to getting the correct output using SQL but I’m not sure how to stop the count(distinct gender.user_id) from counting all the responses in Prefer Not to Answer.

FWIW, one thing you’re doing is presuming that those who purchase a certain style are of a particular gender and that may or may not be true. There are unknowns and it’s injecting bias into the analysis.
Rather, you could speculate that people didn’t answer Q5 b/c they find it too personal, don’t have insurance/haven’t see an eye doc in ages, got bored w/the survey questions, etc. etc.

You could see what frames are the most popular/brought in the most $$, what shapes of frames are the most popular, or, what has more purchases: women’s or men’s frames? Those are actual numbers that you can pull from the data.

If you look at the WP site, they’ve updated their quiz to 8 questions now. (I’m not sure if the project was updated or not to reflect that (it’s an older project)). Plus, they now send people 5 pairs (there’s no more 3 pairs).

1 Like

So a whole bunch going on here and some great lessons from @lisalisaj.

First some data quality issues that have already been discussed:

-- data quality issues
with questions_answered as (
  select
    user_id,
    count(question) as num_responses
  from
    survey
  group by
    user_id
  order by count(question) asc
)

select
  num_responses,
  count(distinct user_id) as users_that_answered
from questions_answered
group by
  num_responses
order by
  num_responses desc
;

You’ll see that as @lisalisaj pointed out quite a lot of people didn’t finish the survey. This will definitely impact the analysis you’re doing. There’s nothing wrong with that per say but you need to account for how you’ll handle it. Real life data is going to have ambiguity like this.

Now onto the query. I made the same mistake you did initially. Take a look at your temp table and you’ll see the data is kind of odd. Your conditions are too loose in your case statements (what happens when the answer to the question is "1-3 Years")? If you compare it to my case in my gender_ids CTE, you’ll see how I was able to discern there was an issue right away.

I’d echo that a CTE is a better choice but I have no great reason to suggest that since when you’re comparing performance, the winner depends on the scenario. I prefer CTEs for readability. I used two below but one would be sufficient:

with gender_ids as (
  select
    user_id,
    case response 
      when "Men's Styles" then "M"
      when "Women's Styles" Then "F"
      when "I'm not sure. Let's skip it." Then "U"
      else "?"
    end as gender
  from survey
  where
    question like '1.%'
),

eye_exam as (
  select
    user_id,
    response
  from survey
  where
    question like '5.%'
)

select
  g.gender,
  ee.response,
  count(g.user_id) as counts
from
  gender_ids as g
left join
  eye_exam as ee
  on g.user_id = ee.user_id
group by
   g.gender,
   ee.response
order by
  g.gender,
  ee.response   
;

So what’s different here than what you have? What do you notice about the answer to your question? When do you think people are going to be on the Warby Parker site? etc?

A good extension of what you’re trying to do is to take your original test, add in some of the other scenarios suggested by @lisalisaj, put it all together and make a little presentation highlighting what you’ve got. If you’re feeling ambitious, record it. Either way, this is one little way to go beyond the projects here like you intended to and to set yourself apart from other candidates.

2 Likes

Thank you for the response. I have updated the project and abandoned this idea altogether because like you stated it was presuming a lot about the users. I am focusing on other aspects of the quiz as you suggested.

1 Like

Thank you for the response. This has been extremely helpful and thought provoking.

After reviewing my code and yours I noticed 2 differences, including the mistake(s) I made. For starters I used “No Preference” instead of “I’m not sure. Let’s skip it.” Secondly, the response “I’m not sure. Let’s skip it.” is a response for other questions so adding the ‘where’ statement focuses the query on responses limited to question 1.

There are many solutions to the problem and I agree that the CTE is definitely more readable. I also learned that I need to focus on the data that is available and not to make too many assumptions.

Thanks for the suggestions and I will work on using the original test and playing around with other scenerios now that I have a clear path forward.

1 Like