Data Science Independent Project #2 – Explore a Sample Database

Project: Exploring a SQLite Sample Database

This project will take you off-platform and get you started in your own developer environment! Never done that before? Not to worry - we’ve shared some resources to help you down below. This project can be completed entirely on your own - or, you can join our Community Discord and find someone to work with! Jump to the community support section to hear more about this.

This project is broken down into key questions that your client or company is looking to answer. As a data scientist, you’ll often become a resource to help businesses answer the key questions about the efficacy of existing or potential strategies & projects.

Overview

Objective

You are the lead data analyst for a popular music store. Help them analyze their sales and service! Download the SQLite sample database provided here.

Pre-requisites

In order to complete this project, we suggest that you have familiarity with the content in the following courses or lessons on the Codecademy platform:

  1. Manipulation
  2. Queries
  3. Aggregate Functions
  4. Multiple Tables

Suggested Technologies

Depending on where you are on your Path, there may be multiple technology options you can use to complete this project - we suggest the following:

  1. DB Browser for SQLite

Project Tasks

Get started - hosting your project

DB Browser for SQLite is a visual tool for working with SQLite databases. Follow the link to download the application for your computer.

  • SQLite can store an entire database in a single file, which usually has a .sqlite or .db extension. To open a database, select Open Database at the top of the window and browse for the file. Alternatively, you can choose to create a New Database by saving a file with the .sqlite or .db extension.
  • To import data from a CSV file into a table, select “File > Import > Table from CSV file…” and browse for the CSV file. (Note: All fields imported from the CSV file will have a data type of TEXT. Be sure to convert fields to numeric type as needed. See here for how to do that.)
  • You can download the data you’ll be using for this specific project here.

There are several tabs near the top of the window for working with the data:

  • Database Structure: View the tables in your database and the columns they contain.
  • Browse Data: Browse the data for each table.
  • Execute SQL: Write and execute SQL queries.

Basic Requirements

Let’s break this project down into a couple different parts.

Analyzing the Data:

  • Which tracks appeared in the most playlists? how many playlist did they appear in?

  • Which track generated the most revenue? which album? which genre?

  • Which countries have the highest sales revenue? What percent of total revenue does each country make up?

  • How many customers did each employee support, what is the average revenue for each sale, and what is their total sale?

Additional Challenges

Intermediate Challenge

  • Do longer or shorter length albums tend to generate more revenue?
    • Hint: We can use the WITH clause to create a temporary table that determines the number of tracks in each album, then group by the length of the album to compare the average revenue generated for each.
  • Is the number of times a track appear in any playlist a good indicator of sales?
    • Hint: We can use the WITH clause to create a temporary table that determines the number of times each track appears in a playlist, then group by the number of times to compare the average revenue generated for each.

Advanced Challenge

  • How much revenue is generated each year, and what is its percent change from the previous year?
    • Hint: The InvoiceDate field is formatted as ‘yyyy-mm-dd hh:mm:ss’. Try taking a look at using the strftime() function to help extract just the year. Then, we can use a subquery in the SELECT statement to query the total revenue from the previous year. Remember that strftime() returns the date as a string, so we would need to CAST it to an integer type for this part. Finally, since we cannot refer to a column alias in the SELECT statement, it may be useful to use the WITH clause to query the previous year total in a temporary table, and then calculate the percent change in the final SELECT statement.

Resources & Support

Project-specific resources

  1. SQLite Date and Time Functions
  2. SQLite strftime() Function
  3. CAST expression
  4. SQLite Documentation
  5. SQLite Tutorial

General Resources

  1. How to get set-up for coding on your computer
  2. What is a Relational Database Management System?
  3. What you need to know about Git, GitHub & Coding in Teams
  4. How developer teams work
  5. First steps in tackling a group project
  6. Resource on writing pseudocode to get started with off-platform projects

Community Support

Looking for additional help or someone to work with (or somewhere to brag about your finished project)? Join our Community Discord to meet other learners like yourself!

Once you’re done…

Share in the Forums for feedback and to see some other ways of solving this problem!

13 Likes

Thanks for setting this project up, it was a great independent learning task!

I found that as I was going through it would have been handy to have an idea of the answers, so I could tell whether or not my solutions were correct or not.

I also misinterpreted “Is the number of times a track appear in any playlist a good indicator of sales?” because of the ambiguity of the word sales - when I attempted the task I had thought it meant unit sales, but when reviewing the hint it became clear it was average revenue rather than average unit sales.

I’ve shared my commented queries here - curious to see how others did it.

10 Likes

Hi,
I did the first part where it asks for tracks that appear in the most playlists, but I was wondering how we can find out which playlists these tracks appear in, without manually looking up the track ID in the playlist_track table?

Thank you for the project.

If anyone is interested, here are my answers to the tasks.

2 Likes

You can use this script to check that:
select tracks.TrackId , tracks.Name as ‘Track Name’, playlists.PlaylistId, playlists.Name as ‘Playlist Name’
from tracks left join playlist_track on tracks.TrackId = playlist_track.TrackId left join playlists on playlist_track.PlaylistId = playlists.PlaylistId
order by playlists.PlaylistId;

The list is ordered by PlaylistId.

Here are my answers to the project.

Hey newtr, thanks for sharing.

I believe you made a mistake on the second intermediate challenge; you’ve divided the total revenue per playlist frequency (i.e. 2-5) with the total number of songs which generated revenue per playlist frequency.

Your code gives an answer of about 1.0 for everything; not surprising, since you’re basically calculating occurrence*AVG(revenue)/occurrence.

I believe you need to account for songs which are part of the different playlist frequencies (i.e. 2-5) but which don’t make any sales. This should show that the less playlists a song appears on, the more revenue it makes on average.

Just commenting so others can see.

Hello, I believe your queries for revenue calculations are incorrect.
In your queries,

ifnull(sum(invoice_items.UnitPrice),0) as ‘Total sales of a track’

is incomplete, invoice_items lists unit price for tracks and the sum of unit prices would give just the price per track, but the way to calculate total revenue
generated by tracks would be

SUM(Total)

GROUP BY TrackId

which represents number of invoices sold per track * price per track, which would be the total revenue generated by the track. This is already stored in the invoice table as ‘Total’.

Here’s a Table Relationship diagram for the project.

These are my answers to the exercises, for reference.

Hi @kodes.k, I have a question about your solution code for the advanced challenge, particularly these lines:

SELECT curr.Year, curr.Previous_Year, curr.Revenue_for_Year, ROUND((curr.Revenue_for_Year - prev.Revenue_for_Year)/prev.Revenue_for_Year * 1.0, 2) AS 'Pct_change_compared_to_last_year'
FROM revenue curr LEFT JOIN revenue prev
ON curr.Previous_Year = prev.Year;

Since you named your temporary table revenue, I was just wondering why/how you refer to that as prev.Revenue_for_Year for instance? Same question for curr.Year, curr.Previous_Year, ect.

Also, I hope you don’t mind me pointing it out, but I believe that the result of this calculation:

ROUND((curr.Revenue_for_Year - prev.Revenue_for_Year)/prev.Revenue_for_Year

should be * 100

Prev refers to previous year’s revenue while curr refers to current year’s revenue.
Both tables initially hold the same values (they are just aliases to the revenue tables).

Here’s the output of a select * on revenue for instance:

#	Year	Previous_Year	Revenue_for_Year
1	2013	      2012	                  450.58
2	2012	      2011	                  477.53
3	2011	      2010	                  469.58
4	2010	      2009	                  481.45
5	2009	      2008	                  449.46

What prev and curr do is serve as reference to the same table; revenue, i.e I’m performing a self join on the table.
What I’ve done is:
Take revenue (curr) on the left, and take revenue(prev) on the right.
Perform a left join, joining all values in curr.previous_year = prev.year where previous_year is a temporary column solely for the purpose of comparing previous year values.
This fetches rows with current year’s revenue and previous year’s revenue.

Once joined, the tables look like this
Code_vr9CGXbN8C

and then I proceed with subtracting right from left (previous from current) and calculate percentage.
You’re correct, multiplying it by 100 should lead to proper percent values. I’ve left them as a decimal in the table.

1 Like

Thanks for your explanation @kodes.k - what I’m struggling with is that in the first part of your solution, you name your columns in in your temporary revenue table as Year, Previous Year, and Revenue_for_Year

So I’m just curious how SQL knows that curr.Revenue_for_Year - prev.Revenue_for_Year is actually a reference to Year - Previous_Year in the revenue table.

Perform a left join, joining all values in curr.previous_year = prev.year where previous_year is a temporary column solely for the purpose of comparing previous year values.

Again, with the reference to prev.year, I thought that to make a reference to the revenue table this would need to be revenue.year

Sorry to bother you with all the questions, but I just can’t seem to get my head around it! :weary:

Hello!
I believe the results from the first Query are wrong.
I analyzed your results and this song has 5 different TrackID´s but is the same song name ‘2 Minutes To Midnight’
which means the number of playlists are duplicating
TrackID Name # PlayLists
1357 2 Minutes To Midnight 2
1221 2 Minutes To Midnight 3
1319 2 Minutes To Midnight 3
1345 2 Minutes To Midnight 3
1289 2 Minutes To Midnight 2

Here is a Query of the Track ‘2 Minutes To Midnight’ and the Playlist Names.
You can see the Playlist name repeats depending on the TrackId which is not UNIQUE by Track Song.
Track Name TrkId Playlist Name
|2 Minutes To Midnight|1221|Music|
|2 Minutes To Midnight|1221|90’s Music|
|2 Minutes To Midnight|1221|Music|
|2 Minutes To Midnight|1289|Music|
|2 Minutes To Midnight|1289|Music|
|2 Minutes To Midnight|1319|Music|
|2 Minutes To Midnight|1319|90’s Music|
|2 Minutes To Midnight|1319|Music|
|2 Minutes To Midnight|1345|Music|
|2 Minutes To Midnight|1345|Music|
|2 Minutes To Midnight|1345|Heavy Metal Classic|
|2 Minutes To Midnight|1357|Music|
|2 Minutes To Midnight|1357|Music|
The right answer should be Music, Heavy Metal Classic and 90’s Music
This is 3 Playlists for the ‘2 Minutes To Midnight’ song, regardless of the TrackId.

Regards!

1 Like

Hello!
The results in this query has the same problem as the query before.
2 Minutes To Midnight is repeated 5 times with repeated Playlists names.
This because the 2 Minutes To Midnight has 5 different TrackId´s insteas of One.

regards!

Your questions are appropriate, part of the data science course goes from aggregates straight to common table expressions (CTEs), which can be intimidating, more so for beginners.
I feel the course creators could start off with subqueries first to allow for a easier transition to the sql part of this course.

To clarify the columns,
Year: Column to hold yearly revenue
Previous Year : (Temporary column) Year - 1
Revenue_For_Year : Aggregate (SUM of Total) by Year

As for the tables, there is a single table revenue, of which two instances are created ‘curr’ and ‘prev’. Both hold the same values.

The key part of the query here is the join clause, specifically

FROM revenue curr LEFT JOIN revenue prev
ON curr.Previous_Year = prev.Year;

I want all of previous_year’s revenue wherever possible. However, as you can see the last row in ‘Year’ is 2009 which means I won’t have 2008’s revenue.
Remember, an SQL LEFT JOIN does a match taking all the rows present on the left, matching corresponding values on the right. Where a value is not present, it is replaced by null.

That’s why the join is done on curr.previous_year = prev.year, which matches
curr’s values(2012…2008) to prev’s values(2013…2009).
Only matching year values and 2008 are kept since it’s a LEFT JOIN.

However, you could also do this by performing a RIGHT JOIN. In this case, what you’d do is:

FROM revenue curr RIGHT JOIN revenue prev
ON curr.year= prev.previous_year;

Unfortunately sqlite does not support right join, but you could use other sql processors to try and see for yourself.
Hope this helps!

2 Likes

That’s really helpful - thank you!

Hi @kodes.k,

I believe this is incorrect. invoice_items indeed lists the Unit Price, but if you SUM(invoice_items.UnitPrice) and GROUP BY TrackId, you are adding up one unit for each time the track id is invoiced. UnitPrice * Units = Total Price. Units here is each time the TrackId is mentioned in invoice_items, which, if you order the table by TrackId, we see is as many times as it is invoiced.

If you SUM(invoices.Total) and GROUP BY TrackId, you are adding up the entire invoice total (which includes the track you want but also any other tracks bought with it) each time the TrackId is found in invoice_items.TrackId (because it is linked with tracks). This results in a much larger number.

Even if my explanation is not right, you can verify this with code.

SELECT SUM(invoices.Total) FROM invoices;

2328.6

SELECT SUM(invoice_items.UnitPrice) FROM invoice_items;

2328.6

The sum is the same, and both are the total revenues of the given data, so therefore we know we cannot exceed this number. The revenues given in your table clearly exceed this number, and the numbers given using UnitPrice add up to the total revenue. You can verify this using a WITH statement.

WITH track_revenue AS(
SELECT tracks.name,
SUM(invoice_items.UnitPrice) AS ‘TotalRevenue’
FROM tracks
JOIN invoice_items
ON tracks.TrackId = invoice_items.TrackId
GROUP BY 1
ORDER BY 2 DESC
)
SELECT SUM(track_revenue.TotalRevenue) FROM track_revenue;

2328.6

WITH track_revenue AS(
SELECT tracks.TrackId, tracks.Name, SUM(Total) AS ‘TotalRevenue’
FROM tracks JOIN invoice_items
ON tracks.TrackId = invoice_items.TrackId
JOIN invoices
ON invoice_items.InvoiceId = invoices.InvoiceId
GROUP BY 1
ORDER BY 3 DESC
)
SELECT SUM(track_revenue.TotalRevenue) FROM track_revenue;

20848.6

Obviously much higher. There is no need to JOIN the invoices table in this case. Because we’re not given answers for these, I always verify to make sure my values make sense.

Thanks! Sharing my queries here.

This took me a while to do all of the questions but I tried and learned a bunch in the process outside the course so far.

1 Like

I have a question about the last task in the basic section: How many customers did each employee support?
When I wrote the query for the first part of the task, I got the number of clients per employee of around 20.
The query went like:

select employees.EmployeeId, count(customers.customerid)
from employees join customers on employees.employeeid=customers.supportrepid
group by EmployeeId

When I joined another table that was needed for the other part of the task, the number of clients per employee surged to ~140.

select employees.EmployeeId, count(customers.CustomerId), 
from employees join customers on employees.employeeid=customers.supportrepid
join invoices on invoices.customerid=customers.customerid
group by EmployeeId

What am I doing wrong in the first query that so many customers get lost?