Data Science Independent Project #3 – Education & Census Data

Project: Analyzing Education & Census Data

This project will take you off-platform and get you started in your own developer environment! Never done that before? Not to worry - we’ve shared some resources to help you down below. This project can be completed entirely on your own - or, you can join our Community Discord and find someone to work with! Jump to the community support section to hear more about this.

This project is broken down into key questions that your client or company is looking to answer. As a data scientist, you’ll often become a resource to help businesses answer the key questions about the efficacy of existing or potential strategies & projects.

Overview

Objective

Your advice is needed by a team of policymakers seeking to make more informed decisions on education. Help them investigate how external factors may influence performance in state assessment exams for public high school students.

Pre-requisites

In order to complete this project, we suggest that you have familiarity with the content in the following courses or lessons on the Codecademy platform:

  1. Manipulation
  2. Queries
  3. Aggregate Functions
  4. Multiple Tables

Suggested Technologies

First, get the data. Depending on where you are on your Path, there may be multiple technology options you can use to complete this project - we suggest the following:

  1. DB Browser for SQLite

Get started - hosting your project

DB Browser for SQLite is a visual tool for working with SQLite databases. Follow the link to download the application for your computer.

  • SQLite can store an entire database in a single file, which usually has a .sqlite or .db extension. To open a database, select Open Database at the top of the window and browse for the file. Alternatively, you can choose to create a New Database by saving a file with the .sqlite or .db extension.
  • To import data from a CSV file into a table, select “File > Import > Table from CSV file…” and browse for the CSV file. (Note: All fields imported from the CSV file will have a data type of TEXT. Be sure to convert fields to numeric type as needed. See here for how to do that.)

There are several tabs near the top of the window for working with the data:

  • Database Structure: View the tables in your database and the columns they contain.
  • Browse Data: Browse the data for each table.
  • Execute SQL: Write and execute SQL queries.

Project Tasks

You can download the data you’ll be using for this specific project here and here.

Basic Requirements

Let’s break this project down into a couple different parts.

Explore tables: Start off by exploring each table separately.

  • How many public high schools are in each zip code? in each state?
  • The locale_code column in the high school data corresponds to various levels of urbanization as listed below. Use the CASE statement to display the corresponding locale_text and locale_size in your query result.
    • Hint: Try taking a look at using the substr() function to help look at each part of the locale_code for determining locale_text and locale_size.
locale_text locale_code (locale_size)
City 11 (Large), 12 (Midsize), 13 (Small)
Suburb 21 (Large), 22 (Midsize), 23 (Small)
Town 31 (Fringe), 32 (Distant), 33 (Remote)
Rural 41 (Fringe), 42 (Distant), 43 (Remote)
  • What is the minimum, maximum, and average median_household_income of the nation? for each state?

Joint analysis: Join the tables together for even more analysis.

  • Do characteristics of the zip-code area, such as median household income, influence students’ performance in high school?
    • Hint: One option would be to use the CASE statement to divide the median_household_income into income ranges (e.g., <$50k, $50k-$100k, $100k+) and find the average exam scores for each.

Additional Challenges

Intermediate Challenge

  • On average, do students perform better on the math or reading exam? Find the number of states where students do better on the math exam, and vice versa.
    • Hint: We can use the WITH clause to create a temporary table of average exam scores for each state, with an additional column for whether the average for math or reading is higher. (Note: Some states may not have standardized assessments, so be sure to also include an option for No Exam Data) Then, in your final SELECT statement, find the number of states fitting each condition.

Advanced Challenge

  • What is the average proficiency on state assessment exams for each zip code, and how do they compare to other zip codes in the same state?
    • Note: Exam standards may vary by state, so limit comparison within states. Some states may not have exams. We can use the WITH clause to create a temporary table of exam score statistic for each state (e.g., min/max/avg) - then join it to each zip-code level data to compare.

Resources & Support

Project-specific resources

  1. SQLite Documentation &
  2. SQLite Tutorial
  3. SQLite substr() Function
  4. US Census Data
  5. Public HS Data
  6. State Assessment Data

General Resources

  1. How to get set-up for coding on your computer
  2. What is a Relational Database Management System?
  3. What you need to know about Git, GitHub & Coding in Teams
  4. How developer teams work
  5. First steps in tackling a group project
  6. Resource on writing pseudocode to get started with off-platform projects

Community Support

Looking for additional help or someone to work with (or somewhere to brag about your finished project)? Join our Community Discord to meet other learners like yourself!

Once you’re done…

Share in the Forums for feedback and to see some other ways of solving this problem!

12 Likes

Join Analysis using CASE statement:
So this is how I was able to analyze some of the data on household income and influence on student’s performance on math and reading scores.

Answer

SELECT
CASE
WHEN (median_household_income < 50000) THEN ‘Low Income’
WHEN (median_household_income BETWEEN 50000 AND 100000)
THEN ‘Middle Income’
WHEN (median_household_income > 100000) THEN ‘High Income’
END AS ‘Income Range’,

ROUND(AVG(p.pct_proficient_math), 2) AS ‘Average Math Scores’, ROUND(AVG(pct_proficient_reading), 2) AS
‘Average Reading Scores’

FROM census_data as ‘c’
LEFT JOIN public_hs_data as ‘p’
ON c.zip_code = p.zip_code
WHERE p.pct_proficient_math IS NOT NULL
GROUP BY 1
ORDER BY 2;

You can change the ORDER BY the columns to see the ranges of math scores versus reading scores for high income earners vs. Middle Income earners vs. low income earners. You can also add another column near the SELECT statement to see how the data is compared by states, zip codes etc. But don’t forget to change the GROUP BY column if you do so. If anybody has a different way of doing it, I would love to see how you did yours different from mine.

1 Like

Intermediate Challenge:

  • On average, do students perform better on the math or reading exam? Find the number of states where students do better on the math exam, and vice versa.
    • Hint: We can use the WITH clause to create a temporary table of average exam scores for each state, with an additional column for whether the average for math or reading is higher. (Note: Some states may not have standardized assessments, so be sure to also include an option for No Exam Data ) Then, in your final SELECT statement, find the number of states fitting each condition.
Answer

WITH table1 AS (
SELECT c.state_code AS ‘States’, ROUND(AVG(pct_proficient_math), 2)
AS ‘Math Scores’,
ROUND(AVG(pct_proficient_reading), 2) AS ‘Reading Scores’,

CASE
WHEN (AVG(p.pct_proficient_math) > AVG(p.pct_proficient_reading)) THEN ‘Math’
WHEN (AVG(p.pct_proficient_math) < AVG(p.pct_proficient_reading)) THEN
‘Reading’
ELSE ‘No Exam Data’
END AS ‘Math vs. Reading’

FROM census_data as ‘c’
LEFT JOIN public_hs_data as ‘p’
ON c.zip_code = p.zip_code
GROUP BY 1
ORDER BY 2 DESC)

SELECT *
FROM table1 as ‘t’;

The answer is actually similar to the Joint Analysis above, with the exception of using WITH clause to create a temporary table AND adding another column to that table. As always if you have a better way of answering the question, please do share.

2 Likes

Hello, I’m working on this project and in this task " What is the minimum, maximum, and average median_household_income of the nation? for each state?" happens something interesting: I get the minimum and average median_household_income, but the maximum returns NULL, why is that?

The code I’m using is: SELECT MIN(median_household_income), MAX(median_household_income), AVG(median_household_income) FROM ‘census_data - census_data’;

Because some of the people who filled out the census did not want to disclose their median household income, the median household income box for those people says NULL. So the code would be:
SELECT MIN(median_household_income), MAX(median_household_income), AVG(median_household_income)
FROM census_data
WHERE median_household_income != ‘NULL’;

3 Likes

Thank you very much!

Hi @netsurfer21224 @css4475885439 @proskhmer08

I have just tried to download the data for this project as a CSV. When I open it up in SQL, the names of the columns are “field 1, field 2” ect (see screenshot).

I was just wondering when you completed this project, if you started off by renaming all of the columns? Or if I’m doing something wrong? I’m on a Mac by the way @alyssavigil @coffeencake I wonder if you might be able to help please?

Hello @callmej9,

I’m going to retrace my steps to recall what I did exactly.

If you look at row 1, all your column names are there. You shouldn’t have to rename the columns.

Hello @callmej9,

When you import the file, select Column names in first line.

1 Like

Thank you so much - I will try that!

1 Like

Hi. Yes. Check that box for Column names in first line. That should fix your problem.

2 Likes

These are my answers to the questions, feedback is always welcome

Basic Requirements

How many public high schools are in each zip code? In each state?

SELECT  COUNT(school_name) as schools,  zip_code
FROM public_hs
GROUP BY zip_code
ORDER BY 1 DESC; 


SELECT  COUNT(school_name) as schools,  state_code
FROM public_hs
GROUP BY 2
ORDER BY 1 DESC; 

The locale_code column in the high school data corresponds to 
various levels of urbanization as listed below. 
Use the CASE statement to display the corresponding locale_text and 
locale_size in your query result.

SELECT school_name, locale_code,

CASE WHEN locale_code <= 13 THEN 'City'
WHEN locale_code <= 23 THEN 'Suburb'
WHEN locale_code <= 33 THEN 'Town'
WHEN locale_code <= 43 THEN 'Rural'
END AS 'Urbanization',

CASE WHEN  locale_code <=23 THEN
CASE substr(locale_code, 2, 2)
WHEN '1' THEN 'Large'
WHEN '2' THEN 'Midsize'
WHEN '3' THEN 'Small'
END 

WHEN locale_code >=31 THEN 
CASE substr(locale_code, 2, 2)
WHEN '1' THEN 'Fringe'
WHEN '2' THEN 'Distant'
WHEN '3' THEN 'Remote'
ELSE 'No data'
END

END AS 'Size'
FROM public_hs;


What's the minimum, maximum and average median_household_income of the nation?
For each state?

For the nation:

SELECT MIN(median_household_income) AS Min, MAX(median_household_income) as Max, ROUND(AVG(median_household_income), 2) as Average
FROM  census_data
WHERE median_household_income != 'NULL';

For each state:

SELECT state_code, MIN(median_household_income) AS Min, MAX(median_household_income) as Max, ROUND(AVG(median_household_income), 2) as Average
FROM  census_data
WHERE median_household_income != 'NULL'
GROUP BY state_code;

Join the 2 tables together
Do characteristics of the zip code area, such as median household income, influence 
student's performance in high school?

SELECT ROUND(AVG(pct_proficient_math), 0) AS 'Math Results', ROUND (AVG(pct_proficient_reading), 0) AS 'Reading Results', 

CASE WHEN median_household_income <50000 THEN '<$50k'
WHEN median_household_income >= 50000 AND median_household_income <= 100000 THEN '$50-100k'
WHEN median_household_income > 100000 THEN '$100k+'
ELSE 'No data available'
END AS 'Income bracket'

FROM public_hs
JOIN census_data
ON  public_hs.zip_code = census_data.zip_code
WHERE median_household_income != 'NULL'
GROUP BY 3
ORDER BY 3 DESC; 

Intermediate Challenge:

On average, do students perform better on the math or reading exam?
Find the number of states where students do better on the math exam and vice versa.

WITH highest_results AS (SELECT ROUND(AVG(pct_proficient_math), 0) AS 'Math Results', ROUND (AVG(pct_proficient_reading), 0) AS 'Reading Results', state_code, 
CASE WHEN ROUND(AVG(pct_proficient_math), 0) > ROUND (AVG(pct_proficient_reading), 0) THEN 'Math'
WHEN ROUND (AVG(pct_proficient_reading), 0) > ROUND(AVG(pct_proficient_math), 0) THEN 'Reading'
ELSE 'No Exam Data'
END AS 'Highest_Subject'
FROM public_hs
GROUP BY 3)

SELECT COUNT(*) as 'Number of States', Highest_Subject
FROM highest_results
GROUP BY Highest_Subject;

Advanced Challenge:

What's the average proficiency on state assessment exams for each zip code, and how do they compare to other zip codes in the same state? 
Figure out state average and then how zip codes compare to that average 

WITH state_results AS (SELECT public_hs.state_code as 'State', ROUND(AVG(pct_proficient_math), 0) AS 'State_Math_Avg', ROUND (AVG(pct_proficient_reading), 0) AS 'State_Reading_Avg', MAX(pct_proficient_math) AS 'Max Math Score', MIN(pct_proficient_math) AS 'Min Math Score',
MAX(pct_proficient_reading) AS 'Max Reading Score', MIN(pct_proficient_reading) AS 'Min Reading Score'
FROM public_hs
GROUP BY 1)


SELECT state_results.state, public_hs.zip_code, state_results.State_Math_Avg, ROUND(AVG(pct_proficient_math), 0) AS 'Zip_Math_Avg', state_results.State_Reading_Avg, 
ROUND(AVG(pct_proficient_reading), 0) AS 'Zip_Reading_Avg'
FROM public_hs
JOIN state_results
ON public_hs.state_code = state_results.state
GROUP BY 2; 
6 Likes

Hi! @coffeencake - Any chance you could me with this?

I imported the CSV (including column names). When I submit a query, I keep getting the error: “no such table.”

Any ideas how to fix this? Did I skip a step?

Thanks!

Hello @davidlucente65104858,

Can you upload a screen print of your table import? As well, can you include your select statement?

@coffeencake

sorry couldn’t fit these in one post as a new user

Hi @davidlucente65104858,

Thanks for letting us know that you couldn’t fit both images as a new user.

The table was created as public_hs_data - public_hs_data. To fix the problem, try modifying the table name to just public_hs_data.

Let me know if that works.

1 Like

In that case I have another question. I tried to use case
“WHERE median_household_income IS NOT NULL”
And it didn’t work. Why is that?

Hi guys. I dont have Excel on my laptop now. Can i still do this task? Because I am having problem with importing database to DB browser.