Data Science Independent Project #5 – Analyze Airfare Data

Project: Analyzing Airfare Data

This project will take you off-platform and get you started in your own developer environment! Never done that before? Not to worry - we’ve shared some resources to help you down below. This project can be completed entirely on your own - or, you can join the data-science channel in our Community Discord and find someone to work with! Jump to the community support section to hear more about this.

This project is broken down into key questions that your client or company is looking to answer. As a data scientist, you’ll often become a resource to help businesses answer the key questions about the efficacy of existing or potential strategies & projects.

Overview

Objective

As the data analyst of a company, you are presented with the airfare data covering the top 1,000 contiguous state city-pair markets. Help them analyze this data and identify trends.

Pre-requisites

In order to complete this project, we suggest that you have familiarity with the content in the following courses or lessons on the Codecademy platform:

  1. Queries
  2. Aggregate Functions

Suggested Technologies

Depending on where you are on your Path, there may be multiple technology options you can use to complete this project - we suggest the following:

  1. DB Browser for SQLite

Project Tasks

Get started - hosting your project

DB Browser for SQLite is a visual tool for working with SQLite databases. Follow the link to download the application for your computer.

  • SQLite can store an entire database in a single file, which usually has a .sqlite or .db extension. To open a database, select Open Database at the top of the window and browse for the file. Alternatively, you can choose to create a New Database by saving a file with the .sqlite or .db extension.
  • To import data from a CSV file into a table, select “File > Import > Table from CSV file…” and browse for the CSV file. (Note: All fields imported from the CSV file will have a data type of TEXT. Be sure to convert fields to numeric type as needed. See here for how to do that.)

There are several tabs near the top of the window for working with the data:

  • Database Structure: View the tables in your database and the columns they contain.
  • Browse Data: Browse the data for each table.
  • Execute SQL: Write and execute SQL queries.

Basic Requirements

You can download the data you’ll be using for this specific project here.

Let’s break this project down into a couple different parts.

Exploration: Familiarize yourself with the dataset.

  • What range of years are represented in the data?
  • What are the shortest and longest-distanced flights, and between which 2 cities are they?
    • Note: When we imported the data from a CSV file, all fields are treated as a string. Make sure to convert the value field into a numeric type if you will be ordering by that field. See here for a hint.
  • How many distinct cities are represented in the data (regardless of whether it is the source or destination)?
    • Hint: We can use UNION to help fetch data from both the city1 and city2 columns. Note the distinction between UNION and UNION ALL.

Analysis: Further explore and analyze the data.

  • Which airline appear most frequently as the carrier with the lowest fare (ie. carrier_low)? How about the airline with the largest market share (ie. carrier_lg)?
  • How many instances are there where the carrier with the largest market share is not the carrier with the lowest fare? What is the average difference in fare?

Additional Challenges

Intermediate Challenge

  • What is the percent change in average fare from 2007 to 2017 by flight? How about from 1997 to 2017?
    • Hint: We can use the WITH clause to create temporary tables containing the airfares, then join them together to compare the change over time.
  • How would you describe the overall trend in airfares from 1997 to 2017, as compared 2007 to 2017?

Advanced Challenge

  • What is the average fare for each quarter? Which quarter of the year has the highest overall average fare? lowest?
    • Note: Not all flights (ie. each city-pair route) have data from all 4 quarters - which may skew the average. Let’s try considering only flights that have data available for all 4 quarters.
  • Considering only the flights that have data available on all 4 quarters of the year, which quarter has the highest overall average fare? lowest? Try breaking it down by year as well.
    • Hint: To consider only flights that have data available for all 4 quarters, we could join the table with itself - each of those tables should be filtered to have data from one quarter.

Resources & Support

Project-specific resources

  1. SQLite Documentation
  2. SQLite Tutorial
  3. View Data Variable Description

General Resources

  1. How to get set-up for coding on your computer
  2. What is a Relational Database Management System?
  3. What you need to know about Git, GitHub & Coding in Teams
  4. How developer teams work
  5. First steps in tackling a group project
  6. Resource on writing pseudocode to get started with off-platform projects

Community Support

Want to look for additional help or someone to work with (or somewhere to brag about your finished project)? Join our Community Discord to meet other learners like yourself!

Once you’re done…

Share in the Forums for feedback and to see some other ways of solving this problem!

5 Likes

Hi ! Thank you for the project ! But… can access the data base with DB Browser… It needs a encryption key to open the data base downloaded from Google Sheet… Where can I get this key?

Thanks !

Greg

3 Likes

The same here. Did you understand how to solve the problem?

Hello, welcome to the forum :slight_smile:

  1. Download sheet https://docs.google.com/spreadsheets/d/1JOQ5bnInxX9Mpf_3xtsX38cw-G1uctiu5jTrjzkJoh0/edit#gid=809136041 as a .csv or .tsv file;
  2. Create a new database using DB Browser;
  3. Use File -> Import -> Table from CSV file... operation to import content of downloaded .csv or .tsv file as a new table in the created database.

No encryption keys needed :slight_smile:

5 Likes

I’m trying to import the csv file as a table in the DB Browser but I receive the error “Error importing data from record number 90021. Inserting row failed: not an error.” Any way I can fix this? Cheers.

2 Likes

Hi kpsimon,

I opened and saved the file in excel and the error was gone.
Hope this helps.

2 Likes

Hi, did anyone do this project?
I’m stuck and i would love some help.

I think in order to find the answer to:

“How many instances are there where the carrier with the largest market share is not the carrier with the lowest fare?”

I need to write:
SELECT COUNT(carrier_low)
FROM airfare_data
WHERE carrier_low IS NOT ‘WN’;

(“WN” is the airline with the largest market share)

But I don’t understand how to solve:

“What is the average difference in fare?”

Anyone?

2 Likes

I got the same error. Unfortunately I don’t have excel, so if you find a way to fix it without just opening it in excel let everyone know. Thank…

I have downloaded the file as .csv, opened in excel and then saved as .xlsx
Nothing works when I try to import in DB Browser.
I then opened the csv file in excel, copied all the rows and columns and pasted in a blank excel sheet and gave it a new name.
In DB Browser, created a new database, a new window pops up…close that window and now you have a database structure where there are all zeroes.
Now imported the new csv file, clicked OK on the new window (in DB Browser), it downloaded the table and now it works! I hope this helps.

1 Like

Hi, I was having the same issue (on a MacBook, so opening it in excel as suggested above wasn’t an option).

I managed to get it to work by saving the excel data as a .TSV, instead of a .CSV. I hope that helps!

This may be late to help you, but maybe it’ll help someone else with the same problem:

I solved this by typing,

SELECT ROUND(AVG(fare_lg - fare_low), 2) AS ‘Average Difference in Fare’
FROM airfare_data
WHERE carrier_low != carrier_lg;

This returns an average difference between the largest market share fare and the lowest fare (make sure it’s in that order so you don’t accidentally wind up with a negative number!), rounded to the second decimal place to keep it neat.

You also want to add the above WHERE clause so that the average doesn’t include any fares that are the same.

If you want to make this nice and neat, you can add this as a second column to the query counting the mismatched carriers. You’ll wind up with fewer result tables from this project that way.

2 Likes

For everyone that is having issues with importing this table (Note: I use SQLite3 and Git Bash):

I found that saving the spreadsheet to my Google Drive allowed me to change the title of the table from “airfare_data.csv” to “airfare_data”. From there, I downloaded it as a csv file. This may not make a difference, but I couldn’t get anything to work right until I changed the table name.

Within the command line, I created an SQLite file named airfare_data.sqlite by typing
sqlite3 airfare_data.sqlite

To import data from the csv file to the sqlite file, I typed

.mode csv
.input airline_data.csv airline_data

(Note: for the .input command, the first phrase is the file you are inputting and the second phrase is what you want to name your table.)

From there, every column is read as text. This will mess with your queries, so you need to change that. Some programs will let you do that. However, if you’re using SQLite, you’ll need to follow this guide to create a new table with the correct data types, import data from the initial table, and save it. https://www.techonthenet.com/sqlite/tables/alter_table.php

From there, everything is hunky dory!

1 Like

hey guys,

no matter what i try i still get in DB Browser the same error when trying to import the data:

Error importing data from record number 90021.

Inserting row failed: not an error

does anyone has an idea how can I import data?

i don’t mind also to use an sql editor which is online

Thanks

Tomer

Similarly to others, I’m having the error:

Error importing data from record number 90021.

Inserting row failed: not an error

I don’t have Excel, as others have mentioned as well, and am lost on how to properly upload the file. Could anyone help?

please can i get a link to join the challenge

Thank you so much! helped alot.

I FIXED THE PROBLEM
“Error importing data from record number 90021. Inserting row failed: not an error.”

Like many people here, I do not have excel, so the solution other have posted here was not helpful. Also, since I am very new to coding, I am not sure where command line is in sqlite. So here’s how I finally solved this problem (For TL;DR, go to the bottom).

Here’s what I did. I followed petrmorgoun500341077 's advice to save the file as .TSV instead of .CSV, but then all of my information from the first several columns was in the same “field” or column, specifically column “year” all the way to column “city 1” are all merged. Then I noticed that city 1 was not ALL merged with other column, only “half” of city 1 was, which was weird, then I noticed it’s because city 1 has data that contain a comma (,) which was what sqlite used to separate column, aka “field”. I then tried to start everything new, upload the .tsv file, but this time at the “Field separator” drop down selection, I chose “Tab” instead of “,”. After that just click okay, and everything works now. Finally, I can properly start on this project.

TL;DR:

  1. Save the file as .TSV
  2. File -> Import -> Table from CSV file... import the airfare .TSV (you may have to select “All files” next to “File name” to see the .TSV that you downloaded)
  3. Select “Tab” instead of “,” in Field Separator option.
  4. Select “Ok”.

Hopefully this help you: kpsimon dsprehe89 @giga1946953891 @inspotrection and future codecademians.

1 Like

Hi, I think I solved this, but I am wondering how other have solved this:

What are the shortest and longest-distanced flights, and between which 2 cities are they?

My code is:

SELECT MIN(nsmiles), MAX(nsmiles) FROM airfare;

Then once I got the shortest miles and longest miles, I typed

SELECT DISTINCT city1, city2 FROM airfare
WHERE nsmiles=109 OR nsmiles= 2724;

But I feel like there’s an easier way to do it. I want to type

SELECT DISTINCT city1, city2 FROM airfare
WHERE nsmiles=MIN(nsmiles) OR nsmiles=MAX(nsmiles);

but apparently that’s misuse of aggregates. What solutions do you guys have?

Anyone know what to join the two temporary tables on when answering this question?

What is the percent change in average fare from 2007 to 2017 by flight? How about from 1997 to 2017?

Two separate tables:

SELECT MAX(nsmiles), city1, city2
FROM airfare_data

And then

SELECT MIN(nsmiles), city1, city2
FROM airfare_data