Data Science Independent Project #5 – Analyze Airfare Data

I formated the LINE column that way just because I wanted to see the joined table first.
Just to check if everything is all right.
The table with filtered flights from my previous post is here:

WITH 
	q1(Year, Line, Fare_Q1) AS
	(SELECT Year, city1 || " -> " || city2, fare
	FROM airfare_data
	WHERE quarter = 1),
	
	q2(Year_2, Line_2, Fare_Q2) AS
	(SELECT Year, city1 || " -> " || city2,	fare
	FROM airfare_data
	WHERE quarter = 2),
	
	q3(Year_3, Line_3, Fare_Q3) AS
	(SELECT Year, city1 || " -> " || city2, fare
	FROM airfare_data
	WHERE quarter = 3),
	
	q4(Year_4, Line_4, Fare_Q4) AS
	(SELECT Year, city1 || " -> " || city2, fare
	FROM airfare_data
	WHERE quarter = 4)
	
SELECT 
	Year, Line, Fare_Q1, Fare_Q2, Fare_Q3, Fare_Q4
FROM 
	q1, q2, q3,q4
WHERE 
	q1.Year = q2.Year_2 AND q1.Year = q3.Year_3 AND q1.Year = q4.Year_4
	AND 
	q1.Line = q2.Line_2 AND q1.Line = q3.Line_3 AND q1.Line = q4.Line_4

ORDER BY 1,2;

Hello, thanks for your response :slight_smile: i had reviewed my code and adjusted it, now the result is the same as your.
However i use much simpler way. Let me share:

alter table data
add line_year as (city1 || " → " || city2 || “_” || year);

With
data1 (Year,line,line_year,quarter) as
(Select year,city1 || " → " || city2,city1 || " → " || city2 || “_” || year,quarter
from data
group by 2,1
Having count(distinct quarter) = 4),
final(year,line,quarter,fare) as
(Select data.year,data.line,data.quarter,data.fare
from data
join data1
on data.line_year = data1.line_year)

select Quarter,round(avg(fare),2) as AvgFare from final
group by 1;"

WELL DONE!
Yes, it’s much more elegant, simple and faster.

I was not yet introduced to HAVING clause.
Thank you!

But what bothers me a little bit is altering the original table.
Everytime you try the code (after altering the table)
you have to delete the new column line_year from the original table
(or delete first two lines of the code)
The original table is changed.

I would prefere to leave the original table as it is, and instead of modifying it
I would create another temporary table.

I think there is a mistake in line:
Select data.year,data.line,data.quarter,data.fare
It should be data1.line.
But, you don’t need that column at all.
Actually, there is a lot of columns you don’t need at all in your temporyary tables
to make it even more simple.
For example, you need only line_year in the table where you filter flights that have data available on all 4 quarters of the year

One more time, thank you for showing me the way!

WITH
--adding line_year column
	table_1 (Year, quarter, fare, line_year) AS
	(SELECT Year, quarter, fare, city1 || " → " || city2 || "_" || year
	FROM airfare_data),

-- filtering only flights that have data available on all 4 quarters of the year
	table_2 (line_year) AS
	(SELECT line_year
	FROM table_1
	GROUP BY 1
	HAVING COUNT(DISTINCT quarter) = 4),

-- joining tables, but actually just filtering table_1	
	final (year, quarter, fare) as
	(SELECT table_1.year, table_1.quarter, table_1.fare
	FROM table_1
	JOIN table_2
	ON table_1.line_year = table_2.line_year)

SELECT Quarter, ROUND(AVG(fare), 2) AS AvgFare 
FROM final
GROUP BY 1;

P.S. My English is not at the top level, so forgive me if there is something not clear, or if there are mistakes in my writing.

Hi there, no problem. I fully understand and agree with your new way :wink:
I started learning coding just one month ago, and part of my answer really was developed from your previous code, so i should be the one to say “Thank you” ! :grinning:

Hey let’s connect outside of this to learn together, shall we? Here is my personal email: [email protected]

Thanks @rorygower for sharing your answer! I have a few questions for how it works:

  1. How are you ensuring that you are filtering by years that have all four quarters of data? I see that you use HAVING count(DISTINCT quarter)=4 to filter. But I don’t see where you do this by year.

The issue that I imagine may occur is that, for example, the route from NYC to LA may have data for Q1, Q2 and Q3 only in 1999, and then data for all four quarters in 2000. How does your query ensure it is not factoring in Q4 of 2000, so as not to include data from year 1999?

I also have more fundamental questions about the way your query works:

  1. How do your WITH clause subqueries work exactly? I see a union between two identical subqueries. What purpose does this solve? I imagine it relates to how you use city1 and city2 in your final WHERE statement, but I don’t really understand what purpose it is serving: WHERE routes.city1=airfare_data.city1 AND routes.city2=airfare_data.city2

  2. In your main query, I notice that your FROM simply lists the main table, airfare_data, and your CTE table, ‘routes’. Don’t you have to use some sort of JOIN between the two tables?

Thank you in advance! I look forward to hearing from you.

@netsurfer21224 and @callmej9 As you both note, city1 is not a unique identifier as it must pair with city2 in order to be unique. To address the lack of a unique ID for each route (E.g. NYC to LA, versus NYC to Miami), I created a new column called ‘Flight’ which concatenates citymarketid1 and citymarketid2 as the unique identifier:

citymarketid_1|| ‘-’ || citymarketid_2 AS Flight,

Note: The syntax for concatenation in SQLite is different from the SQL standard: CONCAT().*

Here’s my full code for the question of annual average fares by flight for 97, 07 and 17:

-- Subquery: Annual avg fares by Flight for 97, 07, 17 -- WITH Seventeen AS( SELECT citymarketid_1|| '-' || citymarketid_2 AS Flight, ROUND(AVG(fare),0) AS avgf17 FROM airfare_data WHERE Year = 2017 GROUP BY Flight ), Seven AS( SELECT citymarketid_1|| '-' || citymarketid_2 AS Flight, ROUND(AVG(fare),0) AS avgf07 FROM airfare_data WHERE Year = 2007 GROUP BY Flight ), NinetySeven AS( SELECT citymarketid_1|| '-' || citymarketid_2 AS Flight, ROUND(AVG(fare),0) AS avgf97 FROM airfare_data WHERE Year = 1997 GROUP BY Flight ) -- Main query: Output all 3 years by Flight -- SELECT NinetySeven.Flight, avgf97, avgf07, avgf17, ROUND((avgf17 - avgf07) / avgf07 * 100,1) AS '% Change 2007 vs. 2017', ROUND((avgf17 - avgf97) / avgf97 * 100,1) AS '% Change 1997 vs. 2017' FROM NinetySeven LEFT JOIN Seven ON NinetySeven.Flight = Seven.Flight LEFT JOIN Seventeen ON Seven.Flight = Seventeen.Flight

Here’s how I calculated the average change in fares:

-- Subquery: Annual avg fares by Flight for 97, 07, 17 -- WITH Seventeen AS( SELECT citymarketid_1|| '-' || citymarketid_2 AS Flight, ROUND(AVG(fare),0) AS avgf17 FROM airfare_data WHERE Year = 2017 GROUP BY Flight ), Seven AS( SELECT citymarketid_1|| '-' || citymarketid_2 AS Flight, ROUND(AVG(fare),0) AS avgf07 FROM airfare_data WHERE Year = 2007 GROUP BY Flight ), NinetySeven AS( SELECT citymarketid_1|| '-' || citymarketid_2 AS Flight, ROUND(AVG(fare),0) AS avgf97 FROM airfare_data WHERE Year = 1997 GROUP BY Flight ) -- Main query: Avg fare percent changes for 97, 07, 17-- SELECT ROUND(AVG((avgf07 - avgf97) / avgf97 * 100),1) AS 'Avg Change 97 vs 07', ROUND(AVG((avgf17 - avgf07) / avgf07 * 100),1) AS 'Avg Change 07 vs 17', ROUND(AVG((avgf17 - avgf97) / avgf97 * 100),1) AS 'Avg Change 97 vs 17' FROM NinetySeven LEFT JOIN Seven ON NinetySeven.Flight = Seven.Flight LEFT JOIN Seventeen ON Seven.Flight = Seventeen.Flight

Some weird stuff on syntax:

  • I wasn’t able to name my subqueries as numbers: e.g. “1997.” But when I renamed them with text, e.g.: “NinetySeven” it worked. No idea why.

  • I had the same issue in naming the subquery columns. For example, it didn’t accept “1997 Avg Fare”, but it did accept: “avg97.” Again, no idea why.

Hi I am trying to get access to the dataset for this exercise but every time I click on the link it says I don’t have it. I am a codeacademy pro member so please get me access.

Hi there!
Sorry about this bug, I shared the problem with the team and they’ll get the link fixed soon.

I tried based on your instructions but DB browser keeps quiting many times and I haven’t been able to work on it yet.
Could you please help me out? Thank you so much.

It did not work after creating a new database , edit table definition came out so I have to write something and create table by myself so how can I use file → import…

It is more tricky to open the file than to learn how SQL works. It took me one houur with reading all the comments and finally nothing works out…

This is pretty late, but you can also use subqueries like this to retrieve both the MIN and MAX flights at the same time :

SELECT DISTINCT city1, city2, nsmiles AS 'miles'
FROM airfare_data
WHERE nsmiles = (SELECT MIN(nsmiles) FROM airfare_data) OR
       nsmiles = (SELECT MAX(nsmiles) FROM airfare_data);

Here are all my answers if anyone wants to comment on any of it.

/* Exploration */

–What range of years are represented in the data?

SELECT (MIN(year) || ' - ' || MAX(year)) AS 'Range of Years'
FROM airfare_data;

–What are the shortest and longest-distanced flights, and between which 2 cities are they?

SELECT DISTINCT city1, city2, nsmiles AS 'miles'
FROM airfare_data
WHERE nsmiles = (SELECT MIN(nsmiles) FROM airfare_data) OR
	nsmiles = (SELECT MAX(nsmiles) FROM airfare_data);

-----OR-----

SELECT city1, city2, MIN(nsmiles) as distance FROM airfare_data
UNION
SELECT city1, city2, MAX(nsmiles) as distance FROM airfare_data;

–How many distinct cities are represented in the data (source or destination)?
SELECT city1
FROM airfare_data
UNION
SELECT city2
FROM airfare_data
ORDER BY 1;

/* Analysis */

–Which airline appear most frequently as the carrier with the lowest fare(ie. carrier_low)?

SELECT carrier_low, COUNT(*) AS 'Frequency'
FROM airfare_data
WHERE carrier_low != 'NULL'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;

–How about the airline with the largest market share (ie. carrier_lg)?

SELECT carrier_lg, COUNT(*) AS 'Frequency'
FROM airfare_data
WHERE carrier_low != 'NULL'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;

–How many instances are there where the carrier with the largest market share is not
– the carrier with the lowest fare? What is the average difference in fare?

SELECT COUNT(carrier_lg) AS 'Num of Lg Carriers w/o Lowest Fare'
FROM airfare_data
WHERE carrier_lg != carrier_low;

/* Intermediate Challenges */

–What is the percent change in average fare from 2007 to 2017 by flight?
–REMINDER: PERCENT CHANGE = ((value2 - value1) / value1)*100

WITH fares1 AS(
	SELECT year, ROUND(AVG(fare), 2) as 'Average'
	FROM airfare_data
	WHERE year = 2007
	GROUP BY 1
),
	fares2 AS(
	SELECT year, ROUND(AVG(fare), 2) as 'Average'
	FROM airfare_data
	WHERE year = 2017
	GROUP BY 1
)
SELECT 
	fares1.year, fares1.Average, fares2.year, fares2.Average,
	ROUND(((fares2.Average-fares1.Average)/fares1.Average)*100, 1) AS 'Percent Change'
FROM fares1, fares2;

–How about from 1997 to 2017?

WITH fares1 AS(
	SELECT year, ROUND(AVG(fare), 2) as 'Average'
	FROM airfare_data
	WHERE year = 1997
	GROUP BY 1
),
	fares2 AS(
	SELECT year, ROUND(AVG(fare), 2) as 'Average'
	FROM airfare_data
	WHERE year = 2017
	GROUP BY 1
)
SELECT 
	fares1.year, fares1.Average, fares2.year, fares2.Average,
	ROUND(((fares2.Average-fares1.Average)/fares1.Average)*100, 1) AS 'Percent Change'
FROM fares1, fares2;

–How would you describe the overall trend in airfares from 2007 to 2017, as compared 2007 to 2017?

WITH fares1 AS(
	SELECT year, ROUND(AVG(fare), 2) as 'Average'
	FROM airfare_data
	WHERE year = 1997
	GROUP BY 1
),
	fares2 AS(
	SELECT year, ROUND(AVG(fare), 2) as 'Average'
	FROM airfare_data
	WHERE year = 2007
	GROUP BY 1
),
	fares3 AS(
	SELECT year, ROUND(AVG(fare), 2) as 'Average'
	FROM airfare_data
	WHERE year = 2017
	GROUP BY 1
)
SELECT 
	ROUND(((fares3.Average-fares1.Average)/fares1.Average)*100, 1) AS '% Chg 1997-2017',
	ROUND(((fares3.Average-fares2.Average)/fares2.Average)*100, 1) AS '% Chg 2007-2017'
FROM fares1, fares2, fares3;

/* Advanced Challenges */

–What is the average fare for each quarter per year(all flights)? Which quarter has the highest average fare? Lowest?

WITH Q1 AS(
	SELECT year, quarter, ROUND(AVG(fare), 2) AS 'Q1_Avg_Fare'
	FROM airfare_data
	WHERE quarter = 1
	GROUP BY 1
),
	Q2 AS(
	SELECT year, quarter, ROUND(AVG(fare), 2) AS 'Q2_Avg_Fare'
	FROM airfare_data
	WHERE quarter = 2
	GROUP BY 1
),
	Q3 AS(
	SELECT year, quarter, ROUND(AVG(fare), 2) AS 'Q3_Avg_Fare'
	FROM airfare_data
	WHERE quarter = 3
	GROUP BY 1
),
	Q4 AS(
	SELECT year, quarter, ROUND(AVG(fare), 2) AS 'Q4_Avg_Fare'
	FROM airfare_data
	WHERE quarter = 4
	GROUP BY 1
)
SELECT Q4.year, 
	Q1.Q1_Avg_Fare, Q2.Q2_Avg_Fare, Q3.Q3_Avg_Fare, Q4.Q4_Avg_Fare
FROM Q1
JOIN Q2
	ON Q1.year = Q2.year
JOIN Q3
	ON Q2.year = Q3.year
JOIN Q4
	ON Q3.year = Q4.year;

/* Considering only flights that have data availabe on all 4 quarters of the year,
which quarter has the highest overall average fare? Lowest? Try breaking down by year as well. */

--Add column to table to create a unique identifier for each flight.
--ALTER TABLE airfare_data
--ADD flight_year AS (city1 || ' -> ' || city2 || '_' || year);

WITH
data1 (Year, flight, flight_year, quarter) AS(
	SELECT 
		year,
		city1 || ' -> ' || city2,
		city1 || ' -> ' || city2 || '_' || year,
		quarter
	FROM airfare_data
	GROUP BY 2,1
	HAVING COUNT(DISTINCT quarter) = 4
),
final(year, flight, quarter, fare) AS(
	SELECT airfare_data.year, data1.flight, airfare_data.quarter, airfare_data.fare
	FROM airfare_data
	JOIN data1
	ON airfare_data.flight_year = data1.flight_year)

SELECT year, flight, Quarter, ROUND(AVG(fare), 2) as AvgFare 
FROM final
GROUP BY 1, 2, 3
ORDER BY 1, 2;

Can someone point me to the answers please? I’d like to know whether what I’m doing is right. Then I can find better ways of getting there :slight_smile:

1 Like

SOLUTION FOR THE ADVANCED CHALLENGE

This analysis would be of great to help to Solve the Advanced Challenge:

  • What is the average fare for each quarter? Which quarter of the year has the highest overall average fare? lowest?
/*Find the cost of every flight.
Goal: Trying to extract the needed column from the main table "AirfareData" before filtering the Flights in every Quarter*/

WITH Flights_cost AS (
SELECT Year, 
	Quarter,
	City1,
	City2,
	fare
FROM AirfareData),


/*Find the flights that have data in all 4 Quarter*/

Flights_In_4Qs AS (
SELECT Year,
	City1,
	City2,
	COUNT(Quarter) No_count
FROM AirfareData
GROUP BY 1, 2, 3
HAVING No_count = 4)

/*INNER JOIN the two tables "Flights_cost" and "Flights_In_4Qs" to filter all the Flights that does not have Data in the all 4 Quarter
Then you GROUP BY the Quarter and ORDER BY Overall average fare cost in each quarters of every year*/

SELECT fc.Quarter,
	ROUND (AVG(fc.Fare), 2) Overall_Average_Fare
FROM Flights_cost fc 
JOIN Flights_In_4Qs fq
	ON fc.Year = fq.Year
	AND fc.City1 = fq.City1
	AND fc.City2 = fq.City2
GROUP BY 1
ORDER BY 2 DESC;
  • Considering only the flights that have data available on all 4 quarters of the year, which quarter has the highest overall average fare? lowest? Try breaking it down by year as well.
/*Find the cost of every flight.
Goal: Trying to extract the needed column from the main table "AirfareData" before filtering the Flights in every Quarter*/

WITH Flights_cost AS (
SELECT Year, 
	Quarter,
	City1,
	City2,
	fare
FROM AirfareData),


/*Find the flights that have data in all 4 Quarter*/

Flights_In_4Qs AS (
SELECT Year,
	City1,
	City2,
	COUNT(Quarter) No_count
FROM AirfareData
GROUP BY 1, 2, 3
HAVING No_count = 4)

/*INNER JOIN the two tables "Flights_cost" and "Flights_In_4Qs" to filter all the Flights that do not have Data in all 4 Quarter
Then you GROUP BY the Year, Quarter and ORDER BY Overall average fare cost in each quarter of every year*/

SELECT fc.Year,
	fc.Quarter,
	ROUND (AVG(fc.Fare), 2) Overall_Average_Fare
FROM Flights_cost fc 
JOIN Flights_In_4Qs fq
	ON fc.Year = fq.Year
	AND fc.City1 = fq.City1
	AND fc.City2 = fq.City2
GROUP BY 1, 2
ORDER BY 3 DESC;

NB: I am also learning SQL and decided to share my findings because I could find a well-analysed solution in the comment which would greatly help beginners so I decided to share mine. You can look over it and if you see any mistake in the Solution you can let me know. It would be much appreciated :clap:t5:.

1 Like

■■■■ @factoradic, When checking the link I get an error the file does not exist. Can you please update the link? Sorry if you updated in a forum in another location and I missed it.

Thanks

1 Like

That’s my take on the Advanced Challenge part 2:

WITH routes AS(
SELECT Year, city1, city2
FROM airfare_data
GROUP BY 1,2,3
HAVING COUNT(DISTINCT quarter) = 4
)
SELECT airfare_data.Year, airfare_data.quarter, ROUND(AVG(airfare_data.Fare),2) AS Average_Fare
FROM airfare_data
JOIN routes ON routes.city1=airfare_data.city1 AND routes.city2=airfare_data.city2 AND routes.Year = airfare_data.Year
GROUP BY 1,2
ORDER BY 3 DESC;

Hi everyone, I worked on the project and got some results which I would like to know if they match with yours.

Here my process:

ARRANGING THE DATA: I opened the CSV file where the name of the columns where: ‘field1’, ‘field2’…, so, I changed the name of the colums and the names on the first row (Year…). I did it one by one using:

ALTER TABLE airfare_data
RENAME ‘field1’ TO ‘Year’;

Note: I am not sure if I needed to do that or if there is another way to do it more efficiently, please let me know in the comments if you have the answer.


QUESTIONS:

  1. What range of years are represented in the data?
SELECT MIN(Year) FROM airfare_data;

Answer: 1996

SELECT MAX(Year) FROM airfare_data;

Answer: 2018

  1. What are the shortest and longest-distanced flights, and between which 2 cities are they?
SELECT city1, city2, CAST(nsmiles AS INTEGER)
FROM airfare_data
GROUP BY nsmiles
ORDER BY CAST(nsmiles AS INTEGER) ASC
LIMIT 1;

Answer:
Shortest: Los Angeles, CA (Metropolitan Area) - San Diego, CA - 109 nsmiles

SELECT city1, city2, CAST(nsmiles AS INTEGER)
FROM airfare_data
GROUP BY nsmiles
ORDER BY CAST(nsmiles AS INTEGER) DESC
LIMIT 1;

Answer:
Longest: Miami, FL (Metropolitan Area) - Seattle, WA - 2724

  1. How many distinct cities are represented in the data (regardless of whether it is the source or destination)?
SELECT city1 FROM airfare_data
	UNION
SELECT city2 FROM airfare_data;

Answer:
city1 + city2 = 164 distinct cities

SELECT COUNT(DISTINCT city2) FROM airfare_data;

Answer:
city1 = 139
city2 = 127

  1. Which airline appears most frequently as the carrier with the lowest fare (ie. carrier_low)?
SELECT carrier_lg, COUNT(carrier_lg)
FROM airfare_data
GROUP BY carrier_lg
ORDER BY COUNT(carrier_lg) DESC
LIMIT 1;

Answer:
WN - 23659

  1. How about the airline with the largest market share (ie. carrier_lg)?
SELECT carrier_low, COUNT(carrier_low)
FROM airfare_data
GROUP BY carrier_low
ORDER BY COUNT(carrier_low) DESC
LIMIT 1;

Answer:
WN - 29652

  1. How many instances are there where the carrier with the largest market share is not the carrier with the lowest fare?
SELECT COUNT(*)
FROM airfare_data
WHERE carrier_lg != carrier_low;

Answer:
59851

  1. What is the average difference in fare?
SELECT avg(fare_lg - fare_low)
FROM airfare_data;

Anwer:
32.887297984848


INTERMEDIATE CHALLENGE

  1. What is the percent change 90 in average fare from 2007 to 2017 by flight?
WITH A AS 
(SELECT avg(fare_lg - fare_low) as AverageFare1
	FROM airfare_data 
	WHERE Year = 2007),
	
B AS 
(SELECT avg(fare_lg - fare_low) as AverageFare2
	FROM airfare_data 
	WHERE Year = 2017)

SELECT ((AverageFare1 - AverageFare2) / AverageFare2 * 100) FROM A, B;

Answer:
-29.3009688879531

  1. How about from 1997 to 2017?
WITH A AS 
(SELECT avg(fare_lg - fare_low) as AverageFare1
	FROM airfare_data 
	WHERE Year = 1997),
	
B AS 
(SELECT avg(fare_lg - fare_low) as AverageFare2
	FROM airfare_data 
	WHERE Year = 2017)

SELECT ((AverageFare1 - AverageFare2) / AverageFare2 * 100) FROM A, B;

Answer:
-21.2960392834909

  1. How would you describe the overall trend in airfares from 1997 to 2017, as compared 2007 to 2017?
    The flights became cheaper with time

Thanks i understand my mistake…

Looking through previous comments I think I may have interpreted the Intermediate Challenge a bit differently than others -
What is the [percent change ] in average fare from 2007 to 2017 by flight? How about from 1997 to 2017?

I took the phrasing by flight to mean that the result should show the percent change in fares for each individual route; i.e.


WITH avgfares2007 AS(
SELECT 
	ROUND(AVG( fare), 1) AS 'AvgFare',
	city1 AS 'Departure',
	city2 AS 'Destination'
FROM airfare_data
WHERE year = '2007'
GROUP BY 2, 3
),

avgfares2017 AS (
SELECT 
	ROUND(AVG( fare), 1) AS 'AvgFare',
	city1 AS 'Departure',
	city2 AS 'Destination'
FROM airfare_data
WHERE year = '2017'
GROUP BY 2, 3
)

SELECT 
	ROUND((((avgfares2017.AvgFare - avgfares2007.AvgFare) / avgfares2007.AvgFare) * 100), 1) AS '% Change -07 to -17',
	avgfares2017.Departure,
	avgfares2017.Destination
FROM avgfares2007
JOIN avgfares2017 
			ON avgfares2017.Departure = avgfares2007.Departure
GROUP BY 2, 3;

This returns a table with 1097 rows and shows the % change for each route ( Albany > Atlanta, Albany > Charlotte, Albany > Chicago etc)

So, did I overkill it?