# FAQ: Subqueries - Correlated Subqueries I

This community-built FAQ covers the “Correlated Subqueries I” exercise from the lesson “Subqueries”.

Paths and Courses
This exercise can be found in the following Codecademy content:

## FAQs on the exercise Correlated Subqueries I

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply () below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

## Join the Discussion. Help a fellow learner on their journey.

Agree with a comment or answer? Like () to up-vote the contribution!

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

I don’t understand the importance of
WHERE carrier = f.carrier

4 Likes

The average value is calculated based on “carrier”, it is already compared with “distance”

3 Likes

This was confusing for me too.

I think since SQL is accessing the same table “flights” and column “distance” based on carriers SQL needs to distinguish the difference between the two. One instance of carriers is holding all the distances as “f.carrier” while the the other is holding the average (AVG) distances as just “carriers” so we are now comparing the two here: WHERE carrier = f.carrier I think SQL needs to be able to distinguish the two in order to use the < or > operators in the above query and give us the appropriate id associated to those carriers who are above or below average.

This is basically what @smilexdrus has stated and what I think I understood from it. I’m just trying to be more explanatory about it.

5 Likes

Why is there the following?
f.origin = flights.origin

Aren’t they both referring to the same chart and data?

1 Like

My understanding is that this will calculate the average distance for each carrier every time it comes up in the flight list.
i.e. the average for each carrier is calculated multiple times.

Is that true?
If so, is this a wasteful/slow way of doing it?
If so, how would you go about doing it otherwise? Would you create a table of carrier names and averages then look up the value in that? Would that actually speed things up?

Alex

I don’t understand this:
SELECT id
FROM flights AS f
Why “as f”?
I tried omitting the “as f” part and used “WHERE carrier = flights.carrier” instead of “WHERE carrier = f.carrier”. Does it mean the same thing?

1 Like

The aliased table (f) is used to distinquish between the two times that the flights table is approached for data. Simply put, the query refers to the same table twice:

Once to select ID’s where the distance is greater than…
Once to select the average distance.

The two are combined to create the result.

The query is confusing because multiple ways of working are used (aliased tables and non-aliased tables). Personally I would write the query as follows:

SELECT a.id
FROM flights AS a
WHERE a.distance > (
SELECT AVG(b.distance)
FROM flights as b
WHERE b.carrier = a.carrier);

This creates a far better overview of what is actually done.

Coming back to my earlier explanation:

Once to select ID’s where the distance is greater than… <- is extracted from aliased table a.
Once to select the average distance. <- is extracted from aliased table b.

6 Likes

Reading the instructions of the exercise :
“Find the id of the flights whose distance is below average for their carrier”,
what I thought that I had to do was to calculate the average for every distinct carrier and then compare the result with the distance of each flight(id) . The expression “for their carrier” meant for me executing a 'GROUP BY carrier ’ in the subquery.

So, firstly I wrote seperately the query :
SELECT carrier, AVG(distance)
FROM flights
GROUP BY carrier
to have a clear picture of the average distances for every carrier

and then I wrote the code I thought as the solution:
SELECT id, carrier, distance
FROM flights
WHERE distance < (
SELECT AVG(distance)
FROM flights
GROUP BY carrier)
ORDER BY carrier;

Apart from id, I also selected the columns ‘carrier’ & ‘distance’ as a more analytical approach of the solution and moreover , I ordered the outer query by carrier to be easier for me to approach the results in relation to my first ‘experimental’ query.

Unfortunately, my solution turned out to be wrong because: 1) it was different from the codeacademy solution code & 2) some results were not logical.
For example, for the FL carrier the AVG(distance) is 583,16.
My solution included the id flight 12038 whose distance is 590. This shouldn’t be the case, I was looking for id’s whose distance is lower than average, not higher!

In the official solution code (I also added here the columns carrier, distance and ORDER BY carrier for the outer query), this id flight is , as expected, not included .

I wonder what is wrong with my “GROUP BY carrier” approach in the subquery. Why some of the calculated results are wrong?

Finally, what is actually the contribution of the line in the official solution code “WHERE carrier = f.carrier”. Which are the calculations that are made, in what order, when the code is executed in relation to this line and the results are as expected?

1 Like

Let go one by one.

lets take this statement.

1. select avg(distance) from flights ;
–will give you the whole avg distance correct.

Now lets check this statement

1. select avg(distance) from flights where carrier = ‘AA’;
–will give the result specific to carrier AA

Note – you can individual carrier distance by running the below query just to get a picture

3.select id, carrier, sum(distance) , avg(distance) from flights group by carrier;

okk now once you got the picture .

lets see the solution provided.

SELECT id
FROM flights AS f
WHERE distance > (
SELECT AVG(distance)
FROM flights
WHERE carrier = f.carrier);

—Now can you relate this querry with the 2nd one i mentioned.
replace f.carrier with some value and think…
what we are doing here is on the basis of each carrier we are getting the avg(distance) – inner query–ex – AA
next outer loop we are comparing it with each id and comparing its distance --ex --id -18341 —its distance is 1593
Now its carrier is AA …so what is the avg distance for this carrier—this we get in inner loop

Hoping someone can help me understand how the solution works. As I understand it, the subquery is executed first which creates a dataset that is then used in the outer query. This does not seem to be the case here as the aliased carrier field is used within the subquery. I am used to aliasing the subquery itself and then doing the comparison in the outer query. Is this a difference between a correlated and non-correlated subquery? Thank you for any information!

I had the same question as @objectace76309 , it seems like this is running in polynomial time when it feels like it could be linear, what would be the best way to write this query faster?
In general are there any good resources people would recommend for SQL beginners to learn to write leaner queries/design more efficient databases to avoid stuff like this?

I have these same questions as hamelton4242and @objectace76309. Creating a temporary table with the average distance grouped by carrier and then comparing that for carrier and distance would be how I would imagine it.

``````WITH avgdist AS
(SELECT carrier, AVG (distance) AS average_distance
FROM flights
GROUP BY 1)
SELECT flights.id,flights.carrier
FROM flights
JOIN avgdist ON flights.carrier=avgdist.carrier
WHERE distance<avgdist.average_distance and  flights.carrier=avgdist.carrier
ORDER BY 1;
``````

I know this is a lot longer but if the WITH clause only has to be run once and then called, is it more efficient?

1 Like

Found an optimazation SQL video on the net that refers to your solution to be be the best for the reason you mention.
They explicited recomend to not use subqueries for performance

I’ts not mention on the course.

But what i think is worth mention is the difference beetween Non-corralated and Corralated subquerys is when you specify your search in the subquery meaning when you put the “WHERE” clause in the subquery.