Solution / Lesson for Correlated Subqueries II, 6/7


#1

This particular lesson of codeacademy's isn't very good. Everything is being told and nothing is being taught, so I decided I would try to make a guide to help out other users like myself learn what is going on with SQL in this section. This is more of an approach of how to play with SQL in order to learn it. I encourage the user to play around to see things for themselves - the editor allows this and we should take advantage of that. It's long, but hopefully clearer.

TL;DR it doesn't actually sort by time, there is no time clause in the code. f.id < flights.id is more about the ordering in the new table rather than the value of the truth statement, and so the inner subquery won't work at all without the outer one. Subquery essentially acts as a for loop. Follow this guide, it takes like 5 minutes and you can see for yourself.

First, the basic SELECT clause.

SELECT * 
FROM flights
ORDER BY carrier;

The result you get should be all of the columns in the table, ordered by carrier.

  • Change SELECT * to SELECT (col_name, col_name) i.e.
    SELECT carrier, id, origin and now change the order of of columns to
    SELECT origin, id, carrier - the difference in order shows up in the result.
  • Change SELECT to SELECT COUNT(*) to see the total number of rows
  • Try SELECT COUNT(id) to retrieve the total of non-null rows with ids
  • You can change ORDER BY carrier to another column name, i.e. ORDER BY id
  • Add a WHERE clause under FROM flights: WHERE id < 300 or any number
  • Play with count, select, and where to see different results in the table.

Now I have something like this:

SELECT COUNT(*)
FROM flights
WHERE id < 200
ORDER BY id;

Now I want to figure out what's going on with flights f and f.id < flights.id so let's see how we can play with this "f" to see what's going on.

  • Change id < 200 to flights.id < 400 (change the number to verify a new result), run and you should see a new number count come up. Leave this alone.
  • Don't use f, pick a random letter or word, I choose "pickle" and change the line FROM flights to FROM flights pickle and run the code again.

You should receive an error that says "no such column flights.id" when flights.id clearly worked before. This means that in the line FROM flights f, the SQL is essentially renaming the table "flights" to "pickle"

  • Now try changing flights.id < 400 to pickle.id < 100 and run. It should work, returning the count of how many flights with an id below the number given.
  • Now change the clause to WHERE pickle.carrier = 'AA' to test.
  • Again if you have FROM flights pickle then WHERE flights.carrier = 'AA' wouldn't work, go ahead and try again to see. flights has been renamed.
  • Lastly, change the clause to WHERE pickle.id < flights.id and run. Then try WHERE pickle.carrier = flights.carrier and there will be an error both times saying that flights.id and flights.carrier do not exist.

This is because we renamed "flights" to pickle and now when the code is referencing "flights" it needs to come from the outer query, the next "FROM flights" in the code, so we need to go ahead and look at the outer query.

Now, the code on the left of the page works, and we've got only the code inside the subquery, but errors come up when it runs on its own. We know that what is in the subquery is essentially a count, a number. So let's look at the code in the higher structure with a random number to see what happens. Comment out with "/*" and try this:

SELECT carrier, id, 200 +1
AS flight_sequence_number
FROM flights
ORDER BY carrier;

Notice that on the left of the page it says carrier, id, (SELECT COUNT(*) .... ) +1. Notice that comma after "id" and know that the result in the subquery is a number when it's working, and also part of the selection, i.e. when our WHERE clause isn't pickles.id < flights.id or pickles.carrier = flights.carrier, so I went ahead and just put in a number and left the "+1". Now run the code and and you will see a table of carrier, id, and a flight sequence number of 201 for EVERY entry. This wasn't what we wanted, but we know this runs without errors.

Since the flights are being sequenced by time, let's add dep_time to the selection and to see the sorting by carrier, I'm going to add an extra WHERE clause:

SELECT carrier, id, dep_time, 200 +1
AS flight_sequence_number
FROM flights
WHERE flights.carrier = 'AA'
ORDER BY flight_sequence_number;

and you will see a result of all AA flights with the same sequence number. Now let's throw in our subquery from earlier, replacing the "200 +1" which gives us:

SELECT carrier, id, dep_date, dep_time,
		(SELECT COUNT (*)
		FROM flights pickle
		WHERE pickle.id < flights.id
		AND pickle.carrier=flights.carrier) +1
AS flight_sequence_number
FROM flights
WHERE flights.carrier = 'AA'
ORDER BY flight_sequence_number;

Now what you will see is a table of AA flights, ordered by ascending id number. I believe the instructions on the left said it wanted it to be ordered by time, but there is nothing in the code selecting that anything be ordered by time, instead it is where pickle.id < flights.id and pickle.carrier = flights.carrier. Plus if you look at the times in the table's results it's all over the place. This leads me to believe that what pickle.id < flights.id is saying, is to put the numbers in the new table, pickle, in ascending order. It's about order, not about the value of the truth statement.

Play Around!

  • change it to flights.id < pickle.id or just change the "<" to ">". Run it. id order switches.
  • change it to pickle.carrier = 'AA' and lower WHERE flights.carrier = 'DL' and you'll get a completely messed up flight sequence chart.
    *change it to flights.carrier = 'AA' and WHERE flights.carrier = 'DL' and again, you'll get a messed up sequence chart with the only value being '1'
  • Get rid of the "+1" and you will see the exact same chart, in the same order, except the sequence will start with "0" instead. That +1 isn't telling it to sequence, the +1 is just telling it what index to start at - that subquery essentially acts as a for loop.
  • Get rid of the second WHERE clause with a comment out line " --" in front.

My final code:

SELECT origin, id, dep_date, dep_time,
		(SELECT COUNT (*)
		FROM flights pickle
		WHERE pickle.id < flights.id
		AND pickle.carrier=flights.carrier) +1
AS flight_sequence_number
FROM flights
-- WHERE flights.carrier = 'DL'
ORDER BY carrier, flight_sequence_number;

And what I get is a list of flights, sequenced by their id number, ordered by carrier. Now, for the solution, just change "carrier" to "origin" and you have:

SELECT origin, id, dep_date, dep_time,
		(SELECT COUNT (*)
		FROM flights pickle
		WHERE pickle.id < flights.id
		AND pickle.origin=flights.origin) +1
AS flight_sequence_number
FROM flights
-- WHERE flights.carrier = 'DL'
ORDER BY origin, flight_sequence_number;

You can keep playing around and switching out variables to get a better sense of the subquery, and what works and what doesn't. Ultimately, I think because the subquery is a count, SQL knows to treat it like a sort of for loop, which is why it doesn't work when you just throw in a random number without all of the conditions of the subquery's WHERE clause.

Hope this helps someone. This exercise actually did help me figure it out as I was stuck for quite some time on this section.

:beers:


Correlated Subqueries II Logic
#2

Thanks so much - this really helped! One question - Is it correct to say that the 'ordering' part of the code

WHERE pickle.id < flights.id
AND pickle.carrier=flights.carrier

works because the SELECT COUNT (*) value will initially come out as '0', since the first 'pickle.id' value in the table will have a corresponding flights.id of equal value? And then each subsequent 'flights.id' will have exactly one more count of 'pickle.id' where that value is less than 'flights.id'


#3

I'm glad it helped!

I'm not entirely sure of your question, but I'll try my best:

So, if you run the SELECT COUNT(*) the value of the rows will be something like 499, it won't be zero, but everything in the subquery isn't just a count, it's a whole setup of this newly defined table "pickle" and how it's structured, I think SQL knows to treat this kind of tabular setup as a sort of for loop. SQL knows there's a table there, even though we only asked for the count. I'm not entirely sure myself as I am still learning while making this, but it isn't iterating pickle.id from 0 onwards, it's iterating flight_sequence_number from 0 onwards (and then the +1).

What I think is actually happening is that it knows to make a new table named pickle, and it knows to put in the first flight as the lowest id, since pickle.id < flights.id. I would read "<" as a command that says "put the least id in first, and then of what's left, put the next lowest." If it were only moving things that were less than flights.id then there would always be one flights.id value left, so it is definitely linked to the copying/new table operation happening in FROM flights pickle.

It's not matching the two id numbers and copying over, it's literally moving everything from flights to pickle, which is why flights.id doesn't exist when you run the subquery alone, but works when it's run inside another query because there is a "FROM flights" in the outer query, so the flights.id actually gets a value to work with - that's so super confusing, I know. The fact that there are two separate FROM commands is important here - I think one is telling it to rename FROM flights pickle, and the other is just a reference point, FROM flights. I hope I didn't confuse you too much, I'm realizing that this is how it works as I'm writing.

Man, this lesson really needs a lot of work!


#4

It helped a little, but I still don`t understand.
I have tried "FROM flights AS f" and it worked. So the table was really given an alias.
BUT if table has now other name, why
WHERE f.id < flights.id
AND f.origin=flights.origin
still work?
f.id should now be valid and flights.id should be invalid.
Or no?


#5

That's correct.
"f" is just an alias, which is why I made my own to reiterate that point, and that f is in no way related to the word 'flights'.

If you run just the subquery alone, flights.id will be invalid, because you've changed the name. That's part of the exercise, and you get to see that if you run it alone. Nesting it in another query however, changes that.

flights.id will work in the correlated query with the structure above it AND below it. The "Select" above the subquery in question is selection FROM flights, as written below our subquery, which is really just a construction of our new table, f, so the flights.id in the subquery is referencing the 'from' that comes after the subquery.

It's running all at once, altogether, linking back and forth from the outer structure to the inner, which is why it is a correlated subquery.


#6