Off Platform Project: Best of Baseball Awards

Hello All,

I think I may have posted or added to another post before, but this is my first creating one.
I have been working through the Off Platform Project in the ‘Design Databases with PostgreSQL’ skill path.
Looking at a lot of other posts similar asking for help, I figured I would post what I have as I sat for almost 3 hours trying to crack just the first question.
The hardest part here was undoubtedly the fact that multiple team names are linked with the same id. Example: Philadelphia Phillies franchise. They have had 4 other names and because they were a continuous franchise, associated with one team id.
By using, lets say basic SQL (and I am no expert by any means) you can find the answer but have multiple team names listed, as seen in this screenshot.
question 1 screen shot

After i got this far. I added the row_number and a partition to give me a rank. I guess I could have used rank here, but this seemed to work for me.
I was then left with old team names like the washington senators in 2016, instead of the washington nationals.
I used this code to ‘solve’ the problem. except for that naming issue.
Also, I imagine if you are not a baseball fan, this may have been where you stopped, maybe not knowing the names were wrong.

with avgweight as
(
select t.name, round(avg(p.weight),2) as avg_weight, b.yearid from batting b
join people p
on b.playerid = p.playerid
join teams t
on b.teamid = t.teamid
group by 1,3
order by 3 desc
)
select * from (select *,
row_number () over (
partition by yearid
order by avg_weight desc) as rank from avgweight)temp
where rank = 1
group by 3, 1, 2, 4
order by 4, 3 desc

question 1 so close

After getting over the frustation, I combined the batting and teams tables to see if i can find something. that led me to this.

select batting.yearid, batting.teamid, teams.id, teams.teamid, teams.name from batting
join teams
on batting.teamid = teams.teamid
order by 5 desc

and right at the top of the query, there was the answer with the Worcester Ruby Legs (great name by the way). 3 years, and 3 different id numbers. not exclusive to the year by the way. Each year, 1880, 1881 and 1882 have 3 id numbers: 86, 94 and 108.
That is where I realized I joined the wrong columns. By joining the teamid columns, I locked it into a singular name.
And thus I was stuck with the first name based on the teamid column instead of the fluidity of the singular id column on the teams table.
After adjusting my join statement to look like this…

with avgweight as
(
select t.name, round(avg(p.weight),2) as avg_weight, b.yearid from batting b
join people p
on b.playerid = p.playerid
join teams t
on b.team_id = t.id
group by 1,3
order by 3 desc
)

select * from (select *,
row_number () over (
partition by yearid
order by avg_weight desc) as rank from avgweight)temp
where rank = 1
group by 3, 1, 2, 4
order by 4, 3 desc

I got this query result, and thus completing the first question… 4 more questions to go and one where i get to make it up.
Here we go!!!

Question 1 solution

Question 2: Shortest Slugger.
After going through the first question and working through those challenges, this question was just simply changing a column, and flipping to ascending order compared to descending order.
Still started with a temporary table, calling it shortheight. the second expression was changed from rounded avg weight, to height, and the column title changed to avg height. With the code below, you get the following output

with shortheight as
(
select t.name, round(avg(p.height),2) as avg_height, b.yearid from batting b
join people p
on b.playerid = p.playerid
join teams t
on b.team_id = t.id
group by 1,3
order by 3 desc
)

select * from shortheight

Question 2 temp table query

This gives you the average height of every team per year in descending order.
Using the next bit of code, you then single out the lowest avg height per team per year.

select * from (select *,
row_number () over (
partition by yearid
order by avg_height asc) as rank from shortheight)temp
where rank = 1
group by 3, 1, 2, 4
order by 4, 3 desc

As with the previous question, this added a row number based on year, in ascending order. Asked it to rank only the top result, or in this case the shortest average height per team, and ordered it in descending order based on year.
And it gives you the following output with 149 results.

Question 2 temp table query

On to the next question. Happy coding.

Now we are rolling.
Question 3 is the biggest spender per year award. As a baseball fan, without seeing the number, you know that the New York Yankees and LA Dodgers will be the answer to every year from 2000 to the most recent season.
This proved to be true. Those two teams spent the most since 1999 up to 2016 (end of the data). And the Yankees dominated that one, the LA Dodgers only topped the Yankees in two of those years.
On to the code.
Continuing to use temporary tables, this one scrapped the 3 table and went back to a simple two table join.
But it wasn’t simple. Like with the first question about the heaviest team per year, you needed to make sure you joined the right columns. you had id and teamid from the teams table and on the salaries table you could choose from id, teamid and team_id.

Temporary table with three columns, linking Salaries and Teams table by team_id and id columns respectively.

with bigmoney as
(
SELECT s.yearid, t.name, SUM(s.salary) AS Total_Dollars
FROM salaries s
JOIN teams t on s.team_id = t.id
GROUP BY 1,2
ORDER BY 3 DESC
)

with the temp table set up as ‘bigmoney’

select * from (select *,
row_number () over (
partition by yearid
order by Total_Dollars desc) as rank from bigmoney)temp
where rank = 1
order by 1 desc

using the same partition and adjusting the order by request, gives you this output.

Question 3 final output

Two to go!

Question 4: Cost Per Win in 2010.

Now that we are rolling along, finishing question 4 is simple but there are two key items needed.

  1. a where clause in my temporary table. Since we are only looking at a single year, the where clause of yearid = 2010 is needed
    and 2. I used a cast function to turn the salary column from double precision to decimal.
with CPW as
(
SELECT s.yearid, t.name, cast(SUM(s.salary) as decimal) AS Total_Salary, t.w
FROM salaries s
JOIN teams t ON s.team_id = t.id
where s.yearid = 2010
	GROUP BY 1,2, 4
ORDER BY 3 DESC
	)

The casting to decimal allowed me to shrink the cost per win column to per million, which i found to be easier to read.
Rather than selecting select *, i went with using all the other columns and dropping the year as it is already separated by the where clause above.
This left me with the team name, total salary in millions, wins, and then the calculation of salary/wins to give you cost per year.

select name, round(Total_Salary/1000000,2), w, round((total_salary/w)/1000000,2) as "cost per win in millions" from CPW
order by 3 desc

I sorted by wins because I liked seeing Tampa Bay with 95 wins at a cost of $750,000 per win, compared to the New York Yankees having one less win (95) but at a cost of $2.17 million per win.

Cost Per Win

Last question to go. Priciest pitcher award. Get to see Kevin Brown to the Yankees on the list. There’s a name from the past.

Question 5: Priciest Pitcher Award
Given all the practice with CTE and using the Row number and Rank functions, this should have been easier.
As the hint in the question says, you need to join 3 columns on two tables and then join a third table.
That threw me off. I was able to join two columns with the two tables (Salaries and Pitching) and I tried to link that third hint column with the third table (People). That got me a result that I liked, but there was no way Zack Greinke cost the most per start when he took the rubber for 26 games for Arizona. The answer per year is essentially a large salary with 10-13 games started.
And to follow up on my Kevin Brown reference, he does appear in the answer twice, but not with they Yankees as I suggested. It was with Florida, before he signed for New York. Around the mid 00’s there was this ‘plague’ when a SP signed a big money free agency deal, then got injured in spring training and never played their first year. It’s a thing. Look it up.
Now on to the code.
Again, started with a CTE and threw in a cast to convert the salary column to decimal. To save some steps later, made the cost per game started into the CTE.
Then joined all the table with specific columns like so…

with PPitcher as
(
select s.yearid, p.namefirst, p.namelast, pi.gs, cast(s.salary as decimal), round(cast(s.salary as decimal)/pi.gs/1000000,2) as "cpg in millions" from pitching pi
join salaries s
on pi.teamid = s.teamid 
	and pi.yearid = s.yearid
	and pi.playerid = s.playerid
join people p
on s.playerid = p.playerid
where pi.gs > 9
group by 1, 2,3,4,5
	)

Another note about the above code, you needed the where clause with a minimum amount of games started. I went with greater than 9 but you could use greater than or equal to 10 as well.

From there, it was standard structure based on my previous answers, use the window function and row_number to make a rank, partition it over the correct column, in this case by yearid and order by the Cost per Game in Millions to give you the answer below.

select * from (select *, 
row_number () over (
	partition by yearid
order by "cpg in millions" desc ) as rank from ppitcher)temp	
where rank = 1
order by 1 desc, 6 desc, 7 desc

And here is the output. Check out 2001 and 2002 for the Kevin Brown reference.
And as you can see, most pitchers on this list started in 10-13 games, with a fiew putting in almost a full season like Johan Santana, and Carlos Zambrano in the 2010’s.

Happy Coding!

I’ve been struggling with this project for such a long time! Still on the first task! Your solution seems to work (finally something does!).
Thank you so much for sharing!
Now I’ll just need to understand what was that you’ve done here!

I got really wordy trying to explain the whole thought process.
The main trap I think everyone gets into is joining the wrong table and that is why you have incorrect team names.
The other thing for me, using rank to help sort.
Once you work through this with the temp tables. Any thing you need me to explain, I’d be happy to expand

Thank you so much for answering!
So my initial code looked like this:

WITH previous_results AS (
SELECT
teams.name,
batting.yearid,
ROUND(AVG(people.weight), 0) AS avg
FROM teams
JOIN batting
ON teams.teamid = batting.teamid
JOIN people
ON batting.playerid = people.playerid
GROUP BY 1, 2
ORDER BY 2 DESC
)
SELECT
previous_results.yearid,
MAX(previous_results.avg)
FROM previous_results
GROUP BY 1
ORDER BY 1 DESC;

And it works fine, but doesn’t display the team names! Haha. As soon as I add team names to the mix, I get the list of all teams by year. This is the first time I’ve heard about the partition and rank options (thank you again here!).
But I’m still a little bit confused. Why do you need the row_number?
Could you explain these lines of your code please?
(select *,
row_number () over (
partition by yearid
order by avg_weight desc) as rank from avgweight)temp

I get the (partition by yearid order by avg_weight desc) part, but not the context.

Thanks a lot!!!

Sorry about the delay. I saw the response and just a life delay.

When I just run the temp table i called avgweight, you get the team name, avgweight and the year. and just by itself it is filtered by year descending.
great start
but as you can see the avg weight isn’t in descending order and it lists the cincinnati redlegs and the cincinnati reds, below that you have the cleveland blues, bronchos, indians and naps.
This is where i was stuck and couldn’t remove these old team names.
AS a baseball fan I know it is the cleveland indians and the cincinnati reds. so I could not figure out how to remove the cincinnati redlegs or the other 3 team names that are not the cleveland indians.

I have seen answers that had a wrong team name in earlier years. their query was correct but just incorrect team name. I think it was 2016. Where the washington nationals is the heaviest team, but it would display the washington senators instead of the current team name.
I believe that is because based on teamid, senators is first so it would display the first name under the teamid.
Now that i think about it maybe you could descend that column to get nationals… anywho.

This is where the rank came in partitioned by year. By adding that column in, i was able to get the correct team name for the team id that is in the table.

the rest of that, where rank = 1, group by and order by is self explanatory now.

but the reason for the rank was because i could not get the correct team name. that is basically it.
a long way to go around, but i just couldn’t figure it out without that row and partition.

if you have another way that worked, please share.
As i have learned, there are many different ways to query and get the same result.

Thank you so much for taking time to explain. I I’ve found one more way to solve my issue but it doesn’t solve the team names mix up in any way, so I guess your solution is more accurate!

Just in case sharing it here, but once again, I don’t think it solves the teams’ names issue.

WITH avg_result AS (
SELECT
teams.name AS team,
batting.yearid AS year,
ROUND(AVG(people.weight), 0) AS average
FROM people, teams, batting
WHERE people.playerid = batting.playerid AND teams.id = batting.team_id
GROUP BY 1, 2
),
max_avg AS (
SELECT
year,
MAX(average) AS max_average
FROM avg_result
GROUP BY year
)
SELECT
avg_result.year,
avg_result.team,
max_avg.max_average
FROM avg_result
JOIN max_avg ON avg_result.year = max_avg.year AND avg_result.average = max_avg.max_average
ORDER BY avg_result.year DESC;

This is fantastic. An alternate way of the same solution.
There is one small change that is needed so you do not produce multiples on the same year.

If you want to spend a little time on it, I am sure you will get it.
I took your query, made the change, and boom. accurate.

I’ll post on the weekend just to keep the suspense alive. Great work and thanks for sharing. I wasn’t in the mind to use a second CTE yet we used it in training. and i feel i used it in another answer…

Thanks again!

1 Like

in your with statement with avg_result as…

if you change round(avg(people.weight), 0)… to (people.weight), 2)…

you will get the answer in full. no duplicates for the same year.

great work and thanks again for the education.