Capstone Project - Analyze Data With R - Analyzing Effect of Outlier Performances on NHL Player Notoriety


title: “R Capstone Project”
output: html_document
date: “2023-01-11”

ANALYZING EFFECT OF OUTLIER PERFORMANCES ON NHL PLAYER NOTORIETY

Introduction

This is my capstone project for the Analyze Data with R Skill Path course on CodeCademy. The assignment for this project is to investigate a topic of my choosing using Google Trends data. I have elected to analyze the connection between goals scored in the NHL and the Google Trends hits for the players that score them. More specifically, I want to see if players who have a significant outlier season (One season where they score many more goals than they would normally score in their career) have a positive effect on Google Trends hits later in their career. The thesis here is that a player is raised to prominence by a season of unusual success, and then enjoys greater notoriety in the seasons that follow. This notoriety would remain regardless of whether or not the player also enjoyed greater goal production in later seasons.

Data Preparation

For this project, in addition to the general use packages of dplyr, tidyr, and ggplot2, I’ll be relying on fastRhockey to bring in player data, and gtrendsR to bring in Google Trends data.

# Import Libraries
library(dplyr)
library(tidyr)
library(fastRhockey)
library(gtrendsR)
library(ggplot2)

First, I need to set up some initial data frames in order to facilitate later code.

  • TeamIDs contains a list of the Team IDs for all NHL franchises in the fastRhockey package.

  • PlayerList will contain a list of all Players and their unique player IDs in the fastRhockey package.

  • PlayerSummary will contain the initial data evaluation that will allow us to find our outlier players.

# Set up data frames
TeamIDs <- list(1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,28,29,30,52,53,54,55)
PlayerList <- data.frame(matrix(ncol = 2, nrow = 0))
PlayerSummary <- data.frame(matrix(ncol = 5, nrow = 0))
colnames(PlayerSummary) <- c('ID','Name','Average','Max','Delta')

Here I use a for loop in order to collect a list of all the players on NHL rosters for the current season. I eliminate all goalies from this list, since they have no goals data, and add just the player’s names and their ID to the PlayerList data frame.

# For loop to build player id list
for(
  i in TeamIDs
) {
  TeamRoster <- try(nhl_teams_roster(team_id = i, season = 20222023))
  TeamRoster <- select(TeamRoster,player_full_name,player_id,position_code)
  TeamRoster <- subset(TeamRoster,TeamRoster$position_code != 'G')
  TeamRoster <- TeamRoster[c(1,2)]
  PlayerList <- rbind(PlayerList,TeamRoster)
}
PlayerList <- unique(PlayerList)

Finding the Outlier Players

In order to find the players we should investigate for our analysis, we need to collect data on NHL players. The players who will be best suited for our analysis will have a single season of goals which is much higher than their average goal scoring per season rate.

To find these players, I’ve used a for loop which performs this analysis on all of the players we identified in PlayerList.

  1. First, we make sure that the function will actually produce a data frame once it is pulled for a player. There are a number of players who lack data in this package, and this allows us to avoid code breaking errors.

  2. The player in question has all of their data pulled by the nhl_player_stats function from fastRhockey. We filter out any seasons which were not played in the NHL, and then group these seasons by season and summarize their goals for those seasons. This allows us to combine single season performance for players who were traded during the year, and played for multiple teams.

  3. In order to properly perform our analysis, a few adjustments need to be made to the numbers. First, the current season should be removed, since it is incomplete. Next, the 12-13 season, the 19-20 season, and the 20-21 seasons were all incomplete and played with fewer than the standard 82 games, due to either lockout or the COVID-19 pandemic. The goal values for these seasons are increased to their projections of if the league had played a full 82 game season. It should be noted that since the goal counts include the postseason goals scored, this mathematical adjustment is not perfect, and is more reliable for players who had fewer playoff games in those years.

  4. The highest goal value among all their seasons, their average number of goals per season, and the difference between them are calculated for all players, and added to the PlayerSummary data frame. This last value, known as Delta, is converted to a number, and then PlayerSummary is sorted by it.

# For loop to gather goals stats
for(
  i in 1:nrow(PlayerList)
) {
  if(
    is.data.frame(try(nhl_player_stats(PlayerList[i,2]))) == TRUE
  ) {
    Player <- nhl_player_stats(PlayerList[i,2])
    Player <- Player[Player$league_name == "National Hockey League", ]
    Player <- Player %>%
      group_by(season) %>%
      summarise(sum(stat_goals)) %>%
      na.omit()
    Player <- subset(Player,Player$season != '20222023')
    colnames(Player) <- c('Season','Goals')
    Player$Goals[Player$Season == 20122013] <- (Player$Goals[Player$Season == 20122013])*(82/48)
    Player$Goals[Player$Season == 20192020] <- (Player$Goals[Player$Season == 20192020])*(82/68)
    Player$Goals[Player$Season == 20202021] <- (Player$Goals[Player$Season == 20202021])*(82/56)
    PlayerSummary[nrow(PlayerSummary) + 1,] <- c(PlayerList[[i,2]],PlayerList[[i,1]],mean(Player$Goals),max(Player$Goals),max(Player$Goals)-mean(Player$Goals))
  }
}
PlayerSummary$Delta <- as.numeric(as.character(PlayerSummary$Delta))
PlayerSummary <- arrange(PlayerSummary, -Delta)
print(PlayerSummary)

It looks like we’ve got our outlier players! A number of these look very promising, but we need to start by eliminating Tage Thompson and Troy Terry. Both of these players are only in their sixth year in the league, and thus will not have enough data to make a meaningful judgment on this question. With those two out of the way, let’s take a look at the top five players here - Chris Kreider, William Karlsson, Mika Zibanejad, Eric Staal, and Jeff Carter.

To dig into these players, I built a function which would perform the same data pulling and cleanup as the last step, but would give a data frame showing the player’s goal history instead of aggregating the data. This allows us to see when in a player’s career their outlier year occurred.

#Function to pull out goals per season for players
PlayerStatSetup <- function(PlayerName,PlayerID) {
  PlayerName <- nhl_player_stats(PlayerID)
  PlayerName <- PlayerName[PlayerName$league_name == "National Hockey League", ]
  PlayerName <- PlayerName %>%
    group_by(season) %>%
    summarise(sum(stat_goals))
  Remove <- c("20222023",NA)
  PlayerName <- PlayerName[! PlayerName$season %in% Remove, ]
  colnames(PlayerName) <- c('Season','Goals')
  PlayerName$Goals[PlayerName$Season == 20122013] <- (PlayerName$Goals[PlayerName$Season == 20122013])*(82/48)
  PlayerName$Goals[PlayerName$Season == 20192020] <- (PlayerName$Goals[PlayerName$Season == 20192020])*(82/68)
  PlayerName$Goals[PlayerName$Season == 20202021] <- (PlayerName$Goals[PlayerName$Season == 20202021])*(82/56)
  return(PlayerName)
}

#Analyzing top players goal histories
Kreider <- PlayerStatSetup(Kreider,8475184)
Karlsson <- PlayerStatSetup(Karlsson,8476448)
Zibanejad <- PlayerStatSetup(Zibanejad,8476459)
Staal <- PlayerStatSetup(Staal,8470595)
Carter <- PlayerStatSetup(Carter,8470604)
print(Kreider)
print(Karlsson)
print(Zibanejad)
print(Staal)
print(Carter)
  • Chris Kreider is a fantastic example of a player with an outlier season. He had been a good, but not exceptional player until last season, when he doubled his previous best season with 62 goals! Unfortunately, since Kreider’s outlier season is his most recent one, that eliminates him as a viable example for our research.

  • William Karlsson is another great example! Having floundered in Anaheim and Columbus before being grabbed by Vegas in the 2017 Expansion Draft, Karlsson exploded with 50 goals in his breakout season. He did enjoy increased goal production compared to before his outlier season (Though not to a significant degree), so though Karlsson is a viable candidate for analysis this should be taken into account.

  • Mika Zibanejad is another fantastic candidate for our analysis! He broke out in the COVID-19 shortened 19-20 season, and was projected to score more than 50 goals over 82 games! Since he only played in three playoff games during that season as well, Zibanejad’s numbers are not significantly altered by the 82 game adjustment.

  • Eric Staal is a very strange member of this list. His outlier season came in his second season, and despite a 50 goal campaign in his fifth season, never reached those initial heights again. Unfortunately, since his outlier year occurred during the very start of his career, and any Google Trends data from before that point does not exist (Due to Google Trends starting in 2004), Staal is not a viable candidate for our research.

  • Jeff Carter rounds us out as another good candidate. Carter’s best season came in the lockout shortened 2012-2013, where he was projected to put up 55 goals. Unfortunately for our analysis, Carter played 18 games with a Conference Finalist Los Angeles Kings, meaning that the six goals he scored in the postseason received a four goal bump in his final numbers. This is not a massive error, but one worth noting.

With this in mind, our best candidates appear to be William Karlsson, Mika Zibanejad, and Jeff Carter.

Pulling Google Trends Data

Initially, this data was pulled with just the player’s names as the keywords. However, when that testing was done, William Karlsson was shown to have very high Google Trends hits before he ever would have set foot in the NHL. Thus, NHL was added to the search terms to narrow down the results to make sure there wasn’t noise from people who had the same name as our subjects.

The data was them filtered so that each player’s stats were limited to those seasons where they were actually in the NHL.

#Generating Player Google Trends
trends <- gtrends(keyword=c("William Karlsson nhl","Mika Zibanejad nhl","Jeff Carter nhl"),time = "all")
trends_over_time <- trends$interest_over_time
trends_over_time <- trends_over_time %>% mutate(hits = as.numeric(hits))
trends_over_time <- trends_over_time %>% replace_na(list(hits = 0))

Karlsson_Trends <- trends_over_time %>% 
  subset(trends_over_time$keyword == "William Karlsson nhl") %>%
  filter(date > as.Date("2014-06-01"))
Zibanejad_Trends <- trends_over_time %>% 
  subset(trends_over_time$keyword == "Mika Zibanejad nhl") %>%
  filter(date > as.Date("2011-06-01"))
Carter_Trends <- trends_over_time %>% 
  subset(trends_over_time$keyword == "Jeff Carter nhl") %>%
  filter(date > as.Date("2005-06-01"))

ggplot() + geom_line(Carter_Trends, mapping = aes(date, hits*6, color = "Carter")) + geom_line(Karlsson_Trends, mapping = aes(date, hits*6, color = "Karlsson")) + geom_line(Zibanejad_Trends, mapping = aes(date, hits*6, color = "Zibanejad"))

If we want to compare this data to the goals data, we’ll need to adjust the “Season” column in those data frames so that they show a date rather than a character string. Originally, I had it fill in the dates from the end of each season, but in later iterations changes it to be the first day of the month in which the season ended. This will allow for correlation comparisons later.

#Converting Seasons to Dates and Plotting Sample Players Goals over Time
NameFunction <- function(q){
    q$Season[q$Season == "20052006"] <- "2006-6-01"
    q$Season[q$Season == "20062007"] <- "2007-6-01"
    q$Season[q$Season == "20072008"] <- "2008-6-01"
    q$Season[q$Season == "20082009"] <- "2009-6-01"
    q$Season[q$Season == "20092010"] <- "2010-6-01"
    q$Season[q$Season == "20102011"] <- "2011-6-01"
    q$Season[q$Season == "20112012"] <- "2012-6-01"
    q$Season[q$Season == "20122013"] <- "2013-6-01"
    q$Season[q$Season == "20132014"] <- "2014-6-01"
    q$Season[q$Season == "20142015"] <- "2015-6-01"
    q$Season[q$Season == "20152016"] <- "2016-6-01"
    q$Season[q$Season == "20162017"] <- "2017-6-01"
    q$Season[q$Season == "20172018"] <- "2018-6-01"
    q$Season[q$Season == "20182019"] <- "2019-6-01"
    q$Season[q$Season == "20192020"] <- "2020-9-01"
    q$Season[q$Season == "20202021"] <- "2021-6-01"
    q$Season[q$Season == "20212022"] <- "2022-6-01"
    q$Season <- as.POSIXct( q$Season)
    return(q)
}
Carter <- NameFunction(Carter)
Karlsson <- NameFunction(Karlsson)
Zibanejad <- NameFunction(Zibanejad)
ggplot() + geom_line(Carter, mapping = aes(x = Season, y = Goals, color = "Carter")) + geom_line(Karlsson, mapping = aes(x = Season, y = Goals, color = "Karlsson")) + geom_line(Zibanejad, mapping = aes(x = Season, y = Goals, color = "Zibanejad"))

Data Analysis

Now it’s time to visualize the relationship between goals and trends hits for each of our subject players.

#Plotting Players Goals and Google Hits over Time
ggplot() + geom_line(Carter, mapping = aes(x = Season, y = Goals, color = "Goals")) + geom_line(Carter_Trends, mapping = aes(date, hits*6, color = "Google Trends Hits")) +
  scale_y_continuous(name = "Goals", sec.axis = sec_axis(~., name="Google Trends Hits")) + 
  labs(title = "Jeff Carter Goals and Trends Hits")
ggplot() + geom_line(Karlsson, mapping = aes(x = Season, y = Goals, color = "Goals")) + geom_line(Karlsson_Trends, mapping = aes(date, hits*6, color = "Google Trends Hits")) +
  scale_y_continuous(name = "Goals", sec.axis = sec_axis(~., name="Google Trends Hits")) + 
  labs(title = "William Karlsson Goals and Trends Hits")
ggplot() + geom_line(Zibanejad, mapping = aes(x = Season, y = Goals, color = "Goals")) + geom_line(Zibanejad_Trends, mapping = aes(date, hits*6, color = "Google Trends Hits")) +
  scale_y_continuous(name = "Goals", sec.axis = sec_axis(~., name="Google Trends Hits")) + 
  labs(title = "Mika Zibanejad Goals and Trends Hits")
  • Jeff Carter does not seem to line up very well with our theory, with the bulk of his trends hits coming early in his career, before his outlier season. Furthermore, his trends hits before his outlier season do not seem to match his goals count, while those that come after his outlier season do. Ultimately, a very confounding result for our analysis.

  • William Karlsson, in comparison, follows our projections to a T. His breakout season comes with a massive spike in Google Trends hits, and we can see clear increases in his Google Trends hits after his outlier season. However, it also seems to line up fairly closely to his goals per year as well.

  • Mika Zibanejad might be the best example of our theory. His Google Trends hits follow his goals output fairly closely up until his outlier season. There he has a massive spike, and then further massive spikes in the years after, which are higher than his goal production. Unfortunately, there aren’t that many season after his outlier year, which raises the question of if his greater notoriety will continue.

Visual analysis can only take us so far. It’s time to do some math. Let’s start by finding out how much someone’s Google Trends changed after their outlier season.

#Analysis to determine impact of outlier year on notoriety
KarlssonBefore <- Karlsson_Trends %>%
  filter(date < as.Date("2017-06-01"))
KBAvg <- mean(KarlssonBefore$hits)
KarlssonAfter <- Karlsson_Trends %>%
  filter(date > as.Date("2018-06-01"))
KAAvg <- mean(KarlssonAfter$hits)
paste("William Karlsson recieved ",KAAvg/KBAvg," times the Google Hits after their outlier season.")

CarterBefore <- Carter_Trends %>%
  filter(date < as.Date("2017-06-01"))
CBAvg <- mean(CarterBefore$hits)
CarterAfter <- Carter_Trends %>%
  filter(date > as.Date("2018-06-01"))
CAAvg <- mean(CarterAfter$hits)
paste("Jeff Carter recieved ",CAAvg/CBAvg," times the Google Hits after their outlier season.")

ZibanejadBefore <- Zibanejad_Trends %>%
  filter(date < as.Date("2017-06-01"))
ZBAvg <- mean(ZibanejadBefore$hits)
ZibanejadAfter <- Zibanejad_Trends %>%
  filter(date > as.Date("2018-06-01"))
ZAAvg <- mean(ZibanejadAfter$hits)
paste("Mika Zibanejad recieved ",ZAAvg/ZBAvg," times the Google Hits after their outlier season.")

Hm. Well, two out of three ain’t bad, but it’s not a ringing endorsement either. That said, remember how William Karlsson’s goals output tracked his Google Trends hits through each year of his career? What if Google Trends follows goals output more closely, and notoriety from a peak season doesn’t have lingering effects? Let’s run the same code as above, but this time for goals. If the change in Google Hits is close to the change in Goals, that will suggest that notoriety isn’t sticky, and you’ve got to keep scoring if you want people to search your name.

#Analysis to determine impact of outlier year on goals
KarlssonGoalsBefore <- Karlsson %>%
  filter(Season < as.Date("2017-06-01"))
KGBAvg <- mean(KarlssonGoalsBefore$Goals)
KarlssonGoalsAfter <- Karlsson %>%
  filter(Season > as.Date("2018-06-01"))
KGAAvg <- mean(KarlssonGoalsAfter$Goals)
paste("William Karlsson scored ",KGAAvg/KGBAvg," times the goals after their outlier season.")

CarterGoalsBefore <- Carter %>%
  filter(Season < as.Date("2017-06-01"))
CGBAvg <- mean(CarterGoalsBefore$Goals)
CarterGoalsAfter <- Carter %>%
  filter(Season > as.Date("2018-06-01"))
CGAAvg <- mean(CarterGoalsAfter$Goals)
paste("Jeff Carter scored ",CGAAvg/CGBAvg," times the goals after their outlier season.")

ZibanejadGoalsBefore <- Zibanejad %>%
  filter(Season < as.Date("2017-06-01"))
ZGBAvg <- mean(ZibanejadGoalsBefore$Goals)
ZibanejadGoalsAfter <- Zibanejad %>%
  filter(Season > as.Date("2018-06-01"))
ZGAAvg <- mean(ZibanejadGoalsAfter$Goals)
paste("Mika Zibanejad scored ",ZGAAvg/ZGBAvg," times the goals after their outlier season.")

So we need two things to be true to prove our hypothesis true. Firstly, the average number of Google Hits per season needs to be greater after someone’s outlier season. This is true for Karlsson and Zibanejad. Secondly, we need the ratio of Google Hits before and after the outlier season to be greater than the ratio of goals before and after the outlier season. This shows that the notoriety effects of the outlier season have stuck around, rather than falling off when goal production dropped.

Here, Carter continues to confound our data, with his goal production after his outlier season also dropping, but by a lower rate. Karlsson and Zibanejad do both conform to our model, but for Zibanejad it’s not by much.

At this point, it’s worth investigating how closely goals and Google Hits actually correlate. Here, I’ve modified all the trends data so that the Google hits are all grouped with the seasons in which they happened, allowing for a direct comparison between all the goals scored, and all the Google Hits received in a single season. Then, I ran a cor.test on them to find how closely the two figures were related.

#Jeff Carter Correlation Data
Carter$YearCode <- seq.int(nrow(Carter))
Carter_Trends$YearCode = 1
for (
  y in 1:nrow(Carter)
) {
  Carter_Trends <- within(Carter_Trends, YearCode[date > Carter[[y,1]]] <- Carter[[y,3]]+1)
}
Carter_Trends <- Carter_Trends %>%
  group_by(YearCode) %>%
  summarize(sum(hits))
CarterData <- merge(Carter,Carter_Trends,by="YearCode")
CarterCorrelation <- cor.test(CarterData$Goals,CarterData$`sum(hits)`)

#William Karlsson Correlation Data
Karlsson$YearCode <- seq.int(nrow(Karlsson))
Karlsson_Trends$YearCode = 1
for (
  y in 1:nrow(Karlsson)
) {
  Karlsson_Trends <- within(Karlsson_Trends, YearCode[date > Karlsson[[y,1]]] <- Karlsson[[y,3]]+1)
}
Karlsson_Trends <- Karlsson_Trends %>%
  group_by(YearCode) %>%
  summarize(sum(hits))
KarlssonData <- merge(Karlsson,Karlsson_Trends,by="YearCode")
KarlssonCorrelation <- cor.test(KarlssonData$Goals,KarlssonData$`sum(hits)`)

#Mika Zibanejad Correlation Data
Zibanejad$YearCode <- seq.int(nrow(Zibanejad))
Zibanejad_Trends$YearCode = 1
for (
  y in 1:nrow(Zibanejad)
) {
  Zibanejad_Trends <- within(Zibanejad_Trends, YearCode[date > Zibanejad[[y,1]]] <- Zibanejad[[y,3]]+1)
}
Zibanejad_Trends <- Zibanejad_Trends %>%
  group_by(YearCode) %>%
  summarize(sum(hits))
ZibanejadData <- merge(Zibanejad,Zibanejad_Trends,by="YearCode")
ZibanejadCorrelation <- cor.test(ZibanejadData$Goals,ZibanejadData$`sum(hits)`)

#Examining Correlation Data
print(CarterCorrelation)
print(KarlssonCorrelation)
print(ZibanejadCorrelation)

Once again, Jeff Carter ruins everything (As a longtime Edmonton Oilers fan, I said that for about a decade). With a ludicrously high p-value, we cannot assume a connection between his goals scored and his Google hits by any metric. The real question to ask with him is why he had such high Google hits counts so early in his career. Karlsson and Sibanejad, on the other hand, have very, very low p-values, and we can likely assume a relationship between their goals and their Google hits. Karlsson in particular has a very high correlation between his goal counts and Google hits, supporting earlier analysis

Conclusions and Follow Up

In conclusion, we probably can’t conclude much from this analysis. Disappointing, but that’s the way it goes sometimes. Jeff Carter is too bizarre of a case to make many sweeping claims based on the data, and William Karlsson would suggest that there isn’t a strong lingering effect from an outlier season’s notoriety. If I had to hazard a guess why this is the case, it would be that after an outlier season, NHL fans start to care more about the new season’s goal leaders, leaving the previous season’s behind.

If I were to continue analysis in this vein, I would wonder how player notoriety changed once they were traded (Or moved in free agency) from a small market team to a large market team.