Skip to content

TylerAPollard/ST558-Project-1

Repository files navigation

ST558 Project 1

Tyler Pollard 6/15/2021

Reading and Summarizing Data from the National Hockey League’s (NHL) API

Introduction

This vignette is an introduction on how to read and summarize NHL data using the tidyverse. We will be using the National Hockey League’s (NHL) Records and Stats API. In this vignette we will explore how to access the different APIs and their corresponding endpoints with various modifiers based on the specified team input. Following the query functions is an exploratory data analysis on the data available.

Load Packages

Before reading and summarizing the data, some necessary packages must be loaded.

library(httr)
library(jsonlite)
library(rmarkdown)
library(knitr)
library(tidyverse)
library(dplyr)
library(readr)
library(DT)
library(lubridate)
library(ggplot2)
library(stringr)

NHL Record API

First we must create a function that calls user input for endpoint to return the corresponding data from the NHL Records API. This function allows the user to access franchise, team totals, season records, goalie records, skater records, and admin history for each team.

API_info <- function(call){
  base <- "https://records.nhl.com/site/api"
  get_info <- GET(paste0(base, "/", call))
  get_info_text <- content(get_info, as = "text")
  get_info_json <- fromJSON(get_info_text, flatten = TRUE)
  get_info_df <- as.data.frame(get_info_json)
  get_info_tbl <- tbl_df(get_info_df)
  return(get_info_tbl)
}

Franchise

This function allows the user to access basic franchise information for each team. No input is required by the user to call this data.

franchise_info <- function(){
  franchise_df <- API_info("franchise")
  franchise_df <- franchise_df %>% select(data.id, data.mostRecentTeamId, data.fullName, data.teamPlaceName,
                                                  data.teamCommonName, data.teamAbbrev, data.firstSeasonId,
                                                  data.lastSeasonId, total) %>% arrange(data.mostRecentTeamId)
  return(franchise_df)
}

Franchise IDs

The following function outputs the possible franchise IDs that are used in the functions for season records, goalie records, skater records, and admin history for each franchise.

franchise_identifer <- function(){
  franchise_IDs <- franchise_info()
  franchise_IDs <- franchise_IDs %>% select(data.fullName, data.id) %>% arrange(data.id)
  colnames(franchise_IDs) <- c("Team Name" , "Franchise ID")
  return(franchise_IDs)
}
kable(franchise_identifer())
Team Name Franchise ID
Montréal Canadiens 1
Montreal Wanderers 2
St. Louis Eagles 3
Hamilton Tigers 4
Toronto Maple Leafs 5
Boston Bruins 6
Montreal Maroons 7
Brooklyn Americans 8
Philadelphia Quakers 9
New York Rangers 10
Chicago Blackhawks 11
Detroit Red Wings 12
Cleveland Barons 13
Los Angeles Kings 14
Dallas Stars 15
Philadelphia Flyers 16
Pittsburgh Penguins 17
St. Louis Blues 18
Buffalo Sabres 19
Vancouver Canucks 20
Calgary Flames 21
New York Islanders 22
New Jersey Devils 23
Washington Capitals 24
Edmonton Oilers 25
Carolina Hurricanes 26
Colorado Avalanche 27
Arizona Coyotes 28
San Jose Sharks 29
Ottawa Senators 30
Tampa Bay Lightning 31
Anaheim Ducks 32
Florida Panthers 33
Nashville Predators 34
Winnipeg Jets 35
Columbus Blue Jackets 36
Minnesota Wild 37
Vegas Golden Knights 38
Seattle Kraken 39

This function allows the user to specify their desired franchise based on either franchise name or franchise ID. It also returns an error with instructional message if the user enters an invalid franchise or franchise ID.

franchise_ID <- function(franchise){
  if(is.numeric(franchise)){
    if(franchise > 0 & franchise < 39)
      return(franchise)
    else{
      stop("Please enter a valid franchise ID from 1-38")
    }
  }else{
    ID_df <- API_info("franchise-team-totals")
    if(franchise %in% ID_df$data.teamName){
      ID_num <- ID_df %>% select(data.teamName, data.franchiseId) %>% filter(data.teamName == franchise) %>% select(data.franchiseId) %>% filter(row_number() == 1)
      return(ID_num[[1]][1])
    }else{
      stop("Please enter a valid franchise name")
    }
  }
}

Franchise Team Totals

This function allows the user to access team totals for each franchise parsed by regular season and playoff data. Some of the variables were modified to make more sense to the user.

franchise_team_totals <- function(){
  team_df <- API_info("franchise-team-totals")
  team_df$data.activeFranchise <- as_factor(team_df$data.activeFranchise)
  levels(team_df$data.activeFranchise) <- list("No" = 0, "Yes" = 1)
  team_df$data.gameTypeId <- as_factor(team_df$data.gameTypeId)
  levels(team_df$data.gameTypeId) <- list("Regular Season" = 2, "Playoffs" = 3)
  team_df <- team_df %>% select(data.id, data.franchiseId, data.teamName, data.triCode, data.teamId, data.activeFranchise, data.firstSeasonId, data.lastSeasonId, everything()) %>% arrange(data.franchiseId)
  return(team_df)
}

Season Records

The following function allows the user to access season records for all of the franchises. The user is given the option to filter the data by franchise or franchise ID.

season_record <- function(franchise = NULL){
  if(is.null(franchise)){
    season_df <- API_info("franchise-season-records")
    season_df <- season_df %>% select(data.id, data.franchiseName, data.franchiseId, everything())
    return(season_df)
  }else{
    season_df <- API_info(paste0("franchise-season-records?cayenneExp=franchiseId=", franchise_ID(franchise)))
    season_df <- season_df %>% select(data.id, data.franchiseName, data.franchiseId, everything())
    return(season_df)
  }
}

Goalie Records

This function allows the user to access goalie records for all of the franchises. The user is given the option to filter the data by franchise or franchise ID.

goalie_record <- function(franchise = NULL){
  if(is.null(franchise)){
    goalie_df <- API_info("franchise-goalie-records")
    return(goalie_df)
  }else{
    goalie_df <- API_info(paste0("franchise-goalie-records?cayenneExp=franchiseId=", franchise_ID(franchise)))
    return(goalie_df)
  }
}

Skater Records

The skater_record function allows the user to access skater records for all of the franchises. The user is given the option to filter the data by franchise or franchise ID.

skater_record <- function(franchise = NULL){
  if(is.null(franchise)){
    skater_df <- API_info("franchise-skater-records")
    return(skater_df)
  }else{
    skater_df <- API_info(paste0("franchise-skater-records?cayenneExp=franchiseId=", franchise_ID(franchise)))
    return(skater_df)
  }
}

Admin History and Retired Numbers

The admin_history function allows the user to access admin history and retired numbers for every franchise. The user may also specify a franchise in order to pull data for a team of their choice.

admin_history <- function(franchise = NULL){
  if(is.null(franchise)){
    admin_history_df <- API_info("franchise-detail")
    admin_history_df <- admin_history_df %>% select(data.mostRecentTeamId, data.teamFullName, data.teamAbbrev, data.id, everything())
    return(admin_history_df)
  }else{
    admin_history_df <- API_info(paste0("franchise-detail?cayenneExp=mostRecentTeamId=", team_ID(franchise)))
    admin_history_df <- admin_history_df %>% select(data.mostRecentTeamId, data.teamFullName, data.teamAbbrev, data.id, everything())
    return(admin_history_df)
  }
}

Team IDs

This function outputs the various team IDs for every franchise in the database. These are the only valid inputs to the admin_history function.

team_identifer <- function(){
  team_IDs <- franchise_team_totals()
  team_IDs <- team_IDs %>% select(data.teamName, data.teamId) %>% arrange(data.teamId)
  team_IDs <- team_IDs[seq(1, length(team_IDs$data.teamName), by = 2),]
  colnames(team_IDs) <- c("Team Name" , "Team ID")
  return(team_IDs)
}
kable(team_identifer())
Team Name Team ID
New Jersey Devils 1
New York Islanders 2
New York Rangers 3
Philadelphia Flyers 4
Pittsburgh Penguins 5
Boston Bruins 6
Buffalo Sabres 7
Montréal Canadiens 8
Ottawa Senators 9
Toronto Maple Leafs 10
Atlanta Thrashers 11
Carolina Hurricanes 12
Florida Panthers 13
Tampa Bay Lightning 14
Washington Capitals 15
Chicago Blackhawks 16
Detroit Red Wings 17
Nashville Predators 18
St. Louis Blues 19
Calgary Flames 20
Colorado Avalanche 21
Edmonton Oilers 22
Vancouver Canucks 23
Anaheim Ducks 24
Dallas Stars 25
Los Angeles Kings 26
Phoenix Coyotes 27
San Jose Sharks 28
Columbus Blue Jackets 29
Minnesota Wild 30
Minnesota North Stars 31
Quebec Nordiques 32
Winnipeg Jets (1979) 33
Hartford Whalers 34
Colorado Rockies 35
Ottawa Senators (1917) 36
Hamilton Tigers 37
Pittsburgh Pirates 38
Detroit Cougars 40
Montreal Wanderers 41
Montreal Maroons 43
New York Americans 44
St. Louis Eagles 45
Oakland Seals 46
Atlanta Flames 47
Cleveland Barons 49
Detroit Falcons 50
Winnipeg Jets 52
Arizona Coyotes 53
Vegas Golden Knights 54
California Golden Seals 56
Toronto Arenas 57
Toronto St. Patricks 58

The following function is the team to ID mapper. It accepts either the character team name (ex. “Washington Capitals”) or team ID. An error will occur if an invalid input is made with a instructional message.

team_ID <- function(franchise){
  if(is.numeric(franchise)){
    if(franchise > 0 & franchise <= 58){
      if(franchise %in% c(11,27,31,32,33,34,35,36,38,40,42,44,46,47,48,50,55)){
        stop("No team ID for 11, 27, 31, 32, 33, 34, 35, 36, 38, 40, 42, 44, 46, 47, 48, 50, 55")
      }else{
        return(franchise)
      }
    }else{
      stop("Please enter a valid team ID from 1-55")
    }
  }else{
    ID_df <- API_info("franchise")
    if(franchise %in% ID_df$data.fullName){
      ID_num <- ID_df %>% select(data.fullName, data.mostRecentTeamId) %>% filter(data.fullName == franchise) %>% select(data.mostRecentTeamId) %>% filter(row_number() == 1)
      return(ID_num[[1]][1])
    }else{
      stop("Please enter a valid franchise name")
    }
  }
}

NHL Stats API

The next NHL API we will access is the NHL Stats API. The following function allows the user to access the NHL Stats API for the teams endpoint. By calling the function with no input, the stats for every team will be returned. If the user wishes, they can specify a team to access data on their specified input.

team_stats <- function(franchise = NULL){
  base <- "https://statsapi.web.nhl.com/api/v1/teams"
  if(is.null(franchise)){
    url <- paste0(base, "?expand=team.stats")
  }else{
    ID <- team_ID(franchise)
    url <- paste0(base, "/", ID, "?expand=team.stats")
  }
  get_stats <- GET(url)
  get_stats_text <- content(get_stats, as = "text")
  get_stats_json <- fromJSON(get_stats_text, flatten = TRUE)
  get_stats_df <- as.data.frame(get_stats_json)
  get_stats_tbl <- tbl_df(get_stats_df)
  return(get_stats_tbl)
}

NHL One Stop Call Function

The NHL_info function is a one stop shop to access any of the above endpoints based on the user input. This function also returns the corresponding endpoint and modifier if the user specifies a franchise they wish to explore. This function is used in the visual outputs as seen below. The possible inputs for endpoints are as follows:

  • “franchise” to access the franchise info for all of the recorded franchises
  • “team totals” to access the data from each team parsed by regular season and playoffs
  • “season records” to access season records for each team
  • “goalie records” to access goalie records for each team
  • “skater records” to access skater records for each team
  • “franchise history” to access admin history and retired numbers for each team
  • “stats” to access the most recent statistics for each team
NHL_info <- function(endpoint, ...){
  if(endpoint == "franchise"){
    franchise_info()
  }else if(endpoint == "team totals"){
    franchise_team_totals()
  }else if(endpoint == "season records"){
    season_record(...)
  }else if(endpoint == "goalie records"){
    goalie_record(...)
  }else if(endpoint == "skater records"){
    skater_record(...)
  }else if(endpoint == "franchise history"){
    admin_history(...)
  }else if(endpoint == "stats"){
    team_stats(...)
  }else{
    stop("Please enter valid endpoint")
  }
}

NHL Exploratory Data Analysis

Newer Teams

This contigency table takes a look at the teams from the franchise endpoint of the NHL Records API based on when they joined the NHL. I was curious to see how many teams joined the NHL after I was born, so I categorized all of the teams by older and younger than me. For reference I was born in 1995. Below it can be seen that 6/39 teams joined the league after I was born and are therefore younger than me.

franchise_df <- NHL_info("franchise")
franchise_df$data.firstSeasonId <- paste0(substr(franchise_df$data.firstSeasonId,1,4),"/",substr(franchise_df$data.firstSeasonId,5,8))
franchise_df$year.split <- str_split(franchise_df$data.firstSeasonId, "/")
for(j in 1:nrow(franchise_df)){
  if(as.numeric(franchise_df$year.split[[j]][1]) >= 1995){
    franchise_df$new.team[j] <- "Younger than me"
  }else{
    franchise_df$new.team[j] <- "Older than me"
  }
}
franchise_table <- table(franchise_df$new.team)
kable(franchise_table, col.names = c("Team Age", "Count"))
Team Age Count
Older than me 33
Younger than me 6

Active Contigency Table

Below is a contigency table that shows the number of teams that have made the playoffs after the regular season parsed by the active and inactive franchises. This shows that 5/13 inactive teams and 43/44 active teams ever made the playoffs. By taking the sum of the Regular Season column you can calculate the total number of 57 franchises represented in this endpoint. Based on my knowledge of the NHL, I know that there are only 31 active teams currently in the league so this information is misleading. To take a better look active the active teams that have made the playoffs, see below.

team_totals_df <- NHL_info("team totals")
active_df <- team_totals_df %>% filter(data.gameTypeId == "Regular")
active_table <- table(team_totals_df$data.activeFranchise, team_totals_df$data.gameTypeId)
kable(active_table)
Regular Season Playoffs
No 13 5
Yes 44 43

Record Contingency Tables

This is a simple contigency table that shows the number of active teams that have ever made the playoffs after the regular season. There are 31 active teams as seen by the Regular Season row because all teams participate in the regular season and of those 31 teams all of them have made the playoffs before. This data was found by joining information from the NHL Record API and the NHL Stats API.

team_totals_df <- NHL_info("team totals")
colnames(team_totals_df)[3] <- "teams.name"
stats_df <- NHL_info("stats")
playoff_df <- left_join(stats_df, team_totals_df)
playoff_table <- table(playoff_df$data.gameTypeId)
kable(playoff_table, col.names = c("Game Type", "Number of Teams"))
Game Type Number of Teams
Regular Season 31
Playoffs 31

Goalie Single Season Records

Below is a summary table of the median values for goalie single season records parsed by active and inactive players. Included is the number of goalies for each player status to show the amount of data being pulled. The data shows that goalies nowadays are a bit more active compared to previous goalies based on the median most shots, but not by a lot. I also wanted to include the median number of saves and median most goals allowed to see if the skill of the goalies have increased or decreased. Based on these values it appears that the skill level of the goalies have remained about the same with a possible slight increase in effectiveness.

goalie_df <- NHL_info("goalie records")
goalie_df$data.activePlayer <- as_factor(goalie_df$data.activePlayer)
levels(goalie_df$data.activePlayer) <- list("Active" = TRUE, "Inactive" = FALSE)
goalie_game_df <- goalie_df %>% group_by(data.activePlayer) %>% summarise(`Number of Goalies` = n(), `Most Shots` = round(median(data.mostShotsAgainstOneGame, na.rm = TRUE)), `Most Saves` = round(median(data.mostSavesOneGame, na.rm = TRUE)), `Most Goals` = round(median(data.mostGoalsAgainstOneGame, na.rm = TRUE)))
colnames(goalie_game_df)[1] <- "Players Status"
kable(goalie_game_df)
Players Status Number of Goalies Most Shots Most Saves Most Goals
Active 149 46 43 6
Inactive 929 44 40 7

Washington Capitals Penalty Minutes

Below is a summary table of the penalty minutes for the Washington Capitals parsed by position. Included is the number of skaters at each position to show the amount of data being pulled to calculate the averages. Based on the averages the left wings and defensemen spend the most time in the penalty box whereas the Centers seem to play a bit cleaner. However, I found it interesting that the position with the max amount of penalty minutes by far came from a center.

WSH_skater_df <- NHL_info("skater records", franchise = "Washington Capitals")
for(k in 1:nrow(WSH_skater_df)){
  WSH_skater_df$full.name[k] <- paste(WSH_skater_df$data.firstName[k], WSH_skater_df$data.lastName[k], sep = " ")
}
WSH_skater_df$data.positionCode <- as_factor(WSH_skater_df$data.positionCode)
levels(WSH_skater_df$data.positionCode) <- list("Left Wing" = "L", "Right Wing" = "R", "Center" = "C", "Defensemen" = "D")
WSH_pen_mins <- WSH_skater_df %>% group_by(data.positionCode) %>% summarise(`Number of Skaters` = n(), `Average Penalty Minutes` = round(mean(data.penaltyMinutes)), `Max Penalty Minutes` = max(data.penaltyMinutes))
colnames(WSH_pen_mins)[1] <- "Position"
kable(WSH_pen_mins)
Position Number of Skaters Average Penalty Minutes Max Penalty Minutes
Left Wing 106 120 1220
Right Wing 109 99 1123
Center 125 86 2003
Defensemen 181 121 1628

Barplot of Goalies

Below is a barplot that shows the number of goalies with at least 10 shutouts in a season separated by team in order from most to least. The purpose of this plot is to show the teams that have had the best goalies throughout history. It should be noted that all teams are not represented here meaning that not all teams have had a goalie with 10 or more shutouts in a season. The teams with the most goalies are the Detroit Red Wings and the Boston Bruins with 4 goalies followed by the Toronto Maple Leafs, New York Rangers, and Montréal Canadiens with 3 goalies each.

library(forcats)
goalie_df <- NHL_info("goalie records")
goalie_df <- goalie_df %>% filter(data.mostShutoutsOneSeason >= 10) %>% arrange(data.mostShutoutsOneSeason)
ggplot(data = goalie_df, aes(y = fct_rev(fct_infreq(data.franchiseName)))) + geom_bar(fill = "blue") + labs(title = "Number of Goalies with 10 or more Shutouts in a Season", x = "Number of Shutouts", y = "Team Name")

Histogram of Win Percentage

This histogram shows the density plot of the winning percentage for all teams split up by regular season and playoffs. Overlayed is the density distribution. From the plots it can be seen that the average winning percentage for both regular season and playoffs is aroung .5 which means that the teams on average win as many games as they lose. This makes sense because the outcome of each game can only end with a single winner and loser. Each plot is left skewed to show that more teams have been unsuccessful in both games types.

team_totals_df <- NHL_info("team totals")
team_totals_df <- team_totals_df %>% group_by(data.teamName)
team_totals_df$win.percentage <- team_totals_df$data.wins/(team_totals_df$data.wins + team_totals_df$data.losses)
ggplot(data = team_totals_df, aes(x  = win.percentage)) + geom_histogram(aes(y = ..density..), bins = 20) + geom_density(lwd = 2, color = "red", position = "stack") + facet_grid(cols = vars(data.gameTypeId)) + labs(title = "Winning Percentage by Game Type", x = "Win Percentage", y = "Density")

Histogram of Washington Capitals All Time Goal Leaders

Below is a histogram of the number of seasons played by all of the Washington Capitals skaters. This plot is right skewed meaning that of all of the skaters that have played for the Capitals the majority only stay for a season or two. There is a drop off of skaters who stay with the Capitals more than 10 seasons which makes sense because that is a very long time with one franchise.

WSH_skater_df <- NHL_info("skater records", "Washington Capitals")
ggplot(data = WSH_skater_df, aes(x = data.seasons)) + geom_histogram(fill = "red") + labs(title = "Washington Capitals Skater Seasons Played", x = "Seasons", y = "Number of Skaters")

Boxplot of Divsions

Below shows boxplots for season points by division with the individual team points overlaid. Points for each team are calcualted as follows:

  • 2 points for each win
  • 1 point for an overtime/shootout loss
  • 0 points for each loss

The more points a team has the better they are considered to be. From the visual below it can be seen that the MassMutaul East was the best division on average, however, the best teams from the Discover Central and Honda West were better than their best teams.

team_stats_df_1 <- team_stats()
team_stats_df <- team_stats_df_1 %>% select(teams.name, teams.division.name, teams.teamStats)
points <- c()
for(i in 1:nrow(team_stats_df)){
  points <- c(points, ((team_stats_df[[3]][[i]])[[1]][[1]])$stat.pts[1])
}
team_stats_df <- team_stats_df[-32,]
team_stats_df <- cbind(team_stats_df, points)
team_stats_df <- team_stats_df %>% select(-teams.teamStats)
team_stats_df$teams.division.name <- as_factor(team_stats_df$teams.division.name)
team_stats_df$points <- as.numeric(as.character(team_stats_df$points))
ggplot(data = team_stats_df, aes(x = teams.division.name, y = points)) + geom_boxplot() + geom_point(aes(color = teams.division.name), position = "jitter") + labs(title = "Boxplots of Season Points by Division", x = "Division", y = "Season Points") + scale_color_discrete(name = "Divisions")

Goals by Season Scatter Plot

The following plots show the number of goals scored by skaters based on the number of seasons they played. The plots are parsed by the position that they played. Overlayed is a line of best fit with error bounds to show the trend in the data. Based on the trend lines it can be seen that as the number of seasons played by skaters increased so did the number of goals they scored. This makes sense because if you play in more season you play more games and have more opprotunities to score goals. I also think it is interesting that the slope of the trend lines are almost identical for the left wings, right wings, and centers meaning that those positions score about the same number of goals based on the number of seasons they played. The slope for the defensemen is much less than the other three which also makes sense because that position has less opprotunies to score.

skater_df <- NHL_info("skater records")
skater_df$data.positionCode <- as_factor(skater_df$data.positionCode)
levels(skater_df$data.positionCode) <- list("Left Wing" = "L", "Right Wing" = "R", "Center" = "C", "Defensemen" = "D")
ggplot(data = skater_df, aes(x = data.seasons, y = data.goals)) + geom_point() + geom_smooth() + facet_grid(cols = vars(data.positionCode)) + labs(title = "Goals Scored by Number of Season", x = "Number of Seasons", y = "Goals Scored")

Conclusion

In conclusion, this analysis was an introduction on how to access the various NHL APIs and how to conduct an exploratory data analysis on the different endpoints. The various visuals are just a beginning to the possible outputs that can be created from these NHL endpoints. I highly recommend to use these functions and visual coding to explore more about the NHL based on your favorite team.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages