GitHub - Eyring-Miller-Muriithi-capstone/shot-caller-for-ballers

Data Science Capstone--Take The Shot or Not.

Project Summary

In this Capstone project, our team has gathered data from the NBA api with a goal to predict whether or not a player should take a shot. The goal is to create a machine learning model that will allow players the opportunity to take a shot based on previous activities leading to the chance.

You can view our team's presentation deck here: Take the shot or not presentation

Initial Questions

Can a model predict when a player should choose to make a 3 point field goal or not?

Is home court advantage really an advantage?

Do players have a favorite shooting position

Do players need rest in order for them to perform better

Project Objectives

Creating clusters for nba shooting locations

Creating tiers for the league's 3 point shooters

Visualizing players 3 point shooting locations

Predicting whether a player should take the shot or not

Data Dictionary

Feature	Datatype	Definition
player	84228 non-null: object	player's official name
player_id	84228 non-null: int64	player's unique id
team	84228 non-null: object	team name
team_id	84228 non-null: int64	team's unique id
game_id	84228 non-null: int64	game's unique id
home	84228 non-null: bool	home games identified by 1
period	84228 non-null: int64	1st 2nd 3rd and 4th periods, data on overtime also included
abs_time	84228 non-null: int64	engineered: how much a player is on the court based on their rotational data
play_time	84228 non-null: float64	engineered: how much a player is on the court based on rotational data
since_rest	84228 non-null: float64	player's time on the court since the last time they rested
loc_x	84228 non-null: int64	location of three point shot on the court
loc_y	84228 non-null: int64	location of three point shot on the court
zone	84228 non-null: object	engineered: shooting clusters engineered with KMeans
shot_type	84228 non-null: object	the type of shot based on the player's shooting mechanics
score_margin	84228 non-null: int64	the difference in scores of the winning vs losing team
points	84228 non-null: int64	points per player per game
shot_result	84228 non-null: object	boolean shot missed/ made
games_played	84228 non-null: int64	number of games played in the season
game_3pa	84228 non-null: int64	3point shots attempted per game
game_3pm	84228 non-null: int64	3point shots made per game
game_3miss	84228 non-null: int64	3point shots missed in a game
cum_3pa	84228 non-null: int64	cumulative 3 point shots attempts in a season
cum_3pm	84228 non-null: int64	cumulative 3 point shots made in a season
cum_3miss	84228 non-null: int64	cumulative 3 point shots missed in a season
cum_3pct	84228 non-null: float64	cumulative 3 point shot percent for the season
tm_v1	84228 non-null: float64	engineered tiers for shooters
tm_v2	84228 non-null: float64	engineered tiers for shooters, most accurate
tm_v3	84228 non-null: float64	engineered tiers for shooters
distance	84228 non-null: float64	distance of shots from the rim
game_event_id	84228 non-null: int64	chronological game events

Plan

We plan to use the NBA api to acquire individual players game statistics for the model.

Clean the data

Visualize the data through the EDA process

Conduct a number of hypothesis tests

Conduct an ensemble of classification models

Draw conclusions

Initial Hypotheses

Hypothesis 1 -

H_0: Is there a statistical relationship on made shots between home and away games

Hypothesis 2 -

H_0: Does the period/quarter have a statistical relationship with made shots

Hypothesis 3 -

H_0: Is there a relationship between the zones and how accurate a player is

Hypothesis 4 -

H_0: Is there a relationship between distance the shot is taken and the shot accuracy

Hypothesis 5 -

H_0: Does rest affect a players shooting ability

Hypothesis 6 -

H_0: Is the engineered feature tm_v2 accurately showing which players are better at 3 point shooting

Hypothesis 7 -

H_0: Is there a relationship between score margin and whether or not a shot is made

Executive Summary - Conclusions & Next Steps

Actions: Using official stats from the NBA.com/stats database, with endpoints accessed through a third party API client, we collected data on shots taken in the 2021-2022 season and filtered for three point shots. This data includes a number of in-game and player specific features, including both raw data and engineered features. After an in-depth analysis, we split the dataset into training, validation and unseen test sets and applied a number of classification models varying hyperparameters. We also created a informational dashboard for players and coaches to use.

Conclusions: We discovered that ‘total game playing time for a player, score margin, points, games played, distance of shot, shooting zone, if it’s the fourth period, and our unique player metric are all significant factors of making a three point shot (as measured at the time of shot). Unfortunately, even with extensive analysis, for a general league-wide model we were only able to beat baseline by 0.9% (0.6% absolute improvement), and thus we feel it is inadequate to recommend at this time. However, when modeling by individual player, we typically get better results.

Recommendations: There is a lot of additional data found in the stats that we feel have a strong impact on three point success, such as distance from defender, how many dribbles before the shot, and time remaining on the shot clock. However they are filters, not returned data, so a more rigorous (and computationally intensive) acquisition can be performed to get these features. Also, to improve player-specific models, we can use multiple years of data, including playoffs and even other leagues contained in the database (G-League, WNBA).

Project Steps:

Acquire

> - Our analysis and model are focused on three pointers taken in the most recent (2021-2022) regular season. Out goal was to acquire a dataframe that had every three point shot taken by eveyr player who took one, along with the result (Made or Missed) and a number of features calculated for the player and game right up to the moment the shot was taken. The [Acquire Workbook](*/acquire_workbook.ipynb) associated with this report contains the step by step process of building our initial, raw dataset. In brief:

1. Get team-player data from 'teamplayerdashboard' endpoint and use it to acquire all shots taken in the '21-'22 Regular Season using shot information from the 'shotchartdetail' endpoint.
2. Isolate 3-pt shots from this data and remove outliers. We used shot distance, acquired from 'shotchartdetail', as the metric to determine outliers.  Outliers were shots taken from a distance 1.5 times the IQR of all three point shots.  This eliminated all shots taken from 30 feet and beyond (2188 shots, or 2.5% of all three pointers).
3. Apply a KMeans clustering algorithm on all remaining three point attempts to identify the natural 'zones' from which players are shooting.  Using a k-elbow test we determined seven (7) clusters were optimal, improving upon the NBA standard five (5) shot zones.

Prepare

For additional pre-processing/cleaning, we perform the following actions detailed in the Wrangle Workbook:

1. Create cumulative three point stats for all players across all games.
2. Add location back in using Pythagorean Theorem on loc_x and loc_y (was lost in the acqusition process; this is also an improvement as it is a float and does not bin by integer foot).
3. Add back in 'game_event_id', as it is needed for the Tableau Dashboards.
4. Encode categoricals, scale numericals, split into train/validate/test, and finally seperate the X_ (features) from the y_ (target).
5. Return the following dataframes:
- df: Complete initial dataframe with pre-processing perfomed, used for Univariate Distribution Analysis
- df_outlier: A dataframe containing the outlier three point shots, in case needed in future
- X_train_exp: A dataframe for exploring the train data split (uneconded and unscaled)
- X_train, X_validate, X_test: Scaled and encoded features for train/validate/test, used for bi/multivariate EDA and modeling
- y_train, y_validate, y_test: Target (actual outcomes), used for bi/multivariate EDA and modeling

EDA

From the visualization exploration combined with statistical testing these are the features that were significant for modeling:

  - Location on court (zone and distance)
  - Playtime (how long a player has been on court)
  - Score margin
  - Number of games played
  - Calculated player three point shooting ability (Jem-metric)

Modeling

After isolating the scaled and encoded features into our modeling feature (X) dataframes, we built a function to train a total of 125 model-hyperparameter combinations, including ensemble models, on our data. The function applies the model to the validate dataset to check for overfitting of the training model. The function and associated modeling details are found in the Modeling Workbook.

Base models include:

K-Nearest Neighbors (k = 1 to 25)
Decision Tree (max_depth = 1 to 25)
Logistical Regression
Random Forest (leaf = 1 to 3; depth = 2 to 5)
Extra Trees (leaf = 1 to 3; depth = 2 to 5; n_estimators = 100, 150, 200, 250)
Ensemble models of the above.

Model results

Taking our key features, we used a Decision Tree Classification Model that returned a 0.7% increase (overall 1.1% over baseline) in predicting whether a shot would be made or missed from each zone on the court. Individual player models seem more promising, but need more data to run modeling on.

Individual player 3pt assessment

We created a new metric, Jem-metric of the Top 3pt shooters scale could give the NBA a better estimation of which players have the best 3pt shot efficiency from game-to-game.

Takeaways

- For a general, league-wide three point shooting model, using the drivers above as key features, we exceeded the predictive power of a baseline model by 0.6 percentage points (0.9% improvement over base).
- Applying modeling to individual players, the best models vary as well as the results, although almost all beat baseline on validation data (including one by 8.8%).  Unfornately, running the model on the test data of a single individual example, we actually performed more poorly than the league-wide.  However this can be due to a signifigant smaller data size whcih can be improved upon by brining in additional years of data on the player.

Reproduce Our Project

In order to reproduce this project, you will need all the necessary files listed below to run the final project notebook.

Read this README.md
Use the acquire.py to run the tome_prep function
Use the wrangle.py file to run the wrangle_prep function to get the cleaned, split, encoded and scaled data
Use the explore.py to do the univariate, bivariate analysis
Continue with your analysis and draw new conclusions

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
Notebooks		Notebooks
.gitignore		.gitignore
README.md		README.md
Take_the_Shot_or_Not_Final.ipynb		Take_the_Shot_or_Not_Final.ipynb
acquire.py		acquire.py
acquire_workbook.ipynb		acquire_workbook.ipynb
all_last_season_shots.csv		all_last_season_shots.csv
all_last_season_shots_3pt_clusters.csv		all_last_season_shots_3pt_clusters.csv
core_df.csv		core_df.csv
explore.py		explore.py
explore_workbook.ipynb		explore_workbook.ipynb
last_season_outlier_3PA.csv		last_season_outlier_3PA.csv
league_3pa.csv		league_3pa.csv
modeling.py		modeling.py
modeling_workbook.ipynb		modeling_workbook.ipynb
predictions.csv		predictions.csv
team_player_ids.csv		team_player_ids.csv
wrangle.py		wrangle.py
wrangle_workbook.ipynb		wrangle_workbook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Science Capstone--Take The Shot or Not.

Project Summary

Initial Questions

Project Objectives

Data Dictionary

Plan

Initial Hypotheses

Executive Summary - Conclusions & Next Steps

Project Steps:

Acquire

Prepare

EDA

Modeling

Model results

Individual player 3pt assessment

Takeaways

Reproduce Our Project

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Eyring-Miller-Muriithi-capstone/shot-caller-for-ballers

Folders and files

Latest commit

History

Repository files navigation

Data Science Capstone--Take The Shot or Not.

Project Summary

Initial Questions

Project Objectives

Data Dictionary

Plan

Initial Hypotheses

Executive Summary - Conclusions & Next Steps

Project Steps:

Acquire

Prepare

EDA

Modeling

Model results

Individual player 3pt assessment

Takeaways

Reproduce Our Project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages