NBA Shot-Log Project

Mohamed Hassan-El Serafi
5 min readOct 28, 2021

--

After finishing the first two Data Science projects, I started feeling confident about my capabilities. The data visualization and linear regression projects helped prepare me for the next project, which involved data classification. I learned from the previous project about continuous and categorical variables and what each type represented. When I learned that this project would deal with categorical variables, I instantly thought of a project idea.

Since I was a child, I have loved basketball. I played basketball almost everyday during grade school, watched almost every Knicks game, and followed the stats of players on the Knicks team and other NBA players. The main criteria basketball is judged on a binary classification of wins and losses, and made and missed shot attempts. This fit perfectly with the guidelines of the project. I searched on Kaggle for datasets that contained statistics of NBA players, with a focus on shot attempts. To my surprise, it didn’t take long to find this type of dataset. It was a detailed breakdown of every shot attempt by every NBA player during the 2014–2015 season. It included information such as teams, location (home or away), game clock, shot clock, shot distance, closest defender distance, dribbles before the shot attempt, the amount of time a player had the ball before the shot attempt, whether the shot attempt was made or missed, the outcome of the game represented by the shot attempt player (win or loss), and the names of the defender and shooter of the shot attempt. I decided to focus on the made or missed shot attempt feature for my project. With the amount of information provided in this dataset, I felt I could determine what feature or features affected the made or missed shot attempts.

For my data cleaning/exploratory data analysis, my first action was to convert the made/miss shot attempt columns into binary numerical values. Missed shots were converted to 0 and made shots were converted to 1. Most of the data did not have missing values. In fact, the only column that had missing values was the Shot_Clock column. I filled those values with the median value of the column. Next I created dummy variables for the Location column, assigning 0 to away games and 1 to home games. Because there were over 128,000 rows of data of shot attempts from hundreds of players, I wanted to focus on a subset of NBA players. To do this, I did a value count of the top 20 players in the dataset, created a new column of the 20 players, and then created dummy variables for each player. Additionally, I dropped features that I thought did not affect the outcome of the shot attempt. Features such as final margin, matchup, and period don’t have an influence on whether a player makes or misses a shot and therefore weren’t included.

After I finished cleaning the data, I began to create my classification models. I used Logistic Regression, K-Nearest Neighbors, Decision Tree, Bagged Tree, Random Forest, and XG Boost as my initial models. Based on the initial results, I decided to perform hyperparameter tuning on the Decision Tree and XG Boost models. Before I began the hyperparameter tuning process, I ran feature importances for each model. The feature that scored the highest was Shot Distance, which became helpful later on when analyzing individual players. I used GridSearch to perform hyperparameter tuning for each model. After running Gridsearch on the XG Boost model, I modified the max depth to see if it would affect the training accuracy of the model. It increased the training accuracy of the model by a little over 6 percent, from 62.8% to 68.95%. I decided to use XG Boost as my final model. The overall testing accuracy of the model was 59 percent, which is higher than the odds of a made/miss shot attempt, which is 50 percent. When running the feature importances for the final model, Shot Distance once again scored the highest among all other features:

Since Shot Distance was the feature in all the models that scored the highest, I decided to explore the shot distance of individual players in this data subset. First I created a graph that displayed the shot percentages of made and missed shot attempts for each player:

Next I created a plot showing the general shot distances of made and missed shot attempts among the 20 players in the data subset:

As illustrated above, the amount of shots attempted overall were predominantly within 0-5 feet, and from about 25 feet from the basket, with the amount of made baskets coming primarily from 0–5 feet. From there, I analyzed the shot distance of the top 6 players with the highest percentage of made shot attempts. I used a bar plot for each of the 6 players. The trends that were illustrated in the overall made and missed shot attempts were present in the separate bar plots of the 6 players: High amounts of shot attempts from 0–5 feet and from about 25 feet, with low amounts of shot attempts in between. As someone who regularly follows basketball, the findings were unsurprising considering that the 3-point line is approximately 24 feet from the basket, and NBA players are attempting more 3-point shots now than ever before. In addition, players who have high made field goal percentages usually attempt their shots close to the basket.

Overall I had a lot of fun working on this project. Because I was familiar with the origins of the dataset I was better able to interpret the results. The one piece of data I wished this dataset had was the coordinates of the court where the shot attempts came from. If those numbers were compiled, I would have been able to create a shot chart showing where on the basketball court the shot attempts came from, which could reveal other facets of made and missed shot attempts such as whether it’s better to shoot from the right or left side of the court, or where along the 3-point line are 3-point field goals are successful. Nevertheless, I gained a lot of confidence creating models using this dataset and I’m excited about the final two projects coming up.

--

--