My Experience Using Tidyverse in R
As someone who primarily used Python programming language, I thought using R wouldn’t be as different. Maybe some of the syntax and imported libraries would be unique, but using loops and functions to carry out Machine Learning models, visualizations and other outputs would be pretty much similar to what I’ve used with Python. While I had some familiarity with R, I never constructed a project using it. Recently, I was assigned a project, which was Part 2 of the hiring process, using NBA team statistics between the 2005–06 and 2021–22 seasons. The assignment consisted of 14 questions, focusing on coding, visualization, modeling, and communication. Unlike Part 1, which allowed any programming language used to complete the task, Part 2 required using only R programming, recommending that base R and explicit loops not be used and instead using Tidyverse.
Besides doing a brief introductory tutorial, I didn’t know much about Tidyverse. The visualizations it produced looked good, but Matplotlib and Seaborn Python libraries produced similar results. While having limited knowledge about Tidyverse presented a challenge, I was excited to explore this new programming language and learn as much as I can while simultaneously preparing my responses. I relied on sources from YouTube channels such as R Programming 101, and online sources such as StackOverflow and Tidyverse.org to understand the packages in Tidyverse as well as the proper syntax of codes as it related to the project questions. Within those packages are unique codes that can perform tasks such as modify/manipulate data, create or delete columns, create visualizations, and read data such as CSV files (Comma Separated Values). These packages help form Tidyverse, and create a fascinating method to construct codes that doesn’t utilize conventional loops and functions to carry out tasks.
While exploring the dataset and thinking about how to answer the questions presented, I began to understand how these packages operate. The main packages of Tidyverse are Tibble, dplyr, tidyr, ggplot2, readr, and purrr. While I didn’t utilize all of these packages for this project, I’ll discuss the functions that were used and how I implemented them to perform the tasks required.
Pipe Operator (%>%): This symbol helped streamline all of the code packages I created. This is similar to for loops, but I would argue it makes for cleaner codes. While it is not part of any particular Tidyverse package, it is vital to use when chaining together functions, helping to organize code chunks. Check out this link to learn more.
dplyr:
Mutate(): Used to create, modify and/or delete columns. One of the questions required creating a column documenting the number of times a team missed the playoffs. You can read more and see examples here.
Filter(): Used to subset a dataframe and retain all the rows that fit specific conditions. For one of the questions I answered, I used filter to get all the seasons starting from 2016, and all the seasons from every team that missed the playoffs. You can read more and see examples here.
Select(): Choosing one or more variables (columns) to subset. While I didn’t use this function a lot for this project, there are a lot of great examples of how to use the function that you can find here.
Summarise(): creates a new dataframe. Grouping variables before running this function is recommended, otherwise the output produced will contain one single row summarizing all of the observations in the input. More information on using summarise can be found here.
Rowwise(): computes dataframes one row at a time. This came in handy when I had to count the number of times each team missed the playoffs, and then resetting the count when the team made the playoffs. Read more about the function here.
Arrange(): equivalent to .sort_values() in Python, where you can sort values in ascending or descending order based on values of selected columns in a dataframe. More information can be found here.
ggplot2:
This library is used for data visualizations. Using tidyverse, data manipulation can be performed and synced with ggplot() using the pipe operator. When using ggplot, you must use the + symbol to carry out additional functions that you use for your visualization. There are 3 main components to include when using ggplot to create visualizations: the name of the dataframe, the variables you want to use(x and y), and the type of visualization you want (plot, bar, line, etc). I’ll go into detail about a few functions that can be used when creating visualizations using ggplot.
aes(): when inputting variables, the proper syntax is placing each variable as x and y. If there’s a third variable you want to include and it’s a categorical variable, you can assign it using color. I created a visualization using age, team and made/miss playoffs, assigning made/miss playoffs to color. More examples of using aes() can be found here.
geom_(): there are several types of layers you can choose with the geom_ function, such as point, bar, boxplot, and dotplot, to represent the data points of the variables. There’s a great cheatsheet that goes into detail the different uses of geom_ that you can view here.
facet_grid(): allows you to form multiple panels based on the variables you use. This function was useful when I created two panels, one showing the average age of teams that made playoffs and the other showing the average age of teams that missed the playoffs. Check out this link for more information.
In terms of constructing Machine Learning predictive models with R, there’s more exploration that I will need to do. I created a logistic regression model predicting next season’s winning percentage using two variables of my choice. As with building models in Python, I split the data into train and test sets. However, there’s a function used called the generalized linear model (glm()), which is used to combine the target variable with one or more variables, the data(train or test), and family, which is assigned to binomial. Once that model is created, I used it to predict the target variable. This is different from Python, where the order of operation is to instantiate the algorithm of choice, fit, and predict. There are visual capabilities with Python to produce confusion matrices that weren’t available in R, which helps when analyzing and comparing the performances of different Machine Learning models.
Overall, using tidyverse was a positive experience. R programming is commonly recognized as a statistical modeling program, and tidyverse enhances those capabilities. Additionally, there was a simplicity when using the program that was different then Python. When reading error messages from failed codes that I entered, for example, there was less jargon with R than with Python, which made it easier to identify the errors of the code. If there was a simple syntax error, such as a missing or unexpected parenthesis, an X mark would pop up on the side of the code before it was entered, which I felt saved time. There’s much more to explore with tidyverse, and I encourage people to check it out.