## Adding 3 & 5 Together: ##
3 + 5
[1] 8
You may be asking yourself, out of all of the possible visualization softwares which exist, why should I spend time learning and using R?
Great question!
R is a useful tool and worthwhile to learn for several reasons:
R is a command-line, object-oriented programming language commonly used for data analysis and statistics.
Command-line means that we have to give R commands in order for us to get it to do something. So for instance, if we wanted to know what the sum of 3 and 5 were, we can use R to solve this problem for us:
## Adding 3 & 5 Together: ##
3 + 5
[1] 8
Object-oriented means that we can save individual pieces of output as some name that we can use later. This is a super handy feature, especially when you have complicated scripts!
## Saving the result of 3 + 5 as a ##
<- 3 + 5
a a
[1] 8
What can R do? Well, for the purpose of data analytics, I am yet to find a limit of what it can do!
In this class, we will be using R as a tool for visualizing categorical and quantitative data through various means.
In order to visualize data in R, we need to be able to import data into R. There are a variety of ways of importing data into R, but they largely depend on the type of datafile that you are importing (e.g., Excel file, CSV file, text file, etc.). While there are lots of different files which can be imported into R (Google and ChatGPT are excellent resources for searching for code for how to do something), we’ll mostly use two types in this class: Excel and CSV
Let’s try importing the HEART.csv
file into R. This file is part of the famous Framingham Heart Study.
Since we set up our first class session by connecting to GitHub, we should have this file in our project folder already.
To read in this CSV file, we will use the read_csv
function which is part of the readr
package. Every function in R requires arguments specified inside of parentheses. We can think of packages like toolboxes within a mechanic’s workshop. Each toolbox contains different tools. A tool is like a function; we use specific tools (functions) to solve specific problems!
read_csv
is a tool we use to solve the problem of reading CSV files into R.read_csv
tools is stored within the readr
toolbox (package).While many packages come installed in RStudio automatically, there are far, far more which we have to install from the web, including readr
.
To install a package, we use the install.packages
function:
## Installing the readr package ##
install.packages('readr')
We can also think of functions like mathematical functions; we have to supply the function with special inputs called arguments in order to get the desired output. For instance, in the install.packages
function, we had to specify to the function which package we wanted to install.
How do we know what arguments to specify for a given function? There are lots of different ways, but one way we can do so is by using the ?
operator.
## What are the arguments of read_csv? ##
::read_csv ?readr
As we can see, there are a lot of arguments we can specify. However, we don’t need to specify most of them in this case. All of the arguments which have an =
after them, like “col_names = TRUE
”, will retain that specific argument unless you explicitly change it. I refer to these as optional arguments, those arguments whose value doesn’t necessarily need to be changed in order to get the function to execute.
To note, col_names = TRUE
means that the columns of the CSV file have names. If they don’t, then we would change it to, col_names = FALSE
and the column names will have generic names, (V1, V2, ... , VN
).
Conversely, any argument which does not have an already specified value is a required argument, or one the user (that’s you!) must fill in for the function to execute. In the case of read_csv
, that required argument is the file path!
library(readr)
<- read_csv("Introduction to R and RStudio/HEART.csv") heart
What about an Excel file? How do we import those? What if we want to read in the esoph.xlsx
file? In this case, we can use a function called read_xlsx
which is part of the package readxl
. So just as before, we can run:
install.packages('readxl')
library(readxl)
<- read_xlsx("Introduction to R and RStudio/esoph.xlsx") esoph
So far we have imported some datasets into the RStudio environment…how do we know they imported correctly? The best way that I have found which uses a combination of some traditional functions is a function called glimpse
which is part of an incredibly useful data wrangling package called dplyr
:
install.packages('dplyr')
library(dplyr)
|>
heart glimpse()
Rows: 5,209
Columns: 17
$ Status <chr> "Dead", "Dead", "Alive", "Alive", "Alive", "Alive", "Al…
$ DeathCause <chr> "Other", "Cancer", NA, NA, NA, NA, NA, "Other", NA, "Ce…
$ AgeCHDdiag <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 57, 55, 79,…
$ Sex <chr> "Female", "Female", "Female", "Female", "Male", "Female…
$ AgeAtStart <dbl> 29, 41, 57, 39, 42, 58, 36, 53, 35, 52, 39, 33, 33, 57,…
$ Height <dbl> 62.50, 59.75, 62.25, 65.75, 66.00, 61.75, 64.75, 65.50,…
$ Weight <dbl> 140, 194, 132, 158, 156, 131, 136, 130, 194, 129, 179, …
$ Diastolic <dbl> 78, 92, 90, 80, 76, 92, 80, 80, 68, 78, 76, 68, 90, 76,…
$ Systolic <dbl> 124, 144, 170, 128, 110, 176, 112, 114, 132, 124, 128, …
$ MRW <dbl> 121, 183, 114, 123, 116, 117, 110, 99, 124, 106, 133, 1…
$ Smoking <dbl> 0, 0, 10, 0, 20, 0, 15, 0, 0, 5, 30, 0, 0, 15, 30, 10, …
$ AgeAtDeath <dbl> 55, 57, NA, NA, NA, NA, NA, 77, NA, 82, NA, NA, NA, NA,…
$ Cholesterol <dbl> NA, 181, 250, 242, 281, 196, 196, 276, 211, 284, 225, 2…
$ Chol_Status <chr> NA, "Desirable", "High", "High", "High", "Desirable", "…
$ BP_Status <chr> "Normal", "High", "High", "Normal", "Optimal", "High", …
$ Weight_Status <chr> "Overweight", "Overweight", "Overweight", "Overweight",…
$ Smoking_Status <chr> "Non-smoker", "Non-smoker", "Moderate (6-15)", "Non-smo…
As we visually inspect the first few rows of the HEART dataframe, we can see that the “Sex” variable appears to be categorical whereas the “Weight” variable appears to be quantitative.
You may be asking yourself, “why does this matter?” It’s important for two primary reasons:
readr::read_csv
, we didn’t have to tell R what types of variables each column was; it by default scans each column and makes a best guess as to what type of variable the column contains.
glimpse
!|>
heart glimpse()
Rows: 5,209
Columns: 17
$ Status <chr> "Dead", "Dead", "Alive", "Alive", "Alive", "Alive", "Al…
$ DeathCause <chr> "Other", "Cancer", NA, NA, NA, NA, NA, "Other", NA, "Ce…
$ AgeCHDdiag <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 57, 55, 79,…
$ Sex <chr> "Female", "Female", "Female", "Female", "Male", "Female…
$ AgeAtStart <dbl> 29, 41, 57, 39, 42, 58, 36, 53, 35, 52, 39, 33, 33, 57,…
$ Height <dbl> 62.50, 59.75, 62.25, 65.75, 66.00, 61.75, 64.75, 65.50,…
$ Weight <dbl> 140, 194, 132, 158, 156, 131, 136, 130, 194, 129, 179, …
$ Diastolic <dbl> 78, 92, 90, 80, 76, 92, 80, 80, 68, 78, 76, 68, 90, 76,…
$ Systolic <dbl> 124, 144, 170, 128, 110, 176, 112, 114, 132, 124, 128, …
$ MRW <dbl> 121, 183, 114, 123, 116, 117, 110, 99, 124, 106, 133, 1…
$ Smoking <dbl> 0, 0, 10, 0, 20, 0, 15, 0, 0, 5, 30, 0, 0, 15, 30, 10, …
$ AgeAtDeath <dbl> 55, 57, NA, NA, NA, NA, NA, 77, NA, 82, NA, NA, NA, NA,…
$ Cholesterol <dbl> NA, 181, 250, 242, 281, 196, 196, 276, 211, 284, 225, 2…
$ Chol_Status <chr> NA, "Desirable", "High", "High", "High", "Desirable", "…
$ BP_Status <chr> "Normal", "High", "High", "Normal", "Optimal", "High", …
$ Weight_Status <chr> "Overweight", "Overweight", "Overweight", "Overweight",…
$ Smoking_Status <chr> "Non-smoker", "Non-smoker", "Moderate (6-15)", "Non-smo…
Sex, for example, we can imagine is a categorical variable as its values, male and female, are characteristics and not numbers.
We can tell R read it in as a categorical variable because it is coded as character (notice the <chr>
to the right of the Sex variable name).
Height, on the other hand, we can imagine is a quantitative variable as its values are numbers!
<dbl>
to the right of the Height variable name).Let’s say I wanted to find the average or mean Age at Death from the heart
dataframe. How would I go about doing that?
First, I need to know how to isolate that single variable by itself. To do this, we make use of the dollar-sign operator after the name of our dataframe.
You can think of the dollar-sign operator like a door to your home. The name of the dataframe is the house itself, the dollar-sign is the door, and the variable name is the person we want to talk to inside of the house.
So the structure is: House$Person
If we enter the following command into our console, we can see that we have returned to us the observations contained withing the Age at Death column:
$AgeAtDeath heart
[1] 55 57 NA NA NA NA NA 77 NA 82
So now that we know how to isolate a single variable within a dataframe, let’s try finding the mean of this quantitative variable using the below code:
mean(heart$AgeAtDeath)
[1] NA
What is returned to us in NA
meaning “Not-Applicable.” How does this make any sense? Sometimes, when I get output that is different than what I’m expecting, I check the documentation to see if there are any arguments within the given function that might need changed to have the function return output I expect. Remember, we can do that with the ?
operator:
?mean
What we find in the documentation is the structure of the function, just like we saw for read_csv
. Here, the structure of mean
is:
mean(x, trim = 0, na.rm = FALSE, ...)
Below, we can see the descriptions of each argument. When we examine na.rm
it states: “a logical evaluating to TRUE
or FALSE
indicating whether NA
values should be stripped before the computation proceeds.”
So in other words, if a numeric column contains missing values and we try to calculate the mean of said column, when the mean function approaches a missing value, it simply doesn’t know what to do with it. This makes sense as a missing value could be anything (0, 1000, \(\pi\), or anything else!). So instead of trying to guess what the value is, it just returns NA
. To have it just omit the missing values to have the function return to us the sample mean, we need to run the code:
mean(heart$AgeAtDeath,na.rm=T)
[1] 70.53641
Now we get a result from the function that makes more sense!
Let’s say I have a large dataframe with lots of columns, as you might see in your own fields of study. But, for whatever analysis I’m wanting to do, I don’t need all of the columns, just a few. In such a case, it might be useful to subset the dataframe and select only the columns we need.
How do we go about doing this? Like many things in R, there are a few different ways to yield the same result, but I’m going to show you what I consider the most straightforward method, which uses the dplyr
package.
Using the heart
dataframe, suppose I want to create a new dataframe which only contains the last four columns: Chol_Status, BP_Status, Weight_Status, and Smoking_Status. To do this, we will use the select
function from within the dplyr
package.
<- heart |>
heart_status select(Chol_Status,BP_Status,
Weight_Status,Smoking_Status)
|>
heart_status glimpse()
Rows: 5,209
Columns: 4
$ Chol_Status <chr> NA, "Desirable", "High", "High", "High", "Desirable", "…
$ BP_Status <chr> "Normal", "High", "High", "Normal", "Optimal", "High", …
$ Weight_Status <chr> "Overweight", "Overweight", "Overweight", "Overweight",…
$ Smoking_Status <chr> "Non-smoker", "Non-smoker", "Moderate (6-15)", "Non-smo…
Okay, great! But what if we instead wanted to subset the heart
dataframe by characteristics of the rows, instead? For example, let’s say in the new heart_status
dataframe we just created, we want to create a new dataframe where we only have those participants whose Weight_Status
is “Overweight.” To do this, we can now instead use the filter
function which is also contained within the powerful and useful dplyr
package:
<- heart_status |>
heart_status_ow filter(Weight_Status == 'Overweight')
If you’re familiar with SQL, dplyr
operates in much the same manner. So most anything we can do from a data wrangling perspective with SQL, we can also with dplyr
.