1  Introduction to R Programming

1.1 Why Use R?

You may be asking yourself, out of all of the possible visualization softwares which exist, why should I spend time learning and using R?

Great question!

R is a useful tool and worthwhile to learn for several reasons:

  1. It’s free!
  2. Because it’s open source, thousands of people have contributed packages and functions at a pace that proprietary softwares can’t compete with
  3. It is very flexible and robust meaning there’s a lot you can do with it (including creating this very website!)
  4. It is becoming widely used across many industries
  5. We can create visualizations and perform quantitative analyses in the same system.

1.2 So What is R?

R is a command-line, object-oriented programming language commonly used for data analysis and statistics.

Command-line means that we have to give R commands in order for us to get it to do something. So for instance, if we wanted to know what the sum of 3 and 5 were, we can use R to solve this problem for us:

## Adding 3 & 5 Together: ##
3 + 5
[1] 8

Object-oriented means that we can save individual pieces of output as some name that we can use later. This is a super handy feature, especially when you have complicated scripts!

## Saving the result of 3 + 5 as a ##
a <- 3 + 5
a
[1] 8

1.3 What Can R Do?

What can R do? Well, for the purpose of data analytics, I am yet to find a limit of what it can do!

In this class, we will be using R as a tool for visualizing categorical and quantitative data through various means.

1.4 Functions in R

In order to visualize data in R, we need to be able to import data into R. There are a variety of ways of importing data into R, but they largely depend on the type of datafile that you are importing (e.g., Excel file, CSV file, text file, etc.). While there are lots of different files which can be imported into R (Google and ChatGPT are excellent resources for searching for code for how to do something), we’ll mostly use two types in this class: Excel and CSV

Let’s try importing the HEART.csv file into R. This file is part of the famous Framingham Heart Study.

Since we set up our first class session by connecting to GitHub, we should have this file in our project folder already.

To read in this CSV file, we will use the read_csv function which is part of the readr package. Every function in R requires arguments specified inside of parentheses. We can think of packages like toolboxes within a mechanic’s workshop. Each toolbox contains different tools. A tool is like a function; we use specific tools (functions) to solve specific problems!

  • read_csv is a tool we use to solve the problem of reading CSV files into R.
  • The read_csv tools is stored within the readr toolbox (package).

1.5 Importing Data into R

While many packages come installed in RStudio automatically, there are far, far more which we have to install from the web, including readr.

To install a package, we use the install.packages function:

## Installing the readr package ##
install.packages('readr')

We can also think of functions like mathematical functions; we have to supply the function with special inputs called arguments in order to get the desired output. For instance, in the install.packages function, we had to specify to the function which package we wanted to install.

How do we know what arguments to specify for a given function? There are lots of different ways, but one way we can do so is by using the ? operator.

## What are the arguments of read_csv? ##
?readr::read_csv

As we can see, there are a lot of arguments we can specify. However, we don’t need to specify most of them in this case. All of the arguments which have an = after them, like “col_names = TRUE”, will retain that specific argument unless you explicitly change it. I refer to these as optional arguments, those arguments whose value doesn’t necessarily need to be changed in order to get the function to execute.

To note, col_names = TRUE means that the columns of the CSV file have names. If they don’t, then we would change it to, col_names = FALSE and the column names will have generic names, (V1, V2, ... , VN).

Conversely, any argument which does not have an already specified value is a required argument, or one the user (that’s you!) must fill in for the function to execute. In the case of read_csv, that required argument is the file path!

library(readr)
heart <- read_csv("Introduction to R and RStudio/HEART.csv")

What about an Excel file? How do we import those? What if we want to read in the esoph.xlsx file? In this case, we can use a function called read_xlsx which is part of the package readxl. So just as before, we can run:

install.packages('readxl')
library(readxl)
esoph <- read_xlsx("Introduction to R and RStudio/esoph.xlsx")

1.6 Exploring Dataframes using R

So far we have imported some datasets into the RStudio environment…how do we know they imported correctly? The best way that I have found which uses a combination of some traditional functions is a function called glimpse which is part of an incredibly useful data wrangling package called dplyr:

install.packages('dplyr')
library(dplyr)
heart |>
  glimpse()
Rows: 5,209
Columns: 17
$ Status         <chr> "Dead", "Dead", "Alive", "Alive", "Alive", "Alive", "Al…
$ DeathCause     <chr> "Other", "Cancer", NA, NA, NA, NA, NA, "Other", NA, "Ce…
$ AgeCHDdiag     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 57, 55, 79,…
$ Sex            <chr> "Female", "Female", "Female", "Female", "Male", "Female…
$ AgeAtStart     <dbl> 29, 41, 57, 39, 42, 58, 36, 53, 35, 52, 39, 33, 33, 57,…
$ Height         <dbl> 62.50, 59.75, 62.25, 65.75, 66.00, 61.75, 64.75, 65.50,…
$ Weight         <dbl> 140, 194, 132, 158, 156, 131, 136, 130, 194, 129, 179, …
$ Diastolic      <dbl> 78, 92, 90, 80, 76, 92, 80, 80, 68, 78, 76, 68, 90, 76,…
$ Systolic       <dbl> 124, 144, 170, 128, 110, 176, 112, 114, 132, 124, 128, …
$ MRW            <dbl> 121, 183, 114, 123, 116, 117, 110, 99, 124, 106, 133, 1…
$ Smoking        <dbl> 0, 0, 10, 0, 20, 0, 15, 0, 0, 5, 30, 0, 0, 15, 30, 10, …
$ AgeAtDeath     <dbl> 55, 57, NA, NA, NA, NA, NA, 77, NA, 82, NA, NA, NA, NA,…
$ Cholesterol    <dbl> NA, 181, 250, 242, 281, 196, 196, 276, 211, 284, 225, 2…
$ Chol_Status    <chr> NA, "Desirable", "High", "High", "High", "Desirable", "…
$ BP_Status      <chr> "Normal", "High", "High", "Normal", "Optimal", "High", …
$ Weight_Status  <chr> "Overweight", "Overweight", "Overweight", "Overweight",…
$ Smoking_Status <chr> "Non-smoker", "Non-smoker", "Moderate (6-15)", "Non-smo…

As we visually inspect the first few rows of the HEART dataframe, we can see that the “Sex” variable appears to be categorical whereas the “Weight” variable appears to be quantitative.

  • A quantitative variable is something which can be measured with a number, like dollars, time, height, weight, blood pressure, etc. R refers to these as “numeric” variables.
  • A categorical variable is just the opposite. It is something which cannot be quantified and is more of a quality. These are things like sex, country of origin, hair color, cause of death etc. R refers to these as “character” variables.

You may be asking yourself, “why does this matter?” It’s important for two primary reasons:

  1. The type of variable we are working with dictates to us which visualization methods would be most appropriate.
  2. In terms of R programming, we can look at the variable “Sex” and “Height” in the heart dataframe and conclude that these are categorical and quantitative variables, respectively. But when we read the heart dataframe into R using readr::read_csv, we didn’t have to tell R what types of variables each column was; it by default scans each column and makes a best guess as to what type of variable the column contains.
    • So how can we know that R properly recognized the variables in the heart dataframe?
    • Let’s take a second glimpse!
heart |>
  glimpse()
Rows: 5,209
Columns: 17
$ Status         <chr> "Dead", "Dead", "Alive", "Alive", "Alive", "Alive", "Al…
$ DeathCause     <chr> "Other", "Cancer", NA, NA, NA, NA, NA, "Other", NA, "Ce…
$ AgeCHDdiag     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 57, 55, 79,…
$ Sex            <chr> "Female", "Female", "Female", "Female", "Male", "Female…
$ AgeAtStart     <dbl> 29, 41, 57, 39, 42, 58, 36, 53, 35, 52, 39, 33, 33, 57,…
$ Height         <dbl> 62.50, 59.75, 62.25, 65.75, 66.00, 61.75, 64.75, 65.50,…
$ Weight         <dbl> 140, 194, 132, 158, 156, 131, 136, 130, 194, 129, 179, …
$ Diastolic      <dbl> 78, 92, 90, 80, 76, 92, 80, 80, 68, 78, 76, 68, 90, 76,…
$ Systolic       <dbl> 124, 144, 170, 128, 110, 176, 112, 114, 132, 124, 128, …
$ MRW            <dbl> 121, 183, 114, 123, 116, 117, 110, 99, 124, 106, 133, 1…
$ Smoking        <dbl> 0, 0, 10, 0, 20, 0, 15, 0, 0, 5, 30, 0, 0, 15, 30, 10, …
$ AgeAtDeath     <dbl> 55, 57, NA, NA, NA, NA, NA, 77, NA, 82, NA, NA, NA, NA,…
$ Cholesterol    <dbl> NA, 181, 250, 242, 281, 196, 196, 276, 211, 284, 225, 2…
$ Chol_Status    <chr> NA, "Desirable", "High", "High", "High", "Desirable", "…
$ BP_Status      <chr> "Normal", "High", "High", "Normal", "Optimal", "High", …
$ Weight_Status  <chr> "Overweight", "Overweight", "Overweight", "Overweight",…
$ Smoking_Status <chr> "Non-smoker", "Non-smoker", "Moderate (6-15)", "Non-smo…

Sex, for example, we can imagine is a categorical variable as its values, male and female, are characteristics and not numbers.

  • We can tell R read it in as a categorical variable because it is coded as character (notice the <chr> to the right of the Sex variable name).

  • Height, on the other hand, we can imagine is a quantitative variable as its values are numbers!

    • We can tell R read it in as a quantitative variable because it is coded as double (notice the <dbl> to the right of the Height variable name).

1.7 Examining Subsets of Dataframes

1.7.1 Single Variable Analysis

Let’s say I wanted to find the average or mean Age at Death from the heart dataframe. How would I go about doing that?

First, I need to know how to isolate that single variable by itself. To do this, we make use of the dollar-sign operator after the name of our dataframe.

You can think of the dollar-sign operator like a door to your home. The name of the dataframe is the house itself, the dollar-sign is the door, and the variable name is the person we want to talk to inside of the house.

So the structure is: House$Person

If we enter the following command into our console, we can see that we have returned to us the observations contained withing the Age at Death column:

heart$AgeAtDeath
 [1] 55 57 NA NA NA NA NA 77 NA 82

So now that we know how to isolate a single variable within a dataframe, let’s try finding the mean of this quantitative variable using the below code:

mean(heart$AgeAtDeath)
[1] NA

What is returned to us in NA meaning “Not-Applicable.” How does this make any sense? Sometimes, when I get output that is different than what I’m expecting, I check the documentation to see if there are any arguments within the given function that might need changed to have the function return output I expect. Remember, we can do that with the ? operator:

?mean

What we find in the documentation is the structure of the function, just like we saw for read_csv. Here, the structure of mean is:

mean(x, trim = 0, na.rm = FALSE, ...)

Below, we can see the descriptions of each argument. When we examine na.rm it states: “a logical evaluating to TRUE or FALSE indicating whether NA values should be stripped before the computation proceeds.”

So in other words, if a numeric column contains missing values and we try to calculate the mean of said column, when the mean function approaches a missing value, it simply doesn’t know what to do with it. This makes sense as a missing value could be anything (0, 1000, \(\pi\), or anything else!). So instead of trying to guess what the value is, it just returns NA. To have it just omit the missing values to have the function return to us the sample mean, we need to run the code:

mean(heart$AgeAtDeath,na.rm=T)
[1] 70.53641

Now we get a result from the function that makes more sense!

1.7.2 Selecting Subsets of Columns

Let’s say I have a large dataframe with lots of columns, as you might see in your own fields of study. But, for whatever analysis I’m wanting to do, I don’t need all of the columns, just a few. In such a case, it might be useful to subset the dataframe and select only the columns we need.

How do we go about doing this? Like many things in R, there are a few different ways to yield the same result, but I’m going to show you what I consider the most straightforward method, which uses the dplyr package.

Using the heart dataframe, suppose I want to create a new dataframe which only contains the last four columns: Chol_Status, BP_Status, Weight_Status, and Smoking_Status. To do this, we will use the select function from within the dplyr package.

heart_status <- heart |>
  select(Chol_Status,BP_Status,
         Weight_Status,Smoking_Status)

heart_status |>
  glimpse()
Rows: 5,209
Columns: 4
$ Chol_Status    <chr> NA, "Desirable", "High", "High", "High", "Desirable", "…
$ BP_Status      <chr> "Normal", "High", "High", "Normal", "Optimal", "High", …
$ Weight_Status  <chr> "Overweight", "Overweight", "Overweight", "Overweight",…
$ Smoking_Status <chr> "Non-smoker", "Non-smoker", "Moderate (6-15)", "Non-smo…

Okay, great! But what if we instead wanted to subset the heart dataframe by characteristics of the rows, instead? For example, let’s say in the new heart_status dataframe we just created, we want to create a new dataframe where we only have those participants whose Weight_Status is “Overweight.” To do this, we can now instead use the filter function which is also contained within the powerful and useful dplyr package:

heart_status_ow <- heart_status |>
  filter(Weight_Status == 'Overweight')

If you’re familiar with SQL, dplyr operates in much the same manner. So most anything we can do from a data wrangling perspective with SQL, we can also with dplyr.