Module 1: Introduction to R and RStudio

Published

August 9, 2024

The goal of this lesson is to introduce you to R and RStudio while teaching you how to perform data preparation using these tools. In this part, we will go over forming an understanding of statistics, learning about observations and variables, and being able to move around and get comfortable with R. I will also teach you how use and load data.

At a Glance

In order to succeed in this lesson, you will need to start by having both R and RStudio downloaded. Then, the only way to learn R is to use it in various ways and to practice as much as you can.
It is also important to note that this is a statistics class and that R is a statistical computing software. Because of that, we need to not only pay attention to what we are typing in, but understand why we are typing it in the ways suggested, and also how we could do it differently to get similar if not the same results.

Lesson Objectives

Be Introduced to R and R Studio.
Set up R and R Studio.
Use basic Built-in Functions in R.
Enter and load data into R.

Consider While Reading

This material is so important because it is likely the start of your R journey and will provide the groundwork for learning a modern approach to calculating statistics. R has a fairly steep learning curve, but, once you are over it, it becomes fairly easy to figure out new things. It is important to make connections to what we are doing and why we are doing it a certain way. There are rules to learning R, and you will get better with constant practice. Try to avoid just typing in code. Instead, determine why the code was typed the way it was and try to figure out all the other ways the code could also be typed to get the same answer. Practice, practice, practice!

What is Statistics?

Statistics is the methodology of extracting useful information from a data set.
Numerical results are not very useful unless they are accompanied with clearly stated actionable business insights.
To do good statistical analysis, you must do the following:
- Find the right data.
- Use the appropriate statistical tools.
- Clearly communicate the numerical information in written language.
With knowledge of statistics:
- Avoid risk of making uninformed decisions and costly mistakes.
- Differentiate between sound statistical conclusions and questionable conclusions.
Data and analytics capabilities have made a leap forward.
- Growing availability of vast amounts of data.
- Improved computational power.
- Development of sophisticated algorithms.

Two Main Branches of Statistics

Descriptive Statistics - collecting, organizing, and presenting the data.
Inferential Statistics - drawing conclusions about a population based on sample data from that population.
- A population consists of all items of interest.
- A sample is a subset of the population.
- A sample statistic is calculated from the sample data and is used to make inferences about the population parameter.
Reasons for sampling from the population:
- Too expensive to gather information on the entire population.
- Often impossible to gather information on the entire population.

Setting up R

R Script Files

Using R Script Files:
- A .R script is simply a text file containing a set of commands and comments. The script can be saved and used later to rerun the code. The script can also be documented with comments and edited again and again to suit your needs.
Using the Console
- Entering and running code at the R command line is effective and simple. However, each time you want to execute a set of commands, you must re-enter them at the command line. Nothing saves for later.
Complex commands are particularly difficult causing you to re-entering the code to fix any errors typographical or otherwise.R script files help to solve this issue.

Create a New R Script File

To save your notes from this lesson, create a .R file named module1.R and save it to your project file you made in the last class.
There are a couple of parts to this module, and we can add code for the entire module in one file so that our code is stacked nicely together.
For each new module, start a new file and save it to your project folder.

Using R

Text in R

Comments

Use comments to organize and explain your code in R scripts by including 1 or more than 1 hashtag.
Aim to write clear, self-explanatory code that minimizes the need for excessive comments.
Add helpful comments where necessary to ensure anyone, including your future self, can understand and run the code.
If something doesn’t work, avoid deleting it immediately. Instead, comment it out while troubleshooting or exploring alternatives.
Essentially, we add comments to our code to document our work and add notes to our self or to others.

# This is a comment for documentation or annotation

Add the code above to your R file and run each line using Ctl + Enter (PC) or Cmd + Return (MAC) or select all lines and click Run.
Take note that nothing prints in the console after running a comment.

A Prolog

A prolog is a set of comments at the top of a code file that provides information about what is in the file. It also names the files and resources used that facilitates identification. Including a prolog is considered coding best practice.
On your own R Script File, add your own prolog following the template as shown.
An informal prolog is below:

####################################
# Project name: Module 1
# Project purpose: To create an R script file to learn about R. 
# Code author name: [Enter Your Name]
# Date last edited: [Enter Date Here]
# Data used: NA
# Libraries used: NA
####################################

Then, as we work through our .R script and add data files or libraries to our code, we go back and edit the prolog.

String

In R, a string is a sequence of characters enclosed in quotes, used to represent text data.
R accepts single quotes or double quotes when marking a string. However, if you use a single quote to start, use a single quote to end. The same for double quotes - ensure the pairing is the same quote type.
You sometimes need to be careful with nested quotes, but generally it does not matter which you use.

"This is a string"

[1] "This is a string"

"This is also a string"

[1] "This is also a string"

Note on R Markdown

These files were formatted with RMarkdown. RMarkdown is a simple formatting syntax for authoring documents of a variety of types, including PowerPoint and html files.
On the document, RMarkdown prints the command and then follows the command with the output after 2 hashtags.
In your R Script File, you only need to type in the command and then run your code to get the same output as presented here.

Running Commands

There are a few ways to run commands via your .R file.
- You can click Ctr + Enter on each line (Cmd + Return on a Mac).
- You can select all the lines you want to run and select Ctr + Enter (Cmd + Return on a Mac).
- You can select all the lines you want to run and select the run button as shown in the Figure.

Now that I have asked you to add a couple lines of code, after this point, when R code is shown on this file, you should add it to your .R script file along with any notes you want. I won’t explicitly say - “add this code.”

R is an Interactive Calculator

An important facet of R is that it should serve as your sole calculator.
Try these commands in your .R file by typing them in and clicking Ctr + Enter on each line for a PC and Cmd + Return on a Mac computer.

3 + 4

[1] 7

3 * 4

[1] 12

3/4

[1] 0.75

3 + 4 * 100^2

[1] 40003

Take note that order of operations holds in R:
The order of operations in R, as in mathematics, determines the sequence in which expressions are evaluated. This follows the standard rules often remembered by the acronym PEMDAS: Parentheses, Exponents, Multiplication and Division, and Addition and Subtraction (from left to right). Parentheses have the highest precedence and are used to group expressions, ensuring they are evaluated first. Operators with the same level of precedence, like multiplication and division, are evaluated from left to right. When in doubt, use parentheses to make the evaluation order explicit and improve the readability of your code.
- Parentheses ()
- Exponents ^ and $**$
- Division $/$, Multiplication $*$, modulo, and integer division
- Addition + and Subtraction -
- In R, modulo (%%) and integer division (%/%) also follow the order of operations, with both being evaluated after multiplication, division, and exponentiation, but before addition and subtraction.
This means that modulo and integer division have the same priority level as multiplication and division, where modulo is just the remainder.

2 + 3 * 5 - 7^2%%4 + (5/2)

[1] 18.5

5/2  #parentheses: = 2.5

[1] 2.5

2 + 3 * 5 - 7^2%%4 + 2.5

[1] 18.5

7^2  #exponent:= 49

[1] 49

2 + 3 * 5 - 49%%4 + 2.5

[1] 18.5

3 * 5  #multiplication: = 15

[1] 15

2 + 15 - 49%%4 + 2.5

[1] 18.5

49%%4  #modulo: = 1

[1] 1

2 + 15 - 1 + 2.5

[1] 18.5

# This is compared to integer division below
49%/%4  #integer division: = 12

[1] 12

2 + 15  #addition: = 17

[1] 17

17 - 1  #subtraction: = 16

[1] 16

16 + 2.5  #addition: = 18.5

[1] 18.5

Creating Objects

Entering and Storing Variables in R requires you to make an assignment.
- We use the assignment operator ‘<-’ to assign a value or expression to a variable.
- We typically do not use the = sign in R even though it works because it also means other things in R.
Some examples are below to add to your .R file.

states <- 29
A <- "Apple"
# Equivalent statement to above - again = is less used in R.
A = "Apple"
print(A)

[1] "Apple"

# Equivalent statement to above
A

[1] "Apple"

B <- 3 + 4 * 12
B

[1] 51

Naming Objects

Line length limit: 80
Always use a consistent way of annotating code.
Camel case is capitalizing the first letter of each word in the object name, with the exception of the first word.
Dot case puts a dot between words in a variable name while camel case capitalizes each word in the variable name.
Object names appear on the left of assignment operator. We say an object receives or is assigned the value of the expression on the right.

Naming Constants: A Constant contains a single numeric value.

The recommended format for constants is starting with a “k” and then using camel case. (e.g., kStates).

Naming Functions: Functions are objects that perform a series of R commands to do something in particular.

The recommended format for Functions is to use Camel case with the first letter capitalized. (e.g., MultiplyByTwo).

Naming Variables: A Variable is a measured characteristic of some entity.

The recommended format for variables is to use either the dot case or camel case. e.g., filled.script.month or filledScriptMonth.
A valid variable name consists of letters, numbers, along with the dot or underline characters.
A variable name must start with a letter, or the dot when not followed by a number.
A variable cannot contain spaces.
Variable names are case sensitive: x is different from X just as Age is different from AGE.
The value on the right must be a number, string, an expression, or another variable.
Some Examples Using Variable Rules:

AB.1 <- "Allowed?"
# Does not follow rules - not allowed Try the statement below with no
# hashtag to see the error message .123 <- 'Allowed?'
A.A123 <- "Allowed?"
G123AB <- "Allowed?"
# Recommended format for constants
kStates <- 29

Different R coders have different preferences, but consistency is key in making sure your code is easy to follow and for others to read. In this course, we will generally use the recommendation in the text which are listed above.
We tend to use one letter variable names (i.e., x) for placeholders or for simple functions (like naming a vector).

Built-in Functions

R has thousands of built-in functions including those for summary statistics. Below, we use a few built-in functions with constant numbers. The sqrt(), max(), and min() functions compute the square root of a number, and find the maximum and minimum numbers in a vector.

sqrt(100)

[1] 10

max(100, 200, 300)

[1] 300

min(100, 200, 300)

[1] 100

We can also create variables to use within built-in functions.

Below, we create a vector x and use a few built-in functions as examples.

The sort() function sorts a vector from small to large.

x <- c(1, 2, 3, 3, 100, -10, 40)  #Creating a Vector x
sort(x)  #Sorting the Vector x from Small to Large

[1] -10   1   2   3   3  40 100

max(x)  #Finding Largest Element of Vector x

[1] 100

min(x)  #Finding Smallest Element of Vector x

[1] -10

Built-in Functions: Setting an Argument

The standard format to a built-in function is functionName(argument)
- For example, the square root function structure is listed as sqrt(x), where x is a numeric or complex vector or array.

# Here, we are setting a required argument x to a value of 100. When
# a value is set, it turns it to a parameter of the function.
sqrt(x = 100)

[1] 10

# Because there is only one argument and it is required, we can
# eliminate its name x= from our function call. This is discussed
# below.
sqrt(100)

[1] 10

There is a little variety in how we can write functions to get the same results.
A parameter is what a function can take as input. It is a placeholder and hence does not have a concrete value. An argument is a value passed during function invocation.
There are some default values set up in R in which arguments have already been set.
There are a few functions with no parameters like Sys.time() which produces the date and time. If you are not sure how many parameters a function has, you should look it up in the help.

Default Values

There are many default values set up in R in which arguments have already been set to a particular value or field.
Default values have been set when you see the = value in the instructions. If we don’t want to change it, we don’t need to include it in our function call.
When only one argument is required, the argument is usually not set to have a default value.

Built-in Functions: Using More than One Argument

For functions with more than one parameter, we must determine what arguments we want to include, and whether a default value was set and if we want to change it. Default values have been set when you see the = value in the instructions. If we don’t want to change it, we don’t need to include it in our function call.
- For example, the default S3 method for the seq() function is listed as the following: seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),length.out = NULL, along.with = NULL, …)
- Default values have been set on each parameter, but we can change some of them to get a meaningful result.
- For example, we set the from, to, and by parameter to get a sequence from 0 to 30 in increments of 5.

# We can use the following code.
seq(from = 0, to = 30, by = 5)

[1]  0  5 10 15 20 25 30

We can simplify this function call even further:
- If we use the same order of parameters as the instructions, we can eliminate the argument= from the function.
- Since we do list the values to the arguments in same order as the function is defined, we can eliminate the from=, to=, and by= to simplify the statement.
```
# Equivalent statement as above
seq(0, 30, 5)
```
```
[1]  0  5 10 15 20 25 30
```
If you leave off the by parameter, it defaults at 1.

# Leaving by= to default value of 1
seq(0, 30)

 [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
[26] 25 26 27 28 29 30

There can be a little hurdle deciding when you need the argument value in the function call. The general rule is that if you don’t know, include it. If it makes more sense to you to include it, include it.

Tips on Arguments

Always look up a built-in function to see the arguments you can use.
Arguments are always named when you define a function.
When you call a function, you do not have to specify the name of the argument.
Arguments have default values, which is used if you do not specify a value for that argument yourself.
An argument list comprises of comma-separated values that contain the various formal arguments.
Default arguments are specified as follows: parameter = expression

y <- 10:20
sort(y)

 [1] 10 11 12 13 14 15 16 17 18 19 20

sort(y, decreasing = FALSE)

 [1] 10 11 12 13 14 15 16 17 18 19 20

Saving

You can save your work in the file menu or the save shortcut using Ctrl + S or Cmd + S depending on your Operating System.
You will routinely be asked to save your workspace image, and you don’t need to save this unless specifically asked. It saves the output we have generated so far.
You can stop this from happening by setting the Tools > Global Options > Under Workspace changing this to Never.
Be careful with this option because it won’t save what you don’t run.

Calling a Library

In R, a package is a collection of R functions, data and compiled code. The location where the packages are stored is called the library.
Libraries need to be activated one time in each new R session.
You can access a function from a library one time using library::function()

# Use to activate library in an R session.
library(tidyverse)
library(dplyr)

You can access a function from a library one time only using library::function()
- Useful if only using one function from the library.
We will return to this in data prep.

## Below is an example that would use dplyr for one select function
## to select variable1 from the oldData and save it as a new object
## NewData. Since we don’t have datasets yet, we will revisit this.
## NewData <- dplyr::select(oldData, variable1)

Some libraries are part of other global libraries:
- dplyr is part of tidyverse, there is actually no need to activate it if tidyverse is active, however, sometimes it helps when conflicts are present
- An example of a conflict is the use of a select function which shows up in both the dplyr and MASS package. If both libraries are active, R does not know which to use.
- tidyverse has many libraries included in it.

The Environment

You can evaluate your Environment Tab to see your Variables we have defined in R Studio.
Use the following functions to view and remove defined variables in your Global Environment

ls()  #Lists all variables in Global Environment

[1] "A"       "A.A123"  "AB.1"    "B"       "G123AB"  "kStates" "states" 
[8] "x"       "y"

rm(states)  #Removes variable named states
rm(list = ls())  #Clears all variables from Global Environment

Entering Data in R

Creating a Vector

A vector is the simplest type of data structure in R.
- A vector is a set of data elements that are saved together as the same type.
- We have many ways to create vectors with some examples below.
Use c() function, which is a generic function which combines its arguments into a vector or list.

c(1, 2, 3, 4, 5)  #Print a Vector 1:5

[1] 1 2 3 4 5

If numbers are aligned, can use the “:“ symbol to include numbers and all in between. This is considered an array.

1:5  #Print a Vector 1:5

[1] 1 2 3 4 5

Use seq() function to make a vector given a sequence.

seq(from = 0, to = 30, by = 5)  #Creates a sequence vector from 0 to 30 in increments on 5

[1]  0  5 10 15 20 25 30

Use rep() function to repeat the elements of a vector.

rep(x = 1:3, times = 4)  #Repeat the elements of the vector 4 times

 [1] 1 2 3 1 2 3 1 2 3 1 2 3

rep(x = 1:3, each = 3)  #Repeat the elements of a vector 3 times each

[1] 1 1 1 2 2 2 3 3 3

Creating a Matrix

A matrix is another type of object like a vector or a list.
- A matrix has a rectangular format with rows and columns.
- A matrix uses matrix() function
- You can include the byrow = argument to tell the function whether to fill across or down first.
- You can also include the dimnames() function in addition to the matrix() to assign names to rows and columns.

Using matrix() function, we can create a matrix with 3 rows and 3 columns as shown below.

Take note how the matrix fills in the new data.

# Creating a Variable X that has 9 Values.
x <- 1:9
# Setting the matrix.
matrix(x, nrow = 3, ncol = 3)

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

# Note – we do not need to name the arguments because we go in the
# correct order.  The function below simplifies the statement and
# provides the same answer as above.
matrix(x, 3, 3)

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Setting More Arguments in a Matrix

The byrow argument fills the Matrix across the row
Below, we can use the byrow statement and assign it to a variable m.

m <- matrix(1:9, 3, 3, byrow = TRUE)  #Fills the Matrix Across the Row and assigns it to variable m
m  #Printing the matrix in the console

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

The dimnames() function adds labels to either the row and the column. In this case below both are added to our matrix m.

dimnames(x = m) <- list(c("2020", "2021", "2022"), c("low", "medium", "high"))
m  #Printing the matrix in the console

     low medium high
2020   1      2    3
2021   4      5    6
2022   7      8    9

You try to make a matrix of 25 items, or a 5 by 5, and fill the matrix across the row and assign the matrix to the name m2.
You should get the answer below.

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]   16   17   18   19   20
[5,]   21   22   23   24   25

Differences between Data Frames and Matrices

In a data frame the columns contain different types of data, but in a matrix all the elements are the same type of data. A matrix is usually numbers.
A matrix can be looked at as a vector with additional methods or dimensions, while a data frame is a list.

Creating a Data Frame

A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. In a data frame the rows are observations and columns are variables.
- Data frames are generic data objects to store tabular data.
- The column names should be non-empty.
- The row names should be unique.
- The data stored in a data frame can be of numeric, factor or character type.
- Each column should contain same number of data items.
- Combing vectors into a data frame using the data.frame() function
Going back to the basics in statistics, we need to define an observation and variable so that we can know how to use them effectively in R in creating data frame objects.
- An Observation is a single row of data in a data frame that usually represents one person or other entity.
- A Variable is a measured characteristic of some entity (e.g., income, years of education, sex, height, blood pressure, smoking status, etc.).
In data frames in R, the columns are variables that contain information about the observations (rows).
Below, we can create vectors for state, year enacted, personal oz limit medical marijuana.

state <- c("Alaska", "Arizona", "Arkansas")
year.legal <- c(1998, 2010, 2016)
ounce.lim <- c(1, 2.5, 3)

Then, we can combine the 3 vectors into a data frame and name the data frame pot.legal.

pot.legal <- data.frame(state, year.legal, ounce.lim)

Next, check your global environment to confirm data frame was created.

Observations: States being measured.
Variables: Information about each state (State Name, Year Weed was legalized, and the Ounce Limit permitted).

# Shows the number of columns or variables
ncol(pot.legal)

[1] 3

# Shows the number of rows or observations
nrow(pot.legal)

[1] 3

# Shows both the number of rows (observations and columns
# (variables).
dim(pot.legal)

[1] 3 3

Working Directories

In R, when working with data files stored on your computer, it’s important to understand the difference between absolute and relative file paths. An absolute reference gives the complete path to the file, starting from the root directory. This path is specific to your system, and it doesn’t change regardless of where the R script is located. For example, on a Windows machine, you might use something like read.csv("C:/Users/username/Documents/data.csv"). This path will always point to the same file, but it can make your code less portable since it only works on your machine or if others have the exact same file structure.
On the other hand, a relative reference specifies the file’s path relative to the location of your R script or working directory. It is more flexible because it assumes the file is located in a directory relative to the current project or script. For example, if your script and data file are in the same folder, you could use read.csv("data.csv"). If the file is in a subdirectory, you would reference it relatively like read.csv("data/data.csv"). Relative paths make your code more portable and easier to share since it will work as long as the folder structure remains consistent.
Using relative paths is often a best practice, especially in collaborative projects or when sharing code. You can check your current working directory in R with getwd() and set it with setwd().

Absolute File Paths

Absolute vs. Relative Links
An absolute file path provides the complete location of a file, starting from the root directory of your computer.
- Always points to the same file.
- Independent of the script’s location.
- Example: read.csv("C:/Users/username/Documents/data.csv")
Pro
- Reliable for your system
- No Ambiguity in locating files
Cons
- Not portable; requires the same file structure across systems.
- Harder to share code with collaborators.

Relative File Paths

A relative file path specifies the file location based on the working directory of your R project or script.
- Changes based on the working directory.
- Often starts from the project folder.
- Example: read.csv("data/myfile.csv")
Pros
- More portable; works across systems if the project structure is consistent.
- Easier collaboration when sharing code and project files.
Cons
- Requires setting the working directory correctly (getwd() and setwd() can help).

Setting up a Working Directory

You should have the data files from our LMS in a data folder on your computer. Your project folder would contain that data folder.
Before importing and manipulating data, you must find and edit your working directory to directly connect to your project folder!
These functions are good to put at the top of your R files if you have many projects going at the same time.

getwd()  #Alerts you to what folder you are currently set to as your working directory
# For example, my working directory is set to the following:
# setwd('C:/Users/Desktop/ProbStat') #Allows you to reset the working
# directory to something of your choice.

In R, when using the setwd() function, notice the forward slashes instead of backslashes.
You can also go to Tools > Global Options > General and reset your default working directory when not in a project. This will pre-select your working directory when you start R.
Or if in a project, like we should be, you can click the More tab as shown in the Figure below, and set your project folder as your working directory.

Reading in Data

When importing data from outside sources, you can do the following:

You can import data from base R or an R package using data() function.
You can also link directly to a file on the web.
You can import data through from your computer through common file extensions:
- .csv: comma separated values;
- .txt: text file;
- .xls or .xlsx: Excel file;
- .sav: SPSS file;
- .sasb7dat: SAS file;
- .xpt: SAS transfer file;
- .dta: Stata file.

Each different file type requires a unique function to read in the file. With all the variety in file types, it is best to look it up in the R Community to help.

Use data() function

All we need is the data() function to read in a data set that is part of R. R has many built in libraries now, so there are many data sets we can use for testing and learning statistics in R.

# The mtcars data set is part of R, so no new package needs to be
# downloaded.
data("mtcars")

Load a data from a package

-   There are also a lot of packages that house data sets. It is fairly easy to make a package that contains data and load it into CRAN. These packages need to be installed into your R one time. Then, each time you open R, you need to reload the library using the library() function.
-   When your run the install.packages() function, do not include the \# symbol. Then, after you run it one time, comment it out. There is no need to run this code a second time unless something happens to your RStudio.

# install.packages('MASS') #only need to install package one time in
# R
library(MASS)

data("Insurance")
head(Insurance)

  District  Group   Age Holders Claims
1        1    <1l   <25     197     38
2        1    <1l 25-29     264     35
3        1    <1l 30-35     246     20
4        1    <1l   >35    1680    156
5        1 1-1.5l   <25     284     63
6        1 1-1.5l 25-29     536     84

Load data from outside sources.

After your working directory is set up, then you can read in datasets into RStudio from outside sources.
Reading in a .csv file is extremely popular way to read in data.
There are a few functions to read in .csv files. And these functions would change based on the file type you are importing.

read.csv() function

Extremely popular way to read in data.
read.csv() is a base R function that comes built-in with R: No library necessary.
All your datasets should be in a data folder in your working directory so that you and I have the same working directory. This creates a relative path to our working directory.
The structure of the function is datasetName <- read.csv(“data/dataset.csv”).

gss.2016 <- read.csv(file = "data/gss2016.csv")
# or equivalently
gss.2016 <- read.csv("data/gss2016.csv")
# Examine the contents of the file
summary(object = gss.2016)

    grass               age           
 Length:2867        Length:2867       
 Class :character   Class :character  
 Mode  :character   Mode  :character

# Or equivalently, we can shorten this to the following code
summary(gss.2016)

    grass               age           
 Length:2867        Length:2867       
 Class :character   Class :character  
 Mode  :character   Mode  :character

read_csv() function

read_csv() is a function from the readr package, which is part of the tidyverse ecosystem.
read_csv() is generally faster than read.csv() as it’s optimized for speed, making it more efficient, particularly for large datasets.
In R, both read.csv() and read_csv() are used to read in CSV files, but they come from different packages and have important differences. read.csv() is part of base R and is widely used for loading CSV files into data frames, as in data <- read.csv("data/data.csv"). It can be slower with large datasets and automatically converts strings to factors unless stringsAsFactors = FALSE.
read_csv(), from the readr package in the tidyverse, is faster and better suited for large datasets. You’d use it like data <- readr::read_csv("data/data.csv"). It doesn’t convert strings to factors by default and provides clearer error messages.read_csv() is often preferred for performance and better handling of data types, especially in larger datasets or tidyverse projects.

# install.packages(tidyverse) ## Only need to install one time on
# your computer. #install.packages links have been commented out
# during processing of RMarkdown.  Activate the library, which you
# need to access each time you open R and RStudio
library(tidyverse)

# Now open the data file to evaluate with tidyverse
gss.2016b <- read_csv(file = "data/gss2016.csv")

Accessing Variables

You can directly access a variable from a dataset using the $ symbol followed by the variable name.
The $ symbol facilitates data manipulation operations by allowing easy access to variables for calculations, transformations, or other analyses. For example:

head(Insurance$Claims)  #lists the first 6 Claims in the Insurance dataset.

[1]  38  35  20 156  63  84

sd(Insurance$Claims)  #provides the standard deviation of all Claims in the Insurance dataset.

[1] 71.1624

Summarize Data

In R, summary() and summarize() serve different purposes. summary() is part of base R and gives a quick overview of data, returning descriptive statistics for each column. For example, summary(mtcars) provides the min, max, median, and mean for numeric columns and counts for factors. It’s useful for a broad snapshot of your dataset.
In contrast, summarize() (or summarise()) is from the dplyr package and allows for custom summaries. For instance, mtcars %>% summarize(avg_mpg = mean(mpg), max_hp = max(hp)) returns the average miles per gallon and the maximum horsepower. It’s more flexible and is often used with group_by() for grouped calculations. In conclusion, summary() gives automatic overviews, while summarize() is better for tailored summaries.
Use the summary() function to examine the contents of the file for a dataset.

summary(object = Insurance)

 District    Group       Age        Holders            Claims      
 1:16     <1l   :16   <25  :16   Min.   :   3.00   Min.   :  0.00  
 2:16     1-1.5l:16   25-29:16   1st Qu.:  46.75   1st Qu.:  9.50  
 3:16     1.5-2l:16   30-35:16   Median : 136.00   Median : 22.00  
 4:16     >2l   :16   >35  :16   Mean   : 364.98   Mean   : 49.23  
                                 3rd Qu.: 327.50   3rd Qu.: 55.50  
                                 Max.   :3582.00   Max.   :400.00

Again, we can eliminate the object = because it is the first argument and is required.

summary(Insurance)

 District    Group       Age        Holders            Claims      
 1:16     <1l   :16   <25  :16   Min.   :   3.00   Min.   :  0.00  
 2:16     1-1.5l:16   25-29:16   1st Qu.:  46.75   1st Qu.:  9.50  
 3:16     1.5-2l:16   30-35:16   Median : 136.00   Median : 22.00  
 4:16     >2l   :16   >35  :16   Mean   : 364.98   Mean   : 49.23  
                                 3rd Qu.: 327.50   3rd Qu.: 55.50  
                                 Max.   :3582.00   Max.   :400.00

Explicit Use of Libraries

You can activate a library one time using library::function() format
For example, we can use the summarize() function from dplyr which is part of tidyverse installed earlier.
Since dplyr is part of tidyverse, there is actually no need to activate it when we have already activated tidyverse in this session, however, it does help when conflicts are present. More on that later.
- The line below says to take the the Insurance data object and summarize the Mean of the Holders variable using the dplyr library.

dplyr::summarize(Insurance, mean(Holders))

  mean(Holders)
1      364.9844

In the line of code above, we see package::function(). If we initiate the library like below, we do not need the beginning of the statement. The code below provides the same answer as the way written above.

library(dplyr)
summarize(Insurance, mean(Holders))

  mean(Holders)
1      364.9844

Evaluating Data Types

A data type of a variable specifies the type of data that is stored inside that variable. R is Called a Dynamically Typed Language, meaning that a variable itself is not declared of any data type. Rather it receives the data type of the R object that is assigned to it. We can change a variable’s data type if we want through coercion, but it will inherit one based on the object assigned to it. We will learn some common ways to do this in the data preparation section.
Evaluating data types in R is crucial for understanding and managing datasets effectively, as different data types require specific handling and operations. R provides several functions to identify and evaluate data types, such as class() to check an object’s class, typeof() to determine its internal storage mode, and is.numeric(), is.character(), or is.factor() to test for specific types. These functions help ensure that data is in the correct format for analysis or modeling tasks. For instance, numeric data types are essential for calculations, while factors or characters are used for categorical data.
There are a number of data types in R that are common to programming and statistical analysis.
- Factor (Nominal or Ordinal)
- Numeric (Real or Discrete (Integer))
- Character
- Logical
- Date

Coercing Data Types

In R, coercion is the process of converting one data type to another, often done automatically by R when necessary. For example, if you combine numeric and character data in a vector, R will coerce the numeric values to characters, as seen in c(1, “A”), which results in c(“1”, “A”). Coercion can also be done manually using functions like as.numeric() or as.factor() to change data types explicitly.
To determine whether a variable is numeric or categorical (factor), you can use the class() function. For example, if you have a variable age, running class(age) will tell you if it is stored as “numeric” or “factor”. Numeric variables represent quantities, such as continuous or discrete numbers, while categorical variables represent groups or categories, typically stored as factors. For example, a variable with levels like “Male” and “Female” would be a categorical (factor) variable.
Sometimes when you read in a dataset all the variables are already in the correct type. Other times, you need to force it into the type you need to conduct the analysis through a process called coercion.

Factor data type

Factor data types can be ordinal or nominal. Ordinal and nominal variables are both types of categorical variables but differ in their characteristics. Ordinal variables have a meaningful order or ranking among categories (e.g., “low,” “medium,” “high”), but the intervals between ranks are not necessarily equal. Nominal variables, on the other hand, represent categories without any inherent order (e.g., “red,” “blue,” “green”). Understanding this distinction is essential for selecting appropriate statistical methods, as ordinal variables often allow for rank-based analyses, while nominal variables do not.
- Ordinal: Contain categories that have some logical order (e.g. categories of age).
- Nominal: Have categories that have no logical order (e.g., religious affiliation and marital status).
R will treat each unique value of a factor as a different level.

Ordinal Variable

Ordinal data may be categorized and ranked with respect to some characteristic or trait.
- For example, instructors are often evaluated on an ordinal scale (excellent, good, fair, poor).
- A scale allows us to code the data based on order, assuming equal distance between scale items (aka likert items).
- You can make an ordinal factor data type in R or you can convert the order to meaningful numbers.
To recode numbers in R, we would code poor to excellent = 1, 2, 3, 4 respectively.
The recode() function from the dplyr package within the tidyverse ecosystem is used to replace specific values in a vector or variable with new values.
It allows you to map old values to new ones in a simple and readable manner.
We will break down the mutate() function and the piping operator a more in the next module. For now, here is an example of how to recode data into numerical values.

library(tidyverse)
# Take a vector representing evaluation scores, named evaluate 
evaluate <- c("excellent", "good", "fair", "poor", "excellent", "good")

data <- data.frame(evaluate)
data <- data %>%
     mutate(evaluate = recode(evaluate,
            "excellent" = 4,
            "good" = 3,
            "fair" = 2,
            "poor" = 1))
data

Nominal Variable

With nominal variables, data are simply categories for grouping.
For example, coding race/ethnicity might have a category value of White, Black, Native American, Asian/Pacific Islander, Other.
Qualitative values may be converted to quantitative values for analysis purposes. + White = 1, Black = 2, etc. This conversion to numerical representation of the category would be needed to run some analysis.
- Sometimes, R does this on our behalf depending on commands used.
We can force a variable into a factor data type using the as.factor() command.
If we use the read.csv() command, we can sometimes do this by setting an argument $stringsAsFactors=TRUE$. We will do this later in the semester.

Numerical data type

Real Number

In R, a real number refers to any numerical value that includes both integers and decimals, as opposed to categorical or character data. The as.numeric() function is used to explicitly convert data into numeric format, allowing R to treat it as a real number for calculations. However, if the input to as.numeric() cannot be logically converted (e.g., non-numeric characters), R will return NA with a warning, highlighting the importance of ensuring the input is suitable for numeric conversion.
The as.numeric() function takes an input vector, such as characters or factors, and attempts to coerce it into numeric values. For instance, as.numeric(c(“1”, “2”, “3”)) returns a numeric vector of 1, 2, and 3. If applied to factors, it returns the underlying integer codes, not the original levels, so caution is needed to avoid misinterpretation. When conversion is not possible (e.g., as.numeric(“abc”)), it results in NA with a warning.
A numerical data type is a vector of numbers that can be Real or Integer, where continuous (Real) variables can take any value along some continuum, and integers take on whole numbers.
Two ways to create:
- We can create a numeric variable by ensuring our value we assign is a number!
- We can force a variable into an real number data type by using the as.numeric() command.

# Assign Rhode Island limit for medical marijuana in ounces per
# person
kOuncesRhode <- 2.5
# Identify the data type
class(x = kOuncesRhode)

[1] "numeric"

Integer Number

Discrete (Integer) Variables: In R, integer discrete data types represent whole numbers, making them ideal for scenarios where only non-fractional values are meaningful, such as counts, rankings, or categorical levels encoded as numbers. Integer values are stored more efficiently than numeric (floating-point) values, providing performance benefits in memory usage and computations. Integers can be explicitly created using the L suffix (e.g., x <- 5L) or by coercing other data types with as.integer(). Operations on integers, such as arithmetic, comparisons, or indexing, behave consistently with their discrete nature. Integer data types are particularly useful for tasks like looping, array indexing, or when working with data where precision beyond whole numbers is unnecessary or irrelevant.
For an example of a discrete variable, we could collect information on the number of children in a family or number of points scored in a basketball game.

# Assign the value of 4 to a constant called kTestInteger and set as
# an integer
kTestInteger <- as.integer(4)
class(kTestInteger)  #Confirm the data type is an integer

[1] "integer"

# Use as.integer() to truncate the variable ouncesRhode
Trunc <- as.integer(kOuncesRhode)
Trunc

[1] 2

Character data type

The character data type in R is used to store text or string data, making it essential for handling names, labels, descriptions, or any non-numeric information. Character data is represented as sequences of characters enclosed in quotes (e.g., “hello” or ‘world’).
Character data types can be wrapped in either single or double quotation marks (e.g, “hello” or ‘hello’).
Character data types can include letters, words, or numbers that cannot logically be included in calculations (e.g., a zip code).
- A quick example is below that shows how to assign a character value to a variable.

# Make string constants
Q1 <- "A"
Q2 <- "B"
# Check the data type
class(x = Q1)

[1] "character"

Logical data type

Logical data types in R represent values that are either TRUE or FALSE, often used for conditional operations and data filtering. Logical values can be the result of comparisons (e.g., x > 10) or explicitly assigned. Logical vectors are particularly useful in subsetting data; for instance, data[data$column > 5, ] selects rows where the column values are greater than 5. Logical operators like & (and), | (or), and ! (not) allow for more complex conditions. Logical data is also essential in control structures such as if, else, and loops. By enabling dynamic and conditional programming, logical types play a key role in efficient data manipulation and analysis.
- A quick example is below that shows how to assign a logical value to a variable.

# Store the result of 6 > 8 in a constant called kSixEight
kSixEight <- 6 > 8
# Can use comparison tests with the following == >= <= > < <> !=
kSixEight  # Print kSixEight

[1] FALSE

# Determine the data type of kSixEight
class(x = kSixEight)

[1] "logical"

Date data type

The date data type in R is used to represent calendar dates, allowing for accurate storage and manipulation of time-related data. Dates in R are typically stored as objects of class Date, which internally represent the number of days since January 1, 1970. This format facilitates arithmetic operations, such as calculating differences between dates or adding/subtracting days. Dates can be created and manipulated using functions like as.Date() for converting character strings to date objects, or Sys.Date() for retrieving the current date. Handling dates correctly is essential in time series analysis, scheduling, and any task involving chronological data, as improper formatting or assumptions can lead to errors in analysis.

# Convert date info in format 'mm/dd/yyyy' using as.Date
strDates <- c("01/05/1965", "08/16/1975")
dates <- as.Date(strDates, "%m/%d/%Y")
str(dates)

 Date[1:2], format: "1965-01-05" "1975-08-16"

The lubridate package provides additional tools for working with dates, such as parsing and extracting components like year, month, and day. This package makes dates a lot easier to work with.

# Convert date info in format 'mm/dd/yyyy' using lubridate
library(lubridate)
strDates <- c("01/05/1965", "08/16/1975")
dates <- mdy(strDates)
str(dates)

 Date[1:2], format: "1965-01-05" "1975-08-16"

If you are only given a year and a month, you can use the ym() command to turn it to a date. But take note that it will add a day to the value as a placeholder.

# Convert date info in format 'yyyymm' using lubridate
stryyyymm <- c("202201", "202003", "202204")
dates <- ym(stryyyymm)
str(dates)

 Date[1:3], format: "2022-01-01" "2020-03-01" "2022-04-01"

# Convert date info in format 'mm/dd/yyyy' using as.Date
strDates <- c("01/05/1965", "08/16/1975")
dates <- as.Date(strDates, "%m/%d/%Y")
str(dates)

 Date[1:2], format: "1965-01-05" "1975-08-16"

UseDates <- read.csv("data/Lubridate.csv")
UseDates$Date <- ym(UseDates$Date)

# You can use the month or year function to access the month or year
# from the date
UseDates$month <- month(UseDates$Date)
UseDates$year <- year(UseDates$Date)

## Now, you can access months or year numerically
summary(UseDates$month)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   4.000   7.000   6.538   9.000  12.000

summary(UseDates$year)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2019    2020    2021    2021    2021    2022

## You could code the seasons using a mutate function making a new
## variable 'season'
library(tidyverse)
UseDates <- UseDates %>%
    mutate(season = case_when(month %in% 3:5 ~ "Spring", month %in% 6:8 ~
        "Summer", month %in% 9:11 ~ "Fall", TRUE ~ "Winter"))

UseDates$season <- as.factor(UseDates$season)
summary(UseDates$season)

  Fall Spring Summer Winter 
     9      9     12      9

## You could use a filter statement to select a particular year.  In
## the example below, I save the filtered data into a new data frame
## UseDate2022.
UseDate2022 <- filter(UseDates, year == 2022)

Nominal Example with Dataset

library(tidyverse)
gss.2016 <- read_csv(file = "data/gss2016.csv")

# Examine the variable types with summary and class functions.
summary(gss.2016)

    grass               age           
 Length:2867        Length:2867       
 Class :character   Class :character  
 Mode  :character   Mode  :character

class(gss.2016$grass)  #Check the data type.

[1] "character"

gss.2016$grass <- as.factor(gss.2016$grass)  #Turn to a factor.
class(gss.2016$grass)  #Confirming it is now correct.

[1] "factor"

Numerical Example with Dataset

We need to ensure data can be coded as numeric before using the as.numeric() command. For example, to handle the variable age, it seems like numerical values except one value of “89 OR OLDER”. If as.numeric() command was used on this variable, it would put all the 89 and older observations as NAs. To force it to be a numerical variable, and keep that the sample participants were the oldest value, we need to recode it and then use the as.numeric() command to coerce it into a number.
Recoding the 89 and older to 89 does cause the data to lack integrity in its current form because it will treat the people over 89 years old as 89. But, we are limited here because this needs to be a numerical variable for us to proceed. We will learn a step later on in this section to transform the age variable into categories so that we bring back our data integrity.

class(gss.2016$age)

[1] "character"

# Recode '89 OR OLDER' into just '89'
gss.2016$age <- recode(gss.2016$age, `89 OR OLDER` = "89")
# Convert to numeric data type
gss.2016$age <- as.numeric(gss.2016$age)
summary(gss.2016)  #Conduct final check confirming correct data types

       grass           age       
 DK       : 110   Min.   :18.00  
 IAP      : 911   1st Qu.:34.00  
 LEGAL    :1126   Median :49.00  
 NOT LEGAL: 717   Mean   :49.16  
 NA's     :   3   3rd Qu.:62.00  
                  Max.   :89.00  
                  NA's   :10

Handling Missing Data

Missing data is a common issue in data analysis and can arise for various reasons, such as data collection errors, non-responses in surveys, or data corruption. In R, handling missing data is crucial to ensure accurate and reliable analysis. Missing values are typically represented by NA (Not Available) in R.
Missing data needs to be closely evaluated and verified within each variable whether the data is truly blank, has no answer, or is marked with a character value such as the text N/A.
Missing data needs to be closely evaluated to see if the missing value is meaningful or not. If the variable that has many missing values is deemed unimportant or can be represented using a proxy variable that does not have missing values, the variable may be excluded from the analysis.
After a data set is loaded, there are three common strategies for dealing with missing values.

The omission strategy recommends that observations with missing values be excluded from subsequent analysis.
The imputation strategy recommends that the missing values be replaced with some reasonable imputed values. For example, imputing missing values using techniques like mean/median substitution or regression models can be considered.
- Numeric variables: replace with the average.
- Categorical variables: replace with the predominant category.
Ignore your missing data if the function works without it.
- When you ignore missing data because your function works without it, the missing values are typically excluded from the calculations by default. In R, many functions, such as mean(), sum(), or lm(), have arguments like na.rm = TRUE to explicitly remove missing values during computation. Ignoring missing data can simplify the analysis, but it comes with potential consequences.

The choice of approach depends on the nature of the missingness, which can be categorized as Missing Completely at Random, Missing at Random, or Missing Not at Random. Addressing missing data appropriately is essential to maintain the validity of statistical analyses and avoid biases.

Limitations of Using a Missing Data Technique

Always recommend a closer evaluation of missing data.
There are limitations of the techniques listed above (omission, imputation, and ignore).
Reduction in Sample Size: Ignoring missing data leads to a smaller effective sample size, which may reduce the power of your analysis and the reliability of the results.
Bias: If the missing data are not Missing Completely at Random, ignoring them may introduce bias. For example, if specific groups or patterns are overrepresented in the remaining data, the results may not generalize to the full dataset.
Distorted Metrics: Calculations that ignore missing values (e.g., averages, sums) might not reflect the true population parameters, especially if the missing data are systematically different from the observed data. In addition, if a large number of values are missing, mean imputation will likely distort the relationships among variables, leading to biased results.
Incorrect Inferences: Ignoring missing data without considering its nature could lead to incorrect conclusions, as the analysis only reflects the subset of available data.
Consider a dataset used to predict factors that lead to intubation due to COVID-19. Suppose one variable, “Number of pregnancies,” contains missing data (NAs) for all men, as the question is not applicable to them. If we were to compare this variable with another, “Intubated due to COVID-19: Yes/No,” simply omitting the rows with blanks (NAs) could lead to the exclusion of an entire gender, distorting the analysis. In this case, a different approach to handling missing data would be more appropriate to ensure the dataset remains representative. Additionally, if a value is not blank but is considered missing for analysis purposes, the data should be consistently processed (e.g., mutated or recoded) to align with the chosen technique for handling true missing values.

na.rm

The na.rm parameter in R is a convenient way to handle missing values (NA) within functions that perform calculations on datasets. The parameter stands for “NA remove” and, when set to TRUE, instructs the function to exclude missing values from the computation.

y <- c(1, 2, NA, 3, 4, NA)
# These lines runs, but do not give you anything useful.
sum(y)

[1] NA

mean(y)

[1] NA

Many functions in R include parameters that will ignore NAs for you.
sum() and mean() are examples of this, and most summary statistics like median() and var() and max() also use the na.rm parameter to ignore the NAs. Always check the help to determine if na.rm is a parameter.

sum(y, na.rm = TRUE)

[1] 10

mean(y, na.rm = TRUE)

[1] 2.5

# na.omit removes the NAs from the vector of dataset. Here, it works
# for removing NAs from the vector.
y <- na.omit(y)

na.rm in a dataset

Using a dataset, we need the _data$variable_ format, like mean(data$column, na.rm = TRUE) calculates the mean of the non-missing values in the specified column or variable. This approach is straightforward and useful when the presence of missing values would otherwise cause an error or return an NA result.

gig <- read.csv("data/gig.csv", stringsAsFactors = TRUE, na.strings = "")
summary(gig)

   EmployeeID         Wage               Industry           Job     
 Min.   :  1.0   Min.   :24.28   Automotive  :190   Engineer  :191  
 1st Qu.:151.8   1st Qu.:34.19   Construction:366   Other     : 88  
 Median :302.5   Median :41.88   Tech        : 38   Accountant: 83  
 Mean   :302.5   Mean   :40.08   NA's        : 10   Programmer: 80  
 3rd Qu.:453.2   3rd Qu.:45.87                      Sales Rep : 77  
 Max.   :604.0   Max.   :51.00                      (Other)   : 69  
                                                    NA's      : 16

mean(gig$Wage, na.rm = TRUE)

[1] 40.0828

is.na()

In R, the is.na() function is used to check for missing (NA) values in objects like vectors, data frames, or arrays. It returns a logical vector of the same length as the input object, where TRUE indicates a missing value and FALSE indicates a non-missing value.
To conduct an example, read in the data set called gig.csv from your working directory.

# Counts the number of all NA values in the entire dataset
CountAllBlanks <- sum(is.na(gig))
CountAllBlanks

[1] 26

# Gives the observation number of the observations that include NA
# values
which(is.na(gig$Industry))

 [1]  24 139 361 378 441 446 479 500 531 565

# Produces a dataset with observations that have NA values in the
# Industry field.
ShowBlankObservations <- gig %>%
    filter(is.na(Industry))
ShowBlankObservations

   EmployeeID  Wage Industry        Job
1          24 42.58     <NA>  Sales Rep
2         139 42.18     <NA>   Engineer
3         361 31.33     <NA>      Other
4         378 48.09     <NA>      Other
5         441 32.35     <NA> Accountant
6         446 30.76     <NA> Accountant
7         479 42.85     <NA> Consultant
8         500 43.13     <NA>  Sales Rep
9         531 43.13     <NA>   Engineer
10        565 38.98     <NA> Accountant

# Counts the number of observations that have NA values in the
# Industry field. Industry is categorical, so we can count values
# based on it.
CountBlanks <- sum(is.na(gig$Industry))
CountBlanks

[1] 10

library(tidyverse)
# Counts the number of observations that have NA values in the Wage
# field.
CountBlanks <- sum(is.na(gig$Wage))
CountBlanks

[1] 0

na_if()

The na_if() function in tidyr is used to replace specific values in a column with NA (missing) values. This function can be particularly useful when you want to standardize missing values across a dataset or when you want to replace certain values with NA for further data processing

TurnNA <- gig %>%
    mutate(Job = na_if(Job, "Other"))
head(TurnNA)

  EmployeeID  Wage     Industry        Job
1          1 32.81 Construction    Analyst
2          2 46.00   Automotive   Engineer
3          3 43.13 Construction  Sales Rep
4          4 48.09   Automotive       <NA>
5          5 43.62   Automotive Accountant
6          6 46.98 Construction   Engineer

na.omit() vs. drop_na()

Both functions return a new object with the rows containing missing values removed.
na.omit() is a base R function, so it doesn’t require any additional package installation where drop_na() requires loading the tidyr package, which is part of the tidyverse ecosystem.
drop_na() fits well into tidyverse pipelines, making it easy to integrate with other tidyverse functions where na.omit() can also be used in pipelines but might require additional steps to fit seamlessly.

# install.packages('Amelia')
library(Amelia)
data("africa")
summary(africa)

      year              country       gdp_pc            infl        
 Min.   :1972   Burkina Faso:20   Min.   : 376.0   Min.   : -8.400  
 1st Qu.:1977   Burundi     :20   1st Qu.: 513.8   1st Qu.:  4.760  
 Median :1982   Cameroon    :20   Median :1035.5   Median :  8.725  
 Mean   :1982   Congo       :20   Mean   :1058.4   Mean   : 12.753  
 3rd Qu.:1986   Senegal     :20   3rd Qu.:1244.8   3rd Qu.: 13.560  
 Max.   :1991   Zambia      :20   Max.   :2723.0   Max.   :127.890  
                                  NA's   :2                         
     trade            civlib         population      
 Min.   : 24.35   Min.   :0.0000   Min.   : 1332490  
 1st Qu.: 38.52   1st Qu.:0.1667   1st Qu.: 4332190  
 Median : 59.59   Median :0.1667   Median : 5853565  
 Mean   : 62.60   Mean   :0.2889   Mean   : 5765594  
 3rd Qu.: 81.16   3rd Qu.:0.3333   3rd Qu.: 7355000  
 Max.   :134.11   Max.   :0.6667   Max.   :11825390  
 NA's   :5

summary(africa$gdp_pc)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  376.0   513.8  1035.5  1058.4  1244.8  2723.0       2

summary(africa$trade)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  24.35   38.52   59.59   62.60   81.16  134.11       5

africa1 <- na.omit(africa)
summary(africa1)

      year              country       gdp_pc            infl       
 Min.   :1972   Burkina Faso:20   Min.   : 376.0   Min.   : -8.40  
 1st Qu.:1976   Burundi     :17   1st Qu.: 511.5   1st Qu.:  4.67  
 Median :1981   Cameroon    :18   Median :1062.0   Median :  8.72  
 Mean   :1981   Congo       :20   Mean   :1071.8   Mean   : 12.91  
 3rd Qu.:1986   Senegal     :20   3rd Qu.:1266.0   3rd Qu.: 13.57  
 Max.   :1991   Zambia      :20   Max.   :2723.0   Max.   :127.89  
     trade            civlib         population      
 Min.   : 24.35   Min.   :0.0000   Min.   : 1332490  
 1st Qu.: 38.52   1st Qu.:0.1667   1st Qu.: 4186485  
 Median : 59.59   Median :0.1667   Median : 5858750  
 Mean   : 62.60   Mean   :0.2899   Mean   : 5749761  
 3rd Qu.: 81.16   3rd Qu.:0.3333   3rd Qu.: 7383000  
 Max.   :134.11   Max.   :0.6667   Max.   :11825390

## to drop all at once.
africa2 <- africa %>%
    drop_na()
summary(africa2)

      year              country       gdp_pc            infl       
 Min.   :1972   Burkina Faso:20   Min.   : 376.0   Min.   : -8.40  
 1st Qu.:1976   Burundi     :17   1st Qu.: 511.5   1st Qu.:  4.67  
 Median :1981   Cameroon    :18   Median :1062.0   Median :  8.72  
 Mean   :1981   Congo       :20   Mean   :1071.8   Mean   : 12.91  
 3rd Qu.:1986   Senegal     :20   3rd Qu.:1266.0   3rd Qu.: 13.57  
 Max.   :1991   Zambia      :20   Max.   :2723.0   Max.   :127.89  
     trade            civlib         population      
 Min.   : 24.35   Min.   :0.0000   Min.   : 1332490  
 1st Qu.: 38.52   1st Qu.:0.1667   1st Qu.: 4186485  
 Median : 59.59   Median :0.1667   Median : 5858750  
 Mean   : 62.60   Mean   :0.2899   Mean   : 5749761  
 3rd Qu.: 81.16   3rd Qu.:0.3333   3rd Qu.: 7383000  
 Max.   :134.11   Max.   :0.6667   Max.   :11825390

You try to load the airquality dataset from base R and look at a summary of the dataset.

Sum the number of NAs in airquality.
Omit all the NAs from airquality and save it in a new data object called airqual and take a new summary of it.

     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0

[1] 44

     Ozone          Solar.R           Wind            Temp      
 Min.   :  1.0   Min.   :  7.0   Min.   : 2.30   Min.   :57.00  
 1st Qu.: 18.0   1st Qu.:113.5   1st Qu.: 7.40   1st Qu.:71.00  
 Median : 31.0   Median :207.0   Median : 9.70   Median :79.00  
 Mean   : 42.1   Mean   :184.8   Mean   : 9.94   Mean   :77.79  
 3rd Qu.: 62.0   3rd Qu.:255.5   3rd Qu.:11.50   3rd Qu.:84.50  
 Max.   :168.0   Max.   :334.0   Max.   :20.70   Max.   :97.00  
     Month            Day       
 Min.   :5.000   Min.   : 1.00  
 1st Qu.:6.000   1st Qu.: 9.00  
 Median :7.000   Median :16.00  
 Mean   :7.216   Mean   :15.95  
 3rd Qu.:9.000   3rd Qu.:22.50  
 Max.   :9.000   Max.   :31.00

Using AI

Use the following prompts on a generative AI, like chatGPT, to learn more about introductory topics.

Explain the difference between descriptive and inferential statistics and provide real-life examples of both.
What is the purpose of using R in statistical analysis, and what are the key benefits of using RStudio as a graphical interface?
What happens when you assign the same variable multiple values in R? Write an example script that demonstrates this behavior and explains the output.
Create a script that demonstrates how to assign values to variables using both numeric and character data types. Then, explain how these assignments are stored in RStudio’s environment.
In R, what is the role of the assignment operator <- Demonstrate its use by creating a few variables for numeric and character data types.
Demonstrate how to create a vector in R using the c() function. Use this vector to perform basic operations like addition and multiplication.
Write a script that reads a CSV file into R using read.csv(). Summarize the dataset and explain how the columns and rows are structured.
How can you access specific columns of a data frame using the $ operator? Provide an example using a sample dataset in R.
Explain how to use the summary() function in R to summarize a dataset. Write a script that loads a dataset and runs summary() on it.
What is the difference between ordinal and nominal variables, and how can you recode these data types in R?
Explain the steps to coerce a character variable into a factor using the as.factor() function in R.
What strategies can you use to handle missing data in R, and how does na.omit() differ from drop_na()?

Summary

In this lesson, we went over information to make sure you had some basics to really start learning R. There are a lot of ways to get the same things done in R; you have to find the way that works best for you. As we learn R, you will get used to doing things your way to be able to slice and evaluate the data to find rich information from the data sets we look at. As long as the data was handled properly, it does not matter how we reach our goal using R as long as we do it ourselves.