Course Assistant

R & Statistics

Ask me anything about R, statistics concepts, or your code. Please note that AI tools can make mistakes, so always double-check responses against the course materials and your own work.

mean vs median?
What is skewness?
How do I use filter()?
Explain na.rm=TRUE
How to read a CSV?
What is kurtosis?

Data Preparation

This lesson covers data preparation in R — the process of inspecting, cleaning, and reshaping a dataset so it is ready for analysis. By the end, you will be able to evaluate a dataset for quality issues, correct variable types, handle missing values, and use dplyr to filter, select, sort, create, and summarize data with confidence.

We begin with data types and coercion: R assigns a type to every variable when a dataset is loaded, and the wrong type causes silent errors that look fine but produce wrong results. A number stored as a character will break a mean calculation without warning. A category stored as numeric will produce a meaningless average. Checking and correcting types before doing anything else is the first habit of a careful analyst.

We then move to the dplyr toolkit: a set of functions designed to manipulate data frames in plain, readable steps. filter() subsets rows by condition, select() reduces columns to what you need, arrange() sorts observations, and mutate() creates new variables or recodes existing ones. The pipe operator (%>%) chains these steps together so that a sequence of transformations reads like a sentence rather than a tangle of nested functions.

The lesson continues with grouping and summarizing: using group_by() with summarize() to compute statistics separately for each level of a categorical variable. This is one of the most frequently used patterns in real analysis — comparing means, counts, or totals across groups is the starting point for almost every business question.

We close with missing data and visualization: identifying where values are absent and deciding whether to omit, impute, or work around them, then producing a scatterplot and grouped bar chart to confirm that the cleaned data tells a coherent story before any formal analysis begins.

By the end of this lesson, you should be able to load a messy dataset, diagnose its problems, apply a dplyr cleaning workflow, and produce a visualization that reflects the prepared data. Work through every code example in your own .R script file alongside the reading.

AI and Data Preparation: Where the Stakes Are Highest

Data preparation is the phase of analysis where AI errors are most consequential — and most invisible. AI can write dplyr pipelines, suggest coercion functions, and generate filter conditions quickly and correctly. But the decisions that matter most in this phase are not about syntax:

AI handles well Requires your judgment
Writing filter(), mutate(), select() pipelines Deciding which observations to filter out — and whether that introduces bias
Suggesting as.factor() or as.numeric() coercions Recognizing when a variable is miscoded in a way that silently corrupts analysis
Generating group_by() + summarize() code Deciding which groups make sense to compare for your business question
Proposing na.omit() or drop_na() for missing data Deciding whether removing missing values on a key demographic introduces selection bias
Producing clean, readable code Noticing when the cleaning logic is technically valid but analytically wrong

The critical insight: removing missing data is not automatically “cleaning.” If missing values are concentrated in a particular region, income bracket, or customer segment, removing them can systematically distort every analysis that follows. AI will execute whatever you ask it without flagging that issue — because it is a statistical judgment, not a code error. This is why understanding the why behind each data preparation step matters as much as learning the functions.

At a Glance

  • This lesson is about developing the habits of a careful analyst before any modeling begins. R will not warn you when a number is stored as text or a category is coded as a number — it will simply produce wrong results. The skills here — checking types, correcting them, filtering and reshaping with dplyr, handling missing values, and verifying your work with a visualization — are not preliminary steps you do once and move on from. They are the foundation of every analysis you will run in the MSBA program. The goal is not just to learn the functions, but to internalize a workflow: inspect first, clean deliberately, verify visually.

Lesson Objectives

  • Load a dataset, evaluate variable types using class() and summary(), and apply coercion functions (as.factor(), as.numeric(), as.character()) to correct type mismatches before analysis begins.
  • Use the core dplyr verbs — filter(), select(), arrange(), mutate(), and group_by() with summarize() — to clean, reshape, and aggregate a data frame, chaining steps together with the pipe operator (%>%).
  • Identify and handle missing values using is.na(), sum(is.na()), and na.omit(), and explain the implications of removing versus retaining missing observations.
  • Produce a scatterplot and a grouped bar chart from a prepared dataset and interpret the visual output as a quality check on the cleaning workflow.

Consider While Reading

  • Every dplyr function in this module does one thing and does it cleanly — the power comes from chaining them. As you work through each function individually, keep asking: how would I combine this with the previous step using %>%? The pipe is not just a convenience; it is the habit that makes your code readable and your workflow reproducible.
  • Data preparation is where most analytical errors are introduced, not in the modeling steps that follow. A wrong data type, an undetected missing value, or a miscoded category will silently corrupt every result downstream. Get in the habit of running summary() and str() on your dataset before touching anything else — these two commands together will surface most problems immediately.
  • When you use group_by() with summarize(), you are shifting from thinking about individual rows to thinking about groups. This is one of the most important conceptual transitions in data analysis. Notice how the same variable — say, AnnualSpend — means something different when you look at its overall mean versus its mean broken out by Region or Churned. The grouped summary is where business questions actually get answered.

Data Types and Coercion

Before cleaning or transforming a dataset, you need to confirm that R is treating each variable as the correct type. A variable stored as text when it should be numeric will silently break calculations; a number stored as a factor will produce the wrong chart. This section covers how to identify data types and how to convert between them — skills you will use immediately in the dplyr examples that follow.

Evaluating Data Types

  • A data type of a variable specifies the type of data that is stored inside that variable. R is Called a Dynamically Typed Language, meaning that a variable itself is not declared of any data type. Rather it receives the data type of the R object that is assigned to it. We can change a variable’s data type if we want through coercion, but it will inherit one based on the object assigned to it. We will learn some common ways to do this in the data preparation section.
  • Evaluating data types in R is crucial for understanding and managing datasets effectively, as different data types require specific handling and operations. R provides several functions to identify and evaluate data types, such as class() to check an object’s class, typeof() to determine its internal storage mode, and is.numeric(), is.character(), or is.factor() to test for specific types. These functions help ensure that data is in the correct format for analysis or modeling tasks. For instance, numeric data types are essential for calculations, while factors or characters are used for categorical data.
  • There are a number of data types in R that are common to programming and statistical analysis.
    • Factor (Nominal or Ordinal)
    • Numeric (Real or Discrete (Integer))
    • Character
    • Logical
    • Date

Coercing Data Types

  • In R, coercion is the process of converting one data type to another, often done automatically by R when necessary. For example, if you combine numeric and character data in a vector, R will coerce the numeric values to characters, as seen in c(1, “A”), which results in c(“1”, “A”). Coercion can also be done manually using functions like as.numeric() or as.factor() to change data types explicitly.
  • To determine whether a variable is numeric or categorical (factor), you can use the class() function. For example, if you have a variable age, running class(age) will tell you if it is stored as “numeric” or “factor”. Numeric variables represent quantities, such as continuous or discrete numbers, while categorical variables represent groups or categories, typically stored as factors. For example, a variable with levels like “Male” and “Female” would be a categorical (factor) variable.
  • Sometimes when you read in a dataset all the variables are already in the correct type. Other times, you need to force it into the type you need to conduct the analysis through a process called coercion.

Factor Data Type

  • Factor data types can be ordinal or nominal. Ordinal and nominal variables are both types of categorical variables but differ in their characteristics. Ordinal variables have a meaningful order or ranking among categories (e.g., “low,” “medium,” “high”), but the intervals between ranks are not necessarily equal. Nominal variables, on the other hand, represent categories without any inherent order (e.g., “red,” “blue,” “green”). Understanding this distinction is essential for selecting appropriate statistical methods, as ordinal variables often allow for rank-based analyses, while nominal variables do not.
    • Ordinal: Contain categories that have some logical order (e.g. categories of age).
    • Nominal: Have categories that have no logical order (e.g., religious affiliation and marital status).
  • R will treat each unique value of a factor as a different level.

Ordinal Variable

  • Ordinal data may be categorized and ranked with respect to some characteristic or trait.
    • For example, instructors are often evaluated on an ordinal scale (excellent, good, fair, poor).
    • A scale allows us to code the data based on order, assuming equal distance between scale items (aka likert items).
    • You can make an ordinal factor data type in R or you can convert the order to meaningful numbers.
  • To recode numbers in R, we would code poor to excellent = 1, 2, 3, 4 respectively.
  • The recode() function from the dplyr package within the tidyverse ecosystem is used to replace specific values in a vector or variable with new values.
  • It allows you to map old values to new ones in a simple and readable manner.
library(tidyverse)
# Take a vector representing evaluation scores, named evaluate 
evaluate <- c("excellent", "good", "fair", "poor", "excellent", "good")

data <- data.frame(evaluate)
data <- data %>%
     mutate(evaluate = recode(evaluate,
            "excellent" = 4,
            "good" = 3,
            "fair" = 2,
            "poor" = 1))
data
  evaluate
1        4
2        3
3        2
4        1
5        4
6        3

Nominal Variable

  • With nominal variables, data are simply categories for grouping.
  • For example, coding race/ethnicity might have a category value of White, Black, Native American, Asian/Pacific Islander, Other.
  • Qualitative values may be converted to quantitative values for analysis purposes. + White = 1, Black = 2, etc. This conversion to numerical representation of the category would be needed to run some analysis.
    • Sometimes, R does this on our behalf depending on commands used.
  • We can force a variable into a factor data type using the as.factor() command.
  • If we use the read.csv() command, we can sometimes do this by setting an argument \(stringsAsFactors=TRUE\). We will do this later in the lesson.

Numeric Data Type

  • The as.numeric() function in R is used to convert data into numeric format, allowing for mathematical and statistical operations. It takes an input vector, such as characters or factors, and attempts to coerce it into numeric values. For instance, as.numeric(c(“1”, “2”, “3”)) returns a numeric vector of 1, 2, and 3. If applied to factors, it returns the underlying integer codes, not the original levels, so caution is needed to avoid misinterpretation. When conversion is not possible (e.g., as.numeric(“abc”)), it results in NA with a warning.

  • A numerical data type is a vector of numbers that can be Real or Integer, where continuous (Real) variables can take any value along some continuum, and integers take on whole numbers.

  • Two ways to create:

    • We can create a numeric variable by ensuring our value we assign is a number!
    • We can force a variable into an real number data type by using the as.numeric() command.
    # Assign Rhode Island limit for medical marijuana in ounces per person
    kOuncesRhode <- 2.5
    #Identify the data type 
    class(x = kOuncesRhode) 
    [1] "numeric"
  • Discrete (Integer) Variables: In R, integer discrete data types represent whole numbers, making them ideal for scenarios where only non-fractional values are meaningful, such as counts, rankings, or categorical levels encoded as numbers. Integer values are stored more efficiently than numeric (floating-point) values, providing performance benefits in memory usage and computations. Integers can be explicitly created using the L suffix (e.g., x <- 5L) or by coercing other data types with as.integer(). Operations on integers, such as arithmetic, comparisons, or indexing, behave consistently with their discrete nature. Integer data types are particularly useful for tasks like looping, array indexing, or when working with data where precision beyond whole numbers is unnecessary or irrelevant.

  • For an example of a discrete variable, we could collect information on the number of children in a family or number of points scored in a basketball game.

# Assign the value of 4 to a constant called kTestInteger and set as an integer
kTestInteger <- as.integer(4)
class(kTestInteger) #Confirm the data type is an integer 
[1] "integer"
#Use as.integer() to truncate the variable ouncesRhode
Trunc <- as.integer(kOuncesRhode); Trunc
[1] 2

Character Data Type

  • The character data type in R is used to store text or string data, making it essential for handling names, labels, descriptions, or any non-numeric information. Character data is represented as sequences of characters enclosed in quotes (e.g., “hello” or ‘world’).
  • Character data types can be wrapped in either single or double quotation marks (e.g, “hello” or ‘hello’).
  • Character data types can include letters, words, or numbers that cannot logically be included in calculations (e.g., a zip code).
    • A quick example is below that shows how to assign a character value to a variable.
# Make string constants 
Q1 <- "A"
Q2 <- 'B'
# Check the data type
class(x = Q1)
[1] "character"

Logical Data Type

  • Logical data types in R represent values that are either TRUE or FALSE, often used for conditional operations and data filtering. Logical values can be the result of comparisons (e.g., x > 10) or explicitly assigned. Logical vectors are particularly useful in subsetting data; for instance, data[data$column > 5, ] selects rows where the column values are greater than 5. Logical operators like & (and), | (or), and ! (not) allow for more complex conditions. Logical data is also essential in control structures such as if, else, and loops. By enabling dynamic and conditional programming, logical types play a key role in efficient data manipulation and analysis.

    • A quick example is below that shows how to assign a logical value to a variable.
    # Store the result of 6 > 8 in a constant called kSixEight
    kSixEight <- 6 > 8 
    # Can use comparison tests with the following == >= <= > < <> != 
    kSixEight # Print kSixEight
    [1] FALSE
    # Determine the data type of kSixEight
    class(x = kSixEight)
    [1] "logical"

Nominal Example with Dataset

library(tidyverse)
gss.2016 <- read_csv(file = "data/gss2016.csv") 
#Examine the variable types with summary and class functions.
summary(gss.2016)
    grass               age           
 Length:2867        Length:2867       
 Class :character   Class :character  
 Mode  :character   Mode  :character  
class(gss.2016$grass) #Check the data type.
[1] "character"
gss.2016$grass <- as.factor(gss.2016$grass) #Turn to a factor.
class(gss.2016$grass) #Confirming it is now correct.
[1] "factor"

Numerical Example with Dataset

  • We need to ensure data can be coded as numeric before using the as.numeric() command. For example, to handle the variable age, it seems like numerical values except one value of “89 OR OLDER”. If as.numeric() command was used on this variable, it would put all the 89 and older observations as NAs. To force it to be a numerical variable, and keep that the sample participants were the oldest value, we need to recode it and then use the as.numeric() command to coerce it into a number.
  • Recoding the 89 and older to 89 does cause the data to lack integrity in its current form because it will treat the people over 89 years old as 89. But, we are limited here because this needs to be a numerical variable for us to proceed. We will learn a step later on in this section to transform the age variable into categories so that we bring back our data integrity.
class(gss.2016$age)
[1] "character"
#Recode "89 OR OLDER" into just "89"
gss.2016$age <- recode(gss.2016$age, "89 OR OLDER" = "89")
# Convert to numeric data type
gss.2016$age <- as.numeric(gss.2016$age) 
summary(gss.2016) #Conduct final check confirming correct data types
       grass           age       
 DK       : 110   Min.   :18.00  
 IAP      : 911   1st Qu.:34.00  
 LEGAL    :1126   Median :49.00  
 NOT LEGAL: 717   Mean   :49.16  
 NA's     :   3   3rd Qu.:62.00  
                  Max.   :89.00  
                  NA's   :10     

Common dplyr Functions

Arrange

  • Sorting or arranging the dataset allows you to specify an order based on variable values.
  • Sorting allows us to review the range of values for each variable, and we can sort based on a single or multiple variables.
  • Notice the difference between sort() and arrange() functions below.
    • The sort() function sorts a vector.
    • The arrange() function sorts a dataset based on a variable.
  • To conduct an example, read in the data set called gig.csv from your working directory.
library(tidyverse)

gig <- read.csv("data/gig.csv", stringsAsFactors = TRUE, na.strings="")
dim(gig)
[1] 604   4
head(gig)
  EmployeeID  Wage     Industry        Job
1          1 32.81 Construction    Analyst
2          2 46.00   Automotive   Engineer
3          3 43.13 Construction  Sales Rep
4          4 48.09   Automotive      Other
5          5 43.62   Automotive Accountant
6          6 46.98 Construction   Engineer
  • Using the arrange() function, we add the dataset, followed by a comma and then add in the variable we want to sort. This arranges from small to large.

  • Below is code to rearrange data based on Wage and save it in a new object.

sortTidy <- arrange(gig, Wage)
head(sortTidy)
  EmployeeID  Wage     Industry        Job
1        467 24.28 Construction   Engineer
2        547 24.28 Construction  Sales Rep
3        580 24.28 Construction Accountant
4        559 24.42 Construction   Engineer
5         16 24.76   Automotive Programmer
6        221 24.76   Automotive Programmer
  • We can apply a desc() function inside the arrange function to re-sort from high to low like shown below.
sortTidyDesc <- arrange(gig, desc(Wage))
head(sortTidyDesc)
  EmployeeID  Wage     Industry        Job
1        110 51.00 Construction      Other
2         79 50.00   Automotive   Engineer
3        348 49.91 Construction Accountant
4        373 49.91 Construction Accountant
5        599 49.84   Automotive   Engineer
6         70 49.77 Construction Accountant

Subsetting and Filtering

  • Subsetting or filtering a data frame is the process of indexing, or extracting a portion of the data set that is relevant for subsequent statistical analysis. Subsetting allows you to work with a subset of your data, which is essential for data analysis and manipulation. One of the most common ways to subset in R is by using square brackets []. We can also use the filter() function from tidyverse.

  • We use subsets to do the following:

    • View data based on specific data values or ranges.
    • Compare two or more subsets of the data.
    • Eliminate observations that contain missing values, low-quality data, or outliers.
    • Exclude variables that contain redundant information, or variables with excessive amounts of missing values.
  • When working with data frames, you can subset by rows and columns using two indices inside the square brackets: data[row, column]. For example, if you have df <- data.frame(a = 1:3, b = c(“X”, “Y”, “Z”)), df[1, 2] would return the value “X”, which is the first row and second column. If you want the entire first row, you would use df[1, ], and to get the second column, you’d use df[, 2].

  • You can also use logical conditions to subset. For instance, x[x > 20] would return all values in x greater than 20, and in a data frame, you could filter rows where a certain condition holds, such as df[df$a > 1, ], which returns rows where column a has values greater than 1.

  • Let’s do an example using the customers.csv file we read in earlier as customers in the last lesson. Base R provides several methods for subsetting data structures. Below uses base R by using the square brackets dataset[row, column] format.

customers <- read.csv("data/customers.csv", stringsAsFactors = TRUE)

#To subset, note the dataset[row,column] format
#Results hidden to save space, but be sure to try this code in your .R file. 
#Data in 1st row
customers[1,] 
#Data in 2nd column
customers[,2] 
#Data for 2nd column/1st observation (row)
customers[1,2] 
#First 3 columns of data
customers[,1:3] 
  • One of the most powerful and intuitive ways to subset data frames in R is by using the filter() function from the dplyr package, which is part of the tidyverse. Tidyverse is extremely popular when filtering data.
  • The filter function is used to subset rows of a data frame based on certain conditions.
  • The below example filters data by the College variable when category values are “Yes” and saves the filtered dataset into an object called college.
#Filtering by whether the customer has a "Yes" for college. 
#Saving this filter into a new object college which you should see in your global environment. 
college <- filter(customers, College == "Yes")
#Showing first 6 records of college - note the College variable is all Yes's. 
head(college)
   CustID    Sex     Race  BirthDate College HHSize Income Spending Orders
1 1530016 Female    Black 12/16/1986     Yes      5  53000      241      3
2 1531136   Male    White   5/9/1993     Yes      5  94000      843     12
3 1532160   Male    Black  5/22/1966     Yes      2  64000      719      9
4 1532307   Male    White  9/16/1964     Yes      4  60000      582     13
5 1532387   Male    White  8/27/1957     Yes      2  67000      452      9
6 1533017 Female Hispanic  5/14/1985     Yes      3  84000      153      2
  Channel
1      SM
2      TV
3      TV
4      SM
5      SM
6     Web
  • Using the filter command, we can add filters pretty easily by using an & for and, or an | for or. The statement below filters by College and Income and save the new dataset in an object called twoFilters.
twoFilters <- filter(customers, College == "Yes" & Income < 50000)
head(twoFilters)
   CustID    Sex     Race  BirthDate College HHSize Income Spending Orders
1 1533697 Female    Asian  10/8/1974     Yes      3  42000      247      3
2 1535063 Female    White 12/17/1982     Yes      3  42000      313      4
3 1544417   Male Hispanic  3/14/1980     Yes      4  46000      369      3
4 1547864 Female Hispanic  6/15/1987     Yes      2  44000      500      5
5 1550969 Female    White   4/8/1978     Yes      4  47000      774     16
6 1553660 Female    White   8/2/1988     Yes      2  47000      745      5
  Channel
1     Web
2      TV
3      TV
4      TV
5      TV
6      SM
  • Next, we can do an or statement. The example below uses the filter command to filter by more than one category in the same field using the | in between the categories.
TwoRaces <- filter(customers, Race == "Black" | Race == "White")
head(TwoRaces)
   CustID    Sex  Race  BirthDate College HHSize Income Spending Orders Channel
1 1530016 Female Black 12/16/1986     Yes      5  53000      241      3      SM
2 1531136   Male White   5/9/1993     Yes      5  94000      843     12      TV
3 1532160   Male Black  5/22/1966     Yes      2  64000      719      9      TV
4 1532307   Male White  9/16/1964     Yes      4  60000      582     13      SM
5 1532387   Male White  8/27/1957     Yes      2  67000      452      9      SM
6 1533791   Male White 10/27/1999     Yes      1  97000     1028     17     Web
  • The str_detect() function is used to detect the presence or absence of a pattern (regular expression) in a string or vector of strings. It returns a logical vector indicating whether the pattern was found in each element of the input vector.
  • Using str_detect it with a filter function allows you to pull observations based on the inclusion of a string pattern.
Birthday2000 <- filter(customers, str_detect(BirthDate, "1985"))

Select

  • In R, the select() function is part of the dplyr package, which is used for data manipulation. The select() function is specifically designed to subset or choose specific columns from a data frame. It allows you to select variables (columns) by their names or indices.
  • Both statements below select Income, Spending, and Orders variables from the customers dataset and form them into a new dataset called smallData.
smallData <- select(customers, Income, Spending, Orders)
head(smallData)
  Income Spending Orders
1  53000      241      3
2  94000      843     12
3  64000      719      9
4  60000      582     13
5  47000      845      7
6  67000      452      9

Piping (Chaining) Operator

  • The pipe operator takes the output of the expression on its left-hand side and passes it as the first argument to the function on its right-hand side. This enables you to chain multiple functions together, making the code easier to understand and debug.
  • If we want to keep our code tidy, we can add the piping operator (%>%) to help combine our lines of code into a new object or overwriting the same object.
  • This operator allows us to pass the result of one function/argument to the other one in sequence.
  • The below example uses a select() function to pull Income, Spending, and Orders variables from the customers dataset and save it as a new object called smallData. It is an identical request to the one directly above, but written with the piping operator.
smallData <- customers %>% select(Income, Spending, Orders)

Counting

  • Counting allows us to gain a better understanding and insights into the data.

  • This helps to verify that the data set is complete or determine if there are missing values.

  • In R, the length() function returns the number of elements in a vector, list, or any other object with a length attribute. It essentially counts the number of elements in the specified object.

#Gives the length of Industry
length(gig$Industry)
[1] 604
  • For counting using tidyverse, we typically use the filter and count function together to filter by a value or state and then count the filtered data.
  • In the function below, I use the piping operator to link together the filter and count functions into one command.
  • Note that we need a piping operator (%>%) before each new function that is part of the chunk.
# Counting with a Categorical Variable
# Here we are filtering by Automotive Industry and then counting the number and saving it in a new object called countAuto
countAuto <- gig %>%
     filter(Industry=="Automotive") %>%
     count()
countAuto #190
    n
1 190
  • Below, we are filtering by Wage and the counting.
# Counting with a Numerical Variable
# We could also save this in an object. 
gig %>%
  filter(Wage > 30) %>%
  count() ##536
    n
1 536
  • We learned that there are 190 employees in the automotive industry and there are 536 employees who earn more than $30 per hour.

  • We could also calculate the number of people with wages under or equal to 30.

#We find 68 Wages under or equal to 30
WageLess30 <- gig %>%
  filter(Wage <= 30) %>%
  count() #
WageLess30
   n
1 68
  • How many Accountants are there in the Job Category of the gig data set. The answer is shown below. Use filter and count to calculate this answer.
   n
1 83

Handling Missing Data

  • Missing data is a common issue in data analysis and can arise for various reasons, such as data collection errors, non-responses in surveys, or data corruption. In R, handling missing data is crucial to ensure accurate and reliable analysis. Missing values are typically represented by NA (Not Available) in R.

  • Missing data needs to be closely evaluated and verified within each variable whether the data is truly blank, has no answer, or is marked with a character value such as the text N/A.

  • Missing data needs to be closely evaluated to see if the missing value is meaningful or not. If the variable that has many missing values is deemed unimportant or can be represented using a proxy variable that does not have missing values, the variable may be excluded from the analysis.

  • After a data set is loaded, there are three common strategies for dealing with missing values.

  1. The omission strategy recommends that observations with missing values be excluded from subsequent analysis.

  2. The imputation strategy recommends that the missing values be replaced with some reasonable imputed values. For example, imputing missing values using techniques like mean/median substitution or regression models can be considered.

    • Numeric variables: replace with the average.
    • Categorical variables: replace with the predominant category.
  3. Ignore your missing data if the function works without it.

    • When you ignore missing data because your function works without it, the missing values are typically excluded from the calculations by default. In R, many functions, such as mean(), sum(), or lm(), have arguments like na.rm = TRUE to explicitly remove missing values during computation. Ignoring missing data can simplify the analysis, but it comes with potential consequences.
  • The choice of approach depends on the nature of the missingness, which can be categorized as Missing Completely at Random, Missing at Random, or Missing Not at Random. Addressing missing data appropriately is essential to maintain the validity of statistical analyses and avoid biases.

Limitations of Using a Missing Data Technique

  • Recommended Closer Evaluation of Missing Data

  • There are limitations of the techniques listed above (omission, imputation, and ignore).

  • Reduction in Sample Size: Ignoring missing data leads to a smaller effective sample size, which may reduce the power of your analysis and the reliability of the results.

  • Bias: If the missing data are not Missing Completely at Random, ignoring them may introduce bias. For example, if specific groups or patterns are overrepresented in the remaining data, the results may not generalize to the full dataset.

  • Distorted Metrics: Calculations that ignore missing values (e.g., averages, sums) might not reflect the true population parameters, especially if the missing data are systematically different from the observed data. In addition, if a large number of values are missing, mean imputation will likely distort the relationships among variables, leading to biased results.

  • Incorrect Inferences: Ignoring missing data without considering its nature could lead to incorrect conclusions, as the analysis only reflects the subset of available data.

  • Consider a dataset used to predict factors that lead to intubation due to COVID-19. Suppose one variable, “Number of pregnancies,” contains missing data (NAs) for all men, as the question is not applicable to them. If we were to compare this variable with another, “Intubated due to COVID-19: Yes/No,” simply omitting the rows with blanks (NAs) could lead to the exclusion of an entire gender, distorting the analysis. In this case, a different approach to handling missing data would be more appropriate to ensure the dataset remains representative. Additionally, if a value is not blank but is considered missing for analysis purposes, the data should be consistently processed (e.g., mutated or recoded) to align with the chosen technique for handling true missing values.

na.rm

  • The na.rm parameter in R is a convenient way to handle missing values (NA) within functions that perform calculations on datasets. The parameter stands for “NA remove” and, when set to TRUE, instructs the function to exclude missing values from the computation.
y <- c(1, 2, NA, 3, 4, NA)
# These lines runs, but do not give you anything useful.
sum(y) 
[1] NA
mean(y)
[1] NA
  • Many functions in R include parameters that will ignore NAs for you.

  • sum() and mean() are examples of this, and most summary statistics like median() and var() and max() also use the na.rm parameter to ignore the NAs. Always check the help to determine if na.rm is a parameter.

sum(y, na.rm=TRUE) 
[1] 10
mean(y, na.rm=TRUE)
[1] 2.5
# na.omit removes the NAs from the vector of dataset. Here, it works for removing NAs from the vector.  
y <- na.omit(y) 

na.rm with a Dataset

  • Using a dataset, we need the _data\(variable_ format, like mean(data\)column, na.rm = TRUE) calculates the mean of the non-missing values in the specified column or variable. This approach is straightforward and useful when the presence of missing values would otherwise cause an error or return an NA result.
summary(gig)
   EmployeeID         Wage               Industry           Job     
 Min.   :  1.0   Min.   :24.28   Automotive  :190   Engineer  :191  
 1st Qu.:151.8   1st Qu.:34.19   Construction:366   Other     : 88  
 Median :302.5   Median :41.88   Tech        : 38   Accountant: 83  
 Mean   :302.5   Mean   :40.08   NA's        : 10   Programmer: 80  
 3rd Qu.:453.2   3rd Qu.:45.87                      Sales Rep : 77  
 Max.   :604.0   Max.   :51.00                      (Other)   : 69  
                                                    NA's      : 16  
mean(gig$Wage, na.rm=TRUE)
[1] 40.0828

is.na()

  • In R, the is.na() function is used to check for missing (NA) values in objects like vectors, data frames, or arrays. It returns a logical vector of the same length as the input object, where TRUE indicates a missing value and FALSE indicates a non-missing value.
#Counts the number of all NA values in the entire dataset
CountAllBlanks <- sum(is.na(gig)); CountAllBlanks 
[1] 26
#Gives the observation number of the observations that include NA values
which(is.na(gig$Industry))
 [1]  24 139 361 378 441 446 479 500 531 565
#Produces a dataset with observations that have NA values in the Industry field. 
ShowBlankObservations <- gig %>%
     filter(is.na(Industry))
ShowBlankObservations
   EmployeeID  Wage Industry        Job
1          24 42.58     <NA>  Sales Rep
2         139 42.18     <NA>   Engineer
3         361 31.33     <NA>      Other
4         378 48.09     <NA>      Other
5         441 32.35     <NA> Accountant
6         446 30.76     <NA> Accountant
7         479 42.85     <NA> Consultant
8         500 43.13     <NA>  Sales Rep
9         531 43.13     <NA>   Engineer
10        565 38.98     <NA> Accountant
#Counts the number of observations that have NA values in the Industry field. Industry is categorical, so we can count values based on it. 
CountBlanks <- sum(is.na(gig$Industry)); CountBlanks 
[1] 10
#Counts the number of observations that have NA values in the Wage field. 
CountBlanks <- sum(is.na(gig$Wage)); CountBlanks 
[1] 0

na_if()

  • The na_if() function in tidyr is used to replace specific values in a column with NA (missing) values. This function can be particularly useful when you want to standardize missing values across a dataset or when you want to replace certain values with NA for further data processing
TurnNA <- gig %>%
     mutate(Job = na_if(Job, "Other"))
head(TurnNA)
  EmployeeID  Wage     Industry        Job
1          1 32.81 Construction    Analyst
2          2 46.00   Automotive   Engineer
3          3 43.13 Construction  Sales Rep
4          4 48.09   Automotive       <NA>
5          5 43.62   Automotive Accountant
6          6 46.98 Construction   Engineer

na.omit() vs. drop_na()

  • Both functions return a new object with the rows containing missing values removed.

  • na.omit() is a base R function, so it doesn’t require any additional package installation where drop_na() requires loading the tidyr package, which is part of the tidyverse ecosystem.

  • drop_na() fits well into tidyverse pipelines, making it easy to integrate with other tidyverse functions where na.omit() can also be used in pipelines but might require additional steps to fit seamlessly.

#install.packages("Amelia")
library(Amelia)
data("africa")
summary(africa)
      year              country       gdp_pc            infl        
 Min.   :1972   Burkina Faso:20   Min.   : 376.0   Min.   : -8.400  
 1st Qu.:1977   Burundi     :20   1st Qu.: 513.8   1st Qu.:  4.760  
 Median :1982   Cameroon    :20   Median :1035.5   Median :  8.725  
 Mean   :1982   Congo       :20   Mean   :1058.4   Mean   : 12.753  
 3rd Qu.:1986   Senegal     :20   3rd Qu.:1244.8   3rd Qu.: 13.560  
 Max.   :1991   Zambia      :20   Max.   :2723.0   Max.   :127.890  
                                  NA's   :2                         
     trade            civlib         population      
 Min.   : 24.35   Min.   :0.0000   Min.   : 1332490  
 1st Qu.: 38.52   1st Qu.:0.1667   1st Qu.: 4332190  
 Median : 59.59   Median :0.1667   Median : 5853565  
 Mean   : 62.60   Mean   :0.2889   Mean   : 5765594  
 3rd Qu.: 81.16   3rd Qu.:0.3333   3rd Qu.: 7355000  
 Max.   :134.11   Max.   :0.6667   Max.   :11825390  
 NA's   :5                                           
summary(africa$gdp_pc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  376.0   513.8  1035.5  1058.4  1244.8  2723.0       2 
summary(africa$trade)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  24.35   38.52   59.59   62.60   81.16  134.11       5 
africa1 <- na.omit(africa)
summary(africa1)
      year              country       gdp_pc            infl       
 Min.   :1972   Burkina Faso:20   Min.   : 376.0   Min.   : -8.40  
 1st Qu.:1976   Burundi     :17   1st Qu.: 511.5   1st Qu.:  4.67  
 Median :1981   Cameroon    :18   Median :1062.0   Median :  8.72  
 Mean   :1981   Congo       :20   Mean   :1071.8   Mean   : 12.91  
 3rd Qu.:1986   Senegal     :20   3rd Qu.:1266.0   3rd Qu.: 13.57  
 Max.   :1991   Zambia      :20   Max.   :2723.0   Max.   :127.89  
     trade            civlib         population      
 Min.   : 24.35   Min.   :0.0000   Min.   : 1332490  
 1st Qu.: 38.52   1st Qu.:0.1667   1st Qu.: 4186485  
 Median : 59.59   Median :0.1667   Median : 5858750  
 Mean   : 62.60   Mean   :0.2899   Mean   : 5749761  
 3rd Qu.: 81.16   3rd Qu.:0.3333   3rd Qu.: 7383000  
 Max.   :134.11   Max.   :0.6667   Max.   :11825390  
##to drop all at once. 
africa2 <- africa %>% drop_na()
summary(africa2)
      year              country       gdp_pc            infl       
 Min.   :1972   Burkina Faso:20   Min.   : 376.0   Min.   : -8.40  
 1st Qu.:1976   Burundi     :17   1st Qu.: 511.5   1st Qu.:  4.67  
 Median :1981   Cameroon    :18   Median :1062.0   Median :  8.72  
 Mean   :1981   Congo       :20   Mean   :1071.8   Mean   : 12.91  
 3rd Qu.:1986   Senegal     :20   3rd Qu.:1266.0   3rd Qu.: 13.57  
 Max.   :1991   Zambia      :20   Max.   :2723.0   Max.   :127.89  
     trade            civlib         population      
 Min.   : 24.35   Min.   :0.0000   Min.   : 1332490  
 1st Qu.: 38.52   1st Qu.:0.1667   1st Qu.: 4186485  
 Median : 59.59   Median :0.1667   Median : 5858750  
 Mean   : 62.60   Mean   :0.2899   Mean   : 5749761  
 3rd Qu.: 81.16   3rd Qu.:0.3333   3rd Qu.: 7383000  
 Max.   :134.11   Max.   :0.6667   Max.   :11825390  
  • You try to load the airquality dataset from base R and look at a summary of the dataset.

    • Sum the number of NAs in airquality.
    • Omit all the NAs from airquality and save it in a new data object called airqual and take a new summary of it.
         Ozone           Solar.R           Wind             Temp      
     Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
     1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
     Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
     Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
     3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
     Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
     NA's   :37       NA's   :7                                       
         Month            Day      
     Min.   :5.000   Min.   : 1.0  
     1st Qu.:6.000   1st Qu.: 8.0  
     Median :7.000   Median :16.0  
     Mean   :6.993   Mean   :15.8  
     3rd Qu.:8.000   3rd Qu.:23.0  
     Max.   :9.000   Max.   :31.0  
    
    [1] 44
         Ozone          Solar.R           Wind            Temp      
     Min.   :  1.0   Min.   :  7.0   Min.   : 2.30   Min.   :57.00  
     1st Qu.: 18.0   1st Qu.:113.5   1st Qu.: 7.40   1st Qu.:71.00  
     Median : 31.0   Median :207.0   Median : 9.70   Median :79.00  
     Mean   : 42.1   Mean   :184.8   Mean   : 9.94   Mean   :77.79  
     3rd Qu.: 62.0   3rd Qu.:255.5   3rd Qu.:11.50   3rd Qu.:84.50  
     Max.   :168.0   Max.   :334.0   Max.   :20.70   Max.   :97.00  
         Month            Day       
     Min.   :5.000   Min.   : 1.00  
     1st Qu.:6.000   1st Qu.: 9.00  
     Median :7.000   Median :16.00  
     Mean   :7.216   Mean   :15.95  
     3rd Qu.:9.000   3rd Qu.:22.50  
     Max.   :9.000   Max.   :31.00  

Summarize

  • The summarize() command is used to create summary statistics for groups of observations in a data frame.
  • In R, summary() and summarize() serve different purposes. summary() is part of base R and gives a quick overview of data, returning descriptive statistics for each column. For example, summary(mtcars) provides the min, max, median, and mean for numeric columns and counts for factors. It’s useful for a broad snapshot of your dataset.
  • In contrast, summarize() (or summarise()) is from the dplyr package and allows for custom summaries. For instance, mtcars %>% summarize(avg_mpg = mean(mpg), max_hp = max(hp)) returns the average miles per gallon and the maximum horsepower. It’s more flexible and is often used with group_by() for grouped calculations. In conclusion, summary() gives automatic overviews, while summarize() is better for tailored summaries.
  • In the example below, we can summarize more than one thing into tidy output.
gig %>%
     drop_na() %>% 
     summarize(mean.days = mean(Wage),
               sd.days = sd(Wage),
               var.days = var(Wage),
               med.days = median(Wage),
               iqr.days = IQR(Wage))
  mean.days  sd.days var.days med.days iqr.days
1  40.14567 7.047058 49.66103    41.82   11.465

Group_by

  • group_by is used for grouping data by one or more variables. When you use group_by() on a data frame, it doesn’t actually perform any computations immediately. Instead, it sets up the data frame in such a way that any subsequent operations are performed within these groups
  • summarize() is often used in combination with group_by() to calculate summary statistics within groups
##summarize data by Industry variable. 
groupedData <- gig %>%
     group_by(Industry) %>%
     summarize(meanWage = mean(Wage))
groupedData
# A tibble: 4 × 2
  Industry     meanWage
  <fct>           <dbl>
1 Automotive       43.4
2 Construction     38.3
3 Tech             40.7
4 <NA>             39.5
##same function with na's dropped. 
groupedData <- gig %>%
     drop_na() %>%
     group_by(Industry) %>%
     summarize(meanWage = mean(Wage))
groupedData
# A tibble: 3 × 2
  Industry     meanWage
  <fct>           <dbl>
1 Automotive       43.4
2 Construction     38.4
3 Tech             40.7

case_when()

One of the most common data preparation tasks is creating or transforming variables based on conditions — for example, converting a numeric score into a letter grade, or grouping continuous values into labeled categories. The case_when() function from the dplyr package is used to create new variables based on multiple conditions. It works like a series of if…else if…else statements and is especially useful for assigning values based on ranges or logical categories. - The case_when() and recode() functions in R both allow you to transform or recode values in a vector, but they differ in flexibility and use cases. recode() is best for straightforward value replacement, where each input value maps directly to a new value—such as turning “good” into 3. - In contrast, case_when() is more flexible and powerful, allowing you to specify complex conditional logic, including comparisons and ranges (e.g., x > 100 ~ “High”). case_when() evaluates conditions in order and is particularly useful when the transformation depends on logical expressions rather than exact matching. For categorical data that maps cleanly one-to-one, recode() is convenient; for anything requiring conditional logic, case_when() is preferred.

  • From recoding to ordinal Variables
evaluate <- c("excellent", "good", "fair", "poor", "excellent", "good")
data <- data.frame(evaluate)

dataRecode <- data %>%
     mutate(evaluate = recode(evaluate,
            "excellent" = 4,
            "good" = 3,
            "fair" = 2,
            "poor" = 1))
  • If we alter this to a case_when, we would include the following.
dataCase_When <- data %>%
  mutate(evaluate = case_when(
    evaluate == "excellent" ~ 4,
    evaluate == "good" ~ 3,
    evaluate == "fair" ~ 2,
    evaluate == "poor" ~ 1,
    TRUE ~ NA_real_  # for safety in case of unexpected values
  ))
dataCase_When
  evaluate
1        4
2        3
3        2
4        1
5        4
6        3
  • If we were to give a range, we could do that with case_when using the >, < or >= or <= signs. An example is below.
# Sample vector of numbers
score <- c(9, 6, 3, 8, 5, 10, 2)

# Categorize using case_when
category <- case_when(
  score > 8        ~ "High",
  score >= 5       ~ "Medium",
  score < 5        ~ "Low"
)

# Combine into a data frame to view
df <- data.frame(score, category); df
  score category
1     9     High
2     6   Medium
3     3      Low
4     8   Medium
5     5   Medium
6    10     High
7     2      Low

With the recoding tools covered — recode() for direct value replacement and case_when() for conditional logic — we now move to the core dplyr verbs for manipulating full datasets. These functions form the backbone of day-to-day data preparation in R.

Mutate

  • mutate() is part of the dplyr package, which is used for data manipulation. The mutate() function is specifically designed to create new variables (columns) or modify existing variables in a data frame. It is commonly used in data wrangling tasks to add calculated columns or transform existing ones.
  • One example is below, but note that there are many things you can do with the mutate function.
#making a new variable called calculation that multiplies gdp_pc by infl variables in the africa1 dataset. 
africa.mutated <- mutate(africa1, calculation = gdp_pc * infl)
head(africa.mutated)
  year      country gdp_pc  infl trade    civlib population calculation
1 1972 Burkina Faso    377 -2.92 29.69 0.5000000    5848380    -1100.84
2 1973 Burkina Faso    376  7.60 31.31 0.5000000    5958700     2857.60
3 1974 Burkina Faso    393  8.72 35.22 0.3333333    6075700     3426.96
4 1975 Burkina Faso    416 18.76 40.11 0.3333333    6202000     7804.16
5 1976 Burkina Faso    435 -8.40 37.76 0.5000000    6341030    -3654.00
6 1977 Burkina Faso    448 29.99 41.11 0.6666667    6486870    13435.52
  • Below is an example with the iris dataset, which is part of base R.
data("iris")
##Selecting 2 variables from the iris dataset: Sepal.Length and Petal.Length
selected_data <-  select(iris, Sepal.Length, Petal.Length)
head(selected_data)
  Sepal.Length Petal.Length
1          5.1          1.4
2          4.9          1.4
3          4.7          1.3
4          4.6          1.5
5          5.0          1.4
6          5.4          1.7
# Filter rows based on a condition: Species = setosa
filtered_data <-  filter(iris, Species == "setosa")
head(filtered_data)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
# Arrange rows by the Sepal.Length column
arranged_data <-  arrange(iris, Sepal.Length)
# Create a new column by mutating the data by transforming Petal.Width to the log form. 
mutated_data <- mutate(iris, Petal.Width_Log = log(Petal.Width))

Advanced Visualization

Single Variable Visualization with Data Prep

The histogram, density plot, and boxplot covered in the Descriptive Statistics lesson worked with clean, ready-to-plot data. The examples here show the more common real-world workflow: a variable arrives with numeric codes, missing value sentinels, or incorrect data types that must be fixed before the chart will make sense. Cleaning comes first — visualization follows.

  • Let’s start an example from scratch using a real dataset. We examine the AUQ300 variable from the nhanes survey, which represents gun use.
  • We load the full dataset and clean all the relevant variables at once so the same nhanes.clean object can be used throughout this section and in the full example below.
nhanes <- read.csv("data/nhanes2012.csv")
head(nhanes)
summary(nhanes$AUQ300) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   1.000   2.000   1.656   2.000   7.000    4689 

AUQ300

Recode Variable if Needed

  • AUQ300 needs to be a factor variable with 1 equaling Yes and 2 equaling No. We use recode_factor() inside mutate().
  • recode_factor() transforms the levels of a categorical variable into a new set of labels.
  • recode() is generic and applies to numeric, categorical, or text data.
  • We select and recode all six variables of interest in a single pipeline so the object is ready for both the bar chart examples below and the full multi-variable example later.
nhanes.clean <- nhanes %>%
  dplyr::select(AUQ300, AUQ310, AUQ320, AUQ060, AUQ070, AUQ080) %>%
  mutate(AUQ300 = recode_factor(AUQ300, '1' = 'Yes', '2' = 'No')) %>%
  mutate(AUQ310 = recode_factor(AUQ310,
           '1' = "1 to less than 100",
           '2' = "100 to less than 1000",
           '3' = "1000 to less than 10k",
           '4' = "10k to less than 50k",
           '5' = "50k or more",
           '7' = "Refused",
           '9' = "Don't know")) %>%
  mutate(AUQ060 = recode_factor(AUQ060, '1' = 'Yes', '2' = 'No')) %>%
  mutate(AUQ070 = recode_factor(AUQ070, '1' = 'Yes', '2' = 'No')) %>%
  mutate(AUQ080 = recode_factor(AUQ080, '1' = 'Yes', '2' = 'No')) %>%
  mutate(AUQ320 = recode_factor(AUQ320,
           '1' = 'Always', '2' = 'Usually',
           '3' = 'About half the time', '4' = 'Seldom', '5' = 'Never'))
summary(nhanes.clean)
  AUQ300                       AUQ310                     AUQ320    
 Yes :1613   1 to less than 100   : 701   Always             : 583  
 No  :3061   100 to less than 1000: 423   Usually            : 152  
 NA's:4690   1000 to less than 10k: 291   About half the time: 123  
             10k to less than 50k : 106   Seldom             : 110  
             50k or more          :  66   Never              : 642  
             Don't know           :  26   NA's               :7754  
             NA's                 :7751                             
  AUQ060      AUQ070      AUQ080    
 Yes :2128   Yes : 564   Yes : 159  
 No  : 745   No  : 210   No  :  53  
 NA's:6491   NA's:8590   NA's:9152  
                                    
                                    
                                    
                                    

Get Bar Roughly Plotted

# Without piping operator
ggplot(nhanes.clean, aes(x = AUQ300)) + geom_bar()

Bar Graph of AUQ300 Variable (Gun Use) Generated by R

#With piping operator
nhanes.clean %>%
  ggplot(aes(x = AUQ300)) + geom_bar()

Bar Graph of AUQ300 Variable (Gun Use) Generated by R

Add Functions to Clean Chart

  • Use drop_na() to remove missing values from the variable before plotting so they do not appear as a category.
  • Add axis labels with labs(x = ..., y = ...).
nhanes.clean %>%
  drop_na(AUQ300) %>%
  ggplot(aes(x = AUQ300)) + geom_bar() +
  labs(x = "Gun use", y = "Number of participants")

Bar Graph of AUQ300 Variable with labels

  • From the bar graph, we can see that almost double the number of people have not fired a firearm for any reason compared to those who have.

Adding Color

  • When fill is mapped to a variable inside aes(), ggplot assigns a distinct color to each category automatically.
nhanes.clean %>%
  drop_na(AUQ300) %>%
  ggplot(aes(AUQ300, fill=AUQ300)) +
  geom_bar() +
  labs(x = "Gun use", y = "Number of participants", 
       subtitle = "Filled inside the aes()") 

Bar Graph of AUQ300 Variable with color

Data Prep and Then Visualized

This example walks through cleaning the gss.2016 dataset and then plotting the result, showing how data preparation and visualization connect in practice.

gss.2016 <- read.csv(file = "data/gss2016.csv")

The grass variable captures whether respondents believe marijuana should be legal, but it arrives as a character and contains placeholder codes ("DK", "IAP") that need to be converted to NA. The age variable has the value "89 OR OLDER" which prevents numeric coercion. We handle all of this in a single pipeline and create an age category variable at the end.

gss.2016.cleaned <- gss.2016 %>%
  mutate(grass = as.factor(grass)) %>%
  mutate(grass = na_if(x = grass, y = "DK")) %>%
  mutate(grass = na_if(x = grass, y = "IAP")) %>%
  mutate(grass = droplevels(x = grass)) %>%
  mutate(age = recode(age, "89 OR OLDER" = "89")) %>%
  mutate(age = as.numeric(x = age)) %>%
  mutate(age.cat = as.factor(case_when(
       age < 30 ~ "< 30",
       age >= 30 & age <= 59 ~ "30 - 59",
       age >= 60 & age <= 74 ~ "60 - 74",
       age >= 75 ~ "75+",
       TRUE ~ NA_character_
  )))
summary(gss.2016.cleaned)
       grass           age           age.cat    
 LEGAL    :1126   Min.   :18.00   < 30   : 481  
 NOT LEGAL: 717   1st Qu.:34.00   30 - 59:1517  
 NA's     :1024   Median :49.00   60 - 74: 598  
                  Mean   :49.16   75+    : 261  
                  3rd Qu.:62.00   NA's   :  10  
                  Max.   :89.00                 
                  NA's   :10                    

With the data cleaned, we can plot directly. geom_bar() counts observations automatically — no need to pre-compute frequencies. drop_na() removes missing values so they don’t appear as a bar category.

ggplot(gss.2016.cleaned, aes(grass)) + geom_bar()

gss.2016.cleaned %>% 
     drop_na() %>%
     ggplot(aes(grass)) + geom_bar(fill=c("red", "blue")) + 
     labs(x = "Should marijuana be legal", y="Frequency of Responses")

Bar Graph Generated by R

Edit The Graphic

  • We can expand to include the age.cat variable on the x axis, with bars filled by grass category.
gss.2016.cleaned %>% 
     drop_na() %>%
     ggplot(aes(age.cat, fill=grass)) + geom_bar() + labs(x="Age Category", y="Frequency of responses")

  • Add position = "dodge" to place the bars side by side (grouped) rather than stacked.
gss.2016.cleaned %>% 
     drop_na() %>%
     ggplot(aes(age.cat, fill=grass)) + geom_bar(position="dodge") + labs(x="Age Category", y="Frequency of responses")

  • We can further edit to show percentages on the y axis using after_stat(count).
gss.2016.cleaned %>% 
     drop_na() %>%
     ggplot(aes(age.cat, y = 100*(after_stat(count))/sum(after_stat(count)), 
                fill=grass)) + 
     geom_bar(position = 'dodge')+  
     theme_minimal()+ 
     labs(x = "Age Category",y = "Percent of responses")

Grouped Bar Graph Generated by R

MultiVariable Data Visualization

Now that we can clean and prepare data, we can start to visualize it. The goals of this section are to explore patterns based on groups or between two or more variables. Visualization is one of the most important steps in any analysis — it helps you understand your data quickly, catch problems early, and communicate findings clearly.

Histograms, density plots, and boxplots for a single continuous variable were covered in the Descriptive Statistics lesson, and single-variable bar charts requiring data preparation are in the section above. This section focuses entirely on two-variable charts: grouped and stacked bar charts for two categorical variables, boxplots across groups, and scatterplots for two continuous variables.

library(tidyverse)
Note

Recall that ggplot2 builds charts in layers using + — each layer adds a geometry, label, or theme on top of the last. The same layering concepts from the Descriptive Statistics lesson apply to every chart in this section.

  • Combinations of 2 variable types for graphing:
    • Two categorical / factor variables.
    • One categorical / factor and one continuous / numeric variable.
    • Two continuous / numeric variables.

Bar Graphs for Two Categorical Variables

  • There are two formats available: Grouped and Stacked.
# A tibble: 6 × 3
# Groups:   vs, gear [6]
     vs  gear     n
  <dbl> <dbl> <int>
1     0     3    12
2     0     4     2
3     0     5     4
4     1     3     3
5     1     4    10
6     1     5     1

Grouped Bar Graph

  • A grouped bar graph allows comparison of multiple sets of data items, with a single color used to denote a specific series across all sets.
  • Use group_by() and count() to generate frequencies first, then plot with stat = "identity" and position = "dodge".
mtcars <- mtcars %>% 
    mutate(vs=as.factor(vs)) %>% 
    mutate(gear=as.factor(gear))

countsDF <- mtcars %>% 
    group_by(vs, gear) %>%
    count()
summary(countsDF)
 vs    gear        n         
 0:3   3:2   Min.   : 1.000  
 1:3   4:2   1st Qu.: 2.250  
       5:2   Median : 3.500  
             Mean   : 5.333  
             3rd Qu.: 8.500  
             Max.   :12.000  
ggplot(countsDF, aes(x = gear, y = n, fill = vs)) +
     geom_bar(stat = "identity", position = "dodge") +
     labs(title = "Grouped Car Distribution by Gears and VS",
     x = "Number of Gears", y = "Count") +
     theme_minimal()

Grouped Bar Graph Generated by R

Stacked Bar Graph

  • A stacked bar graph extends the standard bar chart to two categorical variables by dividing each bar into sub-bars, one per level of the second variable.
  • Remove the position = "dodge" argument to stack instead of group.
ggplot(countsDF, aes(x = gear, y = n, fill = vs)) +
  geom_bar(stat = "identity") +
  labs(title = "Stacked Car Distribution",
       x = "Number of Gears",
       y = "Count") +
  theme_minimal()

Stacked Bar Graph Generated from ggplot2 by R

Bar Graph for Continuous Across Groups

  • Instead of counting observations, we can display a continuous variable (like mean) for each group.
  • Use group_by() and summarise() to calculate the group statistic first, then pass it to geom_bar(stat = "identity").
avg_mpg <- mtcars %>%
  group_by(gear, vs) %>%
  summarise(mpg = mean(mpg, na.rm = TRUE))
ggplot(avg_mpg, aes(gear, mpg, fill = vs)) +
  geom_bar(stat = "identity", position = "dodge") +
  ggtitle("Average MPG by VS and Gear")

Grouped Bar Graph Generated from ggplot2 by R

ggplot(avg_mpg, aes(gear, mpg, fill = vs)) +
     geom_bar(stat = "identity", position = "dodge", color="black") +  
     ggtitle("Average MPG by VS and Gear")+
     scale_fill_manual(values=c("yellow", "brown"))

Boxplot for Continuous Across Groups

  • When a grouping variable is added to a boxplot, we get one boxplot per group. This allows direct comparison of distributions across categories.
  • The categorical variable must be a factor before plotting.
mtcars %>%
  ggplot(aes(x = gear, y = mpg, fill = gear)) +
  geom_boxplot(show.legend = FALSE) +
  scale_fill_manual(values = c("gray", "red", "blue")) +
  theme_minimal()

Boxplot Generated from ggplot2 by R

mtcars %>%
     ggplot(aes(x = vs, y = mpg, fill = vs)) +
     geom_boxplot(show.legend = FALSE) +
     scale_fill_manual(values = c("gray", "red")) +
     theme_minimal() 

Scatterplot for Two Continuous Variables

  • A scatterplot is used to determine if two continuous variables are related.
    • Each point is a pairing: \((x_1, y_1), (x_2, y_2),\) etc.
  • Our goal with a scatterplot is to characterize the relationship visually — positive, negative, or not existent.

ScatterPlot Results
  • Let’s work a clean example examining the relationship between income and years of education.
Edu <- read.csv("data/education.csv")
plot(Edu$Income ~ Edu$Education, ylab = "Income", xlab = "Education")

Scatterplot Generated by R

  • Working with ggplot:
    • Layer 1: ggplot() with aes() pointing to x and y variables.
    • Layer 2: geom_point() to add the observation points.
    • Additional layers: labs(), geom_smooth(), and others.
ggplot(Edu, aes(x=Education, y=Income)) +
     geom_point() +
     labs(y= "Income", x = "Education") 

Scatterplot Generated from ggplot2 by R

  • We can add a trendline using geom_smooth(method = "lm") to fit a linear regression line and visualize the direction of the relationship.
ggplot(Edu, aes(x=Education, y=Income)) +
    geom_point() +
    labs(y= "Income", x = "Education") +
    geom_smooth(method="lm", color="#789F90")

Scatterplot With geom_smooth lm method Generated from ggplot2 by R

  • Let’s look at a few more examples using mtcars and see if the relationship is positive, negative, or not existent.
ggplot(mtcars, aes(x=disp, y=mpg))+ 
  geom_point() + 
  geom_smooth(method="lm", color="#789F90")

Scatterplot examples in R

ggplot(mtcars, aes(x=hp, y=mpg))+ geom_point() +
  geom_smooth(method="lm", color="#789F90")

Scatterplot examples in R

ggplot(mtcars, aes(x=qsec, y=mpg))+geom_point() + geom_smooth(method="lm", color="#789F90")

Scatterplot examples in R

  • When one of the variables is categorical rather than continuous, a boxplot is more appropriate than a scatterplot. If cyl is treated as numeric, only one boxplot appears. Converting it to a factor fixes this.
ggplot(mtcars, aes(cyl, mpg)) + geom_point() + geom_smooth(method="lm")

ggplot2 Example with Incorrect Data Type

ggplot(mtcars, aes(cyl, mpg)) + geom_boxplot()

ggplot2 Example with Boxplot

mtcars <- mtcars %>% 
     mutate(cyl = as.factor(cyl))
ggplot(mtcars, aes(cyl, mpg)) + geom_boxplot()

Full Example

The following examples tie together data preparation and visualization using two real public health datasets. Each one requires cleaning before plotting — the nhanes dataset uses numeric codes that need recoding, and the BRFSS dataset has sentinel values that must be converted to NA before the distributions make sense.

nhanes Dataset example

  • The nhanes dataset includes auditory health variables alongside gun use variables, making it an interesting case for exploring relationships between two categorical variables.

  • Variables: AUQ060/070/080 measure ability to hear across a room; AUQ300/310/320 relate to firearm use and hearing protection.

  • nhanes.clean was fully prepared in the Bar Graph with Data Wrangling section above with all six variables recoded and ready to use.

  • NA values are left in and handled on a chart-by-chart basis using drop_na() so we do not unnecessarily reduce the dataset.

nhanes.clean %>% 
  drop_na(AUQ310) %>% drop_na(AUQ060) %>% 
  ggplot(aes(x=AUQ310, fill=AUQ060)) + geom_bar(position="dodge") + 
  labs(x="How many rounds fired", title="Hearing Whisper vs. Rounds Fired", y="Frequency")

nhanes.clean %>% 
  drop_na(AUQ320) %>% drop_na(AUQ060) %>% 
  ggplot(aes(x=AUQ320, fill=AUQ060)) + geom_bar(position="dodge") + 
  labs(x="Wear hearing protection", title="Hearing Whisper vs. Hearing Protection Use", y="Frequency")

  • Try running the other variable combinations on your own to see what patterns you can find.

brfss Dataset Example

The BRFSS (Behavioral Risk Factor Surveillance System) dataset illustrates how data preparation and visualization come together. The qualitative variable TRNSGNDR requires coercion and recoding before it can be plotted; the continuous variable PHYSHLTH needs its sentinel codes cleaned before analysis. This example ties together the coercion, mutate(), and visualization skills from this lesson.

  • The full codebook where this screenshot is taken is brfss_2014_codebook.pdf.

Evaluate CodeBook Before Making Decisions
brfss <- read.csv("data/brfss.csv")
summary(brfss)
    TRNSGNDR        X_AGEG5YR          X_RACE         X_INCOMG    
 Min.   :1.000    Min.   : 1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:4.000    1st Qu.: 5.000   1st Qu.:1.000   1st Qu.:3.000  
 Median :4.000    Median : 8.000   Median :1.000   Median :5.000  
 Mean   :4.059    Mean   : 7.822   Mean   :1.992   Mean   :4.481  
 3rd Qu.:4.000    3rd Qu.:10.000   3rd Qu.:1.000   3rd Qu.:5.000  
 Max.   :9.000    Max.   :14.000   Max.   :9.000   Max.   :9.000  
 NA's   :310602                    NA's   :94                     
    X_EDUCAG        HLTHPLN1         HADMAM          X_AGE80     
 Min.   :1.000   Min.   :1.000   Min.   :1.000    Min.   :18.00  
 1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000    1st Qu.:44.00  
 Median :3.000   Median :1.000   Median :1.000    Median :58.00  
 Mean   :2.966   Mean   :1.108   Mean   :1.215    Mean   :55.49  
 3rd Qu.:4.000   3rd Qu.:1.000   3rd Qu.:1.000    3rd Qu.:69.00  
 Max.   :9.000   Max.   :9.000   Max.   :9.000    Max.   :80.00  
                                 NA's   :208322                  
    PHYSHLTH   
 Min.   : 1.0  
 1st Qu.:20.0  
 Median :88.0  
 Mean   :61.2  
 3rd Qu.:88.0  
 Max.   :99.0  
 NA's   :4     

Qualitative Variable

  • To look at an example, the one below seeks to understand the healthcare issue in reporting gender based on different definitions. The dataset is part of the Behavioral Risk Factor Surveillance System (brfss) dataset (2014), which includes lots of other variables besides reported gender.
#Summarize the TRNSGNDR variable
summary(object = brfss$TRNSGNDR) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   4.000   4.000   4.059   4.000   9.000  310602 
#Find frequencies 
table(brfss$TRNSGNDR) 

     1      2      3      4      7      9 
   363    212    116 150765   1138   1468 
  • Since this table is not very informative, we need to do some edits.
  • Check the class of the variable to see the issue with analyzing it as a categorical variable.
class(brfss$TRNSGNDR)
[1] "integer"
  • Because TRNSGNDR is stored as numeric, it needs to be converted to a factor before recoding. The mutate() pipeline below handles the coercion and recoding in a single step.
brfss.cleaned <- brfss %>% 
  mutate(TRNSGNDR = as.factor(TRNSGNDR)) %>%
  mutate(TRNSGNDR = recode_factor(TRNSGNDR,
      '1' = 'Male to female',
      '2' = 'Female to male',
      '3' = 'Gender non-conforming',
      '4' = 'Not transgender',
      '7' = 'Not sure',
      '9' = 'Refused'))
  • We can use the levels() command to show the factor levels made with the mutate() command above.
levels(brfss.cleaned$TRNSGNDR)
[1] "Male to female"        "Female to male"        "Gender non-conforming"
[4] "Not transgender"       "Not sure"              "Refused"              
  • Check the summary.
summary(brfss.cleaned$TRNSGNDR)
       Male to female        Female to male Gender non-conforming 
                  363                   212                   116 
      Not transgender              Not sure               Refused 
               150765                  1138                  1468 
                 NA's 
               310602 
  • Take a good look at the table to interpret the frequencies in the output above. The highest percentage was the “NA’s” category, followed by “Not transgender”. Removing the NA’s moved the “Not transgender” category to over 97% of observations.

Quantitative Variable

  • Let’s use the cleaned dataset to make more changes to the continuous variable PHYSHLTH. In the codebook, it looks like the data is most applicable to the first 2 categories. The 1-30 days coding and the 88 coding, which means 0 days of physical illness and injury.
    • Using cleaned data, we need to prep the variable a little more before getting an accurate plot.
    • Specifically, we need to null out the 77 and 99 values and make sure the 88 coding is set to be 0 for 0 days of illness and injury.
brfss.cleaned <- brfss %>% 
  mutate(TRNSGNDR = recode_factor(TRNSGNDR,
      '1' = 'Male to female',
      '2' = 'Female to male',
      '3' = 'Gender non-conforming',
      '4' = 'Not transgender',
      '7' = 'Not sure',
      '9' = 'Refused')) %>%
  #Turn the 77 values to NA's. 
  mutate(PHYSHLTH = na_if(PHYSHLTH, y = 77)) %>%
  #Turn the 99 values to NA's. 
  mutate(PHYSHLTH = na_if(PHYSHLTH, y = 99)) %>%
  #Recode the 88 values to be numeric value of 0. 
  mutate(PHYSHLTH = recode(PHYSHLTH, '88' = 0L))

Histogram

  • The histogram showed most people have between 0 and 10 unhealthy days per 30 days.

  • Next, evaluate mean, median, and mode for the PHYSHLTH variable after ignoring the blanks.

mean(brfss.cleaned$PHYSHLTH, na.rm=TRUE)
[1] 4.224106
median(brfss.cleaned$PHYSHLTH, na.rm=TRUE)
[1] 0
names(x = sort(x = table(brfss.cleaned$PHYSHLTH), decreasing = TRUE))[1]
[1] "0"
  • While the mean is higher at 4.22, the median and most common number is 0.
## Spread to Report with the Mean
var(brfss.cleaned$PHYSHLTH, na.rm=TRUE)
[1] 77.00419
sd(brfss.cleaned$PHYSHLTH, na.rm=TRUE)
[1] 8.775203
##Spread to Report with Median
summary(brfss.cleaned$PHYSHLTH, na.rm=TRUE)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.000   0.000   4.224   3.000  30.000   10303 
range(brfss.cleaned$PHYSHLTH, na.rm=TRUE)
[1]  0 30
max(brfss.cleaned$PHYSHLTH, na.rm=TRUE)-min(brfss.cleaned$PHYSHLTH, na.rm=TRUE)
[1] 30
IQR(brfss.cleaned$PHYSHLTH, na.rm=TRUE)
[1] 3
library(semTools)
# Plot the data
brfss.cleaned %>% 
  ggplot(aes(PHYSHLTH)) + geom_histogram()

# Calculate Skewness and Kurtosis
skew(brfss.cleaned$PHYSHLTH)
skew (g1)        se         z         p 
    2.209     0.004   607.905     0.000 
kurtosis(brfss.cleaned$PHYSHLTH)
Excess Kur (g2)              se               z               p 
          3.474           0.007         478.063           0.000 
  • The skew results provide a z of 607.905 (6.079054e+02) which is much higher than 7 (for large datasets). This indicates a clear right skew which means the data is not normally distributed.
  • The kurtosis results are also very leptokurtic with a score of 478.063.

Review and Practice

Using AI

Use the following prompts in our chatbot below to explore more about data preparation and visualization with ggplot2 further.

Understanding data preparation concepts:

  • What is the difference between na.omit() and drop_na() in R? When would removing rows with missing values introduce analytical bias, and what is an alternative approach?

  • Explain the difference between filter(), select(), and mutate() in dplyr. Write an example of each using a customer dataset with columns for CustomerID, Region, Age, and AnnualSpend.

  • What does the pipe operator %>% do in R, and why does it make dplyr code more readable? Show an example of the same operation written with and without the pipe.

  • When using group_by() with summarize(), how do you decide which grouping variable to use? Give an example where the grouped result changes the business interpretation compared to the overall summary.

Prompting AI effectively for data preparation tasks:

Vague prompt Specific prompt
“Clean my data” “In my customer_data dataframe, coerce PurchaseDate to Date using as.Date(), recode Region from numeric (1, 2, 3) to factor with levels East, West, Central using mutate() and recode(), and use drop_na(AnnualSpend) to remove rows where the outcome variable is missing. Then run summary() and report any remaining type mismatches.”
“Summarize by group” “Using group_by() and summarize(), calculate the mean, median, and count of AnnualSpend broken out by Region and Churned in customer_data. Arrange the output from highest to lowest mean spend.”
“Fix missing values” “In customer_data, identify which columns have missing values using sum(is.na()) applied to each column. For AnnualSpend, impute missing values with the column median. For Region, print the rows with missing values so I can review them before deciding whether to remove or impute.”

Visualization prompts:

  • How can I modify the appearance of a ggplot bar chart to include custom colors for each bar, and what are the best practices for choosing colors in data visualization?

  • What is the role of layering in ggplot, and how can adding multiple layers, such as labels, themes, and lines, improve the readability of a plot?

  • When should a density plot be used instead of a histogram, and how does each visualization help in understanding the distribution of continuous data?

  • How can I use a boxplot in R to identify and visualize outliers in my dataset, and what additional steps should I take to handle these outliers?

  • How can I create a scatter plot in ggplot to explore relationships between two continuous variables, and how do I add a trendline to help interpret the results?

  • What are the steps for creating a grouped bar chart in ggplot, and how does this visualization help in comparing multiple categories or groups within a dataset?

Data Types and Coercion Lab

1. What R function would you use to check the data type of a column? Write the command to check the type of the Age column in a dataset called survey.

class(survey$Age)

2. A dataset has a Rating column stored as character with values "1", "2", "3". Write the code to convert it to numeric.

survey$Rating <- as.numeric(survey$Rating)
class(survey$Rating)  # confirm: "numeric"

3. An instructor evaluation column contains "excellent", "good", "fair", "poor". Write a mutate() pipeline using recode() to convert these to 4, 3, 2, 1 respectively and store them as integers.

library(tidyverse)
survey <- survey %>%
  mutate(Rating = recode(Rating,
    "excellent" = 4L,
    "good"      = 3L,
    "fair"      = 2L,
    "poor"      = 1L))

4. Load dataprep.csv as houseprices. The BuildDay column contains dates stored as character strings (e.g., "1986-08"). Use the lubridate package to convert it to a proper date type, then confirm the conversion with class().

# Your code here
library(lubridate)
houseprices <- read.csv("data/dataprep.csv")
houseprices$BuildDay <- ym(houseprices$BuildDay)  # "1986-08" is year-month format
class(houseprices$BuildDay)  # "Date"
head(houseprices$BuildDay)

ym() parses year-month formatted strings. After conversion, BuildDay is a proper Date column that R can sort, filter, and compute differences on correctly.

Using dplyr Lab

Use the dataprep.csv dataset loaded as houseprices.

houseprices <- read.csv("data/dataprep.csv")

1. Filter the dataset to show only houses with 5 or more bedrooms. Save the result as filtered and count the observations.

filtered <- filter(houseprices, Beds >= 5)
nrow(filtered)  # 9 houses

9 houses have 5 or more bedrooms.

2. Arrange the full dataset by Price in descending order. Save as ArrangedData and print the head. What is the highest price?

ArrangedData <- arrange(houseprices, desc(Price))
head(ArrangedData)

The highest price is $840,000 — it will appear in the first row.

3. Using a single pipeline, filter to houses with at least 3 bedrooms, select only Price, Beds, and Sqft, and arrange by Price descending.

houseprices %>%
  filter(Beds >= 3) %>%
  select(Price, Beds, Sqft) %>%
  arrange(desc(Price))

Continue using houseprices.

1. Create a new categorical variable called HouseSizeCategory using case_when(): - Small: Sqft < 2000 - Medium: Sqft between 2000 and 3000 (inclusive) - Large: Sqft > 3000

houseprices <- houseprices %>%
  mutate(HouseSizeCategory = case_when(
    Sqft < 2000             ~ "Small",
    Sqft >= 2000 & Sqft <= 3000 ~ "Medium",
    Sqft > 3000             ~ "Large"
  ))

2. Calculate the frequency of each HouseSizeCategory using table(). List the count for each category.

table(houseprices$HouseSizeCategory)

Large: 14 | Medium: 25 | Small: 40

3. Calculate the average Price and average Sqft grouped by HouseSizeCategory. Fill in the table below with your results.

# Your code here
HouseSizeCategory avg_price avg_sqft
Large
Medium
Small
houseprices %>%
  group_by(HouseSizeCategory) %>%
  summarise(avg_price = mean(Price, na.rm = TRUE),
            avg_sqft  = mean(Sqft,  na.rm = TRUE))
HouseSizeCategory avg_price avg_sqft
Large $574,786 3,370 sq ft
Medium $596,696 2,488 sq ft
Small $457,568 1,505 sq ft

Visualization Lab

For this lab, switch to the dataviz.csv coffee dataset rather than the datasets used in the teaching examples above. Load it as coffee. It contains 1,311 coffee quality reviews with variables including Country.of.Origin, Processing.Method, TotalScore, Flavor, and Aroma, among others. Use the scores vector for the outlier question.

coffee <- read.csv("data/dataviz.csv")

1. Create a bar chart showing the count of reviews by Country.of.Origin. Because many countries appear, filter to the top 6 countries first using dplyr. Give it a custom fill color and a theme_minimal(). What do you observe?

library(tidyverse)
top6 <- coffee %>%
  count(Country.of.Origin, sort = TRUE) %>%
  slice_head(n = 6) %>%
  pull(Country.of.Origin)

coffee %>%
  filter(Country.of.Origin %in% top6) %>%
  ggplot(aes(x = reorder(Country.of.Origin, -table(Country.of.Origin)[Country.of.Origin]),
             fill = Country.of.Origin)) +
  geom_bar(show.legend = FALSE) +
  theme_minimal() +
  labs(title = "Coffee Reviews by Country (Top 6)",
       x = "Country", y = "Number of Reviews")

Mexico leads with 236 reviews, followed by Colombia (183) and Guatemala (181). The distribution is right-skewed — a few countries dominate the dataset.

2. Make a density plot of TotalScore. Add a vertical dashed line at the mean. Then run skew() and kurtosis() from semTools. Is TotalScore skewed? Does the visual agree with the statistics?

# Filter out the extreme low values (data quality issue — a few scores near 0)
coffee_clean <- filter(coffee, TotalScore > 50)

ggplot(coffee_clean, aes(x = TotalScore)) +
  geom_density(fill = "steelblue", alpha = 0.5) +
  geom_vline(aes(xintercept = mean(TotalScore)), color = "red", linetype = "dashed") +
  labs(title = "Density Plot of Coffee Total Score", x = "Total Score")

library(semTools)
skew(coffee_clean$TotalScore)
kurtosis(coffee_clean$TotalScore)

TotalScore is left-skewed — the mean (≈82.2) is slightly below the median (≈82.5), and the distribution has a longer tail to the left. The density plot will show the peak shifted right with a tail extending left.

3. Make a boxplot of TotalScore (using coffee_clean). Do any outliers appear? Then use the scores vector below to manually verify outlier detection using the IQR method.

scores <- c(55, 56, 58, 60, 62, 65, 66, 67, 70, 120)
# Boxplot
ggplot(coffee_clean, aes(x = TotalScore)) +
  geom_boxplot(fill = "lightblue", color = "navy") +
  labs(title = "Boxplot of Coffee Total Score", x = "Total Score")

# Manual outlier detection on the scores vector
Q1   <- quantile(scores)[2]; Q1     # 58.50
Q3   <- quantile(scores)[4]; Q3     # 66.75
IQRv <- IQR(scores); IQRv           # 8.25
LB   <- Q1 - 1.5 * IQRv; LB        # 46.12
UB   <- Q3 + 1.5 * IQRv; UB        # 79.12
scores[scores < LB | scores > UB]   # 120 is an outlier

Q1: 58.50 | Q3: 66.75 | IQR: 8.25 | Lower bound: 46.12 | Upper bound: 79.12 | Outlier: 120

The TotalScore boxplot will show outliers at the low end — a handful of very low scores (the data quality issues) appear as individual points well below the whisker.

4. Create three different charts using the coffee dataset (or coffee_clean). Try at least one chart type not yet used in this lab. Write your three commands below and note what each reveals.

Many correct answers are possible. For example:

# Histogram of Flavor scores
ggplot(coffee_clean, aes(Flavor)) +
  geom_histogram(binwidth = 0.1, fill = "goldenrod", color = "white") +
  labs(title = "Distribution of Flavor Scores")

# Bar chart of Processing Method
coffee %>%
  filter(Processing.Method != "") %>%
  ggplot(aes(x = Processing.Method, fill = Processing.Method)) +
  geom_bar(show.legend = FALSE) +
  theme_minimal() +
  labs(title = "Reviews by Processing Method", x = "Method", y = "Count") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

# Scatterplot of Aroma vs Flavor
ggplot(coffee_clean, aes(x = Aroma, y = Flavor)) +
  geom_point(alpha = 0.3, color = "darkgreen") +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Aroma vs. Flavor Rating")

Washed / Wet is by far the most common processing method. Aroma and Flavor show a strong positive relationship — coffees rated highly on aroma tend to score well on flavor too.

Interactive R Lesson: Data Preparation with dplyr

Note

You will use filter(), select(), arrange(), mutate(), the pipe operator, and geom_point() on the Auto dataset — all in the browser. Work through each section, then take the scored quiz.

First-time load: The interactive R environment may take 10–20 seconds to initialize on your first visit. Once the Run Code buttons become active, you are ready to go.

Tip

Loading data in WebR vs. RStudio: In this browser environment, datasets are accessed directly from the ISLR package using ISLR::Auto. In your own .R file in RStudio, you would instead use library(ISLR) followed by data(Auto). Both approaches give you the same dataset — the syntax differs because WebR handles packages differently than a local R session.

Part 1: Loading and Summarizing Data

We will work with the Auto dataset, which contains information on cars including mpg, horsepower, weight, acceleration, and more.

Summarize the Auto dataset:

Tip

summary() is part of base R and gives you a quick overview — min, max, mean, median, and quartiles for numeric variables. Note that the data is mostly numerical.


Part 2: Subsetting with Square Brackets

Before using dplyr functions, you can subset a data frame using [rows, columns] — base R syntax that is useful for quick slices.

Produce the first 4 rows of all columns:

Tip

Auto[1:4, ] means rows 1 through 4, and the empty space after the comma means all columns. The structure dataset[rows, columns] is the general pattern.


Part 3: filter() — Subsetting Rows by Condition

The filter() function from dplyr is more powerful than bracket subsetting for conditional row selection.

Filter cars where horsepower is above 100:

Note

Notice the minimum of AutoFilter$horsepower is above 100 — the filter worked. An alternate way to write the same thing using the pipe operator is:

AutoFilter <- Auto %>% filter(horsepower > 100)

Both produce identical results. The pipe operator (%>%) passes the left-hand object as the first argument to the right-hand function — making chains of operations easier to read.


Part 4: select() — Choosing Columns

select() lets you pull specific variables from a dataset, creating a smaller, focused data frame.

Select horsepower, acceleration, and year:

Tip

Notice that AutoSelect now contains only your three chosen columns. This is useful before running analysis so you are not carrying around variables you do not need.

If you encounter conflicts with MASS::select(), add dplyr:: before the function: dplyr::select(Auto, horsepower, acceleration, year).


Part 5: arrange() — Sorting Data

arrange() sorts a data frame by one or more variables. By default it sorts ascending (smallest to largest).

Sort by weight (ascending):

Sort by weight (descending) using desc():

Tip

desc() wraps the variable name inside arrange() to reverse the sort order. You should see the heaviest cars at the top now.


Part 6: mutate() — Creating New Variables

mutate() adds new columns or modifies existing ones. It is commonly used to recode variables or create calculated fields.

Create a new mpg_category variable using ifelse():

Tip

ifelse(condition, value_if_true, value_if_false) is a vectorized if-else. Any car with mpg > 30 is labeled "high", all others "low". Notice mpg_category now appears as a new column in the summary.


Part 7: Scatterplot — Visualizing Two Continuous Variables

A scatterplot shows the relationship between two continuous variables. Our goal is to characterize the relationship by visual inspection — positive, negative, or no relationship.

Scatterplot of Horsepower vs. MPG:

Add a linear trend line using stat_smooth(method="lm"):

Tip

stat_smooth(method = "lm") adds a straight linear model line. Without method="lm" you get a curved loess line instead. The downward slope here indicates a negative relationship — higher horsepower cars tend to have lower mpg.

Now try the Credit dataset relationship — Limit vs. Rating:

Note

As Rating increases, so does the credit Limit — there is a positive relationship between the two variables. The upward slope of the trend line confirms this.


Scored Quiz: Data Preparation

Data Preparation Quiz

Question 1 of 6

1. What does Auto[1:4, ] return?




Note

No need to submit this quiz anywhere. This exercise is for your benefit to help you learn R.*


Summary

In this lesson, we worked through the full data preparation and visualization workflow in R. We began with data types and coercion — understanding how R stores variables as factors, numeric, character, and logical types, and how to convert between them using as.factor(), as.numeric(), and recode(). We also covered date conversion using the lubridate package with the dataprep.csv dataset.

The core of the lesson covered the dplyr toolkit: arrange() to sort data, filter() to subset rows, select() to choose columns, and the pipe operator (%>%) to chain operations cleanly. We used count() and length() to tally observations, worked through the full range of missing data techniques including na.rm, is.na(), na_if(), na.omit(), and drop_na(), and used summarise() and group_by() to compute grouped statistics. case_when() and mutate() handle recoding and creating new variables.

The visualization section is organized by complexity. Single-variable bar charts that require data cleaning (recoding factor levels, handling missing values) are covered first using the nhanes and GSS datasets — these show how data preparation and visualization connect in practice. Multi-variable charts follow: grouped and stacked bar charts for two categorical variables, boxplots by group, and scatterplots with geom_smooth() for two continuous variables. Full worked examples using the nhanes and BRFSS datasets tie everything together.

What comes next: The Hypothesis Testing lesson introduces t-tests — the first formal inferential tool in the course. The visualization and reasoning-error principles from this lesson apply directly: a t-test result is only meaningful if the question was well-posed, the data are appropriate, and the output is communicated honestly. The Trifecta Checkup extends naturally to interpreting statistical output: does the analysis answer the right question, with the right data, communicated clearly?