Data Preparation

Published

August 9, 2024

The goal of this lesson is to teach you how to clean datasets for use in analytics. This lesson focuses on dplyr. dplyr is a package in R that provides a set of functions for data manipulation tasks. These functions are designed to be intuitive and efficient, making it easier to work with data frames or tibbles (a modern reimagining of data frames provided by the tibble package).

####################################
# Project name: Data Preparation
# Data used: gss.2016, brfss.csv, customers.csv, gig.csv from Blackboard, iris from datasets
# Libraries used: tidyverse, semTools, lubridate
####################################

At a Glance

In order to succeed in this lesson, we need to be able to evaluate variables and understand how to clean and prepare data to make variables easier to use and in the correct form. This sometimes includes subsetting and filtering data alongside other techniques.

Lesson Objectives

Create variables and identify and change data types.
Learn how to clean data via dplyr.

Consider While Reading

We often spend a considerable amount of time inspecting and preparing the data for the subsequent analysis. This includes the following:
- Evaluating Data Types
- Sorting Data
- Selecting Variables
- Filtering Data
- Counting Data
- Handling Missing Values
- Summarizing
- Grouping Data

Evaluating Data Types

A data type of a variable specifies the type of data that is stored inside that variable. R is Called a Dynamically Typed Language, meaning that a variable itself is not declared of any data type. Rather it receives the data type of the R object that is assigned to it. We can change a variable’s data type if we want through coercion, but it will inherit one based on the object assigned to it. We will learn some common ways to do this in the data preparation section.
Evaluating data types in R is crucial for understanding and managing datasets effectively, as different data types require specific handling and operations. R provides several functions to identify and evaluate data types, such as class() to check an object’s class, typeof() to determine its internal storage mode, and is.numeric(), is.character(), or is.factor() to test for specific types. These functions help ensure that data is in the correct format for analysis or modeling tasks. For instance, numeric data types are essential for calculations, while factors or characters are used for categorical data.
There are a number of data types in R that are common to programming and statistical analysis.
- Factor (Nominal or Ordinal)
- Numeric (Real or Discrete (Integer))
- Character
- Logical
- Date

Coercing Data Types

In R, coercion is the process of converting one data type to another, often done automatically by R when necessary. For example, if you combine numeric and character data in a vector, R will coerce the numeric values to characters, as seen in c(1, “A”), which results in c(“1”, “A”). Coercion can also be done manually using functions like as.numeric() or as.factor() to change data types explicitly.
To determine whether a variable is numeric or categorical (factor), you can use the class() function. For example, if you have a variable age, running class(age) will tell you if it is stored as “numeric” or “factor”. Numeric variables represent quantities, such as continuous or discrete numbers, while categorical variables represent groups or categories, typically stored as factors. For example, a variable with levels like “Male” and “Female” would be a categorical (factor) variable.
Sometimes when you read in a dataset all the variables are already in the correct type. Other times, you need to force it into the type you need to conduct the analysis through a process called coercion.

Factor data type:

Factor data types can be ordinal or nominal. Ordinal and nominal variables are both types of categorical variables but differ in their characteristics. Ordinal variables have a meaningful order or ranking among categories (e.g., “low,” “medium,” “high”), but the intervals between ranks are not necessarily equal. Nominal variables, on the other hand, represent categories without any inherent order (e.g., “red,” “blue,” “green”). Understanding this distinction is essential for selecting appropriate statistical methods, as ordinal variables often allow for rank-based analyses, while nominal variables do not.
- Ordinal: Contain categories that have some logical order (e.g. categories of age).
- Nominal: Have categories that have no logical order (e.g., religious affiliation and marital status).
R will treat each unique value of a factor as a different level.

Ordinal Variable

Ordinal data may be categorized and ranked with respect to some characteristic or trait.
- For example, instructors are often evaluated on an ordinal scale (excellent, good, fair, poor).
- A scale allows us to code the data based on order, assuming equal distance between scale items (aka likert items).
- You can make an ordinal factor data type in R or you can convert the order to meaningful numbers.
To recode numbers in R, we would code poor to excellent = 1, 2, 3, 4 respectively.
The recode() function from the dplyr package within the tidyverse ecosystem is used to replace specific values in a vector or variable with new values.
It allows you to map old values to new ones in a simple and readable manner.

library(tidyverse)
# Take a vector representing evaluation scores, named evaluate 
evaluate <- c("excellent", "good", "fair", "poor", "excellent", "good")

data <- data.frame(evaluate)
data <- data %>%
     mutate(evaluate = recode(evaluate,
            "excellent" = 4,
            "good" = 3,
            "fair" = 2,
            "poor" = 1))
data

Nominal Variable

With nominal variables, data are simply categories for grouping.
For example, coding race/ethnicity might have a category value of White, Black, Native American, Asian/Pacific Islander, Other.
Qualitative values may be converted to quantitative values for analysis purposes. + White = 1, Black = 2, etc. This conversion to numerical representation of the category would be needed to run some analysis.
- Sometimes, R does this on our behalf depending on commands used.
We can force a variable into a factor data type using the as.factor() command.
If we use the read.csv() command, we can sometimes do this by setting an argument $stringsAsFactors=TRUE$. We will do this later in the lesson.

Numerical data types:

The as.numeric() function in R is used to convert data into numeric format, allowing for mathematical and statistical operations. It takes an input vector, such as characters or factors, and attempts to coerce it into numeric values. For instance, as.numeric(c(“1”, “2”, “3”)) returns a numeric vector of 1, 2, and 3. If applied to factors, it returns the underlying integer codes, not the original levels, so caution is needed to avoid misinterpretation. When conversion is not possible (e.g., as.numeric(“abc”)), it results in NA with a warning.
A numerical data type is a vector of numbers that can be Real or Integer, where continuous (Real) variables can take any value along some continuum, and integers take on whole numbers.
Two ways to create:
- We can create a numeric variable by ensuring our value we assign is a number!
- We can force a variable into an real number data type by using the as.numeric() command.
```
# Assign Rhode Island limit for medical marijuana in ounces per
# person
kOuncesRhode <- 2.5
# Identify the data type
class(x = kOuncesRhode)
```
```
[1] "numeric"
```
Discrete (Integer) Variables: In R, integer discrete data types represent whole numbers, making them ideal for scenarios where only non-fractional values are meaningful, such as counts, rankings, or categorical levels encoded as numbers. Integer values are stored more efficiently than numeric (floating-point) values, providing performance benefits in memory usage and computations. Integers can be explicitly created using the L suffix (e.g., x <- 5L) or by coercing other data types with as.integer(). Operations on integers, such as arithmetic, comparisons, or indexing, behave consistently with their discrete nature. Integer data types are particularly useful for tasks like looping, array indexing, or when working with data where precision beyond whole numbers is unnecessary or irrelevant.
For an example of a discrete variable, we could collect information on the number of children in a family or number of points scored in a basketball game.

# Assign the value of 4 to a constant called kTestInteger and set as
# an integer
kTestInteger <- as.integer(4)
class(kTestInteger)  #Confirm the data type is an integer

[1] "integer"

# Use as.integer() to truncate the variable ouncesRhode
Trunc <- as.integer(kOuncesRhode)
Trunc

[1] 2

Character data type

The character data type in R is used to store text or string data, making it essential for handling names, labels, descriptions, or any non-numeric information. Character data is represented as sequences of characters enclosed in quotes (e.g., “hello” or ‘world’).
Character data types can be wrapped in either single or double quotation marks (e.g, “hello” or ‘hello’).
Character data types can include letters, words, or numbers that cannot logically be included in calculations (e.g., a zip code).
- A quick example is below that shows how to assign a character value to a variable.

# Make string constants
Q1 <- "A"
Q2 <- "B"
# Check the data type
class(x = Q1)

[1] "character"

Logical data type

Logical data types in R represent values that are either TRUE or FALSE, often used for conditional operations and data filtering. Logical values can be the result of comparisons (e.g., x > 10) or explicitly assigned. Logical vectors are particularly useful in subsetting data; for instance, data[data$column > 5, ] selects rows where the column values are greater than 5. Logical operators like & (and), | (or), and ! (not) allow for more complex conditions. Logical data is also essential in control structures such as if, else, and loops. By enabling dynamic and conditional programming, logical types play a key role in efficient data manipulation and analysis.
- A quick example is below that shows how to assign a logical value to a variable.
```
# Store the result of 6 > 8 in a constant called kSixEight
kSixEight <- 6 > 8
# Can use comparison tests with the following == >= <= > < <> !=
kSixEight  # Print kSixEight
```
```
[1] FALSE
```
```
# Determine the data type of kSixEight
class(x = kSixEight)
```
```
[1] "logical"
```

Nominal Example with Dataset

library(tidyverse)
gss.2016 <- read_csv(file = "data/gss2016.csv")

# Examine the variable types with summary and class functions.
summary(gss.2016)

    grass               age           
 Length:2867        Length:2867       
 Class :character   Class :character  
 Mode  :character   Mode  :character

class(gss.2016$grass)  #Check the data type.

[1] "character"

gss.2016$grass <- as.factor(gss.2016$grass)  #Turn to a factor.
class(gss.2016$grass)  #Confirming it is now correct.

[1] "factor"

Numerical Example with Dataset

We need to ensure data can be coded as numeric before using the as.numeric() command. For example, to handle the variable age, it seems like numerical values except one value of “89 OR OLDER”. If as.numeric() command was used on this variable, it would put all the 89 and older observations as NAs. To force it to be a numerical variable, and keep that the sample participants were the oldest value, we need to recode it and then use the as.numeric() command to coerce it into a number.
Recoding the 89 and older to 89 does cause the data to lack integrity in its current form because it will treat the people over 89 years old as 89. But, we are limited here because this needs to be a numerical variable for us to proceed. We will learn a step later on in this section to transform the age variable into categories so that we bring back our data integrity.

class(gss.2016$age)

[1] "character"

# Recode '89 OR OLDER' into just '89'
gss.2016$age <- recode(gss.2016$age, `89 OR OLDER` = "89")
# Convert to numeric data type
gss.2016$age <- as.numeric(gss.2016$age)
summary(gss.2016)  #Conduct final check confirming correct data types

       grass           age       
 DK       : 110   Min.   :18.00  
 IAP      : 911   1st Qu.:34.00  
 LEGAL    :1126   Median :49.00  
 NOT LEGAL: 717   Mean   :49.16  
 NA's     :   3   3rd Qu.:62.00  
                  Max.   :89.00  
                  NA's   :10

Common dplyr Functions

Arrange

Sorting or arranging the dataset allows you to specify an order based on variable values.
Sorting allows us to review the range of values for each variable, and we can sort based on a single or multiple variables.
Notice the difference between sort() and arrange() functions below.
- The sort() function sorts a vector.
- The arrange() function sorts a dataset based on a variable.
To conduct an example, read in the data set called gig.csv from your working directory.

gig <- read.csv("data/gig.csv", stringsAsFactors = TRUE, na.strings = "")
dim(gig)

[1] 604   4

head(gig)

  EmployeeID  Wage     Industry        Job
1          1 32.81 Construction    Analyst
2          2 46.00   Automotive   Engineer
3          3 43.13 Construction  Sales Rep
4          4 48.09   Automotive      Other
5          5 43.62   Automotive Accountant
6          6 46.98 Construction   Engineer

Using the arrange() function, we add the dataset, followed by a comma and then add in the variable we want to sort. This arranges from small to large.
Below is code to rearrange data based on Wage and save it in a new object.

sortTidy <- arrange(gig, Wage)
head(sortTidy)

  EmployeeID  Wage     Industry        Job
1        467 24.28 Construction   Engineer
2        547 24.28 Construction  Sales Rep
3        580 24.28 Construction Accountant
4        559 24.42 Construction   Engineer
5         16 24.76   Automotive Programmer
6        221 24.76   Automotive Programmer

We can apply a desc() function inside the arrange function to re-sort from high to low like shown below.

sortTidyDesc <- arrange(gig, desc(Wage))
head(sortTidyDesc)

  EmployeeID  Wage     Industry        Job
1        110 51.00 Construction      Other
2         79 50.00   Automotive   Engineer
3        348 49.91 Construction Accountant
4        373 49.91 Construction Accountant
5        599 49.84   Automotive   Engineer
6         70 49.77 Construction Accountant

Subsetting or Filtering

Subsetting or filtering a data frame is the process of indexing, or extracting a portion of the data set that is relevant for subsequent statistical analysis. Subsetting allows you to work with a subset of your data, which is essential for data analysis and manipulation. One of the most common ways to subset in R is by using square brackets []. We can also use the filter() function from tidyverse.
We use subsets to do the following:
- View data based on specific data values or ranges.
- Compare two or more subsets of the data.
- Eliminate observations that contain missing values, low-quality data, or outliers.
- Exclude variables that contain redundant information, or variables with excessive amounts of missing values.
When working with data frames, you can subset by rows and columns using two indices inside the square brackets: data[row, column]. For example, if you have df <- data.frame(a = 1:3, b = c(“X”, “Y”, “Z”)), df[1, 2] would return the value “X”, which is the first row and second column. If you want the entire first row, you would use df[1, ], and to get the second column, you’d use df[, 2].
You can also use logical conditions to subset. For instance, x[x > 20] would return all values in x greater than 20, and in a data frame, you could filter rows where a certain condition holds, such as df[df$a > 1, ], which returns rows where column a has values greater than 1.
Let’s do an example using the customers.csv file we read in earlier as customers in the last lesson. Base R provides several methods for subsetting data structures. Below uses base R by using the square brackets dataset[row, column] format.

customers <- read.csv("data/customers.csv", stringsAsFactors = TRUE)

# To subset, note the dataset[row,column] format Results hidden to
# save space, but be sure to try this code in your .R file.  Data in
# 1st row
customers[1, ]
# Data in 2nd column
customers[, 2]
# Data for 2nd column/1st observation (row)
customers[1, 2]
# First 3 columns of data
customers[, 1:3]

One of the most powerful and intuitive ways to subset data frames in R is by using the filter() function from the dplyr package, which is part of the tidyverse. Tidyverse is extremely popular when filtering data.
The filter function is used to subset rows of a data frame based on certain conditions.
The below example filters data by the College variable when category values are “Yes” and saves the filtered dataset into an object called college.

# Filtering by whether the customer has a 'Yes' for college.  Saving
# this filter into a new object college which you should see in your
# global environment.
college <- filter(customers, College == "Yes")
# Showing first 6 records of college - note the College variable is
# all Yes's.
head(college)

   CustID    Sex     Race  BirthDate College HHSize Income Spending Orders
1 1530016 Female    Black 12/16/1986     Yes      5  53000      241      3
2 1531136   Male    White   5/9/1993     Yes      5  94000      843     12
3 1532160   Male    Black  5/22/1966     Yes      2  64000      719      9
4 1532307   Male    White  9/16/1964     Yes      4  60000      582     13
5 1532387   Male    White  8/27/1957     Yes      2  67000      452      9
6 1533017 Female Hispanic  5/14/1985     Yes      3  84000      153      2
  Channel
1      SM
2      TV
3      TV
4      SM
5      SM
6     Web

Using the filter command, we can add filters pretty easily by using an & for and, or an | for or. The statement below filters by College and Income and save the new dataset in an object called twoFilters.

twoFilters <- filter(customers, College == "Yes" & Income < 50000)
head(twoFilters)

   CustID    Sex     Race  BirthDate College HHSize Income Spending Orders
1 1533697 Female    Asian  10/8/1974     Yes      3  42000      247      3
2 1535063 Female    White 12/17/1982     Yes      3  42000      313      4
3 1544417   Male Hispanic  3/14/1980     Yes      4  46000      369      3
4 1547864 Female Hispanic  6/15/1987     Yes      2  44000      500      5
5 1550969 Female    White   4/8/1978     Yes      4  47000      774     16
6 1553660 Female    White   8/2/1988     Yes      2  47000      745      5
  Channel
1     Web
2      TV
3      TV
4      TV
5      TV
6      SM

Next, we can do an or statement. The example below uses the filter command to filter by more than one category in the same field using the | in between the categories.

TwoRaces <- filter(customers, Race == "Black" | Race == "White")
head(TwoRaces)

   CustID    Sex  Race  BirthDate College HHSize Income Spending Orders Channel
1 1530016 Female Black 12/16/1986     Yes      5  53000      241      3      SM
2 1531136   Male White   5/9/1993     Yes      5  94000      843     12      TV
3 1532160   Male Black  5/22/1966     Yes      2  64000      719      9      TV
4 1532307   Male White  9/16/1964     Yes      4  60000      582     13      SM
5 1532387   Male White  8/27/1957     Yes      2  67000      452      9      SM
6 1533791   Male White 10/27/1999     Yes      1  97000     1028     17     Web

The str_detect() function is used to detect the presence or absence of a pattern (regular expression) in a string or vector of strings. It returns a logical vector indicating whether the pattern was found in each element of the input vector.
Using str_detect it with a filter function allows you to pull observations based on the inclusion of a string pattern.

library(tidyverse)
Birthday2000 <- filter(customers, str_detect(BirthDate, "1985"))

Select

In R, the select() function is part of the dplyr package, which is used for data manipulation. The select() function is specifically designed to subset or choose specific columns from a data frame. It allows you to select variables (columns) by their names or indices.
Both statements below select Income, Spending, and Orders variables from the customers dataset and form them into a new dataset called smallData.
The statements are written with and without the chaining operator.

smallData <- select(customers, Income, Spending, Orders)
head(smallData)

  Income Spending Orders
1  53000      241      3
2  94000      843     12
3  64000      719      9
4  60000      582     13
5  47000      845      7
6  67000      452      9

Piping (Chaining) Operator

The pipe operator takes the output of the expression on its left-hand side and passes it as the first argument to the function on its right-hand side. This enables you to chain multiple functions together, making the code easier to understand and debug.
If we want to keep our code tidy, we can add the piping operator (%>%) to help combine our lines of code into a new object or overwriting the same object.
This operator allows us to pass the result of one function/argument to the other one in sequence.
The below example uses a select function to pull Income, Spending, and Orders variables fromt he customers dataset and save it as a new object called smallData. It is an identical request to the one directly above, but written with the piping operator.

smallData <- customers %>%
    select(Income, Spending, Orders)

Counting

Counting allows us to gain a better understanding and insights into the data.
This helps to verify that the data set is complete or determine if there are missing values.
In R, the length() function returns the number of elements in a vector, list, or any other object with a length attribute. It essentially counts the number of elements in the specified object.

# Gives the length of Industry
length(gig$Industry)

[1] 604

For counting using tidyverse, we typically use the filter and count function together to filter by a value or state and then count the filtered data.
In the function below, I use the piping operator to link together the filter and count functions into one command.
Note that we need a piping operator (%>%) before each new function that is part of the chunk.

# Counting with a Categorical Variable Here we are filtering by
# Automotive Industry and then counting the number and saving it in a
# new object called countAuto
countAuto <- gig %>%
    filter(Industry == "Automotive") %>%
    count()
countAuto  #190

    n
1 190

Below, we are filtering by Wage and the counting.

# Counting with a Numerical Variable We could also save this in an
# object.
gig %>%
    filter(Wage > 30) %>%
    count()  ##536

    n
1 536

We learned that there are 190 employees in the automotive industry and there are 536 employees who earn more than $30 per hour.
We could also calculate the number of people with wages under or equal to 30.

# We find 68 Wages under or equal to 30
WageLess30 <- gig %>%
    filter(Wage <= 30) %>%
    count()  #
WageLess30

   n
1 68

How many Accountants are there in the Job Category of the gig data set. The answer is shown below. Use filter and count to calculate this answer.

   n
1 83

Handling Missing Data

Missing data is a common issue in data analysis and can arise for various reasons, such as data collection errors, non-responses in surveys, or data corruption. In R, handling missing data is crucial to ensure accurate and reliable analysis. Missing values are typically represented by NA (Not Available) in R.
Missing data needs to be closely evaluated and verified within each variable whether the data is truly blank, has no answer, or is marked with a character value such as the text N/A.
Missing data needs to be closely evaluated to see if the missing value is meaningful or not. If the variable that has many missing values is deemed unimportant or can be represented using a proxy variable that does not have missing values, the variable may be excluded from the analysis.
After a data set is loaded, there are three common strategies for dealing with missing values.

The omission strategy recommends that observations with missing values be excluded from subsequent analysis.
The imputation strategy recommends that the missing values be replaced with some reasonable imputed values. For example, imputing missing values using techniques like mean/median substitution or regression models can be considered.
- Numeric variables: replace with the average.
- Categorical variables: replace with the predominant category.
Ignore your missing data if the function works without it.
- When you ignore missing data because your function works without it, the missing values are typically excluded from the calculations by default. In R, many functions, such as mean(), sum(), or lm(), have arguments like na.rm = TRUE to explicitly remove missing values during computation. Ignoring missing data can simplify the analysis, but it comes with potential consequences.

The choice of approach depends on the nature of the missingness, which can be categorized as Missing Completely at Random, Missing at Random, or Missing Not at Random. Addressing missing data appropriately is essential to maintain the validity of statistical analyses and avoid biases.

Limitations of Using a Missing Data Technique

Recommended Closer Evaluation of Missing Data
There are limitations of the techniques listed above (omission, imputation, and ignore).
Reduction in Sample Size: Ignoring missing data leads to a smaller effective sample size, which may reduce the power of your analysis and the reliability of the results.
Bias: If the missing data are not Missing Completely at Random, ignoring them may introduce bias. For example, if specific groups or patterns are overrepresented in the remaining data, the results may not generalize to the full dataset.
Distorted Metrics: Calculations that ignore missing values (e.g., averages, sums) might not reflect the true population parameters, especially if the missing data are systematically different from the observed data. In addition, if a large number of values are missing, mean imputation will likely distort the relationships among variables, leading to biased results.
Incorrect Inferences: Ignoring missing data without considering its nature could lead to incorrect conclusions, as the analysis only reflects the subset of available data.
Consider a dataset used to predict factors that lead to intubation due to COVID-19. Suppose one variable, “Number of pregnancies,” contains missing data (NAs) for all men, as the question is not applicable to them. If we were to compare this variable with another, “Intubated due to COVID-19: Yes/No,” simply omitting the rows with blanks (NAs) could lead to the exclusion of an entire gender, distorting the analysis. In this case, a different approach to handling missing data would be more appropriate to ensure the dataset remains representative. Additionally, if a value is not blank but is considered missing for analysis purposes, the data should be consistently processed (e.g., mutated or recoded) to align with the chosen technique for handling true missing values.

na.rm

The na.rm parameter in R is a convenient way to handle missing values (NA) within functions that perform calculations on datasets. The parameter stands for “NA remove” and, when set to TRUE, instructs the function to exclude missing values from the computation.

y <- c(1, 2, NA, 3, 4, NA)
# These lines runs, but do not give you anything useful.
sum(y)

[1] NA

mean(y)

[1] NA

Many functions in R include parameters that will ignore NAs for you.
sum() and mean() are examples of this, and most summary statistics like median() and var() and max() also use the na.rm parameter to ignore the NAs. Always check the help to determine if na.rm is a parameter.

sum(y, na.rm = TRUE)

[1] 10

mean(y, na.rm = TRUE)

[1] 2.5

# na.omit removes the NAs from the vector of dataset. Here, it works
# for removing NAs from the vector.
y <- na.omit(y)

na.rm with a a dataset

Using a dataset, we need the _data$variable_ format, like mean(data$column, na.rm = TRUE) calculates the mean of the non-missing values in the specified column or variable. This approach is straightforward and useful when the presence of missing values would otherwise cause an error or return an NA result.

gig <- read.csv("data/gig.csv", stringsAsFactors = TRUE, na.strings = "")
summary(gig)

   EmployeeID         Wage               Industry           Job     
 Min.   :  1.0   Min.   :24.28   Automotive  :190   Engineer  :191  
 1st Qu.:151.8   1st Qu.:34.19   Construction:366   Other     : 88  
 Median :302.5   Median :41.88   Tech        : 38   Accountant: 83  
 Mean   :302.5   Mean   :40.08   NA's        : 10   Programmer: 80  
 3rd Qu.:453.2   3rd Qu.:45.87                      Sales Rep : 77  
 Max.   :604.0   Max.   :51.00                      (Other)   : 69  
                                                    NA's      : 16

mean(gig$Wage, na.rm = TRUE)

[1] 40.0828

is.na()

In R, the is.na() function is used to check for missing (NA) values in objects like vectors, data frames, or arrays. It returns a logical vector of the same length as the input object, where TRUE indicates a missing value and FALSE indicates a non-missing value.

# Counts the number of all NA values in the entire dataset
CountAllBlanks <- sum(is.na(gig))
CountAllBlanks

[1] 26

# Gives the observation number of the observations that include NA
# values
which(is.na(gig$Industry))

 [1]  24 139 361 378 441 446 479 500 531 565

# Produces a dataset with observations that have NA values in the
# Industry field.
ShowBlankObservations <- gig %>%
    filter(is.na(Industry))
ShowBlankObservations

   EmployeeID  Wage Industry        Job
1          24 42.58     <NA>  Sales Rep
2         139 42.18     <NA>   Engineer
3         361 31.33     <NA>      Other
4         378 48.09     <NA>      Other
5         441 32.35     <NA> Accountant
6         446 30.76     <NA> Accountant
7         479 42.85     <NA> Consultant
8         500 43.13     <NA>  Sales Rep
9         531 43.13     <NA>   Engineer
10        565 38.98     <NA> Accountant

# Counts the number of observations that have NA values in the
# Industry field. Industry is categorical, so we can count values
# based on it.
CountBlanks <- sum(is.na(gig$Industry))
CountBlanks

[1] 10

library(tidyverse)
# Counts the number of observations that have NA values in the Wage
# field.
CountBlanks <- sum(is.na(gig$Wage))
CountBlanks

[1] 0

na_if()

The na_if() function in tidyr is used to replace specific values in a column with NA (missing) values. This function can be particularly useful when you want to standardize missing values across a dataset or when you want to replace certain values with NA for further data processing

TurnNA <- gig %>%
    mutate(Job = na_if(Job, "Other"))
head(TurnNA)

  EmployeeID  Wage     Industry        Job
1          1 32.81 Construction    Analyst
2          2 46.00   Automotive   Engineer
3          3 43.13 Construction  Sales Rep
4          4 48.09   Automotive       <NA>
5          5 43.62   Automotive Accountant
6          6 46.98 Construction   Engineer

na.omit() vs. drop_na()

Both functions return a new object with the rows containing missing values removed.
na.omit() is a base R function, so it doesn’t require any additional package installation where drop_na() requires loading the tidyr package, which is part of the tidyverse ecosystem.
drop_na() fits well into tidyverse pipelines, making it easy to integrate with other tidyverse functions where na.omit() can also be used in pipelines but might require additional steps to fit seamlessly.

# install.packages('Amelia')
library(Amelia)
data("africa")
summary(africa)

      year              country       gdp_pc            infl        
 Min.   :1972   Burkina Faso:20   Min.   : 376.0   Min.   : -8.400  
 1st Qu.:1977   Burundi     :20   1st Qu.: 513.8   1st Qu.:  4.760  
 Median :1982   Cameroon    :20   Median :1035.5   Median :  8.725  
 Mean   :1982   Congo       :20   Mean   :1058.4   Mean   : 12.753  
 3rd Qu.:1986   Senegal     :20   3rd Qu.:1244.8   3rd Qu.: 13.560  
 Max.   :1991   Zambia      :20   Max.   :2723.0   Max.   :127.890  
                                  NA's   :2                         
     trade            civlib         population      
 Min.   : 24.35   Min.   :0.0000   Min.   : 1332490  
 1st Qu.: 38.52   1st Qu.:0.1667   1st Qu.: 4332190  
 Median : 59.59   Median :0.1667   Median : 5853565  
 Mean   : 62.60   Mean   :0.2889   Mean   : 5765594  
 3rd Qu.: 81.16   3rd Qu.:0.3333   3rd Qu.: 7355000  
 Max.   :134.11   Max.   :0.6667   Max.   :11825390  
 NA's   :5

summary(africa$gdp_pc)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  376.0   513.8  1035.5  1058.4  1244.8  2723.0       2

summary(africa$trade)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  24.35   38.52   59.59   62.60   81.16  134.11       5

africa1 <- na.omit(africa)
summary(africa1)

      year              country       gdp_pc            infl       
 Min.   :1972   Burkina Faso:20   Min.   : 376.0   Min.   : -8.40  
 1st Qu.:1976   Burundi     :17   1st Qu.: 511.5   1st Qu.:  4.67  
 Median :1981   Cameroon    :18   Median :1062.0   Median :  8.72  
 Mean   :1981   Congo       :20   Mean   :1071.8   Mean   : 12.91  
 3rd Qu.:1986   Senegal     :20   3rd Qu.:1266.0   3rd Qu.: 13.57  
 Max.   :1991   Zambia      :20   Max.   :2723.0   Max.   :127.89  
     trade            civlib         population      
 Min.   : 24.35   Min.   :0.0000   Min.   : 1332490  
 1st Qu.: 38.52   1st Qu.:0.1667   1st Qu.: 4186485  
 Median : 59.59   Median :0.1667   Median : 5858750  
 Mean   : 62.60   Mean   :0.2899   Mean   : 5749761  
 3rd Qu.: 81.16   3rd Qu.:0.3333   3rd Qu.: 7383000  
 Max.   :134.11   Max.   :0.6667   Max.   :11825390

## to drop all at once.
africa2 <- africa %>%
    drop_na()
summary(africa2)

      year              country       gdp_pc            infl       
 Min.   :1972   Burkina Faso:20   Min.   : 376.0   Min.   : -8.40  
 1st Qu.:1976   Burundi     :17   1st Qu.: 511.5   1st Qu.:  4.67  
 Median :1981   Cameroon    :18   Median :1062.0   Median :  8.72  
 Mean   :1981   Congo       :20   Mean   :1071.8   Mean   : 12.91  
 3rd Qu.:1986   Senegal     :20   3rd Qu.:1266.0   3rd Qu.: 13.57  
 Max.   :1991   Zambia      :20   Max.   :2723.0   Max.   :127.89  
     trade            civlib         population      
 Min.   : 24.35   Min.   :0.0000   Min.   : 1332490  
 1st Qu.: 38.52   1st Qu.:0.1667   1st Qu.: 4186485  
 Median : 59.59   Median :0.1667   Median : 5858750  
 Mean   : 62.60   Mean   :0.2899   Mean   : 5749761  
 3rd Qu.: 81.16   3rd Qu.:0.3333   3rd Qu.: 7383000  
 Max.   :134.11   Max.   :0.6667   Max.   :11825390

You try to load the airquality dataset from base R and look at a summary of the dataset.

Sum the number of NAs in airquality.
Omit all the NAs from airquality and save it in a new data object called airqual and take a new summary of it.

     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0

[1] 44

     Ozone          Solar.R           Wind            Temp      
 Min.   :  1.0   Min.   :  7.0   Min.   : 2.30   Min.   :57.00  
 1st Qu.: 18.0   1st Qu.:113.5   1st Qu.: 7.40   1st Qu.:71.00  
 Median : 31.0   Median :207.0   Median : 9.70   Median :79.00  
 Mean   : 42.1   Mean   :184.8   Mean   : 9.94   Mean   :77.79  
 3rd Qu.: 62.0   3rd Qu.:255.5   3rd Qu.:11.50   3rd Qu.:84.50  
 Max.   :168.0   Max.   :334.0   Max.   :20.70   Max.   :97.00  
     Month            Day       
 Min.   :5.000   Min.   : 1.00  
 1st Qu.:6.000   1st Qu.: 9.00  
 Median :7.000   Median :16.00  
 Mean   :7.216   Mean   :15.95  
 3rd Qu.:9.000   3rd Qu.:22.50  
 Max.   :9.000   Max.   :31.00

Summarize

The summarize() command is used to create summary statistics for groups of observations in a data frame.
In R, summary() and summarize() serve different purposes. summary() is part of base R and gives a quick overview of data, returning descriptive statistics for each column. For example, summary(mtcars) provides the min, max, median, and mean for numeric columns and counts for factors. It’s useful for a broad snapshot of your dataset.
In contrast, summarize() (or summarise()) is from the dplyr package and allows for custom summaries. For instance, mtcars %>% summarize(avg_mpg = mean(mpg), max_hp = max(hp)) returns the average miles per gallon and the maximum horsepower. It’s more flexible and is often used with group_by() for grouped calculations. In conclusion, summary() gives automatic overviews, while summarize() is better for tailored summaries.
In the example below, we can summarize more than one thing into tidy output.

gig %>%
    drop_na() %>%
    summarize(mean.days = mean(Wage), sd.days = sd(Wage), var.days = var(Wage),
        med.days = median(Wage), iqr.days = IQR(Wage))

  mean.days  sd.days var.days med.days iqr.days
1  40.14567 7.047058 49.66103    41.82   11.465

Group_by

group_by is used for grouping data by one or more variables. When you use group_by() on a data frame, it doesn’t actually perform any computations immediately. Instead, it sets up the data frame in such a way that any subsequent operations are performed within these groups
summarize() is often used in combination with group_by() to calculate summary statistics within groups

## summarize data by Industry variable.
groupedData <- gig %>%
    group_by(Industry) %>%
    summarize(meanWage = mean(Wage))
groupedData

# A tibble: 4 × 2
  Industry     meanWage
  <fct>           <dbl>
1 Automotive       43.4
2 Construction     38.3
3 Tech             40.7
4 <NA>             39.5

## same function with na's dropped.
groupedData <- gig %>%
    drop_na() %>%
    group_by(Industry) %>%
    summarize(meanWage = mean(Wage))
groupedData

# A tibble: 3 × 2
  Industry     meanWage
  <fct>           <dbl>
1 Automotive       43.4
2 Construction     38.4
3 Tech             40.7

Mutate

mutate() is part of the dplyr package, which is used for data manipulation. The mutate() function is specifically designed to create new variables (columns) or modify existing variables in a data frame. It is commonly used in data wrangling tasks to add calculated columns or transform existing ones.
One example is below, but note that there are many things you can do with the mutate function.

# making a new variable called calculation that multiplies gdp_pc by
# infl variables in the africa1 dataset.
africa.mutated <- mutate(africa1, calculation = gdp_pc * infl)
head(africa.mutated)

  year      country gdp_pc  infl trade    civlib population calculation
1 1972 Burkina Faso    377 -2.92 29.69 0.5000000    5848380    -1100.84
2 1973 Burkina Faso    376  7.60 31.31 0.5000000    5958700     2857.60
3 1974 Burkina Faso    393  8.72 35.22 0.3333333    6075700     3426.96
4 1975 Burkina Faso    416 18.76 40.11 0.3333333    6202000     7804.16
5 1976 Burkina Faso    435 -8.40 37.76 0.5000000    6341030    -3654.00
6 1977 Burkina Faso    448 29.99 41.11 0.6666667    6486870    13435.52

Below is an example with the iris dataset, which is part of base R.

data("iris")
## Selecting 2 variables from the iris dataset: Sepal.Length and
## Petal.Length
selected_data <- select(iris, Sepal.Length, Petal.Length)
head(selected_data)

  Sepal.Length Petal.Length
1          5.1          1.4
2          4.9          1.4
3          4.7          1.3
4          4.6          1.5
5          5.0          1.4
6          5.4          1.7

# Filter rows based on a condition: Species = setosa
filtered_data <- filter(iris, Species == "setosa")
head(filtered_data)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

# Arrange rows by the Sepal.Length column
arranged_data <- arrange(iris, Sepal.Length)
# Create a new column by mutating the data by transforming
# Petal.Width to the log form.
mutated_data <- mutate(iris, Petal.Width_Log = log(Petal.Width))

Full Examples

gss.2016 Data Cleaning

First, because we made some edits to the data set, reread in version a using the read.csv command. This brings the data set back to its original form. It is always a good idea to read the dataset back in when you are unsure about whether you have made a mistake during data preparation that could cause a lack of data integrity.

gss.2016 <- read.csv(file = "data/gss2016.csv")

Before we remove any missing data, we need it to be the correct data type. In this case, grass should be a factor.

# We coerced this variable earlier, but the object was called
# gss.2016.  Since we reread in the data set, this needs to be done
# again.
gss.2016$grass <- as.factor(gss.2016$grass)

The statement below is an equivalent to the function above, but written with the piping operator. It is overwriting gss.2016 after conducting the coercion to factor.
We added the mutate function because we are going to add other data cleaning tasks to this statement.

gss.2016 <- gss.2016 %>%
    mutate(grass = as.factor(grass))

Piping to More Functions: Missing Data

In the code below, the as.factor() command has been moved inside a broader mutate statement (that uses tidyverse library) and piped to it the na_if() command that handles missing data. If you use more than one data manipulation statement, the mutate() command is needed to help organize your code with one mutate() is needed for each major change you are making.
In the code below, we created a new object gss.2016.cleaned to help store the cleaned version of the dataset. This helps maintain data integrity because your original dataset is still intact and each time, you rerun the entire chunk, which includes all the changes at one time.

gss.2016.cleaned <- gss.2016 %>%
    # Moved coercion statement into a mutate function to keep code
    # tidy
mutate(grass = as.factor(grass)) %>%
    # Moving DK value to NA for not applicable
mutate(grass = na_if(x = grass, y = "DK"))

# Check the summary, there should be 110 + 3 = 113 in the NA category
summary(object = gss.2016.cleaned)

       grass          age           
 DK       :   0   Length:2867       
 IAP      : 911   Class :character  
 LEGAL    :1126   Mode  :character  
 NOT LEGAL: 717                     
 NA's     : 113

Drop Levels

The droplevels function is part of base R and is used to drop unused levels from factor variables in a data frame. It works by removing any levels from a factor variable that are not present in the data.

Next, we want to edit our code to convert IAP and DK to NA values and drop levels that have are empty.

Note the Piping operator added to the end of the DK line so you can keep going with new commands editing gss.2016.cleaned.

gss.2016.cleaned <- gss.2016 %>%
    mutate(grass = as.factor(grass)) %>%
    # Added piping operator
mutate(grass = na_if(x = grass, y = "DK")) %>%
    # Turn to na if value of grass = IAP
mutate(grass = na_if(x = grass, y = "IAP")) %>%
    # Drop levels in grass that have no values
mutate(grass = droplevels(x = grass))
# Check what you just did
summary(gss.2016.cleaned)

       grass          age           
 LEGAL    :1126   Length:2867       
 NOT LEGAL: 717   Class :character  
 NA's     :1024   Mode  :character

Coercing to Numeric

Next, we handle a numerical variable, age. Age again has an issue being able to be numerical data type because it has “89 OR OLDER” as a value. Before using the as.numeric() command, we need to recode it. We did this above as a stand-alone statement.

gss.2016.cleaned <- gss.2016 %>%
    mutate(grass = as.factor(grass)) %>%
    mutate(grass = na_if(x = grass, y = "DK")) %>%
    mutate(grass = na_if(x = grass, y = "IAP")) %>%
    # Added piping operator
mutate(grass = droplevels(x = grass)) %>%
    # Ensure variable can be coded as numeric and fix if necessary.
mutate(age = recode(age, `89 OR OLDER` = "89")) %>%
    # Coerce into numeric
mutate(age = as.numeric(x = age))

# Check what you just did
summary(gss.2016.cleaned)

       grass           age       
 LEGAL    :1126   Min.   :18.00  
 NOT LEGAL: 717   1st Qu.:34.00  
 NA's     :1024   Median :49.00  
                  Mean   :49.16  
                  3rd Qu.:62.00  
                  Max.   :89.00  
                  NA's   :10

The recode() command that is part of dplyr is like the ifelse() command that is in base R. There are a lot of ways to recode in R.

Finally, we want to take our numerical variable, age, and cut it at certain breaks to make categories that can be easily analyzed.

This also ensures that anyone above 89 is coded correctly in a category instead of as the value 89. This again brings back data integrity.
The cut() function generates class limits and bins used in frequency distributions (and histograms) for quantitative data.
Here, we are using it to cut age into a categorical variable.

gss.2016.cleaned <- gss.2016 %>%
    mutate(grass = as.factor(grass)) %>%
    mutate(grass = na_if(x = grass, y = "DK")) %>%
    mutate(grass = na_if(x = grass, y = "IAP")) %>%
    mutate(grass = droplevels(grass)) %>%
    mutate(age = recode(age, `89 OR OLDER` = "89")) %>%
    # Added piping operator
mutate(age = as.numeric(age)) %>%
    # Cut numeric variable into groupings
mutate(age.cat = cut(age, breaks = c(-Inf, 29, 59, 74, Inf), labels = c("< 30",
    "30 - 59", "60 - 74", "75+")))

# Check what you just did
summary(gss.2016.cleaned)

       grass           age           age.cat    
 LEGAL    :1126   Min.   :18.00   < 30   : 481  
 NOT LEGAL: 717   1st Qu.:34.00   30 - 59:1517  
 NA's     :1024   Median :49.00   60 - 74: 598  
                  Mean   :49.16   75+    : 261  
                  3rd Qu.:62.00   NA's   :  10  
                  Max.   :89.00                 
                  NA's   :10

brfss Data Cleaning

The full codebook where this screenshot is taken is brfss_2014_codebook.pdf.

Evaluate CodeBook Before Making Decisions

brfss <- read.csv("data/brfss.csv")
summary(brfss)

    TRNSGNDR        X_AGEG5YR          X_RACE         X_INCOMG    
 Min.   :1.00     Min.   : 1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:4.00     1st Qu.: 5.000   1st Qu.:1.000   1st Qu.:3.000  
 Median :4.00     Median : 8.000   Median :1.000   Median :5.000  
 Mean   :4.06     Mean   : 7.822   Mean   :1.992   Mean   :4.481  
 3rd Qu.:4.00     3rd Qu.:10.000   3rd Qu.:1.000   3rd Qu.:5.000  
 Max.   :9.00     Max.   :14.000   Max.   :9.000   Max.   :9.000  
 NA's   :310602                    NA's   :94                     
    X_EDUCAG        HLTHPLN1         HADMAM          X_AGE80     
 Min.   :1.000   Min.   :1.000   Min.   :1.00     Min.   :18.00  
 1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.00     1st Qu.:44.00  
 Median :3.000   Median :1.000   Median :1.00     Median :58.00  
 Mean   :2.966   Mean   :1.108   Mean   :1.22     Mean   :55.49  
 3rd Qu.:4.000   3rd Qu.:1.000   3rd Qu.:1.00     3rd Qu.:69.00  
 Max.   :9.000   Max.   :9.000   Max.   :9.00     Max.   :80.00  
                                 NA's   :208322                  
    PHYSHLTH   
 Min.   : 1.0  
 1st Qu.:20.0  
 Median :88.0  
 Mean   :61.2  
 3rd Qu.:88.0  
 Max.   :99.0  
 NA's   :4

Qualitative Variable

To look at an example, the one below seeks to understand the healthcare issue in reporting gender based on different definitions. The dataset is part of the Behavioral Risk Factor Surveillance System (brfss) dataset (2014), which includes lots of other variables besides reported gender.

# Load the data
brfss <- read.csv("data/brfss.csv")
# Summarize the TRNSGNDR variable
summary(object = brfss$TRNSGNDR)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00    4.00    4.00    4.06    4.00    9.00  310602

# Find frequencies
table(brfss$TRNSGNDR)


     1      2      3      4      7      9 
   363    212    116 150765   1138   1468

Since this table is not very informative, we need to do some edits.
Check the class of the variable to see the issue with analyzing it as a categorical variable.

class(brfss$TRNSGNDR)

[1] "integer"

First, we need to change the TRNSGNDR variable to a factor using as.factor().

# Change variable from numeric to factor
brfss$TRNSGNDR <- as.factor(brfss$TRNSGNDR)
# Check data type again to ensure factor
class(brfss$TRNSGNDR)

[1] "factor"

Then, we need to do some data cleaning on the TRNSGNDR Variable.

brfss.cleaned <- brfss %>% 
  mutate(TRNSGNDR = recode_factor(TRNSGNDR,
      '1' = 'Male to female',
      '2' = 'Female to male',
      '3' = 'Gender non-conforming',
      '4' = 'Not transgender',
      '7' = 'Not sure',
      '9' = 'Refused'))

We can use the levels() command to show the factor levels made with the mutate() command above.

levels(brfss.cleaned$TRNSGNDR)

[1] "Male to female"        "Female to male"        "Gender non-conforming"
[4] "Not transgender"       "Not sure"              "Refused"

Check the summary.

summary(brfss.cleaned$TRNSGNDR)

       Male to female        Female to male Gender non-conforming 
                  363                   212                   116 
      Not transgender              Not sure               Refused 
               150765                  1138                  1468 
                 NA's 
               310602

Take a good look at the table to interpret the frequencies in the output above. The highest percentage was the “NA’s” category, followed by “Not transgender”. Removing the NA’s moved the “Not transgender” category to over 97% of observations.

Quantitative Variable

Let’s use the cleaned dataset to make more changes to the continuous variable PHYSHLTH. In the codebook, it looks like the data is most applicable to the first 2 categories. The 1-30 days coding and the 88 coding, which means 0 days of physical illness and injury.
- Using cleaned data, we need to prep the variable a little more before getting an accurate plot.
- Specifically, we need to null out the 77 and 99 values and make sure the 88 coding is set to be 0 for 0 days of illness and injury.

brfss.cleaned <- brfss %>%
    mutate(TRNSGNDR = recode_factor(TRNSGNDR, `1` = "Male to female", `2` = "Female to male",
        `3` = "Gender non-conforming", `4` = "Not transgender", `7` = "Not sure",
        `9` = "Refused")) %>%
    # Turn the 77 values to NA's.
mutate(PHYSHLTH = na_if(PHYSHLTH, y = 77)) %>%
    # Turn the 99 values to NA's.
mutate(PHYSHLTH = na_if(PHYSHLTH, y = 99)) %>%
    # Recode the 88 values to be numeric value of 0.
mutate(PHYSHLTH = recode(PHYSHLTH, `88` = 0L))

The histogram showed most people have between 0 and 10 unhealthy days per 30 days.
Next, evaluate mean, median, and mode for the PHYSHLTH variable after ignoring the blanks.

mean(brfss.cleaned$PHYSHLTH, na.rm = TRUE)

[1] 4.224106

median(brfss.cleaned$PHYSHLTH, na.rm = TRUE)

[1] 0

names(x = sort(x = table(brfss.cleaned$PHYSHLTH), decreasing = TRUE))[1]

[1] "0"

While the mean is higher at 4.22, the median and most common number is 0.

## Spread to Report with the Mean
var(brfss.cleaned$PHYSHLTH, na.rm = TRUE)

[1] 77.00419

sd(brfss.cleaned$PHYSHLTH, na.rm = TRUE)

[1] 8.775203

## Spread to Report with Median
summary(brfss.cleaned$PHYSHLTH, na.rm = TRUE)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.000   0.000   4.224   3.000  30.000   10303

range(brfss.cleaned$PHYSHLTH, na.rm = TRUE)

[1]  0 30

max(brfss.cleaned$PHYSHLTH, na.rm = TRUE) - min(brfss.cleaned$PHYSHLTH,
    na.rm = TRUE)

[1] 30

IQR(brfss.cleaned$PHYSHLTH, na.rm = TRUE)

[1] 3

library(semTools)
# Plot the data
brfss.cleaned %>%
    ggplot(aes(PHYSHLTH)) + geom_histogram()

# Calculate Skewness and Kurtosis
skew(brfss.cleaned$PHYSHLTH)

   skew (g1)           se            z            p 
2.209078e+00 3.633918e-03 6.079054e+02 0.000000e+00

kurtosis(brfss.cleaned$PHYSHLTH)

Excess Kur (g2)              se               z               p 
   3.474487e+00    7.267836e-03    4.780634e+02    0.000000e+00

The skew results provide a z of 607.905 (6.079054e+02) which is much higher than 7 (for large datasets). This indicates a clear right skew which means the data is not normally distributed.
The kurtosis results are also very leptokurtic with a score of 478.063.

Using Filters Example

Below takes an example of the brfss data to filter by certain variable statuses.
- The first filter() chose observations that were any one of the three categories of transgender included in the data. Used the | “or” operator for this filter().
- The second filter chose people in an age category above category 4 but below category 12, in the age categories 5 through 11.
- The last filter used the !is.na to choose observations where HADMAM variable was not NA.
Next, we reduce data set to contain only variables used to create table by using the select() command.
Next, we change all the remaining variables in data set to factors using mutate_all() command. This not only changes the strings to factors, but also changes the numerical variables to factors.

Finally, we use mutate() commands to change the variable category to something meaningful(from the codebook).

Notice the backslash before the apostrophe in Don’t in the X_INCOMG recode. This is to prevent the .R file from ending the quotations. You could use double quotes around the statement to bypass this, or add the backslash like I did here.

brfss_small <- brfss.cleaned %>%
    filter(TRNSGNDR == "Male to female" | TRNSGNDR == "Female to male" |
        TRNSGNDR == "Gender non-conforming") %>%
    filter(X_AGEG5YR > 4 & X_AGEG5YR < 12) %>%
    filter(!is.na(HADMAM)) %>%
    select(TRNSGNDR, X_AGEG5YR, X_RACE, X_INCOMG, X_EDUCAG, HLTHPLN1, HADMAM) %>%
    mutate_all(as.factor) %>%
    # The next few mutates add labels to categorical variables based
    # on the codebook.
mutate(X_AGEG5YR = recode_factor(X_AGEG5YR, `5` = "40-44", `6` = "45-49",
    `7` = "50-54", `8` = "55-59", `9` = "60-64", `10` = "65-69", `11` = "70-74")) %>%
    mutate(X_INCOMG = recode_factor(X_INCOMG, `1` = "Less than 15,000",
        `2` = "15,000 to less than 25,000", `3` = "25,000 to less than 35,000",
        `4` = "35,000 to less than 50,000", `5` = "50,000 or more", `9` = "Don't know/not sure/missing")) %>%
    mutate(X_EDUCAG = recode_factor(X_EDUCAG, `1` = "Did not graduate high school",
        `2` = "Graduated high school", `3` = "Attended college/technical school",
        `4` = "Graduated from college/technical school", `9` = NA_character_)) %>%
    mutate(HLTHPLN1 = recode_factor(HLTHPLN1, `1` = "Yes", `2` = "No",
        `7` = "Don't know/not sure/missing", `9` = "Refused")) %>%
    mutate(X_RACE = recode_factor(X_RACE, `1` = "White", `2` = "Black",
        `3` = "Native American", `4` = "Asian/Pacific Islander", `5` = "Other",
        `6` = "Other", `7` = "Other", `8` = "Other", `9` = "Other"))
# print a summary
summary(brfss_small)

                  TRNSGNDR   X_AGEG5YR                     X_RACE   
 Male to female       : 77   40-44:27   White                 :152  
 Female to male       :113   45-49:27   Black                 : 31  
 Gender non-conforming: 32   50-54:32   Native American       :  4  
 Not transgender      :  0   55-59:44   Asian/Pacific Islander:  6  
 Not sure             :  0   60-64:44   Other                 : 29  
 Refused              :  0   65-69:24                               
                             70-74:24                               
                        X_INCOMG                                     X_EDUCAG 
 Less than 15,000           :46   Did not graduate high school           :24  
 15,000 to less than 25,000 :44   Graduated high school                  :86  
 25,000 to less than 35,000 :19   Attended college/technical school      :68  
 35,000 to less than 50,000 :26   Graduated from college/technical school:44  
 50,000 or more             :65                                               
 Don't know/not sure/missing:22                                               

 HLTHPLN1  HADMAM 
 Yes:198   1:198  
 No : 24   2: 22  
           9:  2

This data set full of categorical variables is now fully cleaned and ready to be analyzed!

Using AI

Use the following prompts on a generative AI, like chatGPT, to learn more about data preparation activities.
What is the difference between ordinal and nominal variables, and how can you recode these data types in R?
Explain the steps to coerce a character variable into a factor using the as.factor() function in R.
How can you use the filter() function to subset a dataset based on multiple conditions using & and | in R?
How does subsetting using square brackets [] differ from using the filter() function in R?
What strategies can you use to handle missing data in R, and how does na.omit() differ from drop_na()?
How does the mutate() function help in transforming and creating new variables, and what are some practical examples?”
What is the purpose of the group_by() function, and how does it interact with summarize() to create summary statistics in R?”
Explain how you can use arrange() to sort a dataset by one or more variables and demonstrate sorting both in ascending and descending order.”
Why is the piping operator %>% useful in R, and how does it improve the readability and structure of your code?
How would you use summarize() to calculate mean, median, and standard deviation for a numerical variable in R?

Summary

In this lesson, we worked through the basics on data cleaning. Data cleaning is so important and there are so many ways to do it. Provided are some examples using popular functions in dplyr (under tidyverse).