####################################
# Project name: Data Preparation
# Data used: gss.2016, brfss.csv, customers.csv, gig.csv from Blackboard, iris from datasets
# Libraries used: tidyverse, semTools, lubridate
####################################
Data Preparation
- The goal of this lesson is to teach you how to clean datasets for use in analytics. This lesson focuses on dplyr. dplyr is a package in R that provides a set of functions for data manipulation tasks. These functions are designed to be intuitive and efficient, making it easier to work with data frames or tibbles (a modern reimagining of data frames provided by the tibble package).
At a Glance
- In order to succeed in this lesson, we need to be able to evaluate variables and understand how to clean and prepare data to make variables easier to use and in the correct form. This sometimes includes subsetting and filtering data alongside other techniques.
Lesson Objectives
- Create variables and identify and change data types.
- Learn how to clean data via dplyr.
Consider While Reading
- We often spend a considerable amount of time inspecting and preparing the data for the subsequent analysis. This includes the following:
- Evaluating Data Types
- Sorting Data
- Selecting Variables
- Filtering Data
- Counting Data
- Handling Missing Values
- Summarizing
- Grouping Data
Evaluating Data Types
- A data type of a variable specifies the type of data that is stored inside that variable. R is Called a Dynamically Typed Language, meaning that a variable itself is not declared of any data type. Rather it receives the data type of the R object that is assigned to it. We can change a variable’s data type if we want through coercion, but it will inherit one based on the object assigned to it. We will learn some common ways to do this in the data preparation section.
- Evaluating data types in R is crucial for understanding and managing datasets effectively, as different data types require specific handling and operations. R provides several functions to identify and evaluate data types, such as class() to check an object’s class, typeof() to determine its internal storage mode, and is.numeric(), is.character(), or is.factor() to test for specific types. These functions help ensure that data is in the correct format for analysis or modeling tasks. For instance, numeric data types are essential for calculations, while factors or characters are used for categorical data.
- There are a number of data types in R that are common to programming and statistical analysis.
- Factor (Nominal or Ordinal)
- Numeric (Real or Discrete (Integer))
- Character
- Logical
- Date
Coercing Data Types
- In R, coercion is the process of converting one data type to another, often done automatically by R when necessary. For example, if you combine numeric and character data in a vector, R will coerce the numeric values to characters, as seen in c(1, “A”), which results in c(“1”, “A”). Coercion can also be done manually using functions like as.numeric() or as.factor() to change data types explicitly.
- To determine whether a variable is numeric or categorical (factor), you can use the class() function. For example, if you have a variable age, running class(age) will tell you if it is stored as “numeric” or “factor”. Numeric variables represent quantities, such as continuous or discrete numbers, while categorical variables represent groups or categories, typically stored as factors. For example, a variable with levels like “Male” and “Female” would be a categorical (factor) variable.
- Sometimes when you read in a dataset all the variables are already in the correct type. Other times, you need to force it into the type you need to conduct the analysis through a process called coercion.
Factor data type:
- Factor data types can be ordinal or nominal. Ordinal and nominal variables are both types of categorical variables but differ in their characteristics. Ordinal variables have a meaningful order or ranking among categories (e.g., “low,” “medium,” “high”), but the intervals between ranks are not necessarily equal. Nominal variables, on the other hand, represent categories without any inherent order (e.g., “red,” “blue,” “green”). Understanding this distinction is essential for selecting appropriate statistical methods, as ordinal variables often allow for rank-based analyses, while nominal variables do not.
- Ordinal: Contain categories that have some logical order (e.g. categories of age).
- Nominal: Have categories that have no logical order (e.g., religious affiliation and marital status).
- R will treat each unique value of a factor as a different level.
Ordinal Variable
- Ordinal data may be categorized and ranked with respect to some characteristic or trait.
- For example, instructors are often evaluated on an ordinal scale (excellent, good, fair, poor).
- A scale allows us to code the data based on order, assuming equal distance between scale items (aka likert items).
- You can make an ordinal factor data type in R or you can convert the order to meaningful numbers.
- To recode numbers in R, we would code poor to excellent = 1, 2, 3, 4 respectively.
- The recode() function from the dplyr package within the tidyverse ecosystem is used to replace specific values in a vector or variable with new values.
- It allows you to map old values to new ones in a simple and readable manner.
library(tidyverse)
# Take a vector representing evaluation scores, named evaluate
<- c("excellent", "good", "fair", "poor", "excellent", "good")
evaluate
<- data.frame(evaluate)
data <- data %>%
data mutate(evaluate = recode(evaluate,
"excellent" = 4,
"good" = 3,
"fair" = 2,
"poor" = 1))
data
evaluate
1 4
2 3
3 2
4 1
5 4
6 3
Nominal Variable
- With nominal variables, data are simply categories for grouping.
- For example, coding race/ethnicity might have a category value of White, Black, Native American, Asian/Pacific Islander, Other.
- Qualitative values may be converted to quantitative values for analysis purposes. + White = 1, Black = 2, etc. This conversion to numerical representation of the category would be needed to run some analysis.
- Sometimes, R does this on our behalf depending on commands used.
- We can force a variable into a factor data type using the as.factor() command.
- If we use the read.csv() command, we can sometimes do this by setting an argument \(stringsAsFactors=TRUE\). We will do this later in the lesson.
Numerical data types:
The as.numeric() function in R is used to convert data into numeric format, allowing for mathematical and statistical operations. It takes an input vector, such as characters or factors, and attempts to coerce it into numeric values. For instance, as.numeric(c(“1”, “2”, “3”)) returns a numeric vector of 1, 2, and 3. If applied to factors, it returns the underlying integer codes, not the original levels, so caution is needed to avoid misinterpretation. When conversion is not possible (e.g., as.numeric(“abc”)), it results in NA with a warning.
A numerical data type is a vector of numbers that can be Real or Integer, where continuous (Real) variables can take any value along some continuum, and integers take on whole numbers.
Two ways to create:
- We can create a numeric variable by ensuring our value we assign is a number!
- We can force a variable into an real number data type by using the as.numeric() command.
# Assign Rhode Island limit for medical marijuana in ounces per # person <- 2.5 kOuncesRhode # Identify the data type class(x = kOuncesRhode)
[1] "numeric"
Discrete (Integer) Variables: In R, integer discrete data types represent whole numbers, making them ideal for scenarios where only non-fractional values are meaningful, such as counts, rankings, or categorical levels encoded as numbers. Integer values are stored more efficiently than numeric (floating-point) values, providing performance benefits in memory usage and computations. Integers can be explicitly created using the L suffix (e.g., x <- 5L) or by coercing other data types with as.integer(). Operations on integers, such as arithmetic, comparisons, or indexing, behave consistently with their discrete nature. Integer data types are particularly useful for tasks like looping, array indexing, or when working with data where precision beyond whole numbers is unnecessary or irrelevant.
For an example of a discrete variable, we could collect information on the number of children in a family or number of points scored in a basketball game.
# Assign the value of 4 to a constant called kTestInteger and set as
# an integer
<- as.integer(4)
kTestInteger class(kTestInteger) #Confirm the data type is an integer
[1] "integer"
# Use as.integer() to truncate the variable ouncesRhode
<- as.integer(kOuncesRhode)
Trunc Trunc
[1] 2
Character data type
- The character data type in R is used to store text or string data, making it essential for handling names, labels, descriptions, or any non-numeric information. Character data is represented as sequences of characters enclosed in quotes (e.g., “hello” or ‘world’).
- Character data types can be wrapped in either single or double quotation marks (e.g, “hello” or ‘hello’).
- Character data types can include letters, words, or numbers that cannot logically be included in calculations (e.g., a zip code).
- A quick example is below that shows how to assign a character value to a variable.
# Make string constants
<- "A"
Q1 <- "B"
Q2 # Check the data type
class(x = Q1)
[1] "character"
Logical data type
Logical data types in R represent values that are either TRUE or FALSE, often used for conditional operations and data filtering. Logical values can be the result of comparisons (e.g., x > 10) or explicitly assigned. Logical vectors are particularly useful in subsetting data; for instance, data[data$column > 5, ] selects rows where the column values are greater than 5. Logical operators like & (and), | (or), and ! (not) allow for more complex conditions. Logical data is also essential in control structures such as if, else, and loops. By enabling dynamic and conditional programming, logical types play a key role in efficient data manipulation and analysis.
- A quick example is below that shows how to assign a logical value to a variable.
# Store the result of 6 > 8 in a constant called kSixEight <- 6 > 8 kSixEight # Can use comparison tests with the following == >= <= > < <> != # Print kSixEight kSixEight
[1] FALSE
# Determine the data type of kSixEight class(x = kSixEight)
[1] "logical"
Nominal Example with Dataset
library(tidyverse)
.2016 <- read_csv(file = "data/gss2016.csv") gss
# Examine the variable types with summary and class functions.
summary(gss.2016)
grass age
Length:2867 Length:2867
Class :character Class :character
Mode :character Mode :character
class(gss.2016$grass) #Check the data type.
[1] "character"
.2016$grass <- as.factor(gss.2016$grass) #Turn to a factor.
gssclass(gss.2016$grass) #Confirming it is now correct.
[1] "factor"
Numerical Example with Dataset
- We need to ensure data can be coded as numeric before using the as.numeric() command. For example, to handle the variable age, it seems like numerical values except one value of “89 OR OLDER”. If as.numeric() command was used on this variable, it would put all the 89 and older observations as NAs. To force it to be a numerical variable, and keep that the sample participants were the oldest value, we need to recode it and then use the as.numeric() command to coerce it into a number.
- Recoding the 89 and older to 89 does cause the data to lack integrity in its current form because it will treat the people over 89 years old as 89. But, we are limited here because this needs to be a numerical variable for us to proceed. We will learn a step later on in this section to transform the age variable into categories so that we bring back our data integrity.
class(gss.2016$age)
[1] "character"
# Recode '89 OR OLDER' into just '89'
.2016$age <- recode(gss.2016$age, `89 OR OLDER` = "89")
gss# Convert to numeric data type
.2016$age <- as.numeric(gss.2016$age)
gsssummary(gss.2016) #Conduct final check confirming correct data types
grass age
DK : 110 Min. :18.00
IAP : 911 1st Qu.:34.00
LEGAL :1126 Median :49.00
NOT LEGAL: 717 Mean :49.16
NA's : 3 3rd Qu.:62.00
Max. :89.00
NA's :10
Common dplyr Functions
Arrange
- Sorting or arranging the dataset allows you to specify an order based on variable values.
- Sorting allows us to review the range of values for each variable, and we can sort based on a single or multiple variables.
- Notice the difference between sort() and arrange() functions below.
- The sort() function sorts a vector.
- The arrange() function sorts a dataset based on a variable.
- To conduct an example, read in the data set called gig.csv from your working directory.
<- read.csv("data/gig.csv", stringsAsFactors = TRUE, na.strings = "")
gig dim(gig)
[1] 604 4
head(gig)
EmployeeID Wage Industry Job
1 1 32.81 Construction Analyst
2 2 46.00 Automotive Engineer
3 3 43.13 Construction Sales Rep
4 4 48.09 Automotive Other
5 5 43.62 Automotive Accountant
6 6 46.98 Construction Engineer
Using the arrange() function, we add the dataset, followed by a comma and then add in the variable we want to sort. This arranges from small to large.
Below is code to rearrange data based on Wage and save it in a new object.
<- arrange(gig, Wage)
sortTidy head(sortTidy)
EmployeeID Wage Industry Job
1 467 24.28 Construction Engineer
2 547 24.28 Construction Sales Rep
3 580 24.28 Construction Accountant
4 559 24.42 Construction Engineer
5 16 24.76 Automotive Programmer
6 221 24.76 Automotive Programmer
- We can apply a desc() function inside the arrange function to re-sort from high to low like shown below.
<- arrange(gig, desc(Wage))
sortTidyDesc head(sortTidyDesc)
EmployeeID Wage Industry Job
1 110 51.00 Construction Other
2 79 50.00 Automotive Engineer
3 348 49.91 Construction Accountant
4 373 49.91 Construction Accountant
5 599 49.84 Automotive Engineer
6 70 49.77 Construction Accountant
Subsetting or Filtering
Subsetting or filtering a data frame is the process of indexing, or extracting a portion of the data set that is relevant for subsequent statistical analysis. Subsetting allows you to work with a subset of your data, which is essential for data analysis and manipulation. One of the most common ways to subset in R is by using square brackets []. We can also use the filter() function from tidyverse.
We use subsets to do the following:
- View data based on specific data values or ranges.
- Compare two or more subsets of the data.
- Eliminate observations that contain missing values, low-quality data, or outliers.
- Exclude variables that contain redundant information, or variables with excessive amounts of missing values.
When working with data frames, you can subset by rows and columns using two indices inside the square brackets: data[row, column]. For example, if you have df <- data.frame(a = 1:3, b = c(“X”, “Y”, “Z”)), df[1, 2] would return the value “X”, which is the first row and second column. If you want the entire first row, you would use df[1, ], and to get the second column, you’d use df[, 2].
You can also use logical conditions to subset. For instance, x[x > 20] would return all values in x greater than 20, and in a data frame, you could filter rows where a certain condition holds, such as df[df$a > 1, ], which returns rows where column a has values greater than 1.
Let’s do an example using the customers.csv file we read in earlier as customers in the last lesson. Base R provides several methods for subsetting data structures. Below uses base R by using the square brackets dataset[row, column] format.
<- read.csv("data/customers.csv", stringsAsFactors = TRUE)
customers
# To subset, note the dataset[row,column] format Results hidden to
# save space, but be sure to try this code in your .R file. Data in
# 1st row
1, ]
customers[# Data in 2nd column
2]
customers[, # Data for 2nd column/1st observation (row)
1, 2]
customers[# First 3 columns of data
1:3] customers[,
- One of the most powerful and intuitive ways to subset data frames in R is by using the filter() function from the dplyr package, which is part of the tidyverse. Tidyverse is extremely popular when filtering data.
- The filter function is used to subset rows of a data frame based on certain conditions.
- The below example filters data by the College variable when category values are “Yes” and saves the filtered dataset into an object called college.
# Filtering by whether the customer has a 'Yes' for college. Saving
# this filter into a new object college which you should see in your
# global environment.
<- filter(customers, College == "Yes")
college # Showing first 6 records of college - note the College variable is
# all Yes's.
head(college)
CustID Sex Race BirthDate College HHSize Income Spending Orders
1 1530016 Female Black 12/16/1986 Yes 5 53000 241 3
2 1531136 Male White 5/9/1993 Yes 5 94000 843 12
3 1532160 Male Black 5/22/1966 Yes 2 64000 719 9
4 1532307 Male White 9/16/1964 Yes 4 60000 582 13
5 1532387 Male White 8/27/1957 Yes 2 67000 452 9
6 1533017 Female Hispanic 5/14/1985 Yes 3 84000 153 2
Channel
1 SM
2 TV
3 TV
4 SM
5 SM
6 Web
- Using the filter command, we can add filters pretty easily by using an & for and, or an | for or. The statement below filters by College and Income and save the new dataset in an object called twoFilters.
<- filter(customers, College == "Yes" & Income < 50000)
twoFilters head(twoFilters)
CustID Sex Race BirthDate College HHSize Income Spending Orders
1 1533697 Female Asian 10/8/1974 Yes 3 42000 247 3
2 1535063 Female White 12/17/1982 Yes 3 42000 313 4
3 1544417 Male Hispanic 3/14/1980 Yes 4 46000 369 3
4 1547864 Female Hispanic 6/15/1987 Yes 2 44000 500 5
5 1550969 Female White 4/8/1978 Yes 4 47000 774 16
6 1553660 Female White 8/2/1988 Yes 2 47000 745 5
Channel
1 Web
2 TV
3 TV
4 TV
5 TV
6 SM
- Next, we can do an or statement. The example below uses the filter command to filter by more than one category in the same field using the | in between the categories.
<- filter(customers, Race == "Black" | Race == "White")
TwoRaces head(TwoRaces)
CustID Sex Race BirthDate College HHSize Income Spending Orders Channel
1 1530016 Female Black 12/16/1986 Yes 5 53000 241 3 SM
2 1531136 Male White 5/9/1993 Yes 5 94000 843 12 TV
3 1532160 Male Black 5/22/1966 Yes 2 64000 719 9 TV
4 1532307 Male White 9/16/1964 Yes 4 60000 582 13 SM
5 1532387 Male White 8/27/1957 Yes 2 67000 452 9 SM
6 1533791 Male White 10/27/1999 Yes 1 97000 1028 17 Web
- The str_detect() function is used to detect the presence or absence of a pattern (regular expression) in a string or vector of strings. It returns a logical vector indicating whether the pattern was found in each element of the input vector.
- Using str_detect it with a filter function allows you to pull observations based on the inclusion of a string pattern.
library(tidyverse)
<- filter(customers, str_detect(BirthDate, "1985")) Birthday2000
Select
- In R, the select() function is part of the dplyr package, which is used for data manipulation. The select() function is specifically designed to subset or choose specific columns from a data frame. It allows you to select variables (columns) by their names or indices.
- Both statements below select Income, Spending, and Orders variables from the customers dataset and form them into a new dataset called smallData.
- The statements are written with and without the chaining operator.
<- select(customers, Income, Spending, Orders)
smallData head(smallData)
Income Spending Orders
1 53000 241 3
2 94000 843 12
3 64000 719 9
4 60000 582 13
5 47000 845 7
6 67000 452 9
Piping (Chaining) Operator
- The pipe operator takes the output of the expression on its left-hand side and passes it as the first argument to the function on its right-hand side. This enables you to chain multiple functions together, making the code easier to understand and debug.
- If we want to keep our code tidy, we can add the piping operator (%>%) to help combine our lines of code into a new object or overwriting the same object.
- This operator allows us to pass the result of one function/argument to the other one in sequence.
- The below example uses a select function to pull Income, Spending, and Orders variables fromt he customers dataset and save it as a new object called smallData. It is an identical request to the one directly above, but written with the piping operator.
<- customers %>%
smallData select(Income, Spending, Orders)
Counting
Counting allows us to gain a better understanding and insights into the data.
This helps to verify that the data set is complete or determine if there are missing values.
In R, the length() function returns the number of elements in a vector, list, or any other object with a length attribute. It essentially counts the number of elements in the specified object.
# Gives the length of Industry
length(gig$Industry)
[1] 604
- For counting using tidyverse, we typically use the filter and count function together to filter by a value or state and then count the filtered data.
- In the function below, I use the piping operator to link together the filter and count functions into one command.
- Note that we need a piping operator (%>%) before each new function that is part of the chunk.
# Counting with a Categorical Variable Here we are filtering by
# Automotive Industry and then counting the number and saving it in a
# new object called countAuto
<- gig %>%
countAuto filter(Industry == "Automotive") %>%
count()
#190 countAuto
n
1 190
- Below, we are filtering by Wage and the counting.
# Counting with a Numerical Variable We could also save this in an
# object.
%>%
gig filter(Wage > 30) %>%
count() ##536
n
1 536
We learned that there are 190 employees in the automotive industry and there are 536 employees who earn more than $30 per hour.
We could also calculate the number of people with wages under or equal to 30.
# We find 68 Wages under or equal to 30
<- gig %>%
WageLess30 filter(Wage <= 30) %>%
count() #
WageLess30
n
1 68
- How many Accountants are there in the Job Category of the gig data set. The answer is shown below. Use filter and count to calculate this answer.
n
1 83
Handling Missing Data
Missing data is a common issue in data analysis and can arise for various reasons, such as data collection errors, non-responses in surveys, or data corruption. In R, handling missing data is crucial to ensure accurate and reliable analysis. Missing values are typically represented by NA (Not Available) in R.
Missing data needs to be closely evaluated and verified within each variable whether the data is truly blank, has no answer, or is marked with a character value such as the text N/A.
Missing data needs to be closely evaluated to see if the missing value is meaningful or not. If the variable that has many missing values is deemed unimportant or can be represented using a proxy variable that does not have missing values, the variable may be excluded from the analysis.
After a data set is loaded, there are three common strategies for dealing with missing values.
The omission strategy recommends that observations with missing values be excluded from subsequent analysis.
The imputation strategy recommends that the missing values be replaced with some reasonable imputed values. For example, imputing missing values using techniques like mean/median substitution or regression models can be considered.
- Numeric variables: replace with the average.
- Categorical variables: replace with the predominant category.
Ignore your missing data if the function works without it.
- When you ignore missing data because your function works without it, the missing values are typically excluded from the calculations by default. In R, many functions, such as mean(), sum(), or lm(), have arguments like na.rm = TRUE to explicitly remove missing values during computation. Ignoring missing data can simplify the analysis, but it comes with potential consequences.
- The choice of approach depends on the nature of the missingness, which can be categorized as Missing Completely at Random, Missing at Random, or Missing Not at Random. Addressing missing data appropriately is essential to maintain the validity of statistical analyses and avoid biases.
Limitations of Using a Missing Data Technique
Recommended Closer Evaluation of Missing Data
There are limitations of the techniques listed above (omission, imputation, and ignore).
Reduction in Sample Size: Ignoring missing data leads to a smaller effective sample size, which may reduce the power of your analysis and the reliability of the results.
Bias: If the missing data are not Missing Completely at Random, ignoring them may introduce bias. For example, if specific groups or patterns are overrepresented in the remaining data, the results may not generalize to the full dataset.
Distorted Metrics: Calculations that ignore missing values (e.g., averages, sums) might not reflect the true population parameters, especially if the missing data are systematically different from the observed data. In addition, if a large number of values are missing, mean imputation will likely distort the relationships among variables, leading to biased results.
Incorrect Inferences: Ignoring missing data without considering its nature could lead to incorrect conclusions, as the analysis only reflects the subset of available data.
Consider a dataset used to predict factors that lead to intubation due to COVID-19. Suppose one variable, “Number of pregnancies,” contains missing data (NAs) for all men, as the question is not applicable to them. If we were to compare this variable with another, “Intubated due to COVID-19: Yes/No,” simply omitting the rows with blanks (NAs) could lead to the exclusion of an entire gender, distorting the analysis. In this case, a different approach to handling missing data would be more appropriate to ensure the dataset remains representative. Additionally, if a value is not blank but is considered missing for analysis purposes, the data should be consistently processed (e.g., mutated or recoded) to align with the chosen technique for handling true missing values.
na.rm
- The na.rm parameter in R is a convenient way to handle missing values (NA) within functions that perform calculations on datasets. The parameter stands for “NA remove” and, when set to TRUE, instructs the function to exclude missing values from the computation.
<- c(1, 2, NA, 3, 4, NA)
y # These lines runs, but do not give you anything useful.
sum(y)
[1] NA
mean(y)
[1] NA
Many functions in R include parameters that will ignore NAs for you.
sum() and mean() are examples of this, and most summary statistics like median() and var() and max() also use the na.rm parameter to ignore the NAs. Always check the help to determine if na.rm is a parameter.
sum(y, na.rm = TRUE)
[1] 10
mean(y, na.rm = TRUE)
[1] 2.5
# na.omit removes the NAs from the vector of dataset. Here, it works
# for removing NAs from the vector.
<- na.omit(y) y
na.rm with a a dataset
- Using a dataset, we need the _data\(variable_ format, like mean(data\)column, na.rm = TRUE) calculates the mean of the non-missing values in the specified column or variable. This approach is straightforward and useful when the presence of missing values would otherwise cause an error or return an NA result.
<- read.csv("data/gig.csv", stringsAsFactors = TRUE, na.strings = "")
gig summary(gig)
EmployeeID Wage Industry Job
Min. : 1.0 Min. :24.28 Automotive :190 Engineer :191
1st Qu.:151.8 1st Qu.:34.19 Construction:366 Other : 88
Median :302.5 Median :41.88 Tech : 38 Accountant: 83
Mean :302.5 Mean :40.08 NA's : 10 Programmer: 80
3rd Qu.:453.2 3rd Qu.:45.87 Sales Rep : 77
Max. :604.0 Max. :51.00 (Other) : 69
NA's : 16
mean(gig$Wage, na.rm = TRUE)
[1] 40.0828
is.na()
- In R, the is.na() function is used to check for missing (NA) values in objects like vectors, data frames, or arrays. It returns a logical vector of the same length as the input object, where TRUE indicates a missing value and FALSE indicates a non-missing value.
# Counts the number of all NA values in the entire dataset
<- sum(is.na(gig))
CountAllBlanks CountAllBlanks
[1] 26
# Gives the observation number of the observations that include NA
# values
which(is.na(gig$Industry))
[1] 24 139 361 378 441 446 479 500 531 565
# Produces a dataset with observations that have NA values in the
# Industry field.
<- gig %>%
ShowBlankObservations filter(is.na(Industry))
ShowBlankObservations
EmployeeID Wage Industry Job
1 24 42.58 <NA> Sales Rep
2 139 42.18 <NA> Engineer
3 361 31.33 <NA> Other
4 378 48.09 <NA> Other
5 441 32.35 <NA> Accountant
6 446 30.76 <NA> Accountant
7 479 42.85 <NA> Consultant
8 500 43.13 <NA> Sales Rep
9 531 43.13 <NA> Engineer
10 565 38.98 <NA> Accountant
# Counts the number of observations that have NA values in the
# Industry field. Industry is categorical, so we can count values
# based on it.
<- sum(is.na(gig$Industry))
CountBlanks CountBlanks
[1] 10
library(tidyverse)
# Counts the number of observations that have NA values in the Wage
# field.
<- sum(is.na(gig$Wage))
CountBlanks CountBlanks
[1] 0
na_if()
- The na_if() function in tidyr is used to replace specific values in a column with NA (missing) values. This function can be particularly useful when you want to standardize missing values across a dataset or when you want to replace certain values with NA for further data processing
<- gig %>%
TurnNA mutate(Job = na_if(Job, "Other"))
head(TurnNA)
EmployeeID Wage Industry Job
1 1 32.81 Construction Analyst
2 2 46.00 Automotive Engineer
3 3 43.13 Construction Sales Rep
4 4 48.09 Automotive <NA>
5 5 43.62 Automotive Accountant
6 6 46.98 Construction Engineer
na.omit() vs. drop_na()
Both functions return a new object with the rows containing missing values removed.
na.omit() is a base R function, so it doesn’t require any additional package installation where drop_na() requires loading the tidyr package, which is part of the tidyverse ecosystem.
drop_na() fits well into tidyverse pipelines, making it easy to integrate with other tidyverse functions where na.omit() can also be used in pipelines but might require additional steps to fit seamlessly.
# install.packages('Amelia')
library(Amelia)
data("africa")
summary(africa)
year country gdp_pc infl
Min. :1972 Burkina Faso:20 Min. : 376.0 Min. : -8.400
1st Qu.:1977 Burundi :20 1st Qu.: 513.8 1st Qu.: 4.760
Median :1982 Cameroon :20 Median :1035.5 Median : 8.725
Mean :1982 Congo :20 Mean :1058.4 Mean : 12.753
3rd Qu.:1986 Senegal :20 3rd Qu.:1244.8 3rd Qu.: 13.560
Max. :1991 Zambia :20 Max. :2723.0 Max. :127.890
NA's :2
trade civlib population
Min. : 24.35 Min. :0.0000 Min. : 1332490
1st Qu.: 38.52 1st Qu.:0.1667 1st Qu.: 4332190
Median : 59.59 Median :0.1667 Median : 5853565
Mean : 62.60 Mean :0.2889 Mean : 5765594
3rd Qu.: 81.16 3rd Qu.:0.3333 3rd Qu.: 7355000
Max. :134.11 Max. :0.6667 Max. :11825390
NA's :5
summary(africa$gdp_pc)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
376.0 513.8 1035.5 1058.4 1244.8 2723.0 2
summary(africa$trade)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
24.35 38.52 59.59 62.60 81.16 134.11 5
<- na.omit(africa)
africa1 summary(africa1)
year country gdp_pc infl
Min. :1972 Burkina Faso:20 Min. : 376.0 Min. : -8.40
1st Qu.:1976 Burundi :17 1st Qu.: 511.5 1st Qu.: 4.67
Median :1981 Cameroon :18 Median :1062.0 Median : 8.72
Mean :1981 Congo :20 Mean :1071.8 Mean : 12.91
3rd Qu.:1986 Senegal :20 3rd Qu.:1266.0 3rd Qu.: 13.57
Max. :1991 Zambia :20 Max. :2723.0 Max. :127.89
trade civlib population
Min. : 24.35 Min. :0.0000 Min. : 1332490
1st Qu.: 38.52 1st Qu.:0.1667 1st Qu.: 4186485
Median : 59.59 Median :0.1667 Median : 5858750
Mean : 62.60 Mean :0.2899 Mean : 5749761
3rd Qu.: 81.16 3rd Qu.:0.3333 3rd Qu.: 7383000
Max. :134.11 Max. :0.6667 Max. :11825390
## to drop all at once.
<- africa %>%
africa2 drop_na()
summary(africa2)
year country gdp_pc infl
Min. :1972 Burkina Faso:20 Min. : 376.0 Min. : -8.40
1st Qu.:1976 Burundi :17 1st Qu.: 511.5 1st Qu.: 4.67
Median :1981 Cameroon :18 Median :1062.0 Median : 8.72
Mean :1981 Congo :20 Mean :1071.8 Mean : 12.91
3rd Qu.:1986 Senegal :20 3rd Qu.:1266.0 3rd Qu.: 13.57
Max. :1991 Zambia :20 Max. :2723.0 Max. :127.89
trade civlib population
Min. : 24.35 Min. :0.0000 Min. : 1332490
1st Qu.: 38.52 1st Qu.:0.1667 1st Qu.: 4186485
Median : 59.59 Median :0.1667 Median : 5858750
Mean : 62.60 Mean :0.2899 Mean : 5749761
3rd Qu.: 81.16 3rd Qu.:0.3333 3rd Qu.: 7383000
Max. :134.11 Max. :0.6667 Max. :11825390
You try to load the airquality dataset from base R and look at a summary of the dataset.
- Sum the number of NAs in airquality.
- Omit all the NAs from airquality and save it in a new data object called airqual and take a new summary of it.
Ozone Solar.R Wind Temp Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 NA's :37 NA's :7 Month Day Min. :5.000 Min. : 1.0 1st Qu.:6.000 1st Qu.: 8.0 Median :7.000 Median :16.0 Mean :6.993 Mean :15.8 3rd Qu.:8.000 3rd Qu.:23.0 Max. :9.000 Max. :31.0
[1] 44
Ozone Solar.R Wind Temp Min. : 1.0 Min. : 7.0 Min. : 2.30 Min. :57.00 1st Qu.: 18.0 1st Qu.:113.5 1st Qu.: 7.40 1st Qu.:71.00 Median : 31.0 Median :207.0 Median : 9.70 Median :79.00 Mean : 42.1 Mean :184.8 Mean : 9.94 Mean :77.79 3rd Qu.: 62.0 3rd Qu.:255.5 3rd Qu.:11.50 3rd Qu.:84.50 Max. :168.0 Max. :334.0 Max. :20.70 Max. :97.00 Month Day Min. :5.000 Min. : 1.00 1st Qu.:6.000 1st Qu.: 9.00 Median :7.000 Median :16.00 Mean :7.216 Mean :15.95 3rd Qu.:9.000 3rd Qu.:22.50 Max. :9.000 Max. :31.00
Summarize
- The summarize() command is used to create summary statistics for groups of observations in a data frame.
- In R, summary() and summarize() serve different purposes. summary() is part of base R and gives a quick overview of data, returning descriptive statistics for each column. For example, summary(mtcars) provides the min, max, median, and mean for numeric columns and counts for factors. It’s useful for a broad snapshot of your dataset.
- In contrast, summarize() (or summarise()) is from the dplyr package and allows for custom summaries. For instance, mtcars %>% summarize(avg_mpg = mean(mpg), max_hp = max(hp)) returns the average miles per gallon and the maximum horsepower. It’s more flexible and is often used with group_by() for grouped calculations. In conclusion, summary() gives automatic overviews, while summarize() is better for tailored summaries.
- In the example below, we can summarize more than one thing into tidy output.
%>%
gig drop_na() %>%
summarize(mean.days = mean(Wage), sd.days = sd(Wage), var.days = var(Wage),
med.days = median(Wage), iqr.days = IQR(Wage))
mean.days sd.days var.days med.days iqr.days
1 40.14567 7.047058 49.66103 41.82 11.465
Group_by
- group_by is used for grouping data by one or more variables. When you use group_by() on a data frame, it doesn’t actually perform any computations immediately. Instead, it sets up the data frame in such a way that any subsequent operations are performed within these groups
- summarize() is often used in combination with group_by() to calculate summary statistics within groups
## summarize data by Industry variable.
<- gig %>%
groupedData group_by(Industry) %>%
summarize(meanWage = mean(Wage))
groupedData
# A tibble: 4 × 2
Industry meanWage
<fct> <dbl>
1 Automotive 43.4
2 Construction 38.3
3 Tech 40.7
4 <NA> 39.5
## same function with na's dropped.
<- gig %>%
groupedData drop_na() %>%
group_by(Industry) %>%
summarize(meanWage = mean(Wage))
groupedData
# A tibble: 3 × 2
Industry meanWage
<fct> <dbl>
1 Automotive 43.4
2 Construction 38.4
3 Tech 40.7
Mutate
- mutate() is part of the dplyr package, which is used for data manipulation. The mutate() function is specifically designed to create new variables (columns) or modify existing variables in a data frame. It is commonly used in data wrangling tasks to add calculated columns or transform existing ones.
- One example is below, but note that there are many things you can do with the mutate function.
# making a new variable called calculation that multiplies gdp_pc by
# infl variables in the africa1 dataset.
<- mutate(africa1, calculation = gdp_pc * infl)
africa.mutated head(africa.mutated)
year country gdp_pc infl trade civlib population calculation
1 1972 Burkina Faso 377 -2.92 29.69 0.5000000 5848380 -1100.84
2 1973 Burkina Faso 376 7.60 31.31 0.5000000 5958700 2857.60
3 1974 Burkina Faso 393 8.72 35.22 0.3333333 6075700 3426.96
4 1975 Burkina Faso 416 18.76 40.11 0.3333333 6202000 7804.16
5 1976 Burkina Faso 435 -8.40 37.76 0.5000000 6341030 -3654.00
6 1977 Burkina Faso 448 29.99 41.11 0.6666667 6486870 13435.52
- Below is an example with the iris dataset, which is part of base R.
data("iris")
## Selecting 2 variables from the iris dataset: Sepal.Length and
## Petal.Length
<- select(iris, Sepal.Length, Petal.Length)
selected_data head(selected_data)
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
4 4.6 1.5
5 5.0 1.4
6 5.4 1.7
# Filter rows based on a condition: Species = setosa
<- filter(iris, Species == "setosa")
filtered_data head(filtered_data)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
# Arrange rows by the Sepal.Length column
<- arrange(iris, Sepal.Length)
arranged_data # Create a new column by mutating the data by transforming
# Petal.Width to the log form.
<- mutate(iris, Petal.Width_Log = log(Petal.Width)) mutated_data
Full Examples
gss.2016 Data Cleaning
- First, because we made some edits to the data set, reread in version a using the read.csv command. This brings the data set back to its original form. It is always a good idea to read the dataset back in when you are unsure about whether you have made a mistake during data preparation that could cause a lack of data integrity.
.2016 <- read.csv(file = "data/gss2016.csv") gss
- Before we remove any missing data, we need it to be the correct data type. In this case, grass should be a factor.
# We coerced this variable earlier, but the object was called
# gss.2016. Since we reread in the data set, this needs to be done
# again.
.2016$grass <- as.factor(gss.2016$grass) gss
- The statement below is an equivalent to the function above, but written with the piping operator. It is overwriting gss.2016 after conducting the coercion to factor.
- We added the mutate function because we are going to add other data cleaning tasks to this statement.
.2016 <- gss.2016 %>%
gssmutate(grass = as.factor(grass))
Piping to More Functions: Missing Data
- In the code below, the as.factor() command has been moved inside a broader mutate statement (that uses tidyverse library) and piped to it the na_if() command that handles missing data. If you use more than one data manipulation statement, the mutate() command is needed to help organize your code with one mutate() is needed for each major change you are making.
- In the code below, we created a new object gss.2016.cleaned to help store the cleaned version of the dataset. This helps maintain data integrity because your original dataset is still intact and each time, you rerun the entire chunk, which includes all the changes at one time.
2016.cleaned <- gss.2016 %>%
gss.# Moved coercion statement into a mutate function to keep code
# tidy
mutate(grass = as.factor(grass)) %>%
# Moving DK value to NA for not applicable
mutate(grass = na_if(x = grass, y = "DK"))
# Check the summary, there should be 110 + 3 = 113 in the NA category
summary(object = gss.2016.cleaned)
grass age
DK : 0 Length:2867
IAP : 911 Class :character
LEGAL :1126 Mode :character
NOT LEGAL: 717
NA's : 113
Drop Levels
The droplevels function is part of base R and is used to drop unused levels from factor variables in a data frame. It works by removing any levels from a factor variable that are not present in the data.
Next, we want to edit our code to convert IAP and DK to NA values and drop levels that have are empty.
- Note the Piping operator added to the end of the DK line so you can keep going with new commands editing gss.2016.cleaned.
2016.cleaned <- gss.2016 %>% gss.mutate(grass = as.factor(grass)) %>% # Added piping operator mutate(grass = na_if(x = grass, y = "DK")) %>% # Turn to na if value of grass = IAP mutate(grass = na_if(x = grass, y = "IAP")) %>% # Drop levels in grass that have no values mutate(grass = droplevels(x = grass)) # Check what you just did summary(gss.2016.cleaned)
grass age LEGAL :1126 Length:2867 NOT LEGAL: 717 Class :character NA's :1024 Mode :character
Coercing to Numeric
- Next, we handle a numerical variable, age. Age again has an issue being able to be numerical data type because it has “89 OR OLDER” as a value. Before using the as.numeric() command, we need to recode it. We did this above as a stand-alone statement.
2016.cleaned <- gss.2016 %>%
gss.mutate(grass = as.factor(grass)) %>%
mutate(grass = na_if(x = grass, y = "DK")) %>%
mutate(grass = na_if(x = grass, y = "IAP")) %>%
# Added piping operator
mutate(grass = droplevels(x = grass)) %>%
# Ensure variable can be coded as numeric and fix if necessary.
mutate(age = recode(age, `89 OR OLDER` = "89")) %>%
# Coerce into numeric
mutate(age = as.numeric(x = age))
# Check what you just did
summary(gss.2016.cleaned)
grass age
LEGAL :1126 Min. :18.00
NOT LEGAL: 717 1st Qu.:34.00
NA's :1024 Median :49.00
Mean :49.16
3rd Qu.:62.00
Max. :89.00
NA's :10
The recode() command that is part of dplyr is like the ifelse() command that is in base R. There are a lot of ways to recode in R.
Finally, we want to take our numerical variable, age, and cut it at certain breaks to make categories that can be easily analyzed.
- This also ensures that anyone above 89 is coded correctly in a category instead of as the value 89. This again brings back data integrity.
- The cut() function generates class limits and bins used in frequency distributions (and histograms) for quantitative data.
- Here, we are using it to cut age into a categorical variable.
2016.cleaned <- gss.2016 %>% gss.mutate(grass = as.factor(grass)) %>% mutate(grass = na_if(x = grass, y = "DK")) %>% mutate(grass = na_if(x = grass, y = "IAP")) %>% mutate(grass = droplevels(grass)) %>% mutate(age = recode(age, `89 OR OLDER` = "89")) %>% # Added piping operator mutate(age = as.numeric(age)) %>% # Cut numeric variable into groupings mutate(age.cat = cut(age, breaks = c(-Inf, 29, 59, 74, Inf), labels = c("< 30", "30 - 59", "60 - 74", "75+"))) # Check what you just did summary(gss.2016.cleaned)
grass age age.cat LEGAL :1126 Min. :18.00 < 30 : 481 NOT LEGAL: 717 1st Qu.:34.00 30 - 59:1517 NA's :1024 Median :49.00 60 - 74: 598 Mean :49.16 75+ : 261 3rd Qu.:62.00 NA's : 10 Max. :89.00 NA's :10
brfss Data Cleaning
- The full codebook where this screenshot is taken is brfss_2014_codebook.pdf.
<- read.csv("data/brfss.csv")
brfss summary(brfss)
TRNSGNDR X_AGEG5YR X_RACE X_INCOMG
Min. :1.00 Min. : 1.000 Min. :1.000 Min. :1.000
1st Qu.:4.00 1st Qu.: 5.000 1st Qu.:1.000 1st Qu.:3.000
Median :4.00 Median : 8.000 Median :1.000 Median :5.000
Mean :4.06 Mean : 7.822 Mean :1.992 Mean :4.481
3rd Qu.:4.00 3rd Qu.:10.000 3rd Qu.:1.000 3rd Qu.:5.000
Max. :9.00 Max. :14.000 Max. :9.000 Max. :9.000
NA's :310602 NA's :94
X_EDUCAG HLTHPLN1 HADMAM X_AGE80
Min. :1.000 Min. :1.000 Min. :1.00 Min. :18.00
1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:44.00
Median :3.000 Median :1.000 Median :1.00 Median :58.00
Mean :2.966 Mean :1.108 Mean :1.22 Mean :55.49
3rd Qu.:4.000 3rd Qu.:1.000 3rd Qu.:1.00 3rd Qu.:69.00
Max. :9.000 Max. :9.000 Max. :9.00 Max. :80.00
NA's :208322
PHYSHLTH
Min. : 1.0
1st Qu.:20.0
Median :88.0
Mean :61.2
3rd Qu.:88.0
Max. :99.0
NA's :4
Qualitative Variable
- To look at an example, the one below seeks to understand the healthcare issue in reporting gender based on different definitions. The dataset is part of the Behavioral Risk Factor Surveillance System (brfss) dataset (2014), which includes lots of other variables besides reported gender.
# Load the data
<- read.csv("data/brfss.csv")
brfss # Summarize the TRNSGNDR variable
summary(object = brfss$TRNSGNDR)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.00 4.00 4.00 4.06 4.00 9.00 310602
# Find frequencies
table(brfss$TRNSGNDR)
1 2 3 4 7 9
363 212 116 150765 1138 1468
- Since this table is not very informative, we need to do some edits.
- Check the class of the variable to see the issue with analyzing it as a categorical variable.
class(brfss$TRNSGNDR)
[1] "integer"
- First, we need to change the TRNSGNDR variable to a factor using as.factor().
# Change variable from numeric to factor
$TRNSGNDR <- as.factor(brfss$TRNSGNDR)
brfss# Check data type again to ensure factor
class(brfss$TRNSGNDR)
[1] "factor"
- Then, we need to do some data cleaning on the TRNSGNDR Variable.
<- brfss %>%
brfss.cleaned mutate(TRNSGNDR = recode_factor(TRNSGNDR,
'1' = 'Male to female',
'2' = 'Female to male',
'3' = 'Gender non-conforming',
'4' = 'Not transgender',
'7' = 'Not sure',
'9' = 'Refused'))
- We can use the levels() command to show the factor levels made with the mutate() command above.
levels(brfss.cleaned$TRNSGNDR)
[1] "Male to female" "Female to male" "Gender non-conforming"
[4] "Not transgender" "Not sure" "Refused"
- Check the summary.
summary(brfss.cleaned$TRNSGNDR)
Male to female Female to male Gender non-conforming
363 212 116
Not transgender Not sure Refused
150765 1138 1468
NA's
310602
- Take a good look at the table to interpret the frequencies in the output above. The highest percentage was the “NA’s” category, followed by “Not transgender”. Removing the NA’s moved the “Not transgender” category to over 97% of observations.
Quantitative Variable
- Let’s use the cleaned dataset to make more changes to the continuous variable PHYSHLTH. In the codebook, it looks like the data is most applicable to the first 2 categories. The 1-30 days coding and the 88 coding, which means 0 days of physical illness and injury.
- Using cleaned data, we need to prep the variable a little more before getting an accurate plot.
- Specifically, we need to null out the 77 and 99 values and make sure the 88 coding is set to be 0 for 0 days of illness and injury.
<- brfss %>%
brfss.cleaned mutate(TRNSGNDR = recode_factor(TRNSGNDR, `1` = "Male to female", `2` = "Female to male",
`3` = "Gender non-conforming", `4` = "Not transgender", `7` = "Not sure",
`9` = "Refused")) %>%
# Turn the 77 values to NA's.
mutate(PHYSHLTH = na_if(PHYSHLTH, y = 77)) %>%
# Turn the 99 values to NA's.
mutate(PHYSHLTH = na_if(PHYSHLTH, y = 99)) %>%
# Recode the 88 values to be numeric value of 0.
mutate(PHYSHLTH = recode(PHYSHLTH, `88` = 0L))
The histogram showed most people have between 0 and 10 unhealthy days per 30 days.
Next, evaluate mean, median, and mode for the PHYSHLTH variable after ignoring the blanks.
mean(brfss.cleaned$PHYSHLTH, na.rm = TRUE)
[1] 4.224106
median(brfss.cleaned$PHYSHLTH, na.rm = TRUE)
[1] 0
names(x = sort(x = table(brfss.cleaned$PHYSHLTH), decreasing = TRUE))[1]
[1] "0"
- While the mean is higher at 4.22, the median and most common number is 0.
## Spread to Report with the Mean
var(brfss.cleaned$PHYSHLTH, na.rm = TRUE)
[1] 77.00419
sd(brfss.cleaned$PHYSHLTH, na.rm = TRUE)
[1] 8.775203
## Spread to Report with Median
summary(brfss.cleaned$PHYSHLTH, na.rm = TRUE)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 0.000 0.000 4.224 3.000 30.000 10303
range(brfss.cleaned$PHYSHLTH, na.rm = TRUE)
[1] 0 30
max(brfss.cleaned$PHYSHLTH, na.rm = TRUE) - min(brfss.cleaned$PHYSHLTH,
na.rm = TRUE)
[1] 30
IQR(brfss.cleaned$PHYSHLTH, na.rm = TRUE)
[1] 3
library(semTools)
# Plot the data
%>%
brfss.cleaned ggplot(aes(PHYSHLTH)) + geom_histogram()
# Calculate Skewness and Kurtosis
skew(brfss.cleaned$PHYSHLTH)
skew (g1) se z p
2.209078e+00 3.633918e-03 6.079054e+02 0.000000e+00
kurtosis(brfss.cleaned$PHYSHLTH)
Excess Kur (g2) se z p
3.474487e+00 7.267836e-03 4.780634e+02 0.000000e+00
- The skew results provide a z of 607.905 (6.079054e+02) which is much higher than 7 (for large datasets). This indicates a clear right skew which means the data is not normally distributed.
- The kurtosis results are also very leptokurtic with a score of 478.063.
Using Filters Example
Below takes an example of the brfss data to filter by certain variable statuses.
- The first filter() chose observations that were any one of the three categories of transgender included in the data. Used the | “or” operator for this filter().
- The second filter chose people in an age category above category 4 but below category 12, in the age categories 5 through 11.
- The last filter used the !is.na to choose observations where HADMAM variable was not NA.
Next, we reduce data set to contain only variables used to create table by using the select() command.
Next, we change all the remaining variables in data set to factors using mutate_all() command. This not only changes the strings to factors, but also changes the numerical variables to factors.
Finally, we use mutate() commands to change the variable category to something meaningful(from the codebook).
- Notice the backslash before the apostrophe in Don’t in the X_INCOMG recode. This is to prevent the .R file from ending the quotations. You could use double quotes around the statement to bypass this, or add the backslash like I did here.
<- brfss.cleaned %>% brfss_small filter(TRNSGNDR == "Male to female" | TRNSGNDR == "Female to male" | == "Gender non-conforming") %>% TRNSGNDR filter(X_AGEG5YR > 4 & X_AGEG5YR < 12) %>% filter(!is.na(HADMAM)) %>% select(TRNSGNDR, X_AGEG5YR, X_RACE, X_INCOMG, X_EDUCAG, HLTHPLN1, HADMAM) %>% mutate_all(as.factor) %>% # The next few mutates add labels to categorical variables based # on the codebook. mutate(X_AGEG5YR = recode_factor(X_AGEG5YR, `5` = "40-44", `6` = "45-49", `7` = "50-54", `8` = "55-59", `9` = "60-64", `10` = "65-69", `11` = "70-74")) %>% mutate(X_INCOMG = recode_factor(X_INCOMG, `1` = "Less than 15,000", `2` = "15,000 to less than 25,000", `3` = "25,000 to less than 35,000", `4` = "35,000 to less than 50,000", `5` = "50,000 or more", `9` = "Don't know/not sure/missing")) %>% mutate(X_EDUCAG = recode_factor(X_EDUCAG, `1` = "Did not graduate high school", `2` = "Graduated high school", `3` = "Attended college/technical school", `4` = "Graduated from college/technical school", `9` = NA_character_)) %>% mutate(HLTHPLN1 = recode_factor(HLTHPLN1, `1` = "Yes", `2` = "No", `7` = "Don't know/not sure/missing", `9` = "Refused")) %>% mutate(X_RACE = recode_factor(X_RACE, `1` = "White", `2` = "Black", `3` = "Native American", `4` = "Asian/Pacific Islander", `5` = "Other", `6` = "Other", `7` = "Other", `8` = "Other", `9` = "Other")) # print a summary summary(brfss_small)
TRNSGNDR X_AGEG5YR X_RACE Male to female : 77 40-44:27 White :152 Female to male :113 45-49:27 Black : 31 Gender non-conforming: 32 50-54:32 Native American : 4 Not transgender : 0 55-59:44 Asian/Pacific Islander: 6 Not sure : 0 60-64:44 Other : 29 Refused : 0 65-69:24 70-74:24 X_INCOMG X_EDUCAG Less than 15,000 :46 Did not graduate high school :24 15,000 to less than 25,000 :44 Graduated high school :86 25,000 to less than 35,000 :19 Attended college/technical school :68 35,000 to less than 50,000 :26 Graduated from college/technical school:44 50,000 or more :65 Don't know/not sure/missing:22 HLTHPLN1 HADMAM Yes:198 1:198 No : 24 2: 22 9: 2
This data set full of categorical variables is now fully cleaned and ready to be analyzed!
Using AI
Use the following prompts on a generative AI, like chatGPT, to learn more about data preparation activities.
What is the difference between ordinal and nominal variables, and how can you recode these data types in R?
Explain the steps to coerce a character variable into a factor using the as.factor() function in R.
How can you use the filter() function to subset a dataset based on multiple conditions using & and | in R?
How does subsetting using square brackets [] differ from using the filter() function in R?
What strategies can you use to handle missing data in R, and how does na.omit() differ from drop_na()?
How does the mutate() function help in transforming and creating new variables, and what are some practical examples?”
What is the purpose of the group_by() function, and how does it interact with summarize() to create summary statistics in R?”
Explain how you can use arrange() to sort a dataset by one or more variables and demonstrate sorting both in ascending and descending order.”
Why is the piping operator %>% useful in R, and how does it improve the readability and structure of your code?
How would you use summarize() to calculate mean, median, and standard deviation for a numerical variable in R?
Summary
- In this lesson, we worked through the basics on data cleaning. Data cleaning is so important and there are so many ways to do it. Provided are some examples using popular functions in dplyr (under tidyverse).