####################################
# Project name: Descriptive Statistics
# Data used: customers.csv
# Libraries used: tidyverse, semTools
####################################
3 Descriptive Statistics
- The goal of this lesson is to teach you how to summarize descriptive statistics for both quantitative and qualitative data in R. In order to succeed in this lesson, we need to learn how to describe data by understanding the difference between quantitative and qualitative data alongside examining its context. The context of the data gives us important information about how the data was collected, and its intended use.The second goal is to teach you about data types, coersion, and how to transform variables that do not follow a normal distribution.
3.0.1 Lesson Objectives
- Choose and conduct descriptive analyses for categorical (factor) variables.
- Choose and conduct descriptive analyses for continuous (numeric) variables.
3.0.2 Consider While Reading
- In any analysis, it is critically important to understand your data set by evaluating descriptive statistics. For qualitative data, you should know how to calculate frequencies and proportions and make a user-friendly display of results. For quantitative data, you should know what a histogram is and how it is used to describe quantitative data. We also should know how to describe variables by their center and spread of the distribution.
3.1 Summarizing Qualitative Data
- Qualitative data is information that cannot be easily counted, measured, or easily expressed using numbers.
- Nominal variables: a type of categorical variable that represents discrete categories or groups with no inherent order or ranking
- gender (male, female)
- marital status (single, married, divorced)
- eye color (blue, brown, green)
- Ordinal variables: categories possess a natural order or ranking
- a Likert scale measuring agreement with a statement (e.g., strongly disagree, disagree, neutral, agree, strongly agree)
- Nominal variables: a type of categorical variable that represents discrete categories or groups with no inherent order or ranking
- A frequency distribution shows the number of observations in each category for a factor or categorical variable.
- Guidelines when constructing frequency distribution:
- Classes or categories are mutually exclusive (they are all unique).
- Classes or categories are exhaustive (a full list of categories).
- To calculate frequencies, first, start with a variable that has categorical data.
# Create a vector with some data that could be categorical
<- c("A", "B", "A", "C", "A", "B", "A", "C", "A", "B")
Sample_Vector # Create a data frame with the vector
<- data.frame(Sample_Vector) data
- To count the number of each category value, we can use the table() command.
- The output shows a top row of categories and a bottom row that contains the number of observations in the category.
# Create a table of frequencies
<- table(data$Sample_Vector)
frequencies frequencies
A B C
5 3 2
- Relative frequency is how often something happens divided by all outcomes.
- The relative frequency is calculated by \(f_i/n\), where \(f_i\) is the frequency of class \(i\) and \(n\) is the total frequency.
- We can use the prop.table() command to calculate relative frequency by dividing each category’s frequency by the sample size.
# Calculate proportions
<- prop.table(frequencies) proportions
- The cumulative relative frequency is given by \(cf_i/n\), where \(cf_i\) is the cumulative frequency of class \(i\).
- The cumsum() function calculates the cumulative distribution of the data
# Calculate cumulative frequencies
<- cumsum(frequencies)
cumulfreq # Calculate cumulative proportions
<- cumsum(prop.table(frequencies)) cumulproportions
- The rbind() function is used to combine multiple data frames or matrices by row. The name “rbind” stands for “row bind”. Since the data produced by the table is in rows, we can use rbind to link them together.
# combine into table
<- rbind(frequencies, proportions, cumulfreq, cumulproportions)
frequency_table # Print the table
frequency_table
A B C
frequencies 5.0 3.0 2.0
proportions 0.5 0.3 0.2
cumulfreq 5.0 8.0 10.0
cumulproportions 0.5 0.8 1.0
- We can transpose a table using the t() command, which flips the dataset.
<- t(frequency_table)
TransposedData TransposedData
frequencies proportions cumulfreq cumulproportions
A 5 0.5 5 0.5
B 3 0.3 8 0.8
C 2 0.2 10 1.0
- Finally, sometimes we need to transform our calculations into a dataset.
- The as.data.frame function is used to coerce or convert an object into a data frame.
- as.data.frame() is used when you have an existing object that needs to be coerced into a data frame. data.frame(), on the other hand, is for creating a data frame from scratch by specifying the data directly. Therefore, both as.data.frame() and data.frame() are used to convert or create data frames in R.
- as.data.frame() coerces an existing object (such as a list, matrix, or vector) into a data frame. Data.frame is used to create a new data frame from individual vectors or lists.
- as.data.frame() accepts a wider variety of inputs (like lists, matrices, and vectors), while data.frame() directly accepts vectors and lists to construct the data frame.
<- as.data.frame(TransposedData)
TransposedData TransposedData
frequencies proportions cumulfreq cumulproportions
A 5 0.5 5 0.5
B 3 0.3 8 0.8
C 2 0.2 10 1.0
3.2 Summarizing Quantitative Data
3.2.1 Defining and Calculating Central Tendency
- The term central location refers to how numerical data tend to cluster around some middle or central value.
- Measures of central location attempt to find a typical or central value that describes a variable.
- Why frequency distributions do not work for numeric variables:
- Numeric variables measured on a continuum.
- Instead, we calculate descriptive statistics including central tendency and spread of the values for a numeric variable.
- We will examine the three mostly widely used measures of central location: mean, median and mode.
- Then we discuss a percentile: a measure of relative position.
3.2.1.1 Using the Mean
The arithmetic mean or simply the mean is a primary measure of central location. It is often referred to as the average. Simply add up all the observations and divide by the number of observations.
The numerator (top of the fraction) is the sum (sigma) of all the values of x from the first value (i = 1) to the last value (n) divided by the number of values (n).
\(m_x = (\sum_{i=1}^{n} x_{i})/n\)
Consider the salaries of employees at a company:
We can use the mean() command to calculate the mean in R.
# Create Vector of Salaries
<- c(40000, 40000, 65000, 90000, 145000, 150000, 550000)
salaries # Calculate the mean using the mean() command
mean(salaries)
[1] 154285.7
- Note that due to at least one outlier this mean does not reflect the typical salary - more on that later.
- If we edit our vector to include NAs, we have to account for this. This is a common way to handle NAs in functions that do not allow for them.
<- c(40000, 40000, 65000, 90000, 145000, 150000, 550000, NA,
salaries2 NA)
# Calculate the mean using the mean() command Notice that it does not
# work
mean(salaries2)
[1] NA
# Add in na.rm parameter to get it to produce the mean with no NAs.
mean(salaries2, na.rm = TRUE)
[1] 154285.7
- Note that there are other types of means like the weighted mean or the geometric mean.
- The weighted mean uses weights to determine the importance of each data point of a variable. It is calculated by \(\bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}\), where are the weights associated to the values.
- An example is below.
<- c(4, 7, 10, 5, 6)
values <- c(1, 2, 3, 4, 5)
weights <- weighted.mean(values, weights)
weighted_mean weighted_mean
[1] 6.533333
3.2.1.2 Using the Median
- The median is another measure of central location that is not affected by outliers.
- When the data are arranged in ascending order, the median is:
- The middle value if the number of observations is odd, or
- The average of the two middle values if the number of observations is even.
- Consider the sorted salaries of employees presented earlier which contains an odd number of observations.
- On the same salaries vector created above, use median() command to calculate the median in R.
# Calculate the median using the median() command
median(salaries)
[1] 90000
- Now compare to the mean and note the large difference in numbers signifying that at least one outlier is most likely present.
- Specifically, if the mean and median are different, it is likely the variable is skewed and contains outliers.
mean(salaries)
[1] 154285.7
- For another example, consider the sorted data below that contains an even number of values.
<- c(-38.32, 1.71, 3.17, 5.99, 12.56, 13.47, 16.89, 16.96, 32.16,
GrowthFund 36.29)
- When data contains an even number of values, the median is the average of the 2 sorted middle numbers (12.56 and 13.47).
median(GrowthFund)
[1] 13.015
12.56 + 13.47)/2 (
[1] 13.015
# The mean is still the average
mean(GrowthFund)
[1] 10.088
3.2.1.3 Using the Mode
- The mode is another measure of central location.
- The mode is the most frequently occurring value in a data set.
- The mode is useful in summarizing categorical data but can also be used to summarize quantitative data.
- A data set can have no mode, one mode (unimodal), two modes (bimodal) or many modes (multimodal).
- The mode is less useful when there are more than three modes.
3.2.1.4 Example of Function with Salary Variable
While this is a small vector, when working with a large dataset and a function like sort(x = table(salaries), decreasing = TRUE), appending [1:5] is a way to focus on the top results after the frequencies have been computed and sorted. Specifically, table(salaries) calculates the frequency of each unique salary, sort(…, decreasing = TRUE) orders these frequencies from highest to lowest, and [1:5] selects the first five entries in the sorted list. This is useful when the dataset contains many unique values, as it allows you to quickly identify and extract the top 5 most frequent salaries, providing a concise summary without being overwhelmed by the full distribution.
Consider the salary of employees presented earlier. 40,000 appears 2 times and is the mode because that occurs most often.
# Try this command with and without it.
sort(x = table(salaries), decreasing = TRUE)[1:5]
salaries
40000 65000 90000 145000 150000
2 1 1 1 1
3.2.1.5 Finding No Mode
- Look at the sort(table()) commands with the GrowthFund Vector we made earlier.
- I added a 1:5 in square brackets at the end of the statement to produce the 3 highest frequencies found in the vector.
sort(table(GrowthFund), decreasing = TRUE)[1:5]
GrowthFund
-38.32 1.71 3.17 5.99 12.56
1 1 1 1 1
- Even if you use this command, you still need to evaluate the data more systematically to verify the mode. If the highest frequency of the sorted table is 1, then there is no mode.
3.3 Defining and Calculating Spread
- Spread is a measure of distance values are from the central value.
- Each measure of central tendency has one or more corresponding measures of spread.
- Mean: use variance or standard deviation to measure spread.
- skewness and kurtosis help measure spread as well.
- Median: use range or interquartile range (IQR) to measure spread.
- Mode: use the index of qualitative variation to measure spread.
- Not formally testing here with a function.
3.3.1 Spread to Report with the Mean
3.3.1.1 Evaluating Skewness
- Skewness is a measure of the extent to which a distribution is skewed.
- Can evaluate skewness visually with histogram.
- A histogram is a visual representation of a frequency or a relative frequency distribution.
- Bar height represents the respective class frequency (or relative frequency).
- Bar width represents the class width.
3.3.1.2 Skewed Distributions: Median Not Same as Mean
- Sometimes, a histogram is difficult to tell if skewness is present or if the data is relatively normal or symmetric.
- If Mean is less than Median and Mode, then the variable is Left-Skewed.
- If the Mean is greater than the Median and Mode, then the variable is Right-Skewed.
- If the Mean is about equal to the Median and Mode, then the variable has a symmetric distribution.
- In R, we can easily look at mean and median with the summary() command.
- Mean is great when data are normally distributed (data is not skewed).
- Mean is not a good representation of skewed data where outliers are present.
- Adding together a set of values that includes a few very large or very small values like those on the far left of a left-skewed distribution or the far right of the right-skewed distribution will result in a large or small total value in the numerator of Equation and therefore the mean will be a large or small value relative to the actual middle of the data.
- There are a number of ways to check for skewness, two functions are below.The skewness() function from the e1071 package and the skew() function from the semTools package both measure skewness, but they differ slightly in how they scale the result.
- By default, e1071::skewness() uses a sample-based formula (dividing by \(n\)). In contrast, semTools::skew() uses Mardia’s definition and is more commonly applied in structural equation modeling contexts.
- As a result, the values may differ slightly, especially for small samples or skewed distributions. Both are valid but should be used consistently depending on the analytical context.
3.3.1.3 Using skewness command from e1071
- A standard calculation of skewness using the mean is based on the third standardized moment of a distribution. The formula is:
\[ \text{Skewness} = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{s} \right)^3 \] * where \(x_i\) = each observation * \(\bar{x}\) = sample mean * \(s\) = sample standard deviation * \(n\) = sample size * This formula measures the asymmetry of the distribution: + Positive skew (right-skewed): longer tail on the right + Negative skew (left-skewed): longer tail on the left + Zero skew: symmetric distribution
::skewness(salaries) e1071
[1] 1.415062
## a positive number indicates a longer tail to the right.
3.3.1.4 Using skew() Command from semTools
- We will use the skew() command is from the semTools package.
- The install.packages() command is commented out below, but install it one time on your R before commenting it out.
# install the semTools package if necessary.
# install.packages('semTools') Activate the library
library(semTools)
- After the package is installed and loaded, run the skew() command on the salaries vector made above.
skew(salaries)
skew (g1) se z p
2.311 0.926 2.496 0.013
3.3.1.5 Interpreting the skew() Command Results
se = standard error
z = skew/se
If the sample size is small (n < 50), z values outside the –2 to 2 range are a problem.
If the sample size is between 50 and 300, z values outside the –3.29 to 3.29 range are a problem.
For large samples (n > 300), using a visual is recommended over the statistics, but generally z values outside the range of –7 to 7 can be considered problematic.
Salary: Our sample size was small, <50, so the z value of 2.496 in regards to the salary vector indicates there is a problem with skewness.
GrowthFund: We can check the skew of GrowthFund.
skew(GrowthFund)
skew (g1) se z p
-1.381 0.775 -1.783 0.075
- GrowthFund was also considered a small sample size, so the same -2/2 thresholds are used. Here, our z value is -1.78250137, which is in normal range. This indicates there is no problem with skewness.
3.3.2 Histograms
- A histogram is a graphical representation of the distribution of numerical data.
- It consists of a series of contiguous rectangles, or bars, where the area of each bar corresponds to the frequency of observations within a particular range or bin of values.
- The x-axis typically represents the range of values being measured, while the y-axis represents the frequency or count of observations falling within each range.
- Histograms are commonly used in statistics and data analysis to visualize the distribution of a dataset and identify patterns or trends.
- They are particularly useful for understanding the central tendency, variability, and shape of the data distribution - this includes our observation of skewness.
- Works much better with larger datsets.
3.3.2.1 Commands to Make a Histogram
hist() command in base R.
geom_histogram() command in ggplot2 package.
a hist using the GrowthFund dataset does not look that great because its sample size is so small.
hist(GrowthFund)
3.3.2.2 hist vs geom_histogram
- In R, hist() and geom_histogram() are both used to create histograms, but they belong to different packages and have slightly different functionalities.
# Making an appropriate data.frame to use the hist() command
<- c(430, 520, 460, 475, 670, 521, 670, 417, 533, 525, 538,
HousePrice 370, 530, 525, 430, 330, 575, 555, 521, 350, 399, 560, 440, 425, 669,
660, 702, 540, 460, 588, 445, 412, 735, 537, 630, 430)
<- data.frame(HousePrice) HousePrice
- hist(): This function is from the base R graphics package and is used to create histograms. It provides a simple way to visualize the distribution of a single variable.
# Using base R to create the histogram.
hist(HousePrice$HousePrice, breaks = 5, main = "A Histogram", xlab = "House Prices (in $1,000s)",
col = "yellow")
library(tidyverse)
- geom_histogram(): This function is from the ggplot2 package, which is part of the tidyverse. It is used to create histograms as part of a more flexible and powerful plotting system.
# Using geom_histogram() command to create the histogram.
ggplot(HousePrice, aes(x = HousePrice)) + geom_histogram(binwidth = 100,
boundary = 300, color = "black", fill = "yellow") + labs(title = "A Histogram",
x = "House Prices (in $1,000s)", y = "Frequency")
We could add more parameters here to make the 2 histograms look identical, but this configuration of parameters is very close. Take note that there are a lot more parameters you can add to the geom_histogram() command than you can with base R to make it look more professional. Be sure to look them up and also check with the notes in the book, which focuses on geom_histogram instead of hist().
Variance is a measure of spread for numeric variables that is essentially the average of the squared differences between each observation value on some variable and the mean for that variable with population variance. \[Population Var(X) = \sigma^2 = \sum{(x_i-\mu)^2}/N\] \[Sample Var(x) = s^2 = \sum{(x_i-\bar{x})^2}/(n-1)\]
Standard deviation is the square root of the variance.
- Use var() command and sd() command to calculate sample variance and sample standard deviation.
## Calculated from Small Sample <- c(1, 2, 3, 4, 5) x sum((x - mean(x))^2/(5 - 1))
[1] 2.5
var(x)
[1] 2.5
sqrt(var(x))
[1] 1.581139
sd(x)
[1] 1.581139
sd(HousePrice$HousePrice) #102.6059
[1] 102.6059
var(HousePrice$HousePrice) #10527.97
[1] 10527.97
skew(HousePrice$HousePrice) #normal
skew (g1) se z p 0.317 0.408 0.777 0.437
- Use var() command and sd() command to calculate sample variance and sample standard deviation.
Looking at Spread for a Larger Dataset
<- read.csv("data/customers.csv")
customers summary(customers$Spending, na.rm = TRUE) #mean and median
Min. 1st Qu. Median Mean 3rd Qu. Max.
50.0 383.8 662.0 659.6 962.2 1250.0
mean(customers$Spending, na.rm = TRUE) #mean by itself
[1] 659.555
median(customers$Spending, na.rm = TRUE) #median by itself
[1] 662
### Spread to Report with the Mean
sd(customers$Spending, na.rm = TRUE)
[1] 350.2876
var(customers$Spending, na.rm = TRUE)
[1] 122701.4
3.3.2.3 Kurtosis in Evaluating Mean Spread
Kurtosis is the sharpness of the peak of a frequency-distribution curve or more formally a measure of how many observations are in the tails of a distribution.
The formula for kurtosis is as follows: Kurtosis = \(\frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum \left( \frac{(X_i - \bar{X})^4}{s^4} \right) - \frac{3(n-1)^2}{(n-2)(n-3)}\)
Where:
- \(n\) is the sample size
- \(X_i\) is each individual value
- \(\bar{X}\) is the mean of the data
- \(s\) is the standard deviation
- A normal distribution will have a kurtosis value of three, where distributions with kurtosis around 3 are described as mesokurtic, significantly higher than 3 indicate leptokurtic, and significantly under 3 indicate platykurtic.
- The kurtosis() command from the semTools package subtracts 3 from the kurtosis, so we can evaluate values by comparing them to 0. Positive values will be indicative to a leptokurtic distribution and negative will indicate a platykurtic distribution. To see if kurtosis (leptokurtic or platykurtic) is significant, we confirm them by first evaluating the z-score to see if the variable is normal or not. The same cutoff values from skew also apply for the z for small, medium, and large sample sizes in kurtosis. These are the same basic rules for the rules in judging skewness.
- The rules of determining problematic distributions with regards to kurtosis are below.
- If the sample size is small (n < 50), z values outside the –2 to 2 range are a problem.
- If the sample size is between 50 and 300, z values outside the –3.29 to 3.29 range are a problem.
- For large samples (n > 300), using a visual is recommended over the statistics, but generally z values outside the range of –7 to 7 can be considered problematic.
- If kurtosis is found, then evaluate the excess kur score to see if it is positive or negative to determine whether it is leptokurtic or platykurtic.
# z-value is 3.0398, which is > 2 indicating leptokurtic Small sample
# size: range is -2 to 2
kurtosis(salaries)
Excess Kur (g2) se z p
5.629 1.852 3.040 0.002
# z-value is 2.20528007, which is > 2 indicating leptokurtic Small
# sample size: range is -2 to 2
kurtosis(GrowthFund)
Excess Kur (g2) se z p
3.416 1.549 2.205 0.027
# Small sample size: range is -2 to 2 Skewness and kurtosis are both
# in range.
skew(HousePrice$HousePrice) #normal
skew (g1) se z p
0.317 0.408 0.777 0.437
kurtosis(HousePrice$HousePrice) #normal
Excess Kur (g2) se z p
-0.540 0.816 -0.661 0.508
- Let’s do a few more examples using the customers dataset.
# Noted sample size at 200 observations or a medium sample size.
# Using threshold –3.29 to 3.29 to assess normality.
#-3.4245446445 is below -3.29 so kurtosis is present
# Negative kurtosis value indicates platykurtic
kurtosis(customers$Spending)
Excess Kur (g2) se z p
-1.186 0.346 -3.425 0.001
geom_histogram(binwidth = 100, fill = "pink", color = "black")
geom_bar: na.rm = FALSE, orientation = NA
stat_bin: binwidth = 100, bins = NULL, na.rm = FALSE, orientation = NA, pad = FALSE
position_stack
::skew(customers$Spending) ##normal indicating no skewness semTools
skew (g1) se z p
-0.018 0.173 -0.106 0.916
# Normal: 2.977622119 is in between -3.29 and 3.29
kurtosis(customers$Income)
Excess Kur (g2) se z p
1.031 0.346 2.978 0.003
ggplot(customers, aes(Income)) + geom_histogram(binwidth = 10000, fill = "pink",
color = "black")
::skew(customers$Income) #Skewed right semTools
skew (g1) se z p
0.874 0.173 5.047 0.000
#-3.7251961028 is below -3.29 so kurtosis is present
# Negative kurtosis value indicates platykurtic
kurtosis(customers$HHSize)
Excess Kur (g2) se z p
-1.290 0.346 -3.725 0.000
ggplot(customers, aes(HHSize)) + geom_histogram(binwidth = 1, fill = "pink",
color = "black")
::skew(customers$HHSize) #normal semTools
skew (g1) se z p
-0.089 0.173 -0.513 0.608
# Normal: -0.20056607 is in between -3.29 and 3.29
kurtosis(customers$Orders)
Excess Kur (g2) se z p
-0.069 0.346 -0.201 0.841
geom_histogram(binwidth = 5, fill = "pink", color = "black")
geom_bar: na.rm = FALSE, orientation = NA
stat_bin: binwidth = 5, bins = NULL, na.rm = FALSE, orientation = NA, pad = FALSE
position_stack
::skew(customers$Orders) ##skewed right semTools
skew (g1) se z p
0.789 0.173 4.553 0.000
3.3.3 Spread to Report with the Median
Range = Maximum Value – Minimum Value.
- Simplest measure.
- Focuses on Extreme values.
- Use commands diff(range()) or max() – min().
IQR: Difference between the first and third quartiles.
- Use IQR() command or quantile() command.
summary(customers$Spending, na.rm = TRUE)
Min. 1st Qu. Median Mean 3rd Qu. Max. 50.0 383.8 662.0 659.6 962.2 1250.0
diff(range(customers$Spending, na.rm = TRUE))
[1] 1200
max(customers$Spending, na.rm = TRUE) - min(customers$Spending, na.rm = TRUE)
[1] 1200
IQR(customers$Spending, na.rm = TRUE)
[1] 578.5
3.3.4 Spread to Report with the Mode
- While there is no great function to test for spread, you can look at the data and see if it is concentrated around 1 or 2 frequencies. If it is, then the spread is distorted towards those high frequency values.
4 Evaluating Data Types
- A data type of a variable specifies the type of data that is stored inside that variable. R is Called a Dynamically Typed Language, meaning that a variable itself is not declared of any data type. Rather it receives the data type of the R object that is assigned to it. We can change a variable’s data type if we want through coercion, but it will inherit one based on the object assigned to it. We will learn some common ways to do this in the data preparation section.
- Evaluating data types in R is crucial for understanding and managing datasets effectively, as different data types require specific handling and operations. R provides several functions to identify and evaluate data types, such as class() to check an object’s class, typeof() to determine its internal storage mode, and is.numeric(), is.character(), or is.factor() to test for specific types. These functions help ensure that data is in the correct format for analysis or modeling tasks. For instance, numeric data types are essential for calculations, while factors or characters are used for categorical data.
- There are a number of data types in R that are common to programming and statistical analysis.
- Factor (Nominal or Ordinal)
- Numeric (Real or Discrete (Integer))
- Character
- Logical
- Date
5 Coercing Data Types
- In R, coercion is the process of converting one data type to another, often done automatically by R when necessary. For example, if you combine numeric and character data in a vector, R will coerce the numeric values to characters, as seen in c(1, “A”), which results in c(“1”, “A”). Coercion can also be done manually using functions like as.numeric() or as.factor() to change data types explicitly.
- To determine whether a variable is numeric or categorical (factor), you can use the class() function. For example, if you have a variable age, running class(age) will tell you if it is stored as “numeric” or “factor”. Numeric variables represent quantities, such as continuous or discrete numbers, while categorical variables represent groups or categories, typically stored as factors. For example, a variable with levels like “Male” and “Female” would be a categorical (factor) variable.
- Sometimes when you read in a dataset all the variables are already in the correct type. Other times, you need to force it into the type you need to conduct the analysis through a process called coercion.
5.0.1 Factor data type:
- Factor data types can be ordinal or nominal. Ordinal and nominal variables are both types of categorical variables but differ in their characteristics. Ordinal variables have a meaningful order or ranking among categories (e.g., “low,” “medium,” “high”), but the intervals between ranks are not necessarily equal. Nominal variables, on the other hand, represent categories without any inherent order (e.g., “red,” “blue,” “green”). Understanding this distinction is essential for selecting appropriate statistical methods, as ordinal variables often allow for rank-based analyses, while nominal variables do not.
- Ordinal: Contain categories that have some logical order (e.g. categories of age).
- Nominal: Have categories that have no logical order (e.g., religious affiliation and marital status).
- R will treat each unique value of a factor as a different level.
5.0.1.1 Ordinal Variable
- Ordinal data may be categorized and ranked with respect to some characteristic or trait.
- For example, instructors are often evaluated on an ordinal scale (excellent, good, fair, poor).
- A scale allows us to code the data based on order, assuming equal distance between scale items (aka likert items).
- You can make an ordinal factor data type in R or you can convert the order to meaningful numbers.
- To recode numbers in R, we would code poor to excellent = 1, 2, 3, 4 respectively.
- The recode() function from the dplyr package within the tidyverse ecosystem is used to replace specific values in a vector or variable with new values.
- It allows you to map old values to new ones in a simple and readable manner.
library(tidyverse)
# Take a vector representing evaluation scores, named evaluate
<- c("excellent", "good", "fair", "poor", "excellent", "good")
evaluate
<- data.frame(evaluate)
data <- data %>%
data mutate(evaluate = recode(evaluate,
"excellent" = 4,
"good" = 3,
"fair" = 2,
"poor" = 1))
data
evaluate
1 4
2 3
3 2
4 1
5 4
6 3
5.0.1.2 Nominal Variable
- With nominal variables, data are simply categories for grouping.
- For example, coding race/ethnicity might have a category value of White, Black, Native American, Asian/Pacific Islander, Other.
- Qualitative values may be converted to quantitative values for analysis purposes. + White = 1, Black = 2, etc. This conversion to numerical representation of the category would be needed to run some analysis.
- Sometimes, R does this on our behalf depending on commands used.
- We can force a variable into a factor data type using the as.factor() command.
- If we use the read.csv() command, we can sometimes do this by setting an argument \(stringsAsFactors=TRUE\). We will do this later in the lesson.
5.0.2 Numerical data types:
The as.numeric() function in R is used to convert data into numeric format, allowing for mathematical and statistical operations. It takes an input vector, such as characters or factors, and attempts to coerce it into numeric values. For instance, as.numeric(c(“1”, “2”, “3”)) returns a numeric vector of 1, 2, and 3. If applied to factors, it returns the underlying integer codes, not the original levels, so caution is needed to avoid misinterpretation. When conversion is not possible (e.g., as.numeric(“abc”)), it results in NA with a warning.
A numerical data type is a vector of numbers that can be Real or Integer, where continuous (Real) variables can take any value along some continuum, and integers take on whole numbers.
Two ways to create:
- We can create a numeric variable by ensuring our value we assign is a number!
- We can force a variable into an real number data type by using the as.numeric() command.
# Assign Rhode Island limit for medical marijuana in ounces per # person <- 2.5 kOuncesRhode # Identify the data type class(x = kOuncesRhode)
[1] "numeric"
Discrete (Integer) Variables: In R, integer discrete data types represent whole numbers, making them ideal for scenarios where only non-fractional values are meaningful, such as counts, rankings, or categorical levels encoded as numbers. Integer values are stored more efficiently than numeric (floating-point) values, providing performance benefits in memory usage and computations. Integers can be explicitly created using the L suffix (e.g., x <- 5L) or by coercing other data types with as.integer(). Operations on integers, such as arithmetic, comparisons, or indexing, behave consistently with their discrete nature. Integer data types are particularly useful for tasks like looping, array indexing, or when working with data where precision beyond whole numbers is unnecessary or irrelevant.
For an example of a discrete variable, we could collect information on the number of children in a family or number of points scored in a basketball game.
# Assign the value of 4 to a constant called kTestInteger and set as
# an integer
<- as.integer(4)
kTestInteger class(kTestInteger) #Confirm the data type is an integer
[1] "integer"
# Use as.integer() to truncate the variable ouncesRhode
<- as.integer(kOuncesRhode)
Trunc Trunc
[1] 2
5.0.3 Character data type
- The character data type in R is used to store text or string data, making it essential for handling names, labels, descriptions, or any non-numeric information. Character data is represented as sequences of characters enclosed in quotes (e.g., “hello” or ‘world’).
- Character data types can be wrapped in either single or double quotation marks (e.g, “hello” or ‘hello’).
- Character data types can include letters, words, or numbers that cannot logically be included in calculations (e.g., a zip code).
- A quick example is below that shows how to assign a character value to a variable.
# Make string constants
<- "A"
Q1 <- "B"
Q2 # Check the data type
class(x = Q1)
[1] "character"
5.0.4 Logical data type
Logical data types in R represent values that are either TRUE or FALSE, often used for conditional operations and data filtering. Logical values can be the result of comparisons (e.g., x > 10) or explicitly assigned. Logical vectors are particularly useful in subsetting data; for instance, data[data$column > 5, ] selects rows where the column values are greater than 5. Logical operators like & (and), | (or), and ! (not) allow for more complex conditions. Logical data is also essential in control structures such as if, else, and loops. By enabling dynamic and conditional programming, logical types play a key role in efficient data manipulation and analysis.
- A quick example is below that shows how to assign a logical value to a variable.
# Store the result of 6 > 8 in a constant called kSixEight <- 6 > 8 kSixEight # Can use comparison tests with the following == >= <= > < <> != # Print kSixEight kSixEight
[1] FALSE
# Determine the data type of kSixEight class(x = kSixEight)
[1] "logical"
5.0.5 Nominal Example with Dataset
library(tidyverse)
.2016 <- read_csv(file = "data/gss2016.csv") gss
# Examine the variable types with summary and class functions.
summary(gss.2016)
grass age
Length:2867 Length:2867
Class :character Class :character
Mode :character Mode :character
class(gss.2016$grass) #Check the data type.
[1] "character"
.2016$grass <- as.factor(gss.2016$grass) #Turn to a factor.
gssclass(gss.2016$grass) #Confirming it is now correct.
[1] "factor"
5.0.6 Numerical Example with Dataset
- We need to ensure data can be coded as numeric before using the as.numeric() command. For example, to handle the variable age, it seems like numerical values except one value of “89 OR OLDER”. If as.numeric() command was used on this variable, it would put all the 89 and older observations as NAs. To force it to be a numerical variable, and keep that the sample participants were the oldest value, we need to recode it and then use the as.numeric() command to coerce it into a number.
- Recoding the 89 and older to 89 does cause the data to lack integrity in its current form because it will treat the people over 89 years old as 89. But, we are limited here because this needs to be a numerical variable for us to proceed. We will learn a step later on in this section to transform the age variable into categories so that we bring back our data integrity.
class(gss.2016$age)
[1] "character"
# Recode '89 OR OLDER' into just '89'
.2016$age <- recode(gss.2016$age, `89 OR OLDER` = "89")
gss# Convert to numeric data type
.2016$age <- as.numeric(gss.2016$age)
gsssummary(gss.2016) #Conduct final check confirming correct data types
grass age
DK : 110 Min. :18.00
IAP : 911 1st Qu.:34.00
LEGAL :1126 Median :49.00
NOT LEGAL: 717 Mean :49.16
NA's : 3 3rd Qu.:62.00
Max. :89.00
NA's :10
6 Transformations of Variables
If data is not normally distributed, we need to conduct a transformation. When we transform a variable, we hope to change the shape to normal so that we can continue to calculate under the rules of the normal distribution. For variables that are right skewed, a few transformations that could work to make the variable more normally distributed are: square root, cube root, reciprocal, and log.
First, read in the opioid data set from so we can see a variable that is considered not normal.
# Distance to substance abuse facility with medication-assisted
# treatment
<- read.csv("data/opioidFacility.csv")
dist.mat # Review the data
summary(dist.mat)
STATEFP COUNTYFP YEAR INDICATOR
Min. : 1.00 Min. : 1.0 Min. :2017 Length:3214
1st Qu.:19.00 1st Qu.: 35.0 1st Qu.:2017 Class :character
Median :30.00 Median : 79.0 Median :2017 Mode :character
Mean :31.25 Mean :101.9 Mean :2017
3rd Qu.:46.00 3rd Qu.:133.0 3rd Qu.:2017
Max. :72.00 Max. :840.0 Max. :2017
VALUE STATE STATEABBREVIATION COUNTY
Min. : 0.00 Length:3214 Length:3214 Length:3214
1st Qu.: 9.25 Class :character Class :character Class :character
Median : 18.17 Mode :character Mode :character Mode :character
Mean : 24.04
3rd Qu.: 31.00
Max. :414.86
# Graph the distance variable which is called Value but represents
# miles. Note that this graph does not look normal - instead, it
# looks right or positive skewed.
%>%
dist.mat ggplot(aes(VALUE)) + geom_histogram(fill = "#7463AC", color = "white") +
theme_minimal() + labs(x = "Miles to nearest substance abuse facility",
y = "Number of counties")
Next, transform the variable to the 4 recommended transformations to see which one works best. We cannot see that result yet until we graph these results.
- This requires 4 separate calculations using mutate() commands.
<- dist.mat %>% dist.mat.cleaned mutate(miles.cube.root = VALUE^(1/3)) %>% mutate(miles.log = log(x = VALUE)) %>% mutate(miles.inverse = 1/VALUE) %>% mutate(miles.sqrt = sqrt(x = VALUE))
Now, graph the variable with the 4 recommended transformations to see which is most normal (bell shaped).
<- dist.mat.cleaned %>%
cuberoot ggplot(aes(x = miles.cube.root)) + geom_histogram(fill = "#7463AC",
color = "white") + theme_minimal() + labs(x = "Cube root of miles to nearest facility",
y = "Number of counties")
<- dist.mat.cleaned %>%
logged ggplot(aes(x = miles.log)) + geom_histogram(fill = "#7463AC", color = "white") +
theme_minimal() + labs(x = "Log of miles to nearest facility", y = "")
<- dist.mat.cleaned %>%
inversed ggplot(aes(x = miles.inverse)) + geom_histogram(fill = "#7463AC", color = "white") +
theme_minimal() + xlim(0, 1) + labs(x = "Inverse of miles to nearest facility",
y = "Number of counties")
<- dist.mat.cleaned %>%
squareroot ggplot(aes(x = miles.sqrt)) + geom_histogram(fill = "#7463AC", color = "white") +
theme_minimal() + labs(x = "Square root of miles to nearest facility",
y = "")
- We can show all 4 graphs at one time to directly compare. Ensure your plot window is large enough to see this.
::grid.arrange(cuberoot, logged, inversed, squareroot) gridExtra
Finally, determine if any of the transformations help. In this example, we determined that the cuberoot had the most normal transformation. The cube root graph contains a nice bell shape curve.
Let’s use that new variable in the analysis. Start by summarizing the descriptive statistics, including retrieving the mean and standard deviation for cube root of miles, which are values that are required in the probability calculations.
%>%
dist.mat.cleaned drop_na(miles.cube.root) %>%
summarize(mean.tran.dist = mean(x = miles.cube.root), sd.tran.dist = sd(x = miles.cube.root))
mean.tran.dist sd.tran.dist
1 2.662915 0.7923114
- 2.66 and .79 are the values we pulled for mean and standard deviation. We can use that information to calculate probabilities based on the functions we mentioned above.
- So, what happens if the cuberoot of X < 3 or less than 27 miles from the facility?
- We estimate that about 66% of counties fall in the shaded area, having to travel less than 27 miles to nearest facility (27 = 3^3).
- This means that (1- 0.6665403)*100 is the percentage of countries having to travel more than 27 miles to the nearest facility.
27^(1/3)
[1] 3
3^3
[1] 27
# P(X< cuberoot(27) = P(X < 3)
pnorm(3, 2.66, 0.79) ##about 66% likely
[1] 0.6665403
# P(X > 3) #about 33% likely
pnorm(3, 2.66, 0.79, lower.tail = FALSE)
[1] 0.3334597
1 - pnorm(3, 2.66, 0.79)
[1] 0.3334597
- We estimate that about 20% of counties fall in the shaded area, having to travel < 8 miles to nearest facility (8 = 2^3).
pnorm(2, 2.66, 0.79)
[1] 0.2017342
- We can use the equation to calculate the z-score for a county where you have to drive 15 miles to a facility.
## z = (x-m)/sd since we are in cube root - we multiply x by ^1/3
15^(1/3) - 2.66)/0.79 (
[1] -0.2453012
The transformed distance of a facility 15 miles away is .24 standard deviations LOWER than the mean transformed distance.
Next, we can calculate z for a county with residents who have to travel 50 miles to the nearest facility. In the transformed miles variable, this would be the cube root of 50, or a value of 3.68.
50^(1/3) - 2.66)/0.79 #[1] 1.296242 (
[1] 1.296242
- This indicated that the transformed distance to a facility with MAT for this example county was 1.29 standard deviations above the mean transformed distance from a county to a facility with MAT.
6.0.1 Transformation Second Example
- Taking a second example, let us look at the PHYSHLTH variable from the gender dataset (brfss.csv). We worked with this dataset in an earlier lesson. In doing so, we cleaned the data.
- I copied over that data preparation code in regards to the variable of interest (PHYSHLTH), and tidied it up for one example. To remind ourselves, the question being asked was the following, “Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?”
- If ever you are using the MASS package and dplyr, the select function may have a conflict where R does not know which to use. If you get an error when using select, add dplyr:: in front of the statement to ensure you are using select from dplyr to select variables.
#
<- read.csv("data/brfss.csv")
gender # Review the data
summary(gender)
TRNSGNDR X_AGEG5YR X_RACE X_INCOMG
Min. :1.000 Min. : 1.000 Min. :1.000 Min. :1.000
1st Qu.:4.000 1st Qu.: 5.000 1st Qu.:1.000 1st Qu.:3.000
Median :4.000 Median : 8.000 Median :1.000 Median :5.000
Mean :4.059 Mean : 7.822 Mean :1.992 Mean :4.481
3rd Qu.:4.000 3rd Qu.:10.000 3rd Qu.:1.000 3rd Qu.:5.000
Max. :9.000 Max. :14.000 Max. :9.000 Max. :9.000
NA's :310602 NA's :94
X_EDUCAG HLTHPLN1 HADMAM X_AGE80
Min. :1.000 Min. :1.000 Min. :1.000 Min. :18.00
1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:44.00
Median :3.000 Median :1.000 Median :1.000 Median :58.00
Mean :2.966 Mean :1.108 Mean :1.215 Mean :55.49
3rd Qu.:4.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:69.00
Max. :9.000 Max. :9.000 Max. :9.000 Max. :80.00
NA's :208322
PHYSHLTH
Min. : 1.0
1st Qu.:20.0
Median :88.0
Mean :61.2
3rd Qu.:88.0
Max. :99.0
NA's :4
# PHYSHLTH example
<- gender %>%
gender.clean ::select(PHYSHLTH) %>%
dplyrdrop_na() %>%
# Turn the 77 values to NA, since 77 meant don't know or not sure
# from the brss codebook
mutate(PHYSHLTH = na_if(PHYSHLTH, y = 77)) %>%
# Turn the 99 values to NA, since 99 meant Refuled from the brss
# codebook.
mutate(PHYSHLTH = na_if(PHYSHLTH, y = 99)) %>%
# Recode the 88 values to 0 - since the number 88 meant 0 days of
# illness from the brss codebook.
mutate(PHYSHLTH = recode(PHYSHLTH, `88` = 0L))
table(gender.clean$PHYSHLTH)
0 1 2 3 4 5 6 7 8 9 10
291696 19505 24890 14713 7644 12931 2140 8049 1478 325 9437
11 12 13 14 15 16 17 18 19 20 21
133 908 92 4558 8638 221 153 279 51 5554 1111
22 23 24 25 26 27 28 29 30
132 80 98 2270 149 204 831 390 35701
summary(gender.clean)
PHYSHLTH
Min. : 0.000
1st Qu.: 0.000
Median : 0.000
Mean : 4.224
3rd Qu.: 3.000
Max. :30.000
NA's :10299
qnorm(0.95, mean = 500, sd = 10)
[1] 516.4485
- Once here, we graph PHYSHLTH.
%>%
gender.clean ggplot(aes(PHYSHLTH)) + geom_histogram(fill = "#7463AC", color = "white") +
theme_minimal() + labs(x = "Number of Days Sick", y = "Frequency")
We determined from the descriptive statistics lesson that this variable had severe skewness (positive). Most people had 0 days of illness.
Next, we run the 4 calculations by mutating the variable and saving all 4 transformation under new variable names.
<- gender.clean %>%
genderTransform mutate(phy.cube.root = PHYSHLTH^(1/3)) %>%
mutate(phy.log = log(x = PHYSHLTH)) %>%
mutate(phy.inverse = 1/PHYSHLTH) %>%
mutate(phy.sqrt = sqrt(x = PHYSHLTH))
- Next, we create the 4 graphs for each of the 4 transformations labelled above to see if one helps.
<- genderTransform %>%
cuberoot ggplot(aes(x = phy.cube.root)) + geom_histogram(fill = "#7463AC", color = "white",
binwidth = 0.5) + theme_minimal() + labs(x = "Cube root", y = "")
<- genderTransform %>%
logged ggplot(aes(x = phy.log)) + geom_histogram(fill = "#7463AC", color = "white",
binwidth = 0.5) + theme_minimal() + labs(x = "Log", y = "")
<- genderTransform %>%
inversed ggplot(aes(x = phy.inverse)) + xlim(0, 1) + geom_histogram(fill = "#7463AC",
color = "white", binwidth = 0.05) + theme_minimal() + labs(x = "Inverse",
y = "")
<- genderTransform %>%
squareroot ggplot(aes(x = phy.sqrt)) + geom_histogram(fill = "#7463AC", color = "white",
binwidth = 1) + theme_minimal() + labs(x = "Square root", y = "")
- Finally, we plot the graphs using gridExtra so that we can see all 4.
::grid.arrange(cuberoot, logged, inversed, squareroot) gridExtra
- In this example, NOT ONE transformation helped. If this happens, something else would need to occur before correctly using the variable. Examples could be to run a non-linear model, or categorizing the data into bins, especially since there was a large frequency of people that were not ill.
6.1 Using AI
Use the following prompts on a generative AI, like chatGPT, to learn more about descriptive statistics.
What is the difference between mean, median, and mode in describing data distributions, and how can each be used to understand the shape of a distribution? * How do mean and median help identify whether a distribution is skewed, and what does it tell us about the dataset?
Can you explain how the mean, median, and mode behave in normal, positively skewed, and negatively skewed distributions?
What are standard deviation (SD) and variance, and how do they measure the spread of data in a distribution?
Explain the differences between range, interquartile range (IQR), and standard deviation in describing the variability in a dataset.
How does a high standard deviation or variance affect the interpretation of a dataset compared to a low standard deviation?
What is skewness, and how does it affect the shape of a distribution? How can we identify positive and negative skew?
How is kurtosis defined in the semTools package in R, and what does it tell us about the tails of a distribution?
How would you compare and contrast the roles of skewness and kurtosis in identifying the shape and behavior of a distribution?
6.2 Summary
- In this lesson, we explored descriptive statistics, including measures such as skewness and kurtosis. We discussed variables and scales of measurement, and how to summarize both qualitative and quantitative data. We also covered data types, coercion, and techniques for transforming variables that do not follow a normal distribution.