# Load the required packages:
library(knitr) # dynamic report generation
library(readr) # useful for import/reading rectangular data
library(dplyr) # useful for data manipulation tasks
library(tidyr) # tidy data
library(lubridate) # for dates and date-times
library(ggplot2) # for visualisations
library(magrittr) # for pipes and double assignment
library(kableExtra) # dynamic tables
# set chunk options
::opts_chunk$set(
knitrecho = TRUE,
eval = TRUE,
message = FALSE,
warning = FALSE,
comment = "")
R comes with a huge variety of packages which are a collection of functions, data and documentations that make the task of data wrangling more efficient (Wickham & Grolemund 2017). I used the tidyverse packages readr and dplyr for importing, reading and manipulating data. To load these packages, I used base R library() function and loaded the required packages to complete this assessment. I have taken use of knitr opts_chunk$set() function and used comment = ’’ argument which will remove the default ## prefix in the code output document (Xie 2020).
I also loaded the knitr package by Xie (2020) which is useful for creating nice reports and tables with R. I have set the setting df_print: kable within the YAML header in R Markdown which will use the knitr::kable function as the default display of data frames for a nicer presentation.
The data that I will be exploring for this assignment is the White House office staff salaries dataset which I found through Kaggle’s website. The data set was uploaded by Kaggle member Adiquant (2020) and comes in a CSV file format. The dataset contains the complete White House staff salary data from 1997-2020 and is available for download from:
# Read and import the data and save it as a data frame
<- read_csv("wh_staff_dataset.csv")
data # display data frame
head(data) %>%
kable(align = "l") %>%
kable_paper(c("striped", "hover", "responsive"), full_width = TRUE)
year | name | gender | status | salary | pay_basis | position_title |
---|---|---|---|---|---|---|
1997 | Abedin,Huma M. | Female | Employee | 27500 | Per Annum | OFFICE MANAGER |
1997 | Abrams,Lori E. | Female | Employee | 35000 | Per Annum | DEPUTY DIRECTOR, WHITE HOUSE GIFTS |
1997 | Alcorn,Brian A. | Male | Employee | 25000 | Per Annum | ASSISTANT DIRECTOR FOR ADVANCE |
1997 | Allen,Jeannetta Pam | Female | Employee | 22000 | Per Annum | STAFF ASSISTANT |
1997 | Allison,Donna Tate | Female | Employee | 27624 | Per Annum | WHITE HOUSE TELEPHONE OPERATOR |
1997 | Alston,Mildred C. | Female | Employee | 47000 | Per Annum | SPECIAL ASSISTANT PERSONAL CORRESPONDENCE |
To import and read the data into R, I used the read_csv() function from the readr package developed by Wickham and Hester (2020). I then assigned the dataset to an R object called ‘data’. I prefer to use the readr package instead of base R for reading files into R as they are typically about 10x faster (Wickham & Grolemund 2017).
A great feature of read_csv() function is that it doesn’t automatically convert characters to factors. Another useful feature is the printing of column specifications along with column names and their corresponding data types. This feature is a great way to quickly get a sense of variables and whether they are in the correct data type.
To display the first n rows, I used base R head() function. By default it will print the first 6 rows of the data frame and is a great way to get a view of the imported data. To control how many rows get printed, we can use the argument n= and specify a desired number of rows i.e. 10L or 10 (R Core Team 2020).
This data set contains White House office staff data from 1997 to 2020. I am interested in exploring this dataset, apply tidy data principles and manipulate the data in a manner to compare salaries grouped by gender and practice the data wrangling methods learned in this module. The source of the data is as follows:
The variables in this data set are:
# make a copy of data to work with
<- data
wh
# check dimensions
dim(wh)
[1] 10600 7
# check data types
str(wh, give.attr = FALSE)
spec_tbl_df [10,600 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ year : num [1:10600] 1997 1997 1997 1997 1997 ...
$ name : chr [1:10600] "Abedin,Huma M." "Abrams,Lori E." "Alcorn,Brian A." "Allen,Jeannetta Pam" ...
$ gender : chr [1:10600] "Female" "Female" "Male" "Female" ...
$ status : chr [1:10600] "Employee" "Employee" "Employee" "Employee" ...
$ salary : num [1:10600] 27500 35000 25000 22000 27624 ...
$ pay_basis : chr [1:10600] "Per Annum" "Per Annum" "Per Annum" "Per Annum" ...
$ position_title: chr [1:10600] "OFFICE MANAGER" "DEPUTY DIRECTOR, WHITE HOUSE GIFTS" "ASSISTANT DIRECTOR FOR ADVANCE" "STAFF ASSISTANT" ...
# convert the categorical variables from character to factor
$gender <- as.factor(wh$gender)
wh$status <- as.factor(wh$status)
wh$pay_basis <- as.factor(wh$pay_basis)
wh$position_title <- as.factor(wh$position_title)
wh
# check factor levels for gender
levels(wh$gender)
[1] "Female" "Male"
# check factor levels for status
levels(wh$status)
[1] "Detailee" "Employee" "Employee (part-time)"
# check factor levels for pay_bias
levels(wh$pay_basis)
[1] "Per Annum" "Per Diem"
# 'position_title" will have too many levels,
# first check its length rather than having it print all with levels() function
length(levels(wh$position_title))
[1] 2384
# one way to display only a few levels is by using print() with factor(), then use max.levels argument
print(factor(wh$position_title[1:3]), max.levels = 3)
[1] OFFICE MANAGER DEPUTY DIRECTOR, WHITE HOUSE GIFTS
[3] ASSISTANT DIRECTOR FOR ADVANCE
Levels: ASSISTANT DIRECTOR FOR ADVANCE DEPUTY DIRECTOR, WHITE HOUSE GIFTS OFFICE MANAGER
# a better way is by using str() function which will help us see if I made the changes correctly,
# as well as see a summary of levels
str(wh, give.attr = FALSE)
spec_tbl_df [10,600 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ year : num [1:10600] 1997 1997 1997 1997 1997 ...
$ name : chr [1:10600] "Abedin,Huma M." "Abrams,Lori E." "Alcorn,Brian A." "Allen,Jeannetta Pam" ...
$ gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 1 1 2 1 2 1 ...
$ status : Factor w/ 3 levels "Detailee","Employee",..: 2 2 2 2 2 2 1 1 2 2 ...
$ salary : num [1:10600] 27500 35000 25000 22000 27624 ...
$ pay_basis : Factor w/ 2 levels "Per Annum","Per Diem": 1 1 1 1 1 1 1 1 1 1 ...
$ position_title: Factor w/ 2384 levels ". SENIOR ADVISOR TO THE PRESIDENT",..: 1430 898 53 2288 2373 1779 396 396 2373 1534 ...
# rename levels of 'status' for better consistency
<- wh %>%
wh mutate(status = recode(status,
Employee = 'Employee, full-time',
'Employee (part-time)' = 'Employee, part-time'))
# check to see if the above modification worked correctly
levels(wh$status)
[1] "Detailee" "Employee, full-time" "Employee, part-time"
# I am happy with level names and order
# check column names
colnames(wh)
[1] "year" "name" "gender" "status"
[5] "salary" "pay_basis" "position_title"
# rename 'pay_basis' and 'position_title' to be consistent with other column names
<- wh %>%
wh rename('pay period' = pay_basis,
'position' = position_title)
# check changes
colnames(wh)
[1] "year" "name" "gender" "status" "salary"
[6] "pay period" "position"
I first made a copy of my ‘data’ and assigned it to ‘WH’. This ensures I have the original dataset in its non-modified form for future reference if needed. Then I inspected the data frame as follows:
The numeric variables and the character variable ‘name’ are in the correct data type. However, the variables ‘gender’, ‘status’, ‘pay_basis’ and ‘position_title’ need to be converted from character type to factor as they are categorical variables. Factors are the correct way of storing categorical data in R as they make it easier to use categorical variables into statistical models since they get coded as numbers. I used the base R as.factor() function to convert each of these variable to factor type. To inspect factor variables:
I was happy with the order of factor levels for all variables as well as their names except for the names used for levels of ‘status’. I renamed ‘Employee’ and ‘Employee (part-time)’ to ‘Employee, full-time’ and ‘Employee, part-time’ so their names were consistent with one another as well as a more precise description. To modify the names of factor levels, I used the mutate() and recode() functions from the dplyr package (Wickham et al. 2020). These functions allow for the direct modification of factor levels values while preserving the existing order of the levels (Wickham et al. 2020). One limitation with this method of re-coding factor level labels is that it will not generate an error message if you try to re-code levels that do not exist, for example, if you make any spelling errors (McNamara & Horton 2017). Therefore, it is important to check that the changes were made correctly by using levels() function.
To inspect column names:
This data frame meets the tidy data principles set by Wickham and Grolemund (2017) since:
As stated by Wickham and Grolemund (2017), these three rules are inter-related which leads to two practical instructions:
Before moving on to the next task, I wanted to inspect missing values in the data frame. I used base R colSums() and is.na() functions to display the total missing values in each column:
To check their row number within a column, I used which() and is.na() functions.
After inspecting the NA values, I have decided to not remove any rows with NA values. I think it will be a waste to omit information in all other fields of that row just because one value was missing. From the information that was printed by using colSums() and is.na() functions, it can be deduced that there are no rows with only NA values across all its columns. However, if it was needed to specifically check which rows would meet those conditions, either of the codes using which() and rowSums() functions with conditional arguments would work. This also confirmed that there are no rows that satisfy those condition.
Since the amount of missing data from the numeric variable of the data is small relative to the size of the observations, we can use the na.rm = TRUE argument to exclude missing values from mathematical operations.
Lastly, I noticed that some values in ‘salary’ column were 0, therefore, I replaced those values to NA by using a logical condition and subsetting to over-write those values.
# Inspect for NA values, identify which columns contain NAs
colSums(is.na(wh))
year name gender status salary pay period position
0 3 3 0 5 7 2
# check the row number for NAs in column 'name'
which(is.na(wh$name))
[1] 3243 3406 4123
# check the row number for NAs in column 'gender'
which(is.na(wh$gender))
[1] 3243 3406 4123
# check the row number for NAs in column 'salary'
which(is.na(wh$salary))
[1] 10244 10289 10345 10413 10466
# check the row number for NAs in column 'pay period'
which(is.na(wh$`pay period`))
[1] 3243 3406 10244 10289 10345 10413 10466
# check the row number for NAs in column 'position'
which(is.na(wh$`position`))
[1] 3243 3406
# To specifically check if any rows have only NA values,
# either of the following codes can be used
which(rowSums(is.na(wh)) == ncol(wh))
integer(0)
which(!rowSums(!is.na(wh)))
integer(0)
#replace 0 in 'salary' to NA
== 0] <- NA wh[wh
# Summary statistics of 'salary' grouped by 'gender'
<- wh %>%
wh_stat1 group_by(gender) %>%
filter(!is.na(gender), salary>1000) %>%
summarise(
Min = min(salary, na.rm = T),
Median = median(salary, na.rm = T),
Max = max(salary, na.rm = T),
Mean = mean(salary, na.rm = T),
SD = sd(salary, na.rm = T)
)
# display results
head(wh_stat1)
gender | Min | Median | Max | Mean | SD |
---|---|---|---|---|---|
Female | 21000 | 55746 | 183000 | 68678.08 | 36144.52 |
Male | 21500 | 72700 | 239595 | 83201.26 | 42479.72 |
To display summary statistics of one numeric variable, ‘salary’, grouped by one categorical variable, ‘gender’, I used summerise() and group_by() functions from the dplyr package. This easily generates descriptive statistics by tables of groups. The summarise() function creates a new data frame which will have one or more rows for each combination of grouping variables and one column for each of the statistics that have been specified (Wickham et al. 2020). I also used the filter() function from dplyr which subsets observations based on their values and takes the name of the data frame as the first argument with the second and following arguments being the expressions that filter and subset the data frame (Wickham et al. 2020). I used conditional arguments to only select rows in ‘gender’ column that did not have NA values and only keep rows in ‘salary’ that had values greater than 1000 as any salary less than 1000 are most likely errors.
When using the summarise() function from dyplr, the na.rm=TRUE argument must be used with the functions I used inside summarise() to exclude missing values from the mathematical operations. If this is not done, most functions will just return NA.
The result of the descriptive statistics data frame show that women have on average consistently earned less than men as staff members of the White House office.
The result of the statistics above caused me to be interested in exploring what the summary would be for full-time staff and detailees at another time as well as explore these findings further in the future.
# Create a list of numeric value for each response to categorical variables
# First assign level names of each categorical variable
<- levels(wh$gender)
gender_names <- levels(wh$status)
status_names <- levels(wh$`pay period`)
pay_names <- levels(wh$position)
position_names
# Create each list to have a character vector and numeric,
# I chose to give each categorical value unique numbers
<- list(
list_gender gender_numeric = c(1:2),
gender_levels = c(gender_names)
)
<- list(
list_status status_numeric = c(3:5),
status_levels = c(status_names)
)
<- list(
list_pay pay_numeric = c(6:7),
pay_levels = c(pay_names)
)
<- list(
list_position position_numeric = c(8:(7+2384)),
position_levels = c(position_names)
)
# Check structure of lists - gender
str(list_gender)
List of 2
$ gender_numeric: int [1:2] 1 2
$ gender_levels : chr [1:2] "Female" "Male"
# Check structure of lists - status
str(list_status)
List of 2
$ status_numeric: int [1:3] 3 4 5
$ status_levels : chr [1:3] "Detailee" "Employee, full-time" "Employee, part-time"
# Check structure of lists - pay
str(list_pay)
List of 2
$ pay_numeric: int [1:2] 6 7
$ pay_levels : chr [1:2] "Per Annum" "Per Diem"
# Check structure of lists - position
str(list_position)
List of 2
$ position_numeric: int [1:2384] 8 9 10 11 12 13 14 15 16 17 ...
$ position_levels : chr [1:2384] ". SENIOR ADVISOR TO THE PRESIDENT" ". WATCH OFFICER" "ACCOUNT MANAGER" "ACCOUNTING MANAGER" ...
In this task I had to create a list containing a numeric value for each response to the categorical variables. To achieve this, I first assigned the levels of each categorical variable to represent a character vector. To create the lists, I used base R list() function. This function allows for names to be assigned to elements; therefore, creating an association between names and its value (R Cookbook 2nd Edition). The list of each variable contained two elements. One element being a character vector of each response and the other an integer vector with unique numbers to represent each response to each variable. str() function was then used to examine the structure of the newly created lists.
# Join the list to wh df, numeric values must match categorical variables
# Convert each list to a data frame
<- as.data.frame(list_gender)
df_gender <- as.data.frame(list_status)
df_status <- as.data.frame(list_pay)
df_pay <- as.data.frame(list_position)
df_position
# Join each data frame to WH
<- wh %>% left_join(
wh_joined by = c('gender' = 'gender_levels')
df_gender,
)
<- wh_joined %>% left_join(
wh_joined by = c('status' = 'status_levels')
df_status,
)
<- wh_joined %>% left_join(
wh_joined by = c('pay period' = 'pay_levels')
df_pay,
)
<- wh_joined %>% left_join(
wh_joined by = c('position' = 'position_levels')
df_position,
)# check changes
str(wh_joined, give.attr = FALSE)
spec_tbl_df [10,600 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ year : num [1:10600] 1997 1997 1997 1997 1997 ...
$ name : chr [1:10600] "Abedin,Huma M." "Abrams,Lori E." "Alcorn,Brian A." "Allen,Jeannetta Pam" ...
$ gender : chr [1:10600] "Female" "Female" "Male" "Female" ...
$ status : chr [1:10600] "Employee, full-time" "Employee, full-time" "Employee, full-time" "Employee, full-time" ...
$ salary : num [1:10600] 27500 35000 25000 22000 27624 ...
$ pay period : chr [1:10600] "Per Annum" "Per Annum" "Per Annum" "Per Annum" ...
$ position : chr [1:10600] "OFFICE MANAGER" "DEPUTY DIRECTOR, WHITE HOUSE GIFTS" "ASSISTANT DIRECTOR FOR ADVANCE" "STAFF ASSISTANT" ...
$ gender_numeric : int [1:10600] 1 1 2 1 1 1 2 1 2 1 ...
$ status_numeric : int [1:10600] 4 4 4 4 4 4 3 3 4 4 ...
$ pay_numeric : int [1:10600] 6 6 6 6 6 6 6 6 6 6 ...
$ position_numeric: int [1:10600] 1437 905 60 2295 2380 1786 403 403 2380 1541 ...
To be able to join the lists on to the wh data frame, each list needs to be first converted to a data frame. Base R as.data.frame() function was used as this will convert each element of a list to a column in the data frame (R Core Team 2020). I then joined the data frames one by one to my original wh data frame by using left_join() function from the dplyr package. The key that will combine the data frames were the names given to the character vectors in the lists, which became the variable name for the converted data frames, as well as their matching variable names in the wh data frame. For example, gender column from wh and gender_levels column from df_gender will be the key that match and join the two data frames together. By checking the structure of wh_joined, we can see that the join was completed correctly and the variable values matched to the unique numeric values that were assigned to them.
# Subset 'wh' data frame using first 10 observations
<- wh %>% slice(1:10)
wh_sub10
# Check dimensions
dim(wh_sub10)
[1] 10 7
# Convert to matrix
<- as.matrix(wh_sub10)
wh_matrix # inspect matrix structure
str(wh_matrix)
chr [1:10, 1:7] "1997" "1997" "1997" "1997" "1997" "1997" "1997" "1997" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:7] "year" "name" "gender" "status" ...
To subset the first 10 observations of the data frame, I used the slice() function from the dplyr package and assigned it to a new object. The slice() function allows for subsetting of rows by using their integer locations (Wickham et al. 2020). Then to convert the data frame to a matrix, I used base R as.matrix() function. When checking the structure of the matrix with str() function, it shows that all the data of the matrix has been converted to character data type. This has happened because a matrix is a homogeneous two dimensional structure and therefore, every element must be of the same type. As a result of combining different data types, they have been coerced to the most flexible type by default. Data types from least to most flexible are: logical<integer<double<character (Wickham 2015).
# Subset 'wh' data frame to include only the first and last variables.
# Assign it to new object
<- wh %>%
wh2 select(year, position)
# display new df
head(wh2)
year | position |
---|---|
1997 | OFFICE MANAGER |
1997 | DEPUTY DIRECTOR, WHITE HOUSE GIFTS |
1997 | ASSISTANT DIRECTOR FOR ADVANCE |
1997 | STAFF ASSISTANT |
1997 | WHITE HOUSE TELEPHONE OPERATOR |
1997 | SPECIAL ASSISTANT PERSONAL CORRESPONDENCE |
# Save object as an .RData into current work directory
save(wh2, file = "wh2.RData")
To subset the data frame so it only includes the first and last variable, select() function from ‘dplyr’ package was used. This function subsets columns using their names and types (Wickham et al. 2020). I assigned the new data frame to an object and named it ‘wh2’ which was then saved as an RData file by using the base R save() function. For the save() function to save the object as an RData file, the names of the objects to be saved is used as an argument as well as setting file = ” .RData” (R Core Team 2020).
# New data frame with 1 ordinal, 1 integer variable and 4 observations
<- data.frame(
df "Education" = c("High School Diploma", "Bachelor's",
"Master's", "Doctoral"),
"Total" = c(450L, 1000L, 200L, 50L),
stringsAsFactors = TRUE
)
# Order factor levels from low to high education
$Education <- df$Education %>% factor(
dflevels = c("High School Diploma", "Bachelor's", "Master's", "Doctoral"),
ordered = TRUE
)
# check structure
str(df)
'data.frame': 4 obs. of 2 variables:
$ Education: Ord.factor w/ 4 levels "High School Diploma"<..: 1 2 3 4
$ Total : int 450 1000 200 50
# check levels
levels(df$Education)
[1] "High School Diploma" "Bachelor's" "Master's"
[4] "Doctoral"
# Create additional numeric vector to represent proportions based on 'Total'
<- c(450/1700*100, 1000/1700*100,
Proportion 200/1700*100, 50/1700*100)
#Round the values of proportion to 2 digits
<- round(Proportion, 2)
Proportion
# Combine new variable to data frame using cbind()
<- cbind(df, Proportion)
df1 # check dimensions
dim(df1)
[1] 4 3
# check attributes
attributes(df1)
$names
[1] "Education" "Total" "Proportion"
$class
[1] "data.frame"
$row.names
[1] 1 2 3 4
# check structure
str(df1)
'data.frame': 4 obs. of 3 variables:
$ Education : Ord.factor w/ 4 levels "High School Diploma"<..: 1 2 3 4
$ Total : int 450 1000 200 50
$ Proportion: num 26.47 58.82 11.76 2.94
# display df1
head(df1)
Education | Total | Proportion |
---|---|---|
High School Diploma | 450 | 26.47 |
Bachelor’s | 1000 | 58.82 |
Master’s | 200 | 11.76 |
Doctoral | 50 | 2.94 |
To create a new data frame with 2 variables (1x ordinal & 1x integer) and 4 observations, I used base R data.frame() function. I assigned the name ‘Education’ to a vector of 4 character values representing levels of education to be the ordinal variable. The name ‘Total’ was assigned to a vector of 4 integer numbers to represent the number of people having the type of education level from an imaginary population sample. This is to be the integer variable. To ensure R reads the numbers as integer and not numeric (double), L must be placed after the desired number. I also used the argument stringsAsFactors=TRUE (default setting) so that the character column is converted to a factor column, ensuring my categorical ordinal variable is in the correct data type (R Core Team 2020).
Since R orders factor levels alphabetically by default, I used the factor() function to manually order the levels in the correct order. To achieve this, the factor() function levels= argument needs to be assigned a character vector of the levels and in the order required, as well as using the ordered=TRUE argument which is then assigned to over-write the ‘Education’ column via subsetting.
I used the str() function to then show the structure of my variable and levels() function to show the levels of the ordinal variable.
I created an additional numeric vector called ‘Proportion’ by using c() function along with R’s arithmetic operators to calculate the proportion of each integer value of the ‘Total’ variable. To add this vector to the data frame, I used base R cbind() function which combines R objects by columns and assigned it to ‘df1’. The new data frame now has 3 variables. I then checked the attributes and dimensions of ‘df1’ by using attributes() and dim() functions.
# create data frame with 1 common variable to df1 & 1 new categorical variable: # common variable will be 'Education'
# new categorical variable will be 'Abbreviation'
<- data.frame(
df2 "Education" = c("High School Diploma", "Bachelor's",
"Master's", "Doctoral"),
"Abbreviation" = c("HS", "BS", "MS", "PHD"),
stringsAsFactors = TRUE
)
#order factor levels
$Education <- df2$Education %>% factor(
df2levels = c("High School Diploma", "Bachelor's", "Master's", "Doctoral"),
ordered = TRUE
)
$Abbreviation <- df2$Abbreviation %>% factor(
df2levels = c("HS", "BS", "MS", "PHD"),
ordered = TRUE
)
#join df2 to df1 using left_join() function and assign it to df3
<- df2 %>%
df3 left_join(df1, by = "Education")
# display df3
head(df3)
Education | Abbreviation | Total | Proportion |
---|---|---|---|
High School Diploma | HS | 450 | 26.47 |
Bachelor’s | BS | 1000 | 58.82 |
Master’s | MS | 200 | 11.76 |
Doctoral | PHD | 50 | 2.94 |
# display structure of df3
str(df3)
'data.frame': 4 obs. of 4 variables:
$ Education : Ord.factor w/ 4 levels "High School Diploma"<..: 1 2 3 4
$ Abbreviation: Ord.factor w/ 4 levels "HS"<"BS"<"MS"<..: 1 2 3 4
$ Total : int 450 1000 200 50
$ Proportion : num 26.47 58.82 11.76 2.94
I created a new data frame called ‘df2’ which will share one common variable with ‘df1’ from the previous task and has one new categorical variable. The common variable will be ‘Education’ and the new variable will be ‘Abbreviation’. This variables values will be an abbreviations of the levels of education. This data frame was created by the same method as the previous task as well as the ordinal factors being ordered in the same manner.
To join this data frame to the first dataset, ‘df1’, I used the left_join() function from the dplyr package. This function allows for the addition of new variables to one data frame based on matching observations in another (Wickham et al. 2020). This will result in the combination of variables from my 2 data frames, with the variables being added to the right of the first data frame used as the first argument in the left_join() function. For the join to work, a key variable is needed, which in this case will be the ‘Education’ variable and I will set the key by using by=“Education” argument in left_join().
I wanted the new variable ‘Abbreviation’ to maintain it’s position after ‘Education’. Therefore, I chose to use ‘df2’ as the first argument in left_join() and ‘df1’ as the second argument. This will result in the variables from ‘df1’ being added to the right of ‘df2’ data frame.
Wickham, H & Hester, J 2020, readr: Read Rectangular Text Data, R package version 1.4.0, viewed 20 January 2021, URL.
Wickham, H et al. 2020, dplyr: A Grammar of Data Manipulation, R package version 1.0.2, viewed 20 January 2021, URL.
Xie, Y 2020, knitr: A General-Purpose Package for Dynamic Report Generation in R, R package version 1.30, viewed 20 January 2021, URL.
Adiquant 2020, Complete White House Staff Data from 1997-2020, Version 1.0, data file, viewed 20 January 2021, URL.
Wickham, H & Grolemund, G 2017, R for Data Science, 1st edn, O’Reilly Media, Sebastopol, California, USA.
R Core Team 2020, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, viewed 20 January 2021, URL.
McNamara, A & Horton, NJ 2017, ‘Wrangling categorical data in R’, PeerJ Preprints, 5:e3163v2 URL.
Long, JD & Teetor, P 2019, R Cookbook, 2nd edn, O’Reilly Media, Sebastopol, California, USA.
Wickham, H 2015, Advanced R, 1st edn, CRC Press, Boca Raton, Florida, USA.