Attrition refers to employees leaving an organization without the organization being able to immediately fill their vacancies. High rates of employee attrition often occur when key factors, such as inadequate compensation, poor work-life balance, ineffective management, lack of career advancement opportunities, and poor working conditions, are left unaddressed
Understanding who leaves and why is crucial for improving workplace policies, reducing turnover, and minimizing the negative impact on productivity, culture, and financial performance. High attrition can lead to increased recruitment and training costs, loss of institutional knowledge, and burnout among remaining staff. Luckily, there are solutions. By effectively leveraging data-driven insights from employee surveys, exit interviews, and feedback, organizations/companies can identify root causes to employee attrition and implement targeted strategies to improve retention and create a healthier work environment.
🔗 to Dataset: IBM HR Analytics Employee Attrition and Performance-Kaggle
This is a fictional data set created by IBM data scientists. It
includes the data of nearly 1500 employees and 35 variables. The dataset
contains information on employee demographics, income, job satisfaction,
job information, and whether or not the employee attrited.
| # | Variable | Definition |
|---|---|---|
| 1 | Age | Age of employee |
| 2 | Gender | Employee’s gender (male/female) |
| 3 | Attrition | Whether employee left company (Yes/No) |
| 4 | BusinessTravel | Frequency of travel (Rarely, Frequently, etc.) |
| 5 | DailyRate | Daily salary rate (USD) |
| 6 | Department | Department of employee (Sales, R&D, etc.) |
| 7 | DistanceFromHome | Distance from home to work (miles) |
| 8 | Education | Level of education achieved by employee (1.Below College, 2.College, etc.) |
| 9 | EducationField | Employee’s field of study |
| 10 | EmployeeCount | N/A |
| 11 | EmployeeNumber | Unique identifier for each employee |
| 12 | EnvironmentSatisfaction | Employee satisfaction with working environment (1-4) |
| 13 | HourlyRate | Hourly salary rate (USD) |
| 14 | JobInvolvement | Level of involvement required for employee’s job (1-4) |
| 15 | JobLevel | Employee’s level of work (1-5) |
| 16 | JobRole | Employee’s role in the organization (Sales exec., Lab Tech., etc.) |
| 17 | JobSatisfaction | Satisfaction rating of the job (1-4) |
| 18 | Marital Status | Employee’s marital status (S, M, D) |
| 19 | MonthlyIncome | Employee’s monthly salary ($) |
| 20 | Monthly Rate | Monthly salary rate (USD) |
| 21 | NumCompaniesWorked | # of companies the employee worked for |
| 22 | Over 18 | Whether or not the employee is 18 yrs old |
| 23 | OverTime | Whether or not the employee works overtime (yes/no) |
| 24 | PercentSalaryHike | Salary increase rate |
| 25 | PerformanceRating | Performance rating of employee (1-4) |
| 26 | RelationshipSatisfaction | Employee’s satisfaction with personal relationships (1-4) |
| 27 | StandardHours | Standard working hours for employees |
| 28 | StockOptionLevel | Employee’s stock option level |
| 29 | TotalWorkingYears | Total # of years the employee has worked |
| 30 | TrainingTimesLastYear | # of trainings the employee has taken |
| 31 | WorkLifeBalance | Employee’s perception of their work-life balance (1-4) |
| 32 | YearsAtCompany | # of years employee has been with current company |
| 33 | YearsInCurrentRole | # of years employee has been in current role |
| 34 | YearsSinceLastPromoted | # of years since last promotion |
| 35 | YearsWithCurrManager | # of years with current manager |
Disclosure: This data is intended for research purposes and data practice. Models designed from this data are not designed to predict future attrition X number of months from now. The primary goal is research, data exploration, and inference, rather than developing a predictive model for future attrition. Typically, machine learning workflows involve splitting data into training and testing sets to evaluate how well a model generalizes. However, in this case, the objective is not to predict employee attrition a specific number of months from now but rather to understand the key factors associated with attrition and their relative importance.
This analysis covers 4 key areas:
# 1.) Install & load necessary packages
library(RPostgres) #to connect to SQL code
library(DBI) #connecting to SQL database
# 2.) Enter database info
# connection <- dbConnect(Postgres(),
# dbname = "Name_Of_Your_Database",
# host = "yourHost",
# port = "####",
# user = "insert_user",
# password = "**********")Note: I changed the name of variables by just putting underscores. Just for my own satisfaction. Over18 was removed before querying giving us 34 variables now.
To begin the analysis, I first loaded the dataset into a new SQL table. This ensures data integrity, allows for efficient querying, and provides a clean workspace for further transformations and analysis.
While it’s best practice to check every variable for inconsistencies, I’ll focus on a few key text-based variables to save time. This helps catch common issues like misspellings or inconsistent formatting.
| business_travel |
|---|
| Non-Travel |
| Travel_Frequently |
| Travel_Rarely |
| gender |
|---|
| Female |
| Male |
| department |
|---|
| Human Resources |
| Research & Development |
| Sales |
Missing values can impact analysis and model performance. Here, I check for NULL values in the ‘attrition’ and ‘employee_number’ columns to ensure data completeness and decide how to handle any gaps.
Select COUNT(*) as null_count
From hr_attrition_transformed
Where attrition is NULL OR employee_number is NULL;| null_count |
|---|
| 0 |
I chose to remove these for their own specifc reasons. Either the information seemed irrelevant or I did not think it mattered to the overall analysis. Additionally:
Redundant Information: Some columns may duplicate information already captured in other variables. Keeping both can lead to multicollinearity in regression models.
Irrelevant to the Objective: If a column does not directly relate to predicting attrition or improving insights, removing it helps streamline the analysis.
Improved Model Performance: Keeping only relevant features reduces computational cost and prevents overfitting in machine learning models.
Alter Table hr_attrition_transformed
Drop Column daily_rate, --not useful; may vary
Drop Column hourly_rate, --not useful; may vary
Drop Column monthly_rate, --not useful; may vary
Drop Column distance_from_home, --range is 1-29 miles
Drop Column standard_hours, --nonsensical; same for all employees
Drop Column employee_count; --nonsensical; every value is 1The data contains multiple salary-related columns. To simplify the analysis and avoid redundancy, I created a single ‘annual_income’ variable that combines relevant income information into one value. The CSV was then saved and exported to R for further cleaning to prepare for the regression analysis. (See Below)
| annual_income |
|---|
| 41 |
| 49 |
| 223 |
To begin my work in R, I load the necessary packages for data manipulation and visualization. I also import the dataset, making it ready for further processing and modeling. This step ensures that all required tools and data are available for the job ahead, including my analysis.
# Load necessary packages
library(tidyverse) #for data wrangling and visualization
library(knitr) #for tables
library(ggpubr) #for plot designing
library(scales) #for plot designing
library(corrplot) #for correlation matrix
library(plotly) #for interactive plots
library(reshape2) #for plotting correlations
library(caret) #for model evaluation and balancing
library(car) #for checking multicollinearity
library(broom) #for tables
library(performance) #for model performance/evaluation
library(randomForest) #for variable importance
library(vip) #for graphing variable importance
library(pROC) #for ROC curve and AUC
options(scipen = 999) By reviewing the structure of the dataset, I identify that a few variables need to be converted into categorical variables and checked for the correct baseline. The table below lists the variables I’ll focus on.
| Variable | Baseline |
|---|---|
| Attrition | No |
| Business Travel | Non-travel |
| Department | N/A |
| Education | 1 |
| Education_Field | N/A |
| Environment Satisfaction | 1 |
| Gender | Male |
| Job Involvement | 1 |
| Job Level | 1 |
| Job Role | N/A |
| Job Satisfaction | 1 |
| Marital Status | N/A |
| Overtime | No |
| Performance Rating | 1 |
| Relationship Satisfaction | 1 |
| Stock Option Level | 0 |
| Work-Life Balance | 1 |
Standardizing these variables correctly will enhance the model’s
performance and make the results easier to interpret! ✅
In this step, I convert certain numeric/character variables into categorical variables using the as.factor() function in R. This is necessary for variables that represent categories (like job levels or satisfaction) but are currently stored as numbers. By doing so, I ensure that the model treats them as distinct categories, not continuous values, which will improve the accuracy and interpretability of my analysis.
hr_attrition_transformed <- hr_attrition_transformed %>%
mutate(attrition = as.factor(attrition),
business_travel = as.factor(business_travel),
department = as.factor(department),
education = as.factor(education),
education_field = as.factor(education_field),
environment_satisfaction = as.factor(environment_satisfaction),
gender = as.factor(gender),
job_involvement = as.factor(job_involvement),
job_level = as.factor(job_level),
job_role = as.factor(job_role),
job_satisfaction = as.factor(job_satisfaction),
marital_status = as.factor(marital_status),
overtime = as.factor(overtime),
performance_rating = as.factor(performance_rating),
relationship_satisfaction = as.factor(relationship_satisfaction),
stock_option_level = as.factor(stock_option_level),
work_life_balance = as.factor(work_life_balance))First, I inspect the current levels of the categorical variables using the levels() function. If the levels are already in the correct order, great — no changes needed! 🎉 However, if the order is incorrect, I use the factor() function to set the variables to the desired baseline or order, ensuring that the model processes them correctly.
## [1] "No" "Yes"
# releveling
hr_attrition_transformed <- hr_attrition_transformed %>%
mutate(business_travel = factor(business_travel,
levels = c("Non-Travel", "Travel_Rarely", "Travel_Frequently")),
gender = factor(gender, levels = c("Male", "Female")),
marital_status = factor(marital_status, levels = c("Single", "Married", "Divorced")))Looks like business_travel, gender, and marital_status were the only variables that needed a little re-leveling! Now that they are in the right order, I can move forward confidently, knowing the model will interpret it correctly. ✅
ALL DONE!✅
Select attrition,
COUNT(*) AS count_attrition
From hr_attrition_transformed
Group By attrition
Order By attrition;
-- NOTE: this is an imbalanced data set. Important for our model interpretation.| attrition | count_attrition |
|---|---|
| No | 1233 |
| Yes | 237 |
Select attrition,
MIN(annual_income) AS min_income,
MAX(annual_income) AS max_income,
ROUND(AVG(annual_income), 2) AS avg_income
From hr_attrition_transformed
Group By attrition
Order By attrition;
--NOTE: those leaving earn about 25K less. Range looks similar,
-- but distribution might be different. We can easier see the median in R.| attrition | min_income | max_income | avg_income |
|---|---|---|---|
| No | 12 | 239 | 81.49 |
| Yes | 12 | 238 | 56.95 |
Select business_travel, attrition,
Count(business_travel) AS count_bus_travel,
ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By business_travel),2) AS rate
From hr_attrition_transformed
Group By business_travel, attrition
Order By business_travel, attrition;
--NOTE: Those who travel at all have a higher rate of leaving.
-- Dont quite know how much this plays a role in leaving, however.| business_travel | attrition | count_bus_travel | rate |
|---|---|---|---|
| Non-Travel | No | 138 | 92.00 |
| Non-Travel | Yes | 12 | 8.00 |
| Travel_Frequently | No | 208 | 75.09 |
| Travel_Frequently | Yes | 69 | 24.91 |
| Travel_Rarely | No | 887 | 85.04 |
| Travel_Rarely | Yes | 156 | 14.96 |
Select department, attrition,
Count(department) AS count_department,
ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By department),2) AS rate
From hr_attrition_transformed
Group By department, attrition
Order By department, attrition;
--NOTE: In terms of total employees, sales has a large rate of attrition.| department | attrition | count_department | rate |
|---|---|---|---|
| Human Resources | No | 51 | 80.95 |
| Human Resources | Yes | 12 | 19.05 |
| Research & Development | No | 828 | 86.16 |
| Research & Development | Yes | 133 | 13.84 |
| Sales | No | 354 | 79.37 |
| Sales | Yes | 92 | 20.63 |
Select education_field, attrition,
Count(education_field) AS count_edufield,
ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By education_field),2) AS rate
From hr_attrition_transformed
Group By education_field, attrition
Order By education_field, attrition;
--NOTE: Those with technical degrees have the largest rate with more
-- than 100 people. Perhaps finding better jobs, higher pay for their
-- skills, etc...| education_field | attrition | count_edufield | rate |
|---|---|---|---|
| Human Resources | No | 20 | 74.07 |
| Human Resources | Yes | 7 | 25.93 |
| Life Sciences | No | 517 | 85.31 |
| Life Sciences | Yes | 89 | 14.69 |
| Marketing | No | 124 | 77.99 |
| Marketing | Yes | 35 | 22.01 |
| Medical | No | 401 | 86.42 |
| Medical | Yes | 63 | 13.58 |
| Other | No | 71 | 86.59 |
| Other | Yes | 11 | 13.41 |
Select gender, attrition,
Count(gender) AS count_gender,
ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By gender),2) AS rate
From hr_attrition_transformed
Group By gender, attrition
Order By gender, attrition;
--NOTE: men and women share about the same rate. No glaring differences.| gender | attrition | count_gender | rate |
|---|---|---|---|
| Female | No | 501 | 85.20 |
| Female | Yes | 87 | 14.80 |
| Male | No | 732 | 82.99 |
| Male | Yes | 150 | 17.01 |
Select job_involvement, attrition,
Count(job_involvement) AS count_involvement,
ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By job_involvement),2) AS rate
From hr_attrition_transformed
Group By job_involvement, attrition
Order By job_involvement, attrition;
--NOTE: Not surprisingly, those who rate low job involvement have high
-- attrition rates (33.73%). The proportion of employees is low compared
-- to the other ratings. Those who engage in their job stay and vice versa.| job_involvement | attrition | count_involvement | rate |
|---|---|---|---|
| 1 | No | 55 | 66.27 |
| 1 | Yes | 28 | 33.73 |
| 2 | No | 304 | 81.07 |
| 2 | Yes | 71 | 18.93 |
| 3 | No | 743 | 85.60 |
| 3 | Yes | 125 | 14.40 |
| 4 | No | 131 | 90.97 |
| 4 | Yes | 13 | 9.03 |
Select job_level, attrition,
Count(job_level) AS count_joblevel,
ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By job_level),2) AS rate
From hr_attrition_transformed
Group By job_level, attrition
Order By job_level, attrition;
--NOTE: Those with entry level positions (1) have drastically higher rates
-- of attrition (26.34%) than any other position group. All other groups
-- have a 85%+ retention rate.
-- (1=Entry Level, 2=Mid Level, 3=Senior, 4=Lead, 5=Executive)| job_level | attrition | count_joblevel | rate |
|---|---|---|---|
| 1 | No | 400 | 73.66 |
| 1 | Yes | 143 | 26.34 |
| 2 | No | 482 | 90.26 |
| 2 | Yes | 52 | 9.74 |
| 3 | No | 186 | 85.32 |
| 3 | Yes | 32 | 14.68 |
| 4 | No | 101 | 95.28 |
| 4 | Yes | 5 | 4.72 |
| 5 | No | 64 | 92.75 |
| 5 | Yes | 5 | 7.25 |
Select job_role, attrition,
Count(job_role) AS count_jobrole,
ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By job_role),2) AS rate
From hr_attrition_transformed
Group By job_role, attrition
Order By job_role, attrition;
--NOTE: Not too sure why Sales Representatives have a high rate of attrition (40%).
--Would need to look more into how that job role plays out in careers.| job_role | attrition | count_jobrole | rate |
|---|---|---|---|
| Healthcare Representative | No | 122 | 93.13 |
| Healthcare Representative | Yes | 9 | 6.87 |
| Human Resources | No | 40 | 76.92 |
| Human Resources | Yes | 12 | 23.08 |
| Laboratory Technician | No | 197 | 76.06 |
| Laboratory Technician | Yes | 62 | 23.94 |
| Manager | No | 97 | 95.10 |
| Manager | Yes | 5 | 4.90 |
| Manufacturing Director | No | 135 | 93.10 |
| Manufacturing Director | Yes | 10 | 6.90 |
Select marital_status, attrition,
Count(marital_status) AS count_marital_status,
ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By marital_status),2) AS rate
From hr_attrition_transformed
Group By marital_status, attrition
Order By marital_status, attrition;
--NOTE: single people are leaving at a higher rate, which could
-- be interpreted as maybe not having an established career or
-- the flexibility to look for other jobs without bigger consequences.| marital_status | attrition | count_marital_status | rate |
|---|---|---|---|
| Divorced | No | 294 | 89.91 |
| Divorced | Yes | 33 | 10.09 |
| Married | No | 589 | 87.52 |
| Married | Yes | 84 | 12.48 |
| Single | No | 350 | 74.47 |
| Single | Yes | 120 | 25.53 |
Select overtime, attrition,
Count(overtime) AS count_overtime,
ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By overtime),2) AS rate
From hr_attrition_transformed
Group By overtime, attrition
Order By overtime, attrition;
--NOTE: Working overtime looks like a big reason to why employees leave.
-- I would be curious to see how overtime is handled in these work places
-- (unpaid, low pay, bad conditions, etc..)| overtime | attrition | count_overtime | rate |
|---|---|---|---|
| No | No | 944 | 89.56 |
| No | Yes | 110 | 10.44 |
| Yes | No | 289 | 69.47 |
| Yes | Yes | 127 | 30.53 |
## age attrition business_travel
## Min. :18.00 No :1233 Non-Travel : 150
## 1st Qu.:30.00 Yes: 237 Travel_Rarely :1043
## Median :36.00 Travel_Frequently: 277
## Mean :36.92
## 3rd Qu.:43.00
## Max. :60.00
##
## department education education_field
## Human Resources : 63 1:170 Human Resources : 27
## Research & Development:961 2:282 Life Sciences :606
## Sales :446 3:572 Marketing :159
## 4:398 Medical :464
## 5: 48 Other : 82
## Technical Degree:132
##
## employee_number environment_satisfaction gender job_involvement
## Min. : 1.0 1:284 Male :882 1: 83
## 1st Qu.: 491.2 2:287 Female:588 2:375
## Median :1020.5 3:453 3:868
## Mean :1024.9 4:446 4:144
## 3rd Qu.:1555.8
## Max. :2068.0
##
## job_level job_role job_satisfaction marital_status
## 1:543 Sales Executive :326 1:289 Single :470
## 2:534 Research Scientist :292 2:280 Married :673
## 3:218 Laboratory Technician :259 3:442 Divorced:327
## 4:106 Manufacturing Director :145 4:459
## 5: 69 Healthcare Representative:131
## Manager :102
## (Other) :215
## monthly_income num_companies_worked overtime percent_salary_hike
## Min. : 1009 Min. :0.000 No :1054 Min. :11.00
## 1st Qu.: 2911 1st Qu.:1.000 Yes: 416 1st Qu.:12.00
## Median : 4919 Median :2.000 Median :14.00
## Mean : 6503 Mean :2.693 Mean :15.21
## 3rd Qu.: 8379 3rd Qu.:4.000 3rd Qu.:18.00
## Max. :19999 Max. :9.000 Max. :25.00
##
## performance_rating relationship_satisfaction stock_option_level
## 3:1244 1:276 0:631
## 4: 226 2:303 1:596
## 3:459 2:158
## 4:432 3: 85
##
##
##
## total_working_years training_times_last_year work_life_balance
## Min. : 0.00 Min. :0.000 1: 80
## 1st Qu.: 6.00 1st Qu.:2.000 2:344
## Median :10.00 Median :3.000 3:893
## Mean :11.28 Mean :2.799 4:153
## 3rd Qu.:15.00 3rd Qu.:3.000
## Max. :40.00 Max. :6.000
##
## yrs_at_company yrs_in_current_role yrs_since_last_promotion
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 0.000
## Median : 5.000 Median : 3.000 Median : 1.000
## Mean : 7.008 Mean : 4.229 Mean : 2.188
## 3rd Qu.: 9.000 3rd Qu.: 7.000 3rd Qu.: 3.000
## Max. :40.000 Max. :18.000 Max. :15.000
##
## yrs_with_curr_manager annual_income
## Min. : 0.000 Min. : 12.00
## 1st Qu.: 2.000 1st Qu.: 34.00
## Median : 3.000 Median : 58.50
## Mean : 4.123 Mean : 77.53
## 3rd Qu.: 7.000 3rd Qu.:100.00
## Max. :17.000 Max. :239.00
##
I’m not seeing anything in these distributions that might affect the analysis (e.g., outliers, anomalies). Most columns are right skewed which is what I expected due to the nature of the data.
# breaking down data into age groups for simplicity
# Note: Age range = 0-40
hr_attrition_transformed <- hr_attrition_transformed %>%
mutate(YearRange = cut(yrs_at_company,
breaks = c(0, 5, 10,
15, 20, 25,
30, 35, 40),
labels = c("0-5", "05-10",
"10-15", "15-20",
"20-25", "25-30",
"30-35", "35-40"),
include.lowest = TRUE))
# putting data into plot dataframe
plot1df <- hr_attrition_transformed %>%
group_by(YearRange) %>%
summarise(annual_income = round(mean(annual_income, na.rm=T),0))
#Line Graph
p1 <- ggplot(plot1df, aes(x=YearRange, y=annual_income, group = 1))+
scale_y_continuous(breaks = seq(10, 250, 20),
labels = label_number(suffix = "K"),
limits = c(10, 250))+
geom_line(color="royalblue4")+
geom_point(size = 2.7,
fill = "royalblue",
color = "black",
shape = 23)+
geom_text(label = "Max Income = $239,998",
x=6.5,
y=50000)+
geom_text(label = "Min Income = $12,108",
x=6.45,
y=30000)+
labs(title = "Relationship between annual salary and # of years at company",
x= "Years at Company",
y= "Annual Income ($)")+
theme_bw()+
theme(plot.title = element_text(hjust=0, size=12, face="bold"),
axis.title = element_text(size=10, face="bold"),
axis.text = element_text(size=8),
plot.background = element_rect(fill="gray95"))
#for interactive plots
ggplotly(p1)jobsat <- hr_attrition_transformed %>%
#job satisfaction
group_by(job_satisfaction, attrition) %>%
summarise(employee_count = n(), .groups = "drop") %>%
mutate(jobsat_rate = round(employee_count /sum(employee_count)*100, 2)) %>%
ungroup() %>%
group_by(job_satisfaction) %>%
mutate(cumul_rate = sum(jobsat_rate)) %>%
ungroup() %>%
select(attrition, job_satisfaction, jobsat_rate, cumul_rate)
# Bar graph
p2 <- ggplot(jobsat, aes(x=job_satisfaction, y=jobsat_rate,
fill=attrition)) +
geom_col(stat = 'identity',
position = position_dodge(),
width = 0.8,
color="black") +
geom_text(aes(label = ifelse(attrition == "No",
paste0(round(cumul_rate, 1), "%"),
"")),
position = position_stack(vjust = 0.95), size = 4) +
scale_y_continuous(breaks = seq(0, 50, 10),
labels = label_number(suffix = "%"),
limits = c(0,50)) +
labs(title = "Job Satisfaction and Employee Attrition",
x = "Job Satisfaction Rating",
y = "Percentage of Employees",
fill = "Attrition Status") +
scale_fill_manual(values = c("darkseagreen4", "indianred")) +
theme_bw() +
theme(plot.title = element_text(hjust=0, face="bold", size=12),
plot.background = element_rect(fill="gray95"),
axis.title = element_text(hjust=0.5, face="bold", size=10),
axis.text = element_text(size=8),
legend.background = element_rect(fill="papayawhip"))
#interactive plot
ggplotly(p2)Employees with a job satisfaction rating of 3 or 4 make up the largest portion of the workforce, each accounting for around 30% of employees. This suggests that most employees are at least moderately satisfied with their jobs. The red portion representing employees who left the company remains relatively small across all satisfaction levels. However, lower satisfaction ratings (1 and 2) may still be linked to a higher risk of attrition, which would require further analysis to confirm.
Overall, attrition appears to be low across all job satisfaction groups, indicating that while job satisfaction may play a role in employee retention, it is not the sole determining factor.
envirosat <- hr_attrition_transformed %>%
#environment satisfaction
group_by(environment_satisfaction, attrition) %>%
summarise(employee_count = n(), .groups = "drop") %>%
mutate(envirosat_rate = round(employee_count /sum(employee_count)*100, 2)) %>%
ungroup() %>%
group_by(environment_satisfaction) %>%
mutate(cumul_rate = sum(envirosat_rate)) %>%
ungroup() %>%
select(attrition, environment_satisfaction, envirosat_rate,
cumul_rate)
# Bar graph
p3 <- ggplot(envirosat, aes(x=environment_satisfaction,
y=envirosat_rate,
fill=attrition)) +
geom_col(stat = 'identity',
position = position_dodge(),
width = 0.8,
color="black") +
geom_text(aes(label = ifelse(attrition == "No",
paste0(round(cumul_rate, 1), "%"),
"")),
position = position_stack(vjust = 0.95), size = 4) +
scale_y_continuous(breaks = seq(0, 50, 10),
labels = label_number(suffix = "%"),
limits = c(0,50)) +
labs(title = "Work Environment Satisfaction and Employee Attrition",
x = "Work Environment Satisfaction Rating",
y = "Percentage of Employees",
fill = "Attrition Status") +
scale_fill_manual(values = c("darkslategray3", "darkgoldenrod")) +
theme_bw() +
theme(plot.title = element_text(hjust=0, face="bold", size=12),
plot.background = element_rect(fill="gray95"),
axis.title = element_text(hjust=0.5, face="bold", size=10),
axis.text = element_text(size=8),
legend.background = element_rect(fill="papayawhip"))
#interactive plot
ggplotly(p3)Similarly, employees who rate their work environment satisfaction high (3 or 4) make up the majority of employees in the company. Attrition rates remain low regardless of work environment rating (< 5%), indicating that the work environment of the employee may not be a large factor in deciding to leave a company or not. Furthermore, it could be that the company these employees work at has a positive, clean, and collaborative work environment, meaning there may be other factors (i.e., compensation, career growth) that the company is not strong in, contributing to higher attrition rates.
relsat <- hr_attrition_transformed %>%
#relationship satisfaction
group_by(relationship_satisfaction, attrition) %>%
summarise(employee_count = n(), .groups = "drop") %>%
mutate(relsat_rate = round(employee_count /sum(employee_count)*100, 2)) %>%
ungroup() %>%
group_by(relationship_satisfaction) %>%
mutate(cumul_rate = sum(relsat_rate)) %>%
ungroup() %>%
select(attrition, relationship_satisfaction, relsat_rate, cumul_rate)
# Bar graph
p4 <- ggplot(relsat, aes(x=relationship_satisfaction,
y=relsat_rate,
fill=attrition)) +
geom_col(stat = 'identity',
position = position_dodge(),
width = 0.8,
color="black") +
geom_text(aes(label = ifelse(attrition == "No",
paste0(round(cumul_rate, 1), "%"),
"")),
position = position_stack(vjust = 0.95), size = 4) +
scale_y_continuous(breaks = seq(0, 50, 10),
labels = label_number(suffix = "%"),
limits = c(0,50)) +
labs(title = "Relationship Satisfaction and Employee Attrition",
x = "Relationship Satisfaction Rating",
y = "Percentage of Employees",
fill = "Attrition Status") +
scale_fill_manual(values = c("skyblue3", "salmon2")) +
theme_bw() +
theme(plot.title = element_text(hjust=0, face="bold", size=12),
plot.background = element_rect(fill="gray95"),
axis.title = element_text(hjust=0.5, face="bold", size=10),
axis.text = element_text(size=8),
legend.background = element_rect(fill="papayawhip"))
#interactive plot
ggplotly(p4)When examining employees’ personal lives, relationship satisfaction appears to have little impact on attrition. The majority of employees (60.6%) report high relationship satisfaction, with ratings of 3 or 4. This suggests that employees with varying levels of relationship satisfaction are equally likely to stay with the company.
worklife <- hr_attrition_transformed %>%
#work-life balance
group_by(work_life_balance, attrition) %>%
summarise(employee_count = n(), .groups = "drop") %>%
mutate(worklife_rate = round(employee_count /sum(employee_count)*100, 2)) %>%
ungroup() %>%
group_by(work_life_balance) %>%
mutate(cumul_rate = sum(worklife_rate)) %>%
ungroup() %>%
select(attrition, work_life_balance, worklife_rate, cumul_rate)
# Bar graph
p5 <- ggplot(worklife, aes(x=work_life_balance,
y=worklife_rate,
fill=attrition)) +
geom_col(stat = 'identity',
position = position_dodge(),
width = 0.8,
color="black") +
geom_text(aes(label = ifelse(attrition == "No",
paste0(round(cumul_rate, 1), "%"),
"")),
position = position_stack(vjust = 0.9), size = 4) +
scale_y_continuous(breaks = seq(0, 57, 10),
labels = label_number(suffix = "%"),
limits = c(0,57)) +
labs(title = "Work-Life Balance and Employee Attrition",
x = "Work-Life Balance Rating",
y = "Percentage of Employees",
fill = "Attrition Status") +
scale_fill_manual(values = c("lightcyan4", "tomato3")) +
theme_bw() +
theme(plot.title = element_text(hjust=0, face="bold", size=12),
plot.background = element_rect(fill="gray95"),
axis.title = element_text(hjust=0.5, face="bold", size=10),
axis.text = element_text(size=8),
legend.background = element_rect(fill="papayawhip"))
#interactive plot
ggplotly(p5)The majority of employees (60.8%) rate their work-life balance as a 3, followed by 23.4% rating it as a 2. Only a small percentage of employees rate their work-life balance as either extremely poor (1) or excellent (4). Interestingly, attrition rates remain relatively low across all ratings, with only a slight increase for employees who rate their work-life balance as a 3.
Performance Rating, Employee Number and Percent Salary Hike show no correlation with Attrition. Therefore, we will remove these variables and store the refined dataset in a new data frame for our predictive analysis.
Additionally, Overtime exhibits a low-to-moderate positive correlation with Attrition (0.25), suggesting that employees who work overtime are more likely to leave.
ind_corr <- corrMatrix %>%
melt() %>%
filter(value > 0.7) %>% #threshold > 0.7
filter(Var1 != Var2) #avoid looking at redundant info
ind_corr## Var1 Var2 value
## 1 monthly_income job_level 0.9502999
## 2 total_working_years job_level 0.7822078
## 3 annual_income job_level 0.9501928
## 4 job_level monthly_income 0.9502999
## 5 total_working_years monthly_income 0.7728932
## 6 annual_income monthly_income 0.9999869
## 7 performance_rating percent_salary_hike 0.7735500
## 8 percent_salary_hike performance_rating 0.7735500
## 9 job_level total_working_years 0.7822078
## 10 monthly_income total_working_years 0.7728932
## 11 annual_income total_working_years 0.7728457
## 12 yrs_in_current_role yrs_at_company 0.7587537
## 13 yrs_with_curr_manager yrs_at_company 0.7692124
## 14 yrs_at_company yrs_in_current_role 0.7587537
## 15 yrs_with_curr_manager yrs_in_current_role 0.7143648
## 16 yrs_at_company yrs_with_curr_manager 0.7692124
## 17 yrs_in_current_role yrs_with_curr_manager 0.7143648
## 18 job_level annual_income 0.9501928
## 19 monthly_income annual_income 0.9999869
## 20 total_working_years annual_income 0.7728457
#removing selected columns from hr_attrition_transformed
HRfinaldata <- hr_attrition_transformed %>%
select(-c(7, 11, 12, 15, 19, 22, 26, 28, 30))
ALL DONE!✅
In this section, I will walk through the modeling process to identify the key factors contributing to employee attrition within this company. I start by building the model, followed by checking its assumptions to ensure reliability and that the data fits the model well. Next, I analyze the predicted probabilities and break down the impact of demographic, job-related, and work/life satisfaction factors on employee attrition. Lastly, I bring in results from my exploratory findings and tie it in with the analysis.
I chose the logistic regression model (Generalized Linear Model or GLM) because it can handle binary outcomes such as employee attrition. This model allows me to examine how various factors influence the likelihood of attrition, providing odds ratios that are intuitive and easy to explain. Logistic regression is well-suited for both continuous variables (like age and income) and categorical variables (such as education level and department). It also doesn’t require a linear relationship between predictors and the outcome, making it flexible for the types of data in this study. While more complex models may offer better prediction accuracy, logistic regression provides an interpretable approach that balances model fit with computational efficiency.
#Creating Lists of Predicted Probabilities averaged over all other variables
predictions <- ggeffects::ggeffect(HR_model)Predicted probabilities represent the likelihood of an event occurring based on the model’s analysis. In this case, the model calculates the probability that an employee will leave the company given a specific factor (e.g., age, job role, work-life balance). These probabilities are averaged over all other factors in the model, meaning that while we focus on one variable at a time in the plot, the prediction accounts for the presence of all other variables in the background. A probability close to 100% (or 1) suggests a high likelihood of attrition, while a probability close to 0% (or 0) suggests a low likelihood.
In the graphs below, the y-axis shows the predicted probability of attrition, while the x-axis represents the variable being analyzed (e.g., age or marital status). Since the probabilities are averaged over all other factors, the trends we see represent the general effect of each variable while holding everything else constant.
The model has an accuracy rate of 88.5%. It seems quite high, right? It’s important to note that the model has trouble identifying the actual quitters (as shown by the low recall of 43% below). In summary, the model is good at predicting who WON’T leave but less effective at identifying who will.
Tjur’s R2 = 0.33; A 33% indicates substantial separation between the predicted values of the No’s and Yes’s. That’s a good thing!
Precision = 75%; percentage of actual “non-quitters” correctly identified by the model.
Recall = 43%; percentage of “quitters” the model correctly identified.
F1-score = 54.7%; the model shows low accuracy in identifying attrition cases and minimizing false positives.
| Actual/Predicted | No (Predicted) | Yes (Predicted) |
|---|---|---|
| No (Actual) | 1199 (true neg) | 34 (false pos) |
| Yes (Actual) | 135 (false neg) | 102 (true pos) |
Our analysis identified overtime as the most significant factor contributing to employee attrition. This finding highlights the need for HR to assess overtime policies and employee workload management.
Key considerations include:
Employee Well-being & Work-Life Balance: Extended work hours contribute to fatigue, stress, and diminished job satisfaction, increasing the likelihood of turnover.
Compensation & Fairness: Are employees receiving appropriate compensation for overtime work? Is overtime equitably distributed among teams?
Workload & Project Management: Are urgent deadlines driving excessive overtime? Adjusting project timelines and increasing staffing levels could mitigate burnout.
To address these concerns, HR may consider conducting employee feedback surveys, reviewing scheduling practices, and assessing hiring needs to create a more sustainable work environment. Link to Source
While business travel can provide opportunities for professional growth and networking, frequent travel is strongly associated with higher attrition rates. Our analysis indicates that employees who travel frequently are 459% more likely to leave the company compared to those who do not travel at all.
Key challenges associated with frequent travel include:
Burnout & Stress: Extended time away from home can lead to exhaustion and decreased job satisfaction.
Health & Well-being: Frequent travel has been linked to higher stress levels and potential long-term health impacts.
Work-Life Balance: Employees with caregiving responsibilities or other personal commitments may find frequent travel disruptive.
HR should work closely with managers to assess the necessity of each trip. In today’s digital environment, virtual meetings and remote collaboration tools can often replace in-person travel while still achieving business objectives. By offering more flexible travel policies, companies can enhance retention and maintain a more engaged workforce.
A strong promote-from-within culture can significantly enhance employee retention and long-term company success. Organizations that prioritize internal growth create a sense of loyalty, motivation, and career stability among employees.
For example, Publix Supermarkets has successfully implemented this approach, allowing employees to advance from entry-level roles to management. As a result, Publix maintains an annual voluntary attrition rate of just 5%, compared to the 65% industry average in retail. Link to Source
Beyond retention benefits, internal promotions often align with stock ownership programs and loyalty incentives, which I will expand on in the next section. By investing in career development and providing clear advancement pathways, companies can reduce turnover and build a more engaged, experienced workforce.
Employees thrive when they feel recognized, valued, and appreciated for their contributions. A well-structured loyalty bonus or stock ownership program can significantly boost morale, engagement, and retention.
Instead of relying solely on small gestures like pizza parties or happy hours, companies should consider meaningful incentives that reflect employees’ contributions to business success. For example, Publix Supermarkets is an employee-owned company where team members working over 1,000 hours receive stock valued at an average of 10% of their compensation. This model fosters loyalty, financial investment in the company’s success, and long-term retention.
While salary increases may not always be feasible, alternative recognition programs such as performance-based bonuses, stock options, and milestone celebrations can help reinforce a culture of appreciation and commitment. By implementing these strategies, companies can strengthen employee loyalty and reduce voluntary turnover.
A supportive work environment is crucial for retaining top talent and maintaining productivity. Our analysis shows that when employees experience greater satisfaction in both their personal and professional lives, their likelihood of quitting or underperforming decreases significantly.
To foster this balance, HR can implement initiatives such as:
Flexible Work Arrangements: Evaluate whether certain roles truly require in-office presence. Offering hybrid or remote options can improve work-life balance, especially for employees with caregiving responsibilities.
Commuting Considerations: Long commutes can reduce overall job satisfaction. Flexible start times or remote work days can help employees better manage family and personal responsibilities.
Team-Building & Recognition Events: Hosting quarterly company outings or morale-boosting activities can strengthen team cohesion and engagement.
By prioritizing employee well-being and flexibility, companies can reduce turnover, improve job satisfaction, and enhance overall performance.
Effective leadership is a key driver of employee satisfaction, performance, and retention. However, for leadership evaluations to be meaningful, employees must feel safe providing honest feedback without fear of repercussions.
To strengthen trust and accountability, HR should consider:
Ensuring Anonymity in Employee Surveys: Employees should feel confident that their feedback is truly anonymous, fostering more accurate and constructive insights.
Acting on Feedback Transparently: Rather than penalizing employees for low ratings, HR should use feedback as an opportunity to support leadership development and team collaboration.
Encouraging Constructive Improvement: Leadership evaluations should prioritize growth, offering managers training, mentorship, and resources to enhance their management skills.
By creating a culture of trust, accountability, and continuous improvement, companies can strengthen leadership effectiveness and employee engagement, ultimately improving overall team success.
Comments:
At 0-15 years, the salary increases by an average of around 30K, reflecting typical career growth as employees gain experience and promotions. Between 15-30 years, there is a sharper rise in salary, suggesting that employees who stay longer with the company are receiving substantial promotions and pay raises. However, around 30-40 years, the salary tends to either drop or level off. This could be due to several reasons: employees who have been with the company for this long may be nearing retirement, which can lead to a plateau in salary; senior positions may be given to younger employees; or employees who have been with the company for many years may have reached the peak of their job opportunities, with fewer new advancement opportunities available.
Note: For scale, the y-axis was set to represent both the minimum and maximum annual income in the data.