Introduction

Attrition refers to employees leaving an organization without the organization being able to immediately fill their vacancies. High rates of employee attrition often occur when key factors, such as inadequate compensation, poor work-life balance, ineffective management, lack of career advancement opportunities, and poor working conditions, are left unaddressed

Understanding who leaves and why is crucial for improving workplace policies, reducing turnover, and minimizing the negative impact on productivity, culture, and financial performance. High attrition can lead to increased recruitment and training costs, loss of institutional knowledge, and burnout among remaining staff. Luckily, there are solutions. By effectively leveraging data-driven insights from employee surveys, exit interviews, and feedback, organizations/companies can identify root causes to employee attrition and implement targeted strategies to improve retention and create a healthier work environment.

Data Source

🔗 to Dataset: IBM HR Analytics Employee Attrition and Performance-Kaggle

This is a fictional data set created by IBM data scientists. It includes the data of nearly 1500 employees and 35 variables. The dataset contains information on employee demographics, income, job satisfaction, job information, and whether or not the employee attrited.

Click to View Dataset Variables and Definitions
# Variable Definition
1 Age Age of employee
2 Gender Employee’s gender (male/female)
3 Attrition Whether employee left company (Yes/No)
4 BusinessTravel Frequency of travel (Rarely, Frequently, etc.)
5 DailyRate Daily salary rate (USD)
6 Department Department of employee (Sales, R&D, etc.)
7 DistanceFromHome Distance from home to work (miles)
8 Education Level of education achieved by employee (1.Below College, 2.College, etc.)
9 EducationField Employee’s field of study
10 EmployeeCount N/A
11 EmployeeNumber Unique identifier for each employee
12 EnvironmentSatisfaction Employee satisfaction with working environment (1-4)
13 HourlyRate Hourly salary rate (USD)
14 JobInvolvement Level of involvement required for employee’s job (1-4)
15 JobLevel Employee’s level of work (1-5)
16 JobRole Employee’s role in the organization (Sales exec., Lab Tech., etc.)
17 JobSatisfaction Satisfaction rating of the job (1-4)
18 Marital Status Employee’s marital status (S, M, D)
19 MonthlyIncome Employee’s monthly salary ($)
20 Monthly Rate Monthly salary rate (USD)
21 NumCompaniesWorked # of companies the employee worked for
22 Over 18 Whether or not the employee is 18 yrs old
23 OverTime Whether or not the employee works overtime (yes/no)
24 PercentSalaryHike Salary increase rate
25 PerformanceRating Performance rating of employee (1-4)
26 RelationshipSatisfaction Employee’s satisfaction with personal relationships (1-4)
27 StandardHours Standard working hours for employees
28 StockOptionLevel Employee’s stock option level
29 TotalWorkingYears Total # of years the employee has worked
30 TrainingTimesLastYear # of trainings the employee has taken
31 WorkLifeBalance Employee’s perception of their work-life balance (1-4)
32 YearsAtCompany # of years employee has been with current company
33 YearsInCurrentRole # of years employee has been in current role
34 YearsSinceLastPromoted # of years since last promotion
35 YearsWithCurrManager # of years with current manager


Disclosure: This data is intended for research purposes and data practice. Models designed from this data are not designed to predict future attrition X number of months from now. The primary goal is research, data exploration, and inference, rather than developing a predictive model for future attrition. Typically, machine learning workflows involve splitting data into training and testing sets to evaluate how well a model generalizes. However, in this case, the objective is not to predict employee attrition a specific number of months from now but rather to understand the key factors associated with attrition and their relative importance.

Project Scope

This analysis covers 4 key areas:

1. Data Preparation (SQL & R)

  • Using SQL and R to inspect, clean, and prepare the dataset for analysis

2. Exploratory Data Analysis (SQL & R)

  • Perform initial EDA in SQL with the string and binary variables
  • Create visualizations in R to uncover relationships in the data with the numeric, binary, and string variables

3. Modeling and Analysis (R)

Logistic Regression

  • Applied generalized linear regression (logistic regression) to identify factors influencing employee attrition (binary outcome: stayed vs. quit)
  • Checked model assumptions
  • Constructed predictive probability plots to assess the variables individual contribution to attrition
  • Examined odds ratios (ORs) to quantify how each factor affects the likelihood of an employee leaving
  • Included demographic, job-related, and satisfaction-related variables to understand their impact

Model Evaluation

  • Using metrics such as accuracy, Tjur’s R2, Precision, Recall, and the F1-score to evaluate how well our model performed and what can be done for future models.

4. Conclusion and Company Insights

  • Wrapping up the project with real-world, real-business strategies on the most influential factors from this analysis to help combat the negative impacts employee attrition can have on a company

1. Data Preparation (SQL & R)

Connect to SQL database with R

# 1.) Install & load necessary packages
library(RPostgres)  #to connect to SQL code
library(DBI)        #connecting to SQL database

# 2.) Enter database info
# connection <- dbConnect(Postgres(),
#                  dbname = "Name_Of_Your_Database",
#                  host = "yourHost",
#                  port = "####",
#                  user = "insert_user",
#                  password = "**********")

Note: I changed the name of variables by just putting underscores. Just for my own satisfaction. Over18 was removed before querying giving us 34 variables now.

Data Cleaning

Step 1: Load data into new table

To begin the analysis, I first loaded the dataset into a new SQL table. This ensures data integrity, allows for efficient querying, and provides a clean workspace for further transformations and analysis.

Drop Table if Exists hr_attrition_transformed;
CREATE TABLE hr_attrition_transformed AS  --the new table for analysis in R and EDA.
Select *
From hr_attrition;
--34 variables

Step 2: Check data for typos/errors

While it’s best practice to check every variable for inconsistencies, I’ll focus on a few key text-based variables to save time. This helps catch common issues like misspellings or inconsistent formatting.

Select Distinct(business_travel)
From hr_attrition_transformed; --3 levels. No typos.
3 records
business_travel
Non-Travel
Travel_Frequently
Travel_Rarely
Select Distinct(gender)
From hr_attrition_transformed; -- No typos
2 records
gender
Female
Male
Select Distinct(department)
From hr_attrition_transformed; --No typos
3 records
department
Human Resources
Research & Development
Sales

Step 3: Identify NULL values in ‘attrition’ and ‘employee_number’

Missing values can impact analysis and model performance. Here, I check for NULL values in the ‘attrition’ and ‘employee_number’ columns to ensure data completeness and decide how to handle any gaps.

Select COUNT(*) as null_count
From hr_attrition_transformed
Where attrition is NULL OR employee_number is NULL;
1 records
null_count
0

Step 4: Remove unnecessary columns

I chose to remove these for their own specifc reasons. Either the information seemed irrelevant or I did not think it mattered to the overall analysis. Additionally:

  • Redundant Information: Some columns may duplicate information already captured in other variables. Keeping both can lead to multicollinearity in regression models.

  • Irrelevant to the Objective: If a column does not directly relate to predicting attrition or improving insights, removing it helps streamline the analysis.

  • Improved Model Performance: Keeping only relevant features reduces computational cost and prevents overfitting in machine learning models.

Alter Table hr_attrition_transformed
Drop Column daily_rate, --not useful; may vary
Drop Column hourly_rate, --not useful; may vary
Drop Column monthly_rate, --not useful; may vary
Drop Column distance_from_home, --range is 1-29 miles
Drop Column standard_hours, --nonsensical; same for all employees
Drop Column employee_count; --nonsensical; every value is 1

Step 5: Creating ‘annual_income’ variable to condense income info

The data contains multiple salary-related columns. To simplify the analysis and avoid redundancy, I created a single ‘annual_income’ variable that combines relevant income information into one value. The CSV was then saved and exported to R for further cleaning to prepare for the regression analysis. (See Below)

Alter Table hr_attrition_transformed
Add Column annual_income INT;
Update hr_attrition_transformed
Set annual_income = (monthly_income * 12)/1000;
Select annual_income
From hr_attrition_transformed
Limit 3;
3 records
annual_income
41
49
223

Step 6: Loading packages & dataset in R

To begin my work in R, I load the necessary packages for data manipulation and visualization. I also import the dataset, making it ready for further processing and modeling. This step ensures that all required tools and data are available for the job ahead, including my analysis.

# Load necessary packages
library(tidyverse)     #for data wrangling and visualization
library(knitr)         #for tables
library(ggpubr)        #for plot designing
library(scales)        #for plot designing
library(corrplot)      #for correlation matrix
library(plotly)        #for interactive plots
library(reshape2)      #for plotting correlations
library(caret)         #for model evaluation and balancing
library(car)           #for checking multicollinearity
library(broom)         #for tables
library(performance)   #for model performance/evaluation
library(randomForest)  #for variable importance
library(vip)           #for graphing variable importance
library(pROC)          #for ROC curve and AUC
options(scipen = 999) 
# Load in Data
hr_attrition_transformed <- read_csv("hr_attrition_transformed.csv")

Step 7: Viewing structure of dataset

str(hr_attrition_transformed)

By reviewing the structure of the dataset, I identify that a few variables need to be converted into categorical variables and checked for the correct baseline. The table below lists the variables I’ll focus on.

Click to View Variables
Variable Baseline
Attrition No
Business Travel Non-travel
Department N/A
Education 1
Education_Field N/A
Environment Satisfaction 1
Gender Male
Job Involvement 1
Job Level 1
Job Role N/A
Job Satisfaction 1
Marital Status N/A
Overtime No
Performance Rating 1
Relationship Satisfaction 1
Stock Option Level 0
Work-Life Balance 1


Standardizing these variables correctly will enhance the model’s performance and make the results easier to interpret! ✅

Step 8: Converting numeric/character variables to categorical

In this step, I convert certain numeric/character variables into categorical variables using the as.factor() function in R. This is necessary for variables that represent categories (like job levels or satisfaction) but are currently stored as numbers. By doing so, I ensure that the model treats them as distinct categories, not continuous values, which will improve the accuracy and interpretability of my analysis.

  hr_attrition_transformed <- hr_attrition_transformed %>% 
    mutate(attrition = as.factor(attrition),
           business_travel = as.factor(business_travel),
           department = as.factor(department),
           education = as.factor(education),
           education_field = as.factor(education_field),
           environment_satisfaction = as.factor(environment_satisfaction),
           gender = as.factor(gender),
           job_involvement = as.factor(job_involvement),
           job_level = as.factor(job_level),
           job_role = as.factor(job_role),
           job_satisfaction = as.factor(job_satisfaction),
           marital_status = as.factor(marital_status),
           overtime = as.factor(overtime),
           performance_rating = as.factor(performance_rating),
           relationship_satisfaction = as.factor(relationship_satisfaction),
           stock_option_level = as.factor(stock_option_level),
           work_life_balance = as.factor(work_life_balance))

Step 9: Releveling variables

First, I inspect the current levels of the categorical variables using the levels() function. If the levels are already in the correct order, great — no changes needed! 🎉 However, if the order is incorrect, I use the factor() function to set the variables to the desired baseline or order, ensuring that the model processes them correctly.

levels(hr_attrition_transformed$overtime)
## [1] "No"  "Yes"
# releveling
hr_attrition_transformed <- hr_attrition_transformed %>% 
  mutate(business_travel = factor(business_travel, 
                                  levels = c("Non-Travel", "Travel_Rarely", "Travel_Frequently")),
         gender = factor(gender, levels = c("Male", "Female")),
         marital_status = factor(marital_status, levels = c("Single", "Married", "Divorced")))

Looks like business_travel, gender, and marital_status were the only variables that needed a little re-leveling! Now that they are in the right order, I can move forward confidently, knowing the model will interpret it correctly. ✅


ALL DONE!


2. Exploratory Data Analysis (SQL & R)

Attrition Rates on Key Variables (SQL)

Attrition Count

Select attrition,
COUNT(*) AS count_attrition
From hr_attrition_transformed
Group By attrition
Order By attrition;

-- NOTE: this is an imbalanced data set. Important for our model interpretation.
2 records
attrition count_attrition
No 1233
Yes 237

Annual Income & Attrition

Select attrition, 
        MIN(annual_income) AS min_income, 
        MAX(annual_income) AS max_income,
        ROUND(AVG(annual_income), 2) AS avg_income 
From hr_attrition_transformed
Group By attrition
Order By attrition; 

--NOTE: those leaving earn about 25K less. Range looks similar, 
-- but distribution might be different. We can easier see the median in R.
2 records
attrition min_income max_income avg_income
No 12 239 81.49
Yes 12 238 56.95

Business Travel & Attrition

Select business_travel, attrition,
    Count(business_travel) AS count_bus_travel,
        ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By business_travel),2) AS rate
From hr_attrition_transformed
Group By business_travel, attrition
Order By business_travel, attrition;

--NOTE: Those who travel at all have a higher rate of leaving. 
-- Dont quite know how much this plays a role in leaving, however.
6 records
business_travel attrition count_bus_travel rate
Non-Travel No 138 92.00
Non-Travel Yes 12 8.00
Travel_Frequently No 208 75.09
Travel_Frequently Yes 69 24.91
Travel_Rarely No 887 85.04
Travel_Rarely Yes 156 14.96

Department & Attrition

Select department, attrition,
    Count(department) AS count_department,
        ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By department),2) AS rate
From hr_attrition_transformed
Group By department, attrition
Order By department, attrition;

--NOTE: In terms of total employees, sales has a large rate of attrition.
6 records
department attrition count_department rate
Human Resources No 51 80.95
Human Resources Yes 12 19.05
Research & Development No 828 86.16
Research & Development Yes 133 13.84
Sales No 354 79.37
Sales Yes 92 20.63

Education Field & Attrition

Select education_field, attrition,
    Count(education_field) AS count_edufield,
    ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By education_field),2) AS rate
From hr_attrition_transformed
Group By education_field, attrition
Order By education_field, attrition;

--NOTE: Those with technical degrees have the largest rate with more
-- than 100 people. Perhaps finding better jobs, higher pay for their
-- skills, etc...
Displaying records 1 - 10
education_field attrition count_edufield rate
Human Resources No 20 74.07
Human Resources Yes 7 25.93
Life Sciences No 517 85.31
Life Sciences Yes 89 14.69
Marketing No 124 77.99
Marketing Yes 35 22.01
Medical No 401 86.42
Medical Yes 63 13.58
Other No 71 86.59
Other Yes 11 13.41

Gender & Attrition

Select gender, attrition,
    Count(gender) AS count_gender,
    ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By gender),2) AS rate
From hr_attrition_transformed
Group By gender, attrition
Order By gender, attrition;

--NOTE: men and women share about the same rate. No glaring differences.
4 records
gender attrition count_gender rate
Female No 501 85.20
Female Yes 87 14.80
Male No 732 82.99
Male Yes 150 17.01

Job Involvement & Attrition

Select job_involvement, attrition,
    Count(job_involvement) AS count_involvement,
    ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By job_involvement),2) AS rate
From hr_attrition_transformed
Group By job_involvement, attrition
Order By job_involvement, attrition;

--NOTE: Not surprisingly, those who rate low job involvement have high 
--      attrition rates (33.73%). The proportion of employees is low compared
--      to the other ratings. Those who engage in their job stay and vice versa.
8 records
job_involvement attrition count_involvement rate
1 No 55 66.27
1 Yes 28 33.73
2 No 304 81.07
2 Yes 71 18.93
3 No 743 85.60
3 Yes 125 14.40
4 No 131 90.97
4 Yes 13 9.03

Job Level & Attrition

Select job_level, attrition,
    Count(job_level) AS count_joblevel,
    ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By job_level),2) AS rate
From hr_attrition_transformed
Group By job_level, attrition
Order By job_level, attrition;

--NOTE: Those with entry level positions (1) have drastically higher rates 
--      of attrition (26.34%) than any other position group. All other groups
--      have a 85%+ retention rate.
--      (1=Entry Level, 2=Mid Level, 3=Senior, 4=Lead, 5=Executive)
Displaying records 1 - 10
job_level attrition count_joblevel rate
1 No 400 73.66
1 Yes 143 26.34
2 No 482 90.26
2 Yes 52 9.74
3 No 186 85.32
3 Yes 32 14.68
4 No 101 95.28
4 Yes 5 4.72
5 No 64 92.75
5 Yes 5 7.25

Job Role & Attrition

Select job_role, attrition,
    Count(job_role) AS count_jobrole,
    ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By job_role),2) AS rate
From hr_attrition_transformed
Group By job_role, attrition
Order By job_role, attrition;

--NOTE: Not too sure why Sales Representatives have a high rate of attrition (40%).
--Would need to look more into how that job role plays out in careers.
Displaying records 1 - 10
job_role attrition count_jobrole rate
Healthcare Representative No 122 93.13
Healthcare Representative Yes 9 6.87
Human Resources No 40 76.92
Human Resources Yes 12 23.08
Laboratory Technician No 197 76.06
Laboratory Technician Yes 62 23.94
Manager No 97 95.10
Manager Yes 5 4.90
Manufacturing Director No 135 93.10
Manufacturing Director Yes 10 6.90

Marital Status & Attrition

Select marital_status, attrition,
    Count(marital_status) AS count_marital_status,
    ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By marital_status),2) AS rate
From hr_attrition_transformed
Group By marital_status, attrition
Order By marital_status, attrition;

--NOTE: single people are leaving at a higher rate, which could
-- be interpreted as maybe not having an established career or 
-- the flexibility to look for other jobs without bigger consequences.
6 records
marital_status attrition count_marital_status rate
Divorced No 294 89.91
Divorced Yes 33 10.09
Married No 589 87.52
Married Yes 84 12.48
Single No 350 74.47
Single Yes 120 25.53

Overtime & Attrition

Select overtime, attrition,
    Count(overtime) AS count_overtime,
    ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By overtime),2) AS rate
From hr_attrition_transformed
Group By overtime, attrition
Order By overtime, attrition;

--NOTE: Working overtime looks like a big reason to why employees leave.
-- I would be curious to see how overtime is handled in these work places
-- (unpaid, low pay, bad conditions, etc..)
4 records
overtime attrition count_overtime rate
No No 944 89.56
No Yes 110 10.44
Yes No 289 69.47
Yes Yes 127 30.53

Descriptive Data

hr_attrition_transformed %>%
  select(c(1:29)) %>% 
  summary()
##       age        attrition           business_travel
##  Min.   :18.00   No :1233   Non-Travel       : 150  
##  1st Qu.:30.00   Yes: 237   Travel_Rarely    :1043  
##  Median :36.00              Travel_Frequently: 277  
##  Mean   :36.92                                      
##  3rd Qu.:43.00                                      
##  Max.   :60.00                                      
##                                                     
##                   department  education         education_field
##  Human Resources       : 63   1:170     Human Resources : 27   
##  Research & Development:961   2:282     Life Sciences   :606   
##  Sales                 :446   3:572     Marketing       :159   
##                               4:398     Medical         :464   
##                               5: 48     Other           : 82   
##                                         Technical Degree:132   
##                                                                
##  employee_number  environment_satisfaction    gender    job_involvement
##  Min.   :   1.0   1:284                    Male  :882   1: 83          
##  1st Qu.: 491.2   2:287                    Female:588   2:375          
##  Median :1020.5   3:453                                 3:868          
##  Mean   :1024.9   4:446                                 4:144          
##  3rd Qu.:1555.8                                                        
##  Max.   :2068.0                                                        
##                                                                        
##  job_level                      job_role   job_satisfaction  marital_status
##  1:543     Sales Executive          :326   1:289            Single  :470   
##  2:534     Research Scientist       :292   2:280            Married :673   
##  3:218     Laboratory Technician    :259   3:442            Divorced:327   
##  4:106     Manufacturing Director   :145   4:459                           
##  5: 69     Healthcare Representative:131                                   
##            Manager                  :102                                   
##            (Other)                  :215                                   
##  monthly_income  num_companies_worked overtime   percent_salary_hike
##  Min.   : 1009   Min.   :0.000        No :1054   Min.   :11.00      
##  1st Qu.: 2911   1st Qu.:1.000        Yes: 416   1st Qu.:12.00      
##  Median : 4919   Median :2.000                   Median :14.00      
##  Mean   : 6503   Mean   :2.693                   Mean   :15.21      
##  3rd Qu.: 8379   3rd Qu.:4.000                   3rd Qu.:18.00      
##  Max.   :19999   Max.   :9.000                   Max.   :25.00      
##                                                                     
##  performance_rating relationship_satisfaction stock_option_level
##  3:1244             1:276                     0:631             
##  4: 226             2:303                     1:596             
##                     3:459                     2:158             
##                     4:432                     3: 85             
##                                                                 
##                                                                 
##                                                                 
##  total_working_years training_times_last_year work_life_balance
##  Min.   : 0.00       Min.   :0.000            1: 80            
##  1st Qu.: 6.00       1st Qu.:2.000            2:344            
##  Median :10.00       Median :3.000            3:893            
##  Mean   :11.28       Mean   :2.799            4:153            
##  3rd Qu.:15.00       3rd Qu.:3.000                             
##  Max.   :40.00       Max.   :6.000                             
##                                                                
##  yrs_at_company   yrs_in_current_role yrs_since_last_promotion
##  Min.   : 0.000   Min.   : 0.000      Min.   : 0.000          
##  1st Qu.: 3.000   1st Qu.: 2.000      1st Qu.: 0.000          
##  Median : 5.000   Median : 3.000      Median : 1.000          
##  Mean   : 7.008   Mean   : 4.229      Mean   : 2.188          
##  3rd Qu.: 9.000   3rd Qu.: 7.000      3rd Qu.: 3.000          
##  Max.   :40.000   Max.   :18.000      Max.   :15.000          
##                                                               
##  yrs_with_curr_manager annual_income   
##  Min.   : 0.000        Min.   : 12.00  
##  1st Qu.: 2.000        1st Qu.: 34.00  
##  Median : 3.000        Median : 58.50  
##  Mean   : 4.123        Mean   : 77.53  
##  3rd Qu.: 7.000        3rd Qu.:100.00  
##  Max.   :17.000        Max.   :239.00  
## 

Distributions of Numeric Variables (R)

I’m not seeing anything in these distributions that might affect the analysis (e.g., outliers, anomalies). Most columns are right skewed which is what I expected due to the nature of the data.

Age

Annual Income

Companies Worked For

Percent Salary Hike

Total Working Years

Training Times Last Year

Years at Company

Years in Current Role

Years Since Last Promotion

Years with Current Manager


Other Visualizations & Insights (R)

Plot 1: Annual Income by Years at Company

# breaking down data into age groups for simplicity
# Note: Age range = 0-40
hr_attrition_transformed <- hr_attrition_transformed %>%
  mutate(YearRange = cut(yrs_at_company,
                           breaks = c(0, 5, 10, 
                                      15, 20, 25, 
                                      30, 35, 40),
                           labels = c("0-5", "05-10", 
                                      "10-15", "15-20", 
                                      "20-25", "25-30", 
                                      "30-35", "35-40"),
                           include.lowest = TRUE))

# putting data into plot dataframe
plot1df <- hr_attrition_transformed %>% 
  group_by(YearRange) %>% 
  summarise(annual_income = round(mean(annual_income, na.rm=T),0))

#Line Graph
p1 <- ggplot(plot1df, aes(x=YearRange, y=annual_income, group = 1))+
  scale_y_continuous(breaks = seq(10, 250, 20),
                     labels = label_number(suffix = "K"),
                     limits = c(10, 250))+
  geom_line(color="royalblue4")+
  geom_point(size = 2.7,
             fill = "royalblue",
             color = "black",
             shape = 23)+
  geom_text(label = "Max Income = $239,998",
            x=6.5,
            y=50000)+
  geom_text(label = "Min Income = $12,108",
            x=6.45,
            y=30000)+
  labs(title = "Relationship between annual salary and # of years at company",
       x= "Years at Company",
       y= "Annual Income ($)")+
  theme_bw()+
  theme(plot.title = element_text(hjust=0, size=12, face="bold"),
        axis.title = element_text(size=10, face="bold"),
        axis.text = element_text(size=8),
        plot.background = element_rect(fill="gray95"))

#for interactive plots
ggplotly(p1)

Comments:

At 0-15 years, the salary increases by an average of around 30K, reflecting typical career growth as employees gain experience and promotions. Between 15-30 years, there is a sharper rise in salary, suggesting that employees who stay longer with the company are receiving substantial promotions and pay raises. However, around 30-40 years, the salary tends to either drop or level off. This could be due to several reasons: employees who have been with the company for this long may be nearing retirement, which can lead to a plateau in salary; senior positions may be given to younger employees; or employees who have been with the company for many years may have reached the peak of their job opportunities, with fewer new advancement opportunities available.
Note: For scale, the y-axis was set to represent both the minimum and maximum annual income in the data.

Plot 2: Relationship between Job Satisfaction & Employee Attrition

jobsat <- hr_attrition_transformed %>% 
  #job satisfaction
  group_by(job_satisfaction, attrition) %>% 
  summarise(employee_count = n(), .groups = "drop") %>% 
  mutate(jobsat_rate = round(employee_count /sum(employee_count)*100, 2)) %>% 
  ungroup() %>% 
    group_by(job_satisfaction) %>%
  mutate(cumul_rate = sum(jobsat_rate)) %>% 
  ungroup() %>% 
  select(attrition, job_satisfaction, jobsat_rate, cumul_rate)

# Bar graph
p2 <- ggplot(jobsat, aes(x=job_satisfaction, y=jobsat_rate,
                    fill=attrition)) +
  geom_col(stat = 'identity',
           position = position_dodge(),
           width = 0.8,
           color="black") +
    geom_text(aes(label = ifelse(attrition == "No",
                                 paste0(round(cumul_rate, 1), "%"),
                                 "")), 
            position = position_stack(vjust = 0.95), size = 4) +
    scale_y_continuous(breaks = seq(0, 50, 10),
                     labels = label_number(suffix = "%"),
                     limits = c(0,50)) +
  labs(title = "Job Satisfaction and Employee Attrition",
       x = "Job Satisfaction Rating",
       y = "Percentage of Employees",
       fill = "Attrition Status") +
  scale_fill_manual(values = c("darkseagreen4", "indianred")) +
  theme_bw() +
  theme(plot.title = element_text(hjust=0, face="bold", size=12),
        plot.background = element_rect(fill="gray95"),
        axis.title = element_text(hjust=0.5, face="bold", size=10),
        axis.text = element_text(size=8),
        legend.background = element_rect(fill="papayawhip"))
  
#interactive plot
ggplotly(p2)

Comments:

Employees with a job satisfaction rating of 3 or 4 make up the largest portion of the workforce, each accounting for around 30% of employees. This suggests that most employees are at least moderately satisfied with their jobs. The red portion representing employees who left the company remains relatively small across all satisfaction levels. However, lower satisfaction ratings (1 and 2) may still be linked to a higher risk of attrition, which would require further analysis to confirm.

Overall, attrition appears to be low across all job satisfaction groups, indicating that while job satisfaction may play a role in employee retention, it is not the sole determining factor.

Plot 3: Relationship between Work Environment Satisfaction & Employee Attrition

envirosat <- hr_attrition_transformed %>% 
  #environment satisfaction
  group_by(environment_satisfaction, attrition) %>% 
  summarise(employee_count = n(), .groups = "drop") %>% 
  mutate(envirosat_rate = round(employee_count /sum(employee_count)*100, 2)) %>% 
  ungroup() %>%
      group_by(environment_satisfaction) %>%
  mutate(cumul_rate = sum(envirosat_rate)) %>% 
  ungroup() %>% 
  select(attrition, environment_satisfaction, envirosat_rate,
         cumul_rate)

# Bar graph
p3 <- ggplot(envirosat, aes(x=environment_satisfaction,
                            y=envirosat_rate,
                            fill=attrition)) +
  geom_col(stat = 'identity',
           position = position_dodge(),
           width = 0.8,
           color="black") +
      geom_text(aes(label = ifelse(attrition == "No",
                                 paste0(round(cumul_rate, 1), "%"),
                                 "")), 
            position = position_stack(vjust = 0.95), size = 4) +
    scale_y_continuous(breaks = seq(0, 50, 10),
                     labels = label_number(suffix = "%"),
                     limits = c(0,50)) +
  labs(title = "Work Environment Satisfaction and Employee Attrition",
       x = "Work Environment Satisfaction Rating",
       y = "Percentage of Employees",
       fill = "Attrition Status") +
  scale_fill_manual(values = c("darkslategray3", "darkgoldenrod")) +
  theme_bw() +
  theme(plot.title = element_text(hjust=0, face="bold", size=12),
        plot.background = element_rect(fill="gray95"),
        axis.title = element_text(hjust=0.5, face="bold", size=10),
        axis.text = element_text(size=8),
        legend.background = element_rect(fill="papayawhip"))
  
#interactive plot
ggplotly(p3)

Comments:

Similarly, employees who rate their work environment satisfaction high (3 or 4) make up the majority of employees in the company. Attrition rates remain low regardless of work environment rating (< 5%), indicating that the work environment of the employee may not be a large factor in deciding to leave a company or not. Furthermore, it could be that the company these employees work at has a positive, clean, and collaborative work environment, meaning there may be other factors (i.e., compensation, career growth) that the company is not strong in, contributing to higher attrition rates.

Plot 4: Relationship between Relationship Satisfaction & Employee Attrition

relsat <- hr_attrition_transformed %>% 
  #relationship satisfaction
  group_by(relationship_satisfaction, attrition) %>% 
  summarise(employee_count = n(), .groups = "drop") %>% 
  mutate(relsat_rate = round(employee_count /sum(employee_count)*100, 2)) %>% 
  ungroup() %>%
      group_by(relationship_satisfaction) %>%
  mutate(cumul_rate = sum(relsat_rate)) %>% 
  ungroup() %>% 
  select(attrition, relationship_satisfaction, relsat_rate, cumul_rate)

# Bar graph
p4 <- ggplot(relsat, aes(x=relationship_satisfaction,
                            y=relsat_rate,
                            fill=attrition)) +
  geom_col(stat = 'identity',
           position = position_dodge(),
           width = 0.8,
           color="black") +
      geom_text(aes(label = ifelse(attrition == "No",
                                 paste0(round(cumul_rate, 1), "%"),
                                 "")), 
            position = position_stack(vjust = 0.95), size = 4) +
    scale_y_continuous(breaks = seq(0, 50, 10),
                     labels = label_number(suffix = "%"),
                     limits = c(0,50)) +
  labs(title = "Relationship Satisfaction and Employee Attrition",
       x = "Relationship Satisfaction Rating",
       y = "Percentage of Employees",
       fill = "Attrition Status") +
  scale_fill_manual(values = c("skyblue3", "salmon2")) +
  theme_bw() +
  theme(plot.title = element_text(hjust=0, face="bold", size=12),
        plot.background = element_rect(fill="gray95"),
        axis.title = element_text(hjust=0.5, face="bold", size=10),
        axis.text = element_text(size=8),
        legend.background = element_rect(fill="papayawhip"))
  
#interactive plot
ggplotly(p4)

Comments:

When examining employees’ personal lives, relationship satisfaction appears to have little impact on attrition. The majority of employees (60.6%) report high relationship satisfaction, with ratings of 3 or 4. This suggests that employees with varying levels of relationship satisfaction are equally likely to stay with the company.

Plot 5: Relationship between Work-Life Balance & Employee Attrition

worklife <- hr_attrition_transformed %>% 
  #work-life balance
  group_by(work_life_balance, attrition) %>% 
  summarise(employee_count = n(), .groups = "drop") %>% 
  mutate(worklife_rate = round(employee_count /sum(employee_count)*100, 2)) %>%
  ungroup() %>%
      group_by(work_life_balance) %>%
  mutate(cumul_rate = sum(worklife_rate)) %>% 
  ungroup() %>% 
  select(attrition, work_life_balance, worklife_rate, cumul_rate)

# Bar graph
p5 <- ggplot(worklife, aes(x=work_life_balance,
                            y=worklife_rate,
                            fill=attrition)) +
  geom_col(stat = 'identity',
           position = position_dodge(),
           width = 0.8,
           color="black") +
      geom_text(aes(label = ifelse(attrition == "No",
                                 paste0(round(cumul_rate, 1), "%"),
                                 "")), 
            position = position_stack(vjust = 0.9), size = 4) +
    scale_y_continuous(breaks = seq(0, 57, 10),
                     labels = label_number(suffix = "%"),
                     limits = c(0,57)) +
  labs(title = "Work-Life Balance and Employee Attrition",
       x = "Work-Life Balance Rating",
       y = "Percentage of Employees",
       fill = "Attrition Status") +
  scale_fill_manual(values = c("lightcyan4", "tomato3")) +
  theme_bw() +
  theme(plot.title = element_text(hjust=0, face="bold", size=12),
        plot.background = element_rect(fill="gray95"),
        axis.title = element_text(hjust=0.5, face="bold", size=10),
        axis.text = element_text(size=8),
        legend.background = element_rect(fill="papayawhip"))
  
#interactive plot
ggplotly(p5)

Comments:

The majority of employees (60.8%) rate their work-life balance as a 3, followed by 23.4% rating it as a 2. Only a small percentage of employees rate their work-life balance as either extremely poor (1) or excellent (4). Interestingly, attrition rates remain relatively low across all ratings, with only a slight increase for employees who rate their work-life balance as a 3.

Plot 6: Correlation Matrix

Comments:

Performance Rating, Employee Number and Percent Salary Hike show no correlation with Attrition. Therefore, we will remove these variables and store the refined dataset in a new data frame for our predictive analysis.

Additionally, Overtime exhibits a low-to-moderate positive correlation with Attrition (0.25), suggesting that employees who work overtime are more likely to leave.

Identifying & Removing Independent Variables with High Intercorrelation

ind_corr <- corrMatrix %>%
  melt() %>% 
  filter(value > 0.7) %>% #threshold > 0.7
  filter(Var1 != Var2) #avoid looking at redundant info

ind_corr
##                     Var1                  Var2     value
## 1         monthly_income             job_level 0.9502999
## 2    total_working_years             job_level 0.7822078
## 3          annual_income             job_level 0.9501928
## 4              job_level        monthly_income 0.9502999
## 5    total_working_years        monthly_income 0.7728932
## 6          annual_income        monthly_income 0.9999869
## 7     performance_rating   percent_salary_hike 0.7735500
## 8    percent_salary_hike    performance_rating 0.7735500
## 9              job_level   total_working_years 0.7822078
## 10        monthly_income   total_working_years 0.7728932
## 11         annual_income   total_working_years 0.7728457
## 12   yrs_in_current_role        yrs_at_company 0.7587537
## 13 yrs_with_curr_manager        yrs_at_company 0.7692124
## 14        yrs_at_company   yrs_in_current_role 0.7587537
## 15 yrs_with_curr_manager   yrs_in_current_role 0.7143648
## 16        yrs_at_company yrs_with_curr_manager 0.7692124
## 17   yrs_in_current_role yrs_with_curr_manager 0.7143648
## 18             job_level         annual_income 0.9501928
## 19        monthly_income         annual_income 0.9999869
## 20   total_working_years         annual_income 0.7728457

Removing:

  • employee_number: irrelevant to data
  • job_level: highly correlated with annual_income
  • job_role: multicollinearity with department
  • monthly_income: highly correlated with annual_income; redundant
  • performance_rating: highly correlated with percent_salary_hike
  • total_working_years: redundant with annual_income
  • yrs_in_current_role: highly correlated with yrs_at_company
  • yrs_with_curr_manager: similar as above
  • YearRange: irrelevant to final data set
#removing selected columns from hr_attrition_transformed
HRfinaldata <- hr_attrition_transformed %>% 
  select(-c(7, 11, 12, 15, 19, 22, 26, 28, 30)) 


ALL DONE!


3. Modeling & Analysis (R)

In this section, I will walk through the modeling process to identify the key factors contributing to employee attrition within this company. I start by building the model, followed by checking its assumptions to ensure reliability and that the data fits the model well. Next, I analyze the predicted probabilities and break down the impact of demographic, job-related, and work/life satisfaction factors on employee attrition. Lastly, I bring in results from my exploratory findings and tie it in with the analysis.

What to Expect in This Section
  • Running the Model: fitting a logistic regression model to see what influences attrition.
  • Checking Model Assumptions: ensuring the model meets key statistical assumtions for valid interpretation.
  • Predicted Probability Plots: visualizing how different factors averaged across each other influence the likelihood of attrition
  • Assessing Demographic Factors: understanding the impact of age, gender, education, and other demographics.
  • Assessing Job-Related Factors: understanding how income, working overtime, tenure, and others contribute to attrition.
  • Assessing Work-Life Satisfaction: understanding the role of work-life balance, job satisfaction, and personal life on attrition.
  • Model Evaluation: seeing the performance of the model and pointing out flaws that helps communicate my findings to stakeholders more effectively.

Logistic Regression

I chose the logistic regression model (Generalized Linear Model or GLM) because it can handle binary outcomes such as employee attrition. This model allows me to examine how various factors influence the likelihood of attrition, providing odds ratios that are intuitive and easy to explain. Logistic regression is well-suited for both continuous variables (like age and income) and categorical variables (such as education level and department). It also doesn’t require a linear relationship between predictors and the outcome, making it flexible for the types of data in this study. While more complex models may offer better prediction accuracy, logistic regression provides an interpretable approach that balances model fit with computational efficiency.

HR_model <- glm(attrition ~ ., 
                 data = HRfinaldata,
                 family = binomial)

Checking Model Assumptions

check_model(HR_model)

✅ Model fits the data well
✅ Residuals look uniform with a few outliers
✅ The outliers are not influential
✅ Multicollinearity is less than 5
✅ Residuals exhibit a uniform distribution

Predicted Probability Plots

#Creating Lists of Predicted Probabilities averaged over all other variables
predictions <- ggeffects::ggeffect(HR_model)

What are Predicted Probabilities?

Predicted probabilities represent the likelihood of an event occurring based on the model’s analysis. In this case, the model calculates the probability that an employee will leave the company given a specific factor (e.g., age, job role, work-life balance). These probabilities are averaged over all other factors in the model, meaning that while we focus on one variable at a time in the plot, the prediction accounts for the presence of all other variables in the background. A probability close to 100% (or 1) suggests a high likelihood of attrition, while a probability close to 0% (or 0) suggests a low likelihood.

In the graphs below, the y-axis shows the predicted probability of attrition, while the x-axis represents the variable being analyzed (e.g., age or marital status). Since the probabilities are averaged over all other factors, the trends we see represent the general effect of each variable while holding everything else constant.

Age

Business Travel

Department

Education

Education Field

Environment Satisfaction

Gender

Job Involvement

Job Satisfaction

Marital Status

# of Companies Worked For

Overtime

Percent Salary Hike

Relationship Satisfaction

Stock Option Level

Training Events Last Year

Work-Life Balance

Years at Company

Years Since Last Promotion

Annual Income

Assessing Demographic Factors

Key Takeaways:

  • Every 1-year increase in age decreases the odds of quitting by 4%
  • For females, the odds of quitting are 35% lower than for males.
  • For each additional company the employee has worked for, the odds of quitting increase by 16%
  • Education, Field of Education, and Marital Status do not show a statistically significant effect on the odds of quitting. While there may be some variation, the model does not provide enough evidence to say they influence employee attrition.

Assessing Work/Life Satisfaction

Key Takeaways:

  • Generally, higher environment and job satisfaction and greater work-life balance decrease the odds of an employee quitting
  • Similarly, greater relationship satisfaction decreases the odds of quitting

Model Evaluation

The model has an accuracy rate of 88.5%. It seems quite high, right? It’s important to note that the model has trouble identifying the actual quitters (as shown by the low recall of 43% below). In summary, the model is good at predicting who WON’T leave but less effective at identifying who will.

Tjur’s R2 = 0.33; A 33% indicates substantial separation between the predicted values of the No’s and Yes’s. That’s a good thing!

Precision = 75%; percentage of actual “non-quitters” correctly identified by the model.

Recall = 43%; percentage of “quitters” the model correctly identified.

F1-score = 54.7%; the model shows low accuracy in identifying attrition cases and minimizing false positives.

Confusion Matrix

Actual/Predicted No (Predicted) Yes (Predicted)
No (Actual) 1199 (true neg) 34 (false pos)
Yes (Actual) 135 (false neg) 102 (true pos)

4. Conclusion and Company Insights

Impact of Overtime on Employee Retention

Our analysis identified overtime as the most significant factor contributing to employee attrition. This finding highlights the need for HR to assess overtime policies and employee workload management.

Key considerations include:

  • Employee Well-being & Work-Life Balance: Extended work hours contribute to fatigue, stress, and diminished job satisfaction, increasing the likelihood of turnover.

  • Compensation & Fairness: Are employees receiving appropriate compensation for overtime work? Is overtime equitably distributed among teams?

  • Workload & Project Management: Are urgent deadlines driving excessive overtime? Adjusting project timelines and increasing staffing levels could mitigate burnout.

To address these concerns, HR may consider conducting employee feedback surveys, reviewing scheduling practices, and assessing hiring needs to create a more sustainable work environment. Link to Source

Reevaluating Business Travel

While business travel can provide opportunities for professional growth and networking, frequent travel is strongly associated with higher attrition rates. Our analysis indicates that employees who travel frequently are 459% more likely to leave the company compared to those who do not travel at all.

Key challenges associated with frequent travel include:

  • Burnout & Stress: Extended time away from home can lead to exhaustion and decreased job satisfaction.

  • Health & Well-being: Frequent travel has been linked to higher stress levels and potential long-term health impacts.

  • Work-Life Balance: Employees with caregiving responsibilities or other personal commitments may find frequent travel disruptive.

HR should work closely with managers to assess the necessity of each trip. In today’s digital environment, virtual meetings and remote collaboration tools can often replace in-person travel while still achieving business objectives. By offering more flexible travel policies, companies can enhance retention and maintain a more engaged workforce.

The Value of Promoting from Within

A strong promote-from-within culture can significantly enhance employee retention and long-term company success. Organizations that prioritize internal growth create a sense of loyalty, motivation, and career stability among employees.

For example, Publix Supermarkets has successfully implemented this approach, allowing employees to advance from entry-level roles to management. As a result, Publix maintains an annual voluntary attrition rate of just 5%, compared to the 65% industry average in retail. Link to Source

Beyond retention benefits, internal promotions often align with stock ownership programs and loyalty incentives, which I will expand on in the next section. By investing in career development and providing clear advancement pathways, companies can reduce turnover and build a more engaged, experienced workforce.

Enhancing Retention Through Loyalty Bonuses

Employees thrive when they feel recognized, valued, and appreciated for their contributions. A well-structured loyalty bonus or stock ownership program can significantly boost morale, engagement, and retention.

Instead of relying solely on small gestures like pizza parties or happy hours, companies should consider meaningful incentives that reflect employees’ contributions to business success. For example, Publix Supermarkets is an employee-owned company where team members working over 1,000 hours receive stock valued at an average of 10% of their compensation. This model fosters loyalty, financial investment in the company’s success, and long-term retention.

While salary increases may not always be feasible, alternative recognition programs such as performance-based bonuses, stock options, and milestone celebrations can help reinforce a culture of appreciation and commitment. By implementing these strategies, companies can strengthen employee loyalty and reduce voluntary turnover.

Improving Work-Life Balance

A supportive work environment is crucial for retaining top talent and maintaining productivity. Our analysis shows that when employees experience greater satisfaction in both their personal and professional lives, their likelihood of quitting or underperforming decreases significantly.

To foster this balance, HR can implement initiatives such as:

  • Flexible Work Arrangements: Evaluate whether certain roles truly require in-office presence. Offering hybrid or remote options can improve work-life balance, especially for employees with caregiving responsibilities.

  • Commuting Considerations: Long commutes can reduce overall job satisfaction. Flexible start times or remote work days can help employees better manage family and personal responsibilities.

  • Team-Building & Recognition Events: Hosting quarterly company outings or morale-boosting activities can strengthen team cohesion and engagement.

By prioritizing employee well-being and flexibility, companies can reduce turnover, improve job satisfaction, and enhance overall performance.

Leadership Accountability and Employee Trust

Effective leadership is a key driver of employee satisfaction, performance, and retention. However, for leadership evaluations to be meaningful, employees must feel safe providing honest feedback without fear of repercussions.

To strengthen trust and accountability, HR should consider:

  • Ensuring Anonymity in Employee Surveys: Employees should feel confident that their feedback is truly anonymous, fostering more accurate and constructive insights.

  • Acting on Feedback Transparently: Rather than penalizing employees for low ratings, HR should use feedback as an opportunity to support leadership development and team collaboration.

  • Encouraging Constructive Improvement: Leadership evaluations should prioritize growth, offering managers training, mentorship, and resources to enhance their management skills.

By creating a culture of trust, accountability, and continuous improvement, companies can strengthen leadership effectiveness and employee engagement, ultimately improving overall team success.