Introduction

Attrition refers to employees leaving an organization without the organization being able to immediately fill their vacancies. High rates of employee attrition often occur when key factors, such as inadequate compensation, poor work-life balance, ineffective management, lack of career advancement opportunities, and poor working conditions, are left unaddressed

Understanding who leaves and why is crucial for improving workplace policies, reducing turnover, and minimizing the negative impact on productivity, culture, and financial performance. High attrition can lead to increased recruitment and training costs, loss of institutional knowledge, and burnout among remaining staff. Luckily, there are solutions. By effectively leveraging data-driven insights from employee surveys, exit interviews, and feedback, organizations/companies can identify root causes to employee attrition and implement targeted strategies to improve retention and create a healthier work environment.

Data Source

🔗 to Dataset: IBM HR Analytics Employee Attrition and Performance-Kaggle

This is a fictional data set created by IBM data scientists. It includes the data of nearly 1500 employees and 35 variables. The dataset contains information on employee demographics, income, job satisfaction, job information, and whether or not the employee attrited.

Click to View Dataset Variables and Definitions

#	Variable	Definition
1	Age	Age of employee
2	Gender	Employee’s gender (male/female)
3	Attrition	Whether employee left company (Yes/No)
4	BusinessTravel	Frequency of travel (Rarely, Frequently, etc.)
5	DailyRate	Daily salary rate (USD)
6	Department	Department of employee (Sales, R&D, etc.)
7	DistanceFromHome	Distance from home to work (miles)
8	Education	Level of education achieved by employee (1.Below College, 2.College, etc.)
9	EducationField	Employee’s field of study
10	EmployeeCount	N/A
11	EmployeeNumber	Unique identifier for each employee
12	EnvironmentSatisfaction	Employee satisfaction with working environment (1-4)
13	HourlyRate	Hourly salary rate (USD)
14	JobInvolvement	Level of involvement required for employee’s job (1-4)
15	JobLevel	Employee’s level of work (1-5)
16	JobRole	Employee’s role in the organization (Sales exec., Lab Tech., etc.)
17	JobSatisfaction	Satisfaction rating of the job (1-4)
18	Marital Status	Employee’s marital status (S, M, D)
19	MonthlyIncome	Employee’s monthly salary ($)
20	Monthly Rate	Monthly salary rate (USD)
21	NumCompaniesWorked	# of companies the employee worked for
22	Over 18	Whether or not the employee is 18 yrs old
23	OverTime	Whether or not the employee works overtime (yes/no)
24	PercentSalaryHike	Salary increase rate
25	PerformanceRating	Performance rating of employee (1-4)
26	RelationshipSatisfaction	Employee’s satisfaction with personal relationships (1-4)
27	StandardHours	Standard working hours for employees
28	StockOptionLevel	Employee’s stock option level
29	TotalWorkingYears	Total # of years the employee has worked
30	TrainingTimesLastYear	# of trainings the employee has taken
31	WorkLifeBalance	Employee’s perception of their work-life balance (1-4)
32	YearsAtCompany	# of years employee has been with current company
33	YearsInCurrentRole	# of years employee has been in current role
34	YearsSinceLastPromoted	# of years since last promotion
35	YearsWithCurrManager	# of years with current manager

Disclosure: This data is intended for research purposes and data practice. Models designed from this data are not designed to predict future attrition X number of months from now. The primary goal is research, data exploration, and inference, rather than developing a predictive model for future attrition. Typically, machine learning workflows involve splitting data into training and testing sets to evaluate how well a model generalizes. However, in this case, the objective is not to predict employee attrition a specific number of months from now but rather to understand the key factors associated with attrition and their relative importance.

Project Scope

This analysis covers 4 key areas:

1. Data Preparation (SQL & R)

Using SQL and R to inspect, clean, and prepare the dataset for analysis

2. Exploratory Data Analysis (SQL & R)

Perform initial EDA in SQL with the string and binary variables
Create visualizations in R to uncover relationships in the data with the numeric, binary, and string variables

3. Modeling and Analysis (R)

Logistic Regression

Applied generalized linear regression (logistic regression) to identify factors influencing employee attrition (binary outcome: stayed vs. quit)
Checked model assumptions
Constructed predictive probability plots to assess the variables individual contribution to attrition
Examined odds ratios (ORs) to quantify how each factor affects the likelihood of an employee leaving
Included demographic, job-related, and satisfaction-related variables to understand their impact

Model Evaluation

Using metrics such as accuracy, Tjur’s R², Precision, Recall, and the F1-score to evaluate how well our model performed and what can be done for future models.

4. Conclusion and Company Insights

Wrapping up the project with real-world, real-business strategies on the most influential factors from this analysis to help combat the negative impacts employee attrition can have on a company

1. Data Preparation (SQL & R)

Connect to SQL database with R

# 1.) Install & load necessary packages
library(RPostgres)  #to connect to SQL code
library(DBI)        #connecting to SQL database

# 2.) Enter database info
# connection <- dbConnect(Postgres(),
#                  dbname = "Name_Of_Your_Database",
#                  host = "yourHost",
#                  port = "####",
#                  user = "insert_user",
#                  password = "**********")

Note: I changed the name of variables by just putting underscores. Just for my own satisfaction. Over18 was removed before querying giving us 34 variables now.

Data Cleaning

Step 1: Load data into new table

To begin the analysis, I first loaded the dataset into a new SQL table. This ensures data integrity, allows for efficient querying, and provides a clean workspace for further transformations and analysis.

Drop Table if Exists hr_attrition_transformed;

CREATE TABLE hr_attrition_transformed AS  --the new table for analysis in R and EDA.
Select *
From hr_attrition;
--34 variables

Step 2: Check data for typos/errors

While it’s best practice to check every variable for inconsistencies, I’ll focus on a few key text-based variables to save time. This helps catch common issues like misspellings or inconsistent formatting.

Select Distinct(business_travel)
From hr_attrition_transformed; --3 levels. No typos.

3 records
business_travel
Non-Travel
Travel_Frequently
Travel_Rarely

Select Distinct(gender)
From hr_attrition_transformed; -- No typos

2 records
gender
Female
Male

Select Distinct(department)
From hr_attrition_transformed; --No typos

3 records
department
Human Resources
Research & Development
Sales

Step 3: Identify NULL values in ‘attrition’ and ‘employee_number’

Missing values can impact analysis and model performance. Here, I check for NULL values in the ‘attrition’ and ‘employee_number’ columns to ensure data completeness and decide how to handle any gaps.

Select COUNT(*) as null_count
From hr_attrition_transformed
Where attrition is NULL OR employee_number is NULL;

1 records
null_count
0

Step 4: Remove unnecessary columns

I chose to remove these for their own specifc reasons. Either the information seemed irrelevant or I did not think it mattered to the overall analysis. Additionally:

Redundant Information: Some columns may duplicate information already captured in other variables. Keeping both can lead to multicollinearity in regression models.
Irrelevant to the Objective: If a column does not directly relate to predicting attrition or improving insights, removing it helps streamline the analysis.
Improved Model Performance: Keeping only relevant features reduces computational cost and prevents overfitting in machine learning models.

Alter Table hr_attrition_transformed
Drop Column daily_rate, --not useful; may vary
Drop Column hourly_rate, --not useful; may vary
Drop Column monthly_rate, --not useful; may vary
Drop Column distance_from_home, --range is 1-29 miles
Drop Column standard_hours, --nonsensical; same for all employees
Drop Column employee_count; --nonsensical; every value is 1

Step 5: Creating ‘annual_income’ variable to condense income info

The data contains multiple salary-related columns. To simplify the analysis and avoid redundancy, I created a single ‘annual_income’ variable that combines relevant income information into one value. The CSV was then saved and exported to R for further cleaning to prepare for the regression analysis. (See Below)

Alter Table hr_attrition_transformed
Add Column annual_income INT;

Update hr_attrition_transformed
Set annual_income = (monthly_income * 12)/1000;

Select annual_income
From hr_attrition_transformed
Limit 3;

3 records
annual_income
41
49
223

Step 6: Loading packages & dataset in R

To begin my work in R, I load the necessary packages for data manipulation and visualization. I also import the dataset, making it ready for further processing and modeling. This step ensures that all required tools and data are available for the job ahead, including my analysis.

# Load necessary packages
library(tidyverse)     #for data wrangling and visualization
library(knitr)         #for tables
library(ggpubr)        #for plot designing
library(scales)        #for plot designing
library(corrplot)      #for correlation matrix
library(plotly)        #for interactive plots
library(reshape2)      #for plotting correlations
library(caret)         #for model evaluation and balancing
library(car)           #for checking multicollinearity
library(broom)         #for tables
library(performance)   #for model performance/evaluation
library(randomForest)  #for variable importance
library(vip)           #for graphing variable importance
library(pROC)          #for ROC curve and AUC
options(scipen = 999)

# Load in Data
hr_attrition_transformed <- read_csv("hr_attrition_transformed.csv")

Step 7: Viewing structure of dataset

str(hr_attrition_transformed)

By reviewing the structure of the dataset, I identify that a few variables need to be converted into categorical variables and checked for the correct baseline. The table below lists the variables I’ll focus on.

Click to View Variables

Variable	Baseline
Attrition	No
Business Travel	Non-travel
Department	N/A
Education	1
Education_Field	N/A
Environment Satisfaction	1
Gender	Male
Job Involvement	1
Job Level	1
Job Role	N/A
Job Satisfaction	1
Marital Status	N/A
Overtime	No
Performance Rating	1
Relationship Satisfaction	1
Stock Option Level	0
Work-Life Balance	1

Standardizing these variables correctly will enhance the model’s performance and make the results easier to interpret! ✅

Step 8: Converting numeric/character variables to categorical

In this step, I convert certain numeric/character variables into categorical variables using the as.factor() function in R. This is necessary for variables that represent categories (like job levels or satisfaction) but are currently stored as numbers. By doing so, I ensure that the model treats them as distinct categories, not continuous values, which will improve the accuracy and interpretability of my analysis.

  hr_attrition_transformed <- hr_attrition_transformed %>% 
    mutate(attrition = as.factor(attrition),
           business_travel = as.factor(business_travel),
           department = as.factor(department),
           education = as.factor(education),
           education_field = as.factor(education_field),
           environment_satisfaction = as.factor(environment_satisfaction),
           gender = as.factor(gender),
           job_involvement = as.factor(job_involvement),
           job_level = as.factor(job_level),
           job_role = as.factor(job_role),
           job_satisfaction = as.factor(job_satisfaction),
           marital_status = as.factor(marital_status),
           overtime = as.factor(overtime),
           performance_rating = as.factor(performance_rating),
           relationship_satisfaction = as.factor(relationship_satisfaction),
           stock_option_level = as.factor(stock_option_level),
           work_life_balance = as.factor(work_life_balance))

Step 9: Releveling variables

First, I inspect the current levels of the categorical variables using the levels() function. If the levels are already in the correct order, great — no changes needed! 🎉 However, if the order is incorrect, I use the factor() function to set the variables to the desired baseline or order, ensuring that the model processes them correctly.

levels(hr_attrition_transformed$overtime)

## [1] "No"  "Yes"

# releveling
hr_attrition_transformed <- hr_attrition_transformed %>% 
  mutate(business_travel = factor(business_travel, 
                                  levels = c("Non-Travel", "Travel_Rarely", "Travel_Frequently")),
         gender = factor(gender, levels = c("Male", "Female")),
         marital_status = factor(marital_status, levels = c("Single", "Married", "Divorced")))

Looks like business_travel, gender, and marital_status were the only variables that needed a little re-leveling! Now that they are in the right order, I can move forward confidently, knowing the model will interpret it correctly. ✅

ALL DONE!✅

2. Exploratory Data Analysis (SQL & R)

Attrition Rates on Key Variables (SQL)

Attrition Count

Select attrition,
COUNT(*) AS count_attrition
From hr_attrition_transformed
Group By attrition
Order By attrition;

-- NOTE: this is an imbalanced data set. Important for our model interpretation.

2 records
attrition	count_attrition
No	1233
Yes	237

Annual Income & Attrition

Select attrition, 
        MIN(annual_income) AS min_income, 
        MAX(annual_income) AS max_income,
        ROUND(AVG(annual_income), 2) AS avg_income 
From hr_attrition_transformed
Group By attrition
Order By attrition; 

--NOTE: those leaving earn about 25K less. Range looks similar, 
-- but distribution might be different. We can easier see the median in R.

2 records
attrition	min_income	max_income	avg_income
No	12	239	81.49
Yes	12	238	56.95

Business Travel & Attrition

Select business_travel, attrition,
    Count(business_travel) AS count_bus_travel,
        ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By business_travel),2) AS rate
From hr_attrition_transformed
Group By business_travel, attrition
Order By business_travel, attrition;

--NOTE: Those who travel at all have a higher rate of leaving. 
-- Dont quite know how much this plays a role in leaving, however.

6 records
business_travel	attrition	count_bus_travel	rate
Non-Travel	No	138	92.00
Non-Travel	Yes	12	8.00
Travel_Frequently	No	208	75.09
Travel_Frequently	Yes	69	24.91
Travel_Rarely	No	887	85.04
Travel_Rarely	Yes	156	14.96

Department & Attrition

Select department, attrition,
    Count(department) AS count_department,
        ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By department),2) AS rate
From hr_attrition_transformed
Group By department, attrition
Order By department, attrition;

--NOTE: In terms of total employees, sales has a large rate of attrition.

6 records
department	attrition	count_department	rate
Human Resources	No	51	80.95
Human Resources	Yes	12	19.05
Research & Development	No	828	86.16
Research & Development	Yes	133	13.84
Sales	No	354	79.37
Sales	Yes	92	20.63

Education Field & Attrition

Select education_field, attrition,
    Count(education_field) AS count_edufield,
    ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By education_field),2) AS rate
From hr_attrition_transformed
Group By education_field, attrition
Order By education_field, attrition;

--NOTE: Those with technical degrees have the largest rate with more
-- than 100 people. Perhaps finding better jobs, higher pay for their
-- skills, etc...

Displaying records 1 - 10
education_field	attrition	count_edufield	rate
Human Resources	No	20	74.07
Human Resources	Yes	7	25.93
Life Sciences	No	517	85.31
Life Sciences	Yes	89	14.69
Marketing	No	124	77.99
Marketing	Yes	35	22.01
Medical	No	401	86.42
Medical	Yes	63	13.58
Other	No	71	86.59
Other	Yes	11	13.41

Gender & Attrition

Select gender, attrition,
    Count(gender) AS count_gender,
    ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By gender),2) AS rate
From hr_attrition_transformed
Group By gender, attrition
Order By gender, attrition;

--NOTE: men and women share about the same rate. No glaring differences.

4 records
gender	attrition	count_gender	rate
Female	No	501	85.20
Female	Yes	87	14.80
Male	No	732	82.99
Male	Yes	150	17.01

Job Involvement & Attrition

Select job_involvement, attrition,
    Count(job_involvement) AS count_involvement,
    ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By job_involvement),2) AS rate
From hr_attrition_transformed
Group By job_involvement, attrition
Order By job_involvement, attrition;

--NOTE: Not surprisingly, those who rate low job involvement have high 
--      attrition rates (33.73%). The proportion of employees is low compared
--      to the other ratings. Those who engage in their job stay and vice versa.

8 records
job_involvement	attrition	count_involvement	rate
1	No	55	66.27
1	Yes	28	33.73
2	No	304	81.07
2	Yes	71	18.93
3	No	743	85.60
3	Yes	125	14.40
4	No	131	90.97
4	Yes	13	9.03

Job Level & Attrition

Select job_level, attrition,
    Count(job_level) AS count_joblevel,
    ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By job_level),2) AS rate
From hr_attrition_transformed
Group By job_level, attrition
Order By job_level, attrition;

--NOTE: Those with entry level positions (1) have drastically higher rates 
--      of attrition (26.34%) than any other position group. All other groups
--      have a 85%+ retention rate.
--      (1=Entry Level, 2=Mid Level, 3=Senior, 4=Lead, 5=Executive)

Displaying records 1 - 10
job_level	attrition	count_joblevel	rate
1	No	400	73.66
1	Yes	143	26.34
2	No	482	90.26
2	Yes	52	9.74
3	No	186	85.32
3	Yes	32	14.68
4	No	101	95.28
4	Yes	5	4.72
5	No	64	92.75
5	Yes	5	7.25

Job Role & Attrition

Select job_role, attrition,
    Count(job_role) AS count_jobrole,
    ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By job_role),2) AS rate
From hr_attrition_transformed
Group By job_role, attrition
Order By job_role, attrition;

--NOTE: Not too sure why Sales Representatives have a high rate of attrition (40%).
--Would need to look more into how that job role plays out in careers.

Displaying records 1 - 10
job_role	attrition	count_jobrole	rate
Healthcare Representative	No	122	93.13
Healthcare Representative	Yes	9	6.87
Human Resources	No	40	76.92
Human Resources	Yes	12	23.08
Laboratory Technician	No	197	76.06
Laboratory Technician	Yes	62	23.94
Manager	No	97	95.10
Manager	Yes	5	4.90
Manufacturing Director	No	135	93.10
Manufacturing Director	Yes	10	6.90

Marital Status & Attrition

Select marital_status, attrition,
    Count(marital_status) AS count_marital_status,
    ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By marital_status),2) AS rate
From hr_attrition_transformed
Group By marital_status, attrition
Order By marital_status, attrition;

--NOTE: single people are leaving at a higher rate, which could
-- be interpreted as maybe not having an established career or 
-- the flexibility to look for other jobs without bigger consequences.

6 records
marital_status	attrition	count_marital_status	rate
Divorced	No	294	89.91
Divorced	Yes	33	10.09
Married	No	589	87.52
Married	Yes	84	12.48
Single	No	350	74.47
Single	Yes	120	25.53

Overtime & Attrition

Select overtime, attrition,
    Count(overtime) AS count_overtime,
    ROUND((Count(*) * 100.0) / Sum(Count(*)) OVER (Partition By overtime),2) AS rate
From hr_attrition_transformed
Group By overtime, attrition
Order By overtime, attrition;

--NOTE: Working overtime looks like a big reason to why employees leave.
-- I would be curious to see how overtime is handled in these work places
-- (unpaid, low pay, bad conditions, etc..)

4 records
overtime	attrition	count_overtime	rate
No	No	944	89.56
No	Yes	110	10.44
Yes	No	289	69.47
Yes	Yes	127	30.53

Descriptive Data

hr_attrition_transformed %>%
  select(c(1:29)) %>% 
  summary()

##       age        attrition           business_travel
##  Min.   :18.00   No :1233   Non-Travel       : 150  
##  1st Qu.:30.00   Yes: 237   Travel_Rarely    :1043  
##  Median :36.00              Travel_Frequently: 277  
##  Mean   :36.92                                      
##  3rd Qu.:43.00                                      
##  Max.   :60.00                                      
##                                                     
##                   department  education         education_field
##  Human Resources       : 63   1:170     Human Resources : 27   
##  Research & Development:961   2:282     Life Sciences   :606   
##  Sales                 :446   3:572     Marketing       :159   
##                               4:398     Medical         :464   
##                               5: 48     Other           : 82   
##                                         Technical Degree:132   
##                                                                
##  employee_number  environment_satisfaction    gender    job_involvement
##  Min.   :   1.0   1:284                    Male  :882   1: 83          
##  1st Qu.: 491.2   2:287                    Female:588   2:375          
##  Median :1020.5   3:453                                 3:868          
##  Mean   :1024.9   4:446                                 4:144          
##  3rd Qu.:1555.8                                                        
##  Max.   :2068.0                                                        
##                                                                        
##  job_level                      job_role   job_satisfaction  marital_status
##  1:543     Sales Executive          :326   1:289            Single  :470   
##  2:534     Research Scientist       :292   2:280            Married :673   
##  3:218     Laboratory Technician    :259   3:442            Divorced:327   
##  4:106     Manufacturing Director   :145   4:459                           
##  5: 69     Healthcare Representative:131                                   
##            Manager                  :102                                   
##            (Other)                  :215                                   
##  monthly_income  num_companies_worked overtime   percent_salary_hike
##  Min.   : 1009   Min.   :0.000        No :1054   Min.   :11.00      
##  1st Qu.: 2911   1st Qu.:1.000        Yes: 416   1st Qu.:12.00      
##  Median : 4919   Median :2.000                   Median :14.00      
##  Mean   : 6503   Mean   :2.693                   Mean   :15.21      
##  3rd Qu.: 8379   3rd Qu.:4.000                   3rd Qu.:18.00      
##  Max.   :19999   Max.   :9.000                   Max.   :25.00      
##                                                                     
##  performance_rating relationship_satisfaction stock_option_level
##  3:1244             1:276                     0:631             
##  4: 226             2:303                     1:596             
##                     3:459                     2:158             
##                     4:432                     3: 85             
##                                                                 
##                                                                 
##                                                                 
##  total_working_years training_times_last_year work_life_balance
##  Min.   : 0.00       Min.   :0.000            1: 80            
##  1st Qu.: 6.00       1st Qu.:2.000            2:344            
##  Median :10.00       Median :3.000            3:893            
##  Mean   :11.28       Mean   :2.799            4:153            
##  3rd Qu.:15.00       3rd Qu.:3.000                             
##  Max.   :40.00       Max.   :6.000                             
##                                                                
##  yrs_at_company   yrs_in_current_role yrs_since_last_promotion
##  Min.   : 0.000   Min.   : 0.000      Min.   : 0.000          
##  1st Qu.: 3.000   1st Qu.: 2.000      1st Qu.: 0.000          
##  Median : 5.000   Median : 3.000      Median : 1.000          
##  Mean   : 7.008   Mean   : 4.229      Mean   : 2.188          
##  3rd Qu.: 9.000   3rd Qu.: 7.000      3rd Qu.: 3.000          
##  Max.   :40.000   Max.   :18.000      Max.   :15.000          
##                                                               
##  yrs_with_curr_manager annual_income   
##  Min.   : 0.000        Min.   : 12.00  
##  1st Qu.: 2.000        1st Qu.: 34.00  
##  Median : 3.000        Median : 58.50  
##  Mean   : 4.123        Mean   : 77.53  
##  3rd Qu.: 7.000        3rd Qu.:100.00  
##  Max.   :17.000        Max.   :239.00  
##

Distributions of Numeric Variables (R)

I’m not seeing anything in these distributions that might affect the analysis (e.g., outliers, anomalies). Most columns are right skewed which is what I expected due to the nature of the data.

Age

Annual Income

Companies Worked For

Percent Salary Hike

Total Working Years

Training Times Last Year

Years at Company

Years in Current Role

Years Since Last Promotion

Years with Current Manager

Other Visualizations & Insights (R)

Plot 1: Annual Income by Years at Company

# breaking down data into age groups for simplicity
# Note: Age range = 0-40
hr_attrition_transformed <- hr_attrition_transformed %>%
  mutate(YearRange = cut(yrs_at_company,
                           breaks = c(0, 5, 10, 
                                      15, 20, 25, 
                                      30, 35, 40),
                           labels = c("0-5", "05-10", 
                                      "10-15", "15-20", 
                                      "20-25", "25-30", 
                                      "30-35", "35-40"),
                           include.lowest = TRUE))

# putting data into plot dataframe
plot1df <- hr_attrition_transformed %>% 
  group_by(YearRange) %>% 
  summarise(annual_income = round(mean(annual_income, na.rm=T),0))

#Line Graph
p1 <- ggplot(plot1df, aes(x=YearRange, y=annual_income, group = 1))+
  scale_y_continuous(breaks = seq(10, 250, 20),
                     labels = label_number(suffix = "K"),
                     limits = c(10, 250))+
  geom_line(color="royalblue4")+
  geom_point(size = 2.7,
             fill = "royalblue",
             color = "black",
             shape = 23)+
  geom_text(label = "Max Income = $239,998",
            x=6.5,
            y=50000)+
  geom_text(label = "Min Income = $12,108",
            x=6.45,
            y=30000)+
  labs(title = "Relationship between annual salary and # of years at company",
       x= "Years at Company",
       y= "Annual Income ($)")+
  theme_bw()+
  theme(plot.title = element_text(hjust=0, size=12, face="bold"),
        axis.title = element_text(size=10, face="bold"),
        axis.text = element_text(size=8),
        plot.background = element_rect(fill="gray95"))

#for interactive plots
ggplotly(p1)

Comments:

At 0-15 years, the salary increases by an average of around 30K, reflecting typical career growth as employees gain experience and promotions. Between 15-30 years, there is a sharper rise in salary, suggesting that employees who stay longer with the company are receiving substantial promotions and pay raises. However, around 30-40 years, the salary tends to either drop or level off. This could be due to several reasons: employees who have been with the company for this long may be nearing retirement, which can lead to a plateau in salary; senior positions may be given to younger employees; or employees who have been with the company for many years may have reached the peak of their job opportunities, with fewer new advancement opportunities available.
Note: For scale, the y-axis was set to represent both the minimum and maximum annual income in the data.

Plot 2: Relationship between Job Satisfaction & Employee Attrition

jobsat <- hr_attrition_transformed %>% 
  #job satisfaction
  group_by(job_satisfaction, attrition) %>% 
  summarise(employee_count = n(), .groups = "drop") %>% 
  mutate(jobsat_rate = round(employee_count /sum(employee_count)*100, 2)) %>% 
  ungroup() %>% 
    group_by(job_satisfaction) %>%
  mutate(cumul_rate = sum(jobsat_rate)) %>% 
  ungroup() %>% 
  select(attrition, job_satisfaction, jobsat_rate, cumul_rate)

# Bar graph
p2 <- ggplot(jobsat, aes(x=job_satisfaction, y=jobsat_rate,
                    fill=attrition)) +
  geom_col(stat = 'identity',
           position = position_dodge(),
           width = 0.8,
           color="black") +
    geom_text(aes(label = ifelse(attrition == "No",
                                 paste0(round(cumul_rate, 1), "%"),
                                 "")), 
            position = position_stack(vjust = 0.95), size = 4) +
    scale_y_continuous(breaks = seq(0, 50, 10),
                     labels = label_number(suffix = "%"),
                     limits = c(0,50)) +
  labs(title = "Job Satisfaction and Employee Attrition",
       x = "Job Satisfaction Rating",
       y = "Percentage of Employees",
       fill = "Attrition Status") +
  scale_fill_manual(values = c("darkseagreen4", "indianred")) +
  theme_bw() +
  theme(plot.title = element_text(hjust=0, face="bold", size=12),
        plot.background = element_rect(fill="gray95"),
        axis.title = element_text(hjust=0.5, face="bold", size=10),
        axis.text = element_text(size=8),
        legend.background = element_rect(fill="papayawhip"))
  
#interactive plot
ggplotly(p2)

Comments:

Employees with a job satisfaction rating of 3 or 4 make up the largest portion of the workforce, each accounting for around 30% of employees. This suggests that most employees are at least moderately satisfied with their jobs. The red portion representing employees who left the company remains relatively small across all satisfaction levels. However, lower satisfaction ratings (1 and 2) may still be linked to a higher risk of attrition, which would require further analysis to confirm.

Overall, attrition appears to be low across all job satisfaction groups, indicating that while job satisfaction may play a role in employee retention, it is not the sole determining factor.

Plot 3: Relationship between Work Environment Satisfaction & Employee Attrition

envirosat <- hr_attrition_transformed %>% 
  #environment satisfaction
  group_by(environment_satisfaction, attrition) %>% 
  summarise(employee_count = n(), .groups = "drop") %>% 
  mutate(envirosat_rate = round(employee_count /sum(employee_count)*100, 2)) %>% 
  ungroup() %>%
      group_by(environment_satisfaction) %>%
  mutate(cumul_rate = sum(envirosat_rate)) %>% 
  ungroup() %>% 
  select(attrition, environment_satisfaction, envirosat_rate,
         cumul_rate)

# Bar graph
p3 <- ggplot(envirosat, aes(x=environment_satisfaction,
                            y=envirosat_rate,
                            fill=attrition)) +
  geom_col(stat = 'identity',
           position = position_dodge(),
           width = 0.8,
           color="black") +
      geom_text(aes(label = ifelse(attrition == "No",
                                 paste0(round(cumul_rate, 1), "%"),
                                 "")), 
            position = position_stack(vjust = 0.95), size = 4) +
    scale_y_continuous(breaks = seq(0, 50, 10),
                     labels = label_number(suffix = "%"),
                     limits = c(0,50)) +
  labs(title = "Work Environment Satisfaction and Employee Attrition",
       x = "Work Environment Satisfaction Rating",
       y = "Percentage of Employees",
       fill = "Attrition Status") +
  scale_fill_manual(values = c("darkslategray3", "darkgoldenrod")) +
  theme_bw() +
  theme(plot.title = element_text(hjust=0, face="bold", size=12),
        plot.background = element_rect(fill="gray95"),
        axis.title = element_text(hjust=0.5, face="bold", size=10),
        axis.text = element_text(size=8),
        legend.background = element_rect(fill="papayawhip"))
  
#interactive plot
ggplotly(p3)

Comments:

Similarly, employees who rate their work environment satisfaction high (3 or 4) make up the majority of employees in the company. Attrition rates remain low regardless of work environment rating (< 5%), indicating that the work environment of the employee may not be a large factor in deciding to leave a company or not. Furthermore, it could be that the company these employees work at has a positive, clean, and collaborative work environment, meaning there may be other factors (i.e., compensation, career growth) that the company is not strong in, contributing to higher attrition rates.

Plot 4: Relationship between Relationship Satisfaction & Employee Attrition

relsat <- hr_attrition_transformed %>% 
  #relationship satisfaction
  group_by(relationship_satisfaction, attrition) %>% 
  summarise(employee_count = n(), .groups = "drop") %>% 
  mutate(relsat_rate = round(employee_count /sum(employee_count)*100, 2)) %>% 
  ungroup() %>%
      group_by(relationship_satisfaction) %>%
  mutate(cumul_rate = sum(relsat_rate)) %>% 
  ungroup() %>% 
  select(attrition, relationship_satisfaction, relsat_rate, cumul_rate)

# Bar graph
p4 <- ggplot(relsat, aes(x=relationship_satisfaction,
                            y=relsat_rate,
                            fill=attrition)) +
  geom_col(stat = 'identity',
           position = position_dodge(),
           width = 0.8,
           color="black") +
      geom_text(aes(label = ifelse(attrition == "No",
                                 paste0(round(cumul_rate, 1), "%"),
                                 "")), 
            position = position_stack(vjust = 0.95), size = 4) +
    scale_y_continuous(breaks = seq(0, 50, 10),
                     labels = label_number(suffix = "%"),
                     limits = c(0,50)) +
  labs(title = "Relationship Satisfaction and Employee Attrition",
       x = "Relationship Satisfaction Rating",
       y = "Percentage of Employees",
       fill = "Attrition Status") +
  scale_fill_manual(values = c("skyblue3", "salmon2")) +
  theme_bw() +
  theme(plot.title = element_text(hjust=0, face="bold", size=12),
        plot.background = element_rect(fill="gray95"),
        axis.title = element_text(hjust=0.5, face="bold", size=10),
        axis.text = element_text(size=8),
        legend.background = element_rect(fill="papayawhip"))
  
#interactive plot
ggplotly(p4)

Comments:

When examining employees’ personal lives, relationship satisfaction appears to have little impact on attrition. The majority of employees (60.6%) report high relationship satisfaction, with ratings of 3 or 4. This suggests that employees with varying levels of relationship satisfaction are equally likely to stay with the company.

Plot 5: Relationship between Work-Life Balance & Employee Attrition

worklife <- hr_attrition_transformed %>% 
  #work-life balance
  group_by(work_life_balance, attrition) %>% 
  summarise(employee_count = n(), .groups = "drop") %>% 
  mutate(worklife_rate = round(employee_count /sum(employee_count)*100, 2)) %>%
  ungroup() %>%
      group_by(work_life_balance) %>%
  mutate(cumul_rate = sum(worklife_rate)) %>% 
  ungroup() %>% 
  select(attrition, work_life_balance, worklife_rate, cumul_rate)

# Bar graph
p5 <- ggplot(worklife, aes(x=work_life_balance,
                            y=worklife_rate,
                            fill=attrition)) +
  geom_col(stat = 'identity',
           position = position_dodge(),
           width = 0.8,
           color="black") +
      geom_text(aes(label = ifelse(attrition == "No",
                                 paste0(round(cumul_rate, 1), "%"),
                                 "")), 
            position = position_stack(vjust = 0.9), size = 4) +
    scale_y_continuous(breaks = seq(0, 57, 10),
                     labels = label_number(suffix = "%"),
                     limits = c(0,57)) +
  labs(title = "Work-Life Balance and Employee Attrition",
       x = "Work-Life Balance Rating",
       y = "Percentage of Employees",
       fill = "Attrition Status") +
  scale_fill_manual(values = c("lightcyan4", "tomato3")) +
  theme_bw() +
  theme(plot.title = element_text(hjust=0, face="bold", size=12),
        plot.background = element_rect(fill="gray95"),
        axis.title = element_text(hjust=0.5, face="bold", size=10),
        axis.text = element_text(size=8),
        legend.background = element_rect(fill="papayawhip"))
  
#interactive plot
ggplotly(p5)

Comments:

The majority of employees (60.8%) rate their work-life balance as a 3, followed by 23.4% rating it as a 2. Only a small percentage of employees rate their work-life balance as either extremely poor (1) or excellent (4). Interestingly, attrition rates remain relatively low across all ratings, with only a slight increase for employees who rate their work-life balance as a 3.

Plot 6: Correlation Matrix

Comments:

Performance Rating, Employee Number and Percent Salary Hike show no correlation with Attrition. Therefore, we will remove these variables and store the refined dataset in a new data frame for our predictive analysis.

Additionally, Overtime exhibits a low-to-moderate positive correlation with Attrition (0.25), suggesting that employees who work overtime are more likely to leave.

Identifying & Removing Independent Variables with High Intercorrelation

ind_corr <- corrMatrix %>%
  melt() %>% 
  filter(value > 0.7) %>% #threshold > 0.7
  filter(Var1 != Var2) #avoid looking at redundant info

ind_corr

##                     Var1                  Var2     value
## 1         monthly_income             job_level 0.9502999
## 2    total_working_years             job_level 0.7822078
## 3          annual_income             job_level 0.9501928
## 4              job_level        monthly_income 0.9502999
## 5    total_working_years        monthly_income 0.7728932
## 6          annual_income        monthly_income 0.9999869
## 7     performance_rating   percent_salary_hike 0.7735500
## 8    percent_salary_hike    performance_rating 0.7735500
## 9              job_level   total_working_years 0.7822078
## 10        monthly_income   total_working_years 0.7728932
## 11         annual_income   total_working_years 0.7728457
## 12   yrs_in_current_role        yrs_at_company 0.7587537
## 13 yrs_with_curr_manager        yrs_at_company 0.7692124
## 14        yrs_at_company   yrs_in_current_role 0.7587537
## 15 yrs_with_curr_manager   yrs_in_current_role 0.7143648
## 16        yrs_at_company yrs_with_curr_manager 0.7692124
## 17   yrs_in_current_role yrs_with_curr_manager 0.7143648
## 18             job_level         annual_income 0.9501928
## 19        monthly_income         annual_income 0.9999869
## 20   total_working_years         annual_income 0.7728457

Removing:

employee_number: irrelevant to data
job_level: highly correlated with annual_income
job_role: multicollinearity with department
monthly_income: highly correlated with annual_income; redundant
performance_rating: highly correlated with percent_salary_hike
total_working_years: redundant with annual_income
yrs_in_current_role: highly correlated with yrs_at_company
yrs_with_curr_manager: similar as above
YearRange: irrelevant to final data set

#removing selected columns from hr_attrition_transformed
HRfinaldata <- hr_attrition_transformed %>% 
  select(-c(7, 11, 12, 15, 19, 22, 26, 28, 30))

ALL DONE!✅

3. Modeling & Analysis (R)

In this section, I will walk through the modeling process to identify the key factors contributing to employee attrition within this company. I start by building the model, followed by checking its assumptions to ensure reliability and that the data fits the model well. Next, I analyze the predicted probabilities and break down the impact of demographic, job-related, and work/life satisfaction factors on employee attrition. Lastly, I bring in results from my exploratory findings and tie it in with the analysis.

What to Expect in This Section

Running the Model: fitting a logistic regression model to see what influences attrition.
Checking Model Assumptions: ensuring the model meets key statistical assumtions for valid interpretation.
Predicted Probability Plots: visualizing how different factors averaged across each other influence the likelihood of attrition
Assessing Demographic Factors: understanding the impact of age, gender, education, and other demographics.
Assessing Job-Related Factors: understanding how income, working overtime, tenure, and others contribute to attrition.
Assessing Work-Life Satisfaction: understanding the role of work-life balance, job satisfaction, and personal life on attrition.
Model Evaluation: seeing the performance of the model and pointing out flaws that helps communicate my findings to stakeholders more effectively.

Logistic Regression

I chose the logistic regression model (Generalized Linear Model or GLM) because it can handle binary outcomes such as employee attrition. This model allows me to examine how various factors influence the likelihood of attrition, providing odds ratios that are intuitive and easy to explain. Logistic regression is well-suited for both continuous variables (like age and income) and categorical variables (such as education level and department). It also doesn’t require a linear relationship between predictors and the outcome, making it flexible for the types of data in this study. While more complex models may offer better prediction accuracy, logistic regression provides an interpretable approach that balances model fit with computational efficiency.

HR_model <- glm(attrition ~ ., 
                 data = HRfinaldata,
                 family = binomial)

Checking Model Assumptions

check_model(HR_model)

✅ Model fits the data well
✅ Residuals look uniform with a few outliers
✅ The outliers are not influential
✅ Multicollinearity is less than 5
✅ Residuals exhibit a uniform distribution

Predicted Probability Plots

#Creating Lists of Predicted Probabilities averaged over all other variables
predictions <- ggeffects::ggeffect(HR_model)

What are Predicted Probabilities?

Predicted probabilities represent the likelihood of an event occurring based on the model’s analysis. In this case, the model calculates the probability that an employee will leave the company given a specific factor (e.g., age, job role, work-life balance). These probabilities are averaged over all other factors in the model, meaning that while we focus on one variable at a time in the plot, the prediction accounts for the presence of all other variables in the background. A probability close to 100% (or 1) suggests a high likelihood of attrition, while a probability close to 0% (or 0) suggests a low likelihood.

In the graphs below, the y-axis shows the predicted probability of attrition, while the x-axis represents the variable being analyzed (e.g., age or marital status). Since the probabilities are averaged over all other factors, the trends we see represent the general effect of each variable while holding everything else constant.

Age

Business Travel

Department

Education

Education Field

Environment Satisfaction

Gender

Job Involvement

Job Satisfaction

Marital Status

# of Companies Worked For

Overtime

Percent Salary Hike

Relationship Satisfaction

Stock Option Level

Training Events Last Year

Work-Life Balance

Years at Company

Years Since Last Promotion

Annual Income

Assessing Demographic Factors

Key Takeaways:

Every 1-year increase in age decreases the odds of quitting by 4%
For females, the odds of quitting are 35% lower than for males.
For each additional company the employee has worked for, the odds of quitting increase by 16%
Education, Field of Education, and Marital Status do not show a statistically significant effect on the odds of quitting. While there may be some variation, the model does not provide enough evidence to say they influence employee attrition.

Assessing Work/Life Satisfaction

Key Takeaways:

Generally, higher environment and job satisfaction and greater work-life balance decrease the odds of an employee quitting
Similarly, greater relationship satisfaction decreases the odds of quitting

Model Evaluation

The model has an accuracy rate of 88.5%. It seems quite high, right? It’s important to note that the model has trouble identifying the actual quitters (as shown by the low recall of 43% below). In summary, the model is good at predicting who WON’T leave but less effective at identifying who will.

Tjur’s R² = 0.33; A 33% indicates substantial separation between the predicted values of the No’s and Yes’s. That’s a good thing!

Precision = 75%; percentage of actual “non-quitters” correctly identified by the model.

Recall = 43%; percentage of “quitters” the model correctly identified.

F1-score = 54.7%; the model shows low accuracy in identifying attrition cases and minimizing false positives.

Confusion Matrix

Actual/Predicted	No (Predicted)	Yes (Predicted)
No (Actual)	1199 (true neg)	34 (false pos)
Yes (Actual)	135 (false neg)	102 (true pos)

4. Conclusion and Company Insights

Impact of Overtime on Employee Retention

Our analysis identified overtime as the most significant factor contributing to employee attrition. This finding highlights the need for HR to assess overtime policies and employee workload management.

Key considerations include:

Employee Well-being & Work-Life Balance: Extended work hours contribute to fatigue, stress, and diminished job satisfaction, increasing the likelihood of turnover.
Compensation & Fairness: Are employees receiving appropriate compensation for overtime work? Is overtime equitably distributed among teams?
Workload & Project Management: Are urgent deadlines driving excessive overtime? Adjusting project timelines and increasing staffing levels could mitigate burnout.

To address these concerns, HR may consider conducting employee feedback surveys, reviewing scheduling practices, and assessing hiring needs to create a more sustainable work environment. Link to Source

Reevaluating Business Travel

While business travel can provide opportunities for professional growth and networking, frequent travel is strongly associated with higher attrition rates. Our analysis indicates that employees who travel frequently are 459% more likely to leave the company compared to those who do not travel at all.

Key challenges associated with frequent travel include:

Burnout & Stress: Extended time away from home can lead to exhaustion and decreased job satisfaction.
Health & Well-being: Frequent travel has been linked to higher stress levels and potential long-term health impacts.
Work-Life Balance: Employees with caregiving responsibilities or other personal commitments may find frequent travel disruptive.

HR should work closely with managers to assess the necessity of each trip. In today’s digital environment, virtual meetings and remote collaboration tools can often replace in-person travel while still achieving business objectives. By offering more flexible travel policies, companies can enhance retention and maintain a more engaged workforce.

The Value of Promoting from Within

A strong promote-from-within culture can significantly enhance employee retention and long-term company success. Organizations that prioritize internal growth create a sense of loyalty, motivation, and career stability among employees.

For example, Publix Supermarkets has successfully implemented this approach, allowing employees to advance from entry-level roles to management. As a result, Publix maintains an annual voluntary attrition rate of just 5%, compared to the 65% industry average in retail. Link to Source

Beyond retention benefits, internal promotions often align with stock ownership programs and loyalty incentives, which I will expand on in the next section. By investing in career development and providing clear advancement pathways, companies can reduce turnover and build a more engaged, experienced workforce.

Enhancing Retention Through Loyalty Bonuses

Employees thrive when they feel recognized, valued, and appreciated for their contributions. A well-structured loyalty bonus or stock ownership program can significantly boost morale, engagement, and retention.

Instead of relying solely on small gestures like pizza parties or happy hours, companies should consider meaningful incentives that reflect employees’ contributions to business success. For example, Publix Supermarkets is an employee-owned company where team members working over 1,000 hours receive stock valued at an average of 10% of their compensation. This model fosters loyalty, financial investment in the company’s success, and long-term retention.

While salary increases may not always be feasible, alternative recognition programs such as performance-based bonuses, stock options, and milestone celebrations can help reinforce a culture of appreciation and commitment. By implementing these strategies, companies can strengthen employee loyalty and reduce voluntary turnover.

Improving Work-Life Balance

A supportive work environment is crucial for retaining top talent and maintaining productivity. Our analysis shows that when employees experience greater satisfaction in both their personal and professional lives, their likelihood of quitting or underperforming decreases significantly.

To foster this balance, HR can implement initiatives such as:

Flexible Work Arrangements: Evaluate whether certain roles truly require in-office presence. Offering hybrid or remote options can improve work-life balance, especially for employees with caregiving responsibilities.
Commuting Considerations: Long commutes can reduce overall job satisfaction. Flexible start times or remote work days can help employees better manage family and personal responsibilities.
Team-Building & Recognition Events: Hosting quarterly company outings or morale-boosting activities can strengthen team cohesion and engagement.

By prioritizing employee well-being and flexibility, companies can reduce turnover, improve job satisfaction, and enhance overall performance.

Leadership Accountability and Employee Trust

Effective leadership is a key driver of employee satisfaction, performance, and retention. However, for leadership evaluations to be meaningful, employees must feel safe providing honest feedback without fear of repercussions.

To strengthen trust and accountability, HR should consider:

Ensuring Anonymity in Employee Surveys: Employees should feel confident that their feedback is truly anonymous, fostering more accurate and constructive insights.
Acting on Feedback Transparently: Rather than penalizing employees for low ratings, HR should use feedback as an opportunity to support leadership development and team collaboration.
Encouraging Constructive Improvement: Leadership evaluations should prioritize growth, offering managers training, mentorship, and resources to enhance their management skills.

By creating a culture of trust, accountability, and continuous improvement, companies can strengthen leadership effectiveness and employee engagement, ultimately improving overall team success.

HR- Employee Attrition and Analysis

Introduction

Data Source

Project Scope

1. Data Preparation (SQL & R)

2. Exploratory Data Analysis (SQL & R)

3. Modeling and Analysis (R)

Logistic Regression

Model Evaluation

4. Conclusion and Company Insights

1. Data Preparation (SQL & R)

Connect to SQL database with R

Data Cleaning

Step 1: Load data into new table

Step 2: Check data for typos/errors

Step 3: Identify NULL values in ‘attrition’ and ‘employee_number’

Step 4: Remove unnecessary columns

Step 5: Creating ‘annual_income’ variable to condense income info

Step 6: Loading packages & dataset in R

Step 7: Viewing structure of dataset

Step 8: Converting numeric/character variables to categorical

Step 9: Releveling variables

2. Exploratory Data Analysis (SQL & R)

Attrition Rates on Key Variables (SQL)

Attrition Count

Annual Income & Attrition

Business Travel & Attrition

Department & Attrition

Education Field & Attrition

Gender & Attrition

Job Involvement & Attrition

Job Level & Attrition

Job Role & Attrition

Marital Status & Attrition

Overtime & Attrition

Descriptive Data

Distributions of Numeric Variables (R)

Age

Annual Income

Companies Worked For

Percent Salary Hike

Total Working Years

Training Times Last Year

Years at Company

Years in Current Role

Years Since Last Promotion

Years with Current Manager

Other Visualizations & Insights (R)

Plot 1: Annual Income by Years at Company

Comments:

Plot 2: Relationship between Job Satisfaction & Employee Attrition

Comments:

Plot 3: Relationship between Work Environment Satisfaction & Employee Attrition

Comments:

Plot 4: Relationship between Relationship Satisfaction & Employee Attrition

Comments:

Plot 5: Relationship between Work-Life Balance & Employee Attrition

Comments:

Plot 6: Correlation Matrix

Comments:

Identifying & Removing Independent Variables with High Intercorrelation

Removing:

3. Modeling & Analysis (R)

Logistic Regression

Checking Model Assumptions

Predicted Probability Plots

What are Predicted Probabilities?

Age

Business Travel

Department

Education

Education Field

Environment Satisfaction

Gender

Job Involvement

Job Satisfaction

Marital Status

# of Companies Worked For

Overtime

Percent Salary Hike