简介

本文构建预测员工是否会离职的模型，并使用模型对员工进行预测。通过本文可以学习到：

查看数据集的统计信息
特征工程
数据集的划分
数据集的预处理
数据集的可视化
模型训练
模型调参
模型评估
模型预测

查看数据集信息

import numpy as np
import pandas as pd

# 读入数据
url = 'https://cdn.jsdelivr.net/gh/liaochenlanruo/cdn@master/data/ML/HumanResourcesAnalytics/HR_comma_sep.csv'
df = pd.read_csv(url)
#df = pd.read_csv('HR_comma_sep.csv')
print(df.info()) #474241623
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   sales                  14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
None

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	left	sales	salary
0	0.38	0.53	2	157	3	1	sales	low
1	0.80	0.86	5	262	6	1	sales	medium
2	0.11	0.88	7	272	4	1	sales	medium
3	0.72	0.87	5	223	5	1	sales	low
4	0.37	0.52	2	159	3	1	sales	low

header 信息

satisfaction_level 员工满意度
last_evaluation 员工考核评分
number_project 员工参与的项目数
average_montly_hours 每个月均工作时长
time_spend_company 员工工作年限
Work_accident 是否发生过事故
left 员工是否离职
promotion_last_5years 过去5年中是否有升职
sales 员工岗位
salary 员工薪资水平

# 更正列名
df.rename(columns={'average_montly_hours':'average_monthly_hours', 'sales':'department'}, 
          inplace=True)
df.head()

	satisfaction_level	last_evaluation	number_project	average_monthly_hours	time_spend_company	left	department	salary
0	0.38	0.53	2	157	3	1	sales	low
1	0.80	0.86	5	262	6	1	sales	medium
2	0.11	0.88	7	272	4	1	sales	medium
3	0.72	0.87	5	223	5	1	sales	low
4	0.37	0.52	2	159	3	1	sales	low

1 2	# 展示数据集的统计信息，仅展示数值列 df.describe()

	satisfaction_level	last_evaluation	number_project	average_monthly_hours	time_spend_company	Work_accident	left	promotion_last_5years
count	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000
mean	0.612834	0.716102	3.803054	201.050337	3.498233	0.144610	0.238083	0.021268
std	0.248631	0.171169	1.232592	49.943099	1.460136	0.351719	0.425924	0.144281
min	0.090000	0.360000	2.000000	96.000000	2.000000	0.000000	0.000000	0.000000
25%	0.440000	0.560000	3.000000	156.000000	3.000000	0.000000	0.000000	0.000000
50%	0.640000	0.720000	4.000000	200.000000	3.000000	0.000000	0.000000	0.000000
75%	0.820000	0.870000	5.000000	245.000000	4.000000	0.000000	0.000000	0.000000
max	1.000000	1.000000	7.000000	310.000000	10.000000	1.000000	1.000000	1.000000

# 查看各元素的出现次数
print ('Departments:')
print (df['department'].value_counts())
print ('\nSalary:')
print (df['salary'].value_counts())

Departments:
department
sales          4140
technical      2720
support        2229
IT             1227
product_mng     902
marketing       858
RandD           787
accounting      767
hr              739
management      630
Name: count, dtype: int64

Salary:
salary
low       7316
medium    6446
high      1237
Name: count, dtype: int64

# 记录各特征的类型和取值范围

'''
satisfaction_level | Satisfaction level of employee based on survey | Continuous | [0.09, 1]
last_evaluation | Score based on employee's last evaluation | Continuous | [0.36, 1]
number_project | Number of projects | Continuous | [2, 7]
average_monthly_hours | Average monthly hours | Continuous | [96, 310]
time_spend_company | Years at company | Continuous | [2, 10]
Work_accident | Whether employee had a work accident | Categorical | {0, 1}
left | Whether employee had left (Outcome Variable) | Categorical | {0, 1}
promotion_last_5years | Whether employee had a promotion in the last 5 years | Categorical | {0, 1}
department | Department employee worked in | Categorical | 10 departments
salary | Level of employee's salary | Categorical | {low, medium, high}
'''

"\nsatisfaction_level | Satisfaction level of employee based on survey | Continuous | [0.09, 1]\nlast_evaluation | Score based on employee's last evaluation | Continuous | [0.36, 1]\nnumber_project | Number of projects | Continuous | [2, 7]\naverage_monthly_hours | Average monthly hours | Continuous | [96, 310]\ntime_spend_company | Years at company | Continuous | [2, 10]\nWork_accident | Whether employee had a work accident | Categorical | {0, 1}\nleft | Whether employee had left (Outcome Variable) | Categorical | {0, 1}\npromotion_last_5years | Whether employee had a promotion in the last 5 years | Categorical | {0, 1}\ndepartment | Department employee worked in | Categorical | 10 departments\nsalary | Level of employee's salary | Categorical | {low, medium, high}\n"

特征工程

查找相关性大的特征，只保留其中的一个。
也可查看与标签（left）相关性较大的特征，如此数据集中的satisfaction_level。

# 筛选 DataFrame 中的所有数值列
numeric_df = df.select_dtypes(include=[np.number])
# 计算数值列之间的相关系数
numeric_df.corr()

	satisfaction_level	last_evaluation	number_project	average_monthly_hours	time_spend_company	Work_accident	left	promotion_last_5years
satisfaction_level	1.000000	0.105021	-0.142970	-0.020048	-0.100866	0.058697	-0.388375	0.025605
last_evaluation	0.105021	1.000000	0.349333	0.339742	0.131591	-0.007104	0.006567	-0.008684
number_project	-0.142970	0.349333	1.000000	0.417211	0.196786	-0.004741	0.023787	-0.006064
average_monthly_hours	-0.020048	0.339742	0.417211	1.000000	0.127755	-0.010143	0.071287	-0.003544
time_spend_company	-0.100866	0.131591	0.196786	0.127755	1.000000	0.002120	0.144822	0.067433
Work_accident	0.058697	-0.007104	-0.004741	-0.010143	0.002120	1.000000	-0.154622	0.039245
left	-0.388375	0.006567	0.023787	0.071287	0.144822	-0.154622	1.000000	-0.061788
promotion_last_5years	0.025605	-0.008684	-0.006064	-0.003544	0.067433	0.039245	-0.061788	1.000000

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# 查看离职员工部门分布，发现HR离职员工最多
plot = sns.catplot(x='department', y='left', kind='bar', data=df)
plot.set_xticklabels(rotation=45, horizontalalignment='right');

1 2	# 查看工资水平和离职率的关系 plot = sns.catplot(x='salary', y='left', kind='bar', data=df);

1 2	# 查看经理工资水平分布 df[df['department']=='management']['salary'].value_counts().plot(kind='pie', title='Management salary level distribution');

1 2	# 查看研发工资水平分布 df[df['department']=='RandD']['salary'].value_counts().plot(kind='pie', title='R&D dept salary level distribution');

# 绘制员工满意度分布的直方图，并分为两类员工：已离职和未离职
# 生成21个等间距的数值作为直方图的区间，范围从0.0001到1.0001
bins = np.linspace(0.0001, 1.0001, 21)
# 绘制直方图。首先筛选出已离职员工（df['left']==1）和未离职员工（df['left']==0）的满意度数据，使用指定的区间（bins）、透明度（alpha）和标签（label）进行绘制。
plt.hist(df[df['left']==1]['satisfaction_level'], bins=bins, alpha=0.7, label='Employees Left')
plt.hist(df[df['left']==0]['satisfaction_level'], bins=bins, alpha=0.5, label='Employees Stayed')
plt.xlabel('satisfaction_level')
# 设置x轴的显示范围从0到1.05
plt.xlim((0,1.05))
# 在最合适的位置添加图例
plt.legend(loc='best');

发现已离职员工对公司的满意度比较低（0~0.5），当然也存在满意度较高（0.8附近）的员工离职的情况。

# Last evaluation
bins = np.linspace(0.3501, 1.0001, 14)
plt.hist(df[df['left']==1]['last_evaluation'], bins=bins, alpha=1, label='Employees Left')
plt.hist(df[df['left']==0]['last_evaluation'], bins=bins, alpha=0.4, label='Employees Stayed')
plt.xlabel('last_evaluation')
plt.legend(loc='best');

公司评分高（0.8~1.0）的员工离职了很多，原因可能是这部分员工能力强，跳槽寻求更好的工作机会。

# Number of projects 
bins = np.linspace(1.5, 7.5, 7)
plt.hist(df[df['left']==1]['number_project'], bins=bins, alpha=1, label='Employees Left')
plt.hist(df[df['left']==0]['number_project'], bins=bins, alpha=0.4, label='Employees Stayed')
plt.xlabel('number_project')
plt.grid(axis='x')
plt.legend(loc='best');

项目少时离职了，可能因为员工锻炼机会少。

# Average monthly hours
bins = np.linspace(75, 325, 11)
plt.hist(df[df['left']==1]['average_monthly_hours'], bins=bins, alpha=1, label='Employees Left')
plt.hist(df[df['left']==0]['average_monthly_hours'], bins=bins, alpha=0.4, label='Employees Stayed')
plt.xlabel('average_monthly_hours')
plt.legend(loc='best');

工作时长少和多都容易离职。

# Years at company 
bins = np.linspace(1.5, 10.5, 10)
plt.hist(df[df['left']==1]['time_spend_company'], bins=bins, alpha=1, label='Employees Left')
plt.hist(df[df['left']==0]['time_spend_company'], bins=bins, alpha=0.4, label='Employees Stayed')
plt.xlabel('time_spend_company')
plt.xlim((1,11))
plt.grid(axis='x')
plt.xticks(np.arange(2,11))
plt.legend(loc='best');

工作年限3年，离职率最高。年限越长，离职率越低。

1 2	# whether employee had work accident plot = sns.catplot(x='Work_accident', y='left', kind='bar', data=df);

未发生工作事故的离职率较高，难以解释。

1 2	#whether employee had promotion in last 5 years plot = sns.catplot(x='promotion_last_5years', y='left', kind='bar', data=df);

不升职的离职率较高。

数据预处理

独热编码替换分类数据

# 丢弃标签（left）列
X = df.drop('left', axis=1)
# 提取标签列
y = df['left']
# 删除部门与工资列，后面会通过独热编码将信息添加回来
X.drop(['department','salary'], axis=1, inplace=True)

# One-hot encoding
# 对工资进行独热编码
salary_dummy = pd.get_dummies(df['salary'])
# 对部门进行独热编码
department_dummy = pd.get_dummies(df['department'])
X = pd.concat([X, salary_dummy], axis=1)
X = pd.concat([X, department_dummy], axis=1)
X.head()

	satisfaction_level	last_evaluation	number_project	average_monthly_hours	time_spend_company	high	low	medium	IT	RandD	accounting	hr	management	marketing	product_mng	sales	support	technical
0	0.38	0.53	2	157	3	False	True	False	False	False	False	False	False	False	False	True	False	False
1	0.80	0.86	5	262	6	False	False	True	False	False	False	False	False	False	False	True	False	False
2	0.11	0.88	7	272	4	False	False	True	False	False	False	False	False	False	False	True	False	False
3	0.72	0.87	5	223	5	False	True	False	False	False	False	False	False	False	False	True	False	False
4	0.37	0.52	2	159	3	False	True	False	False	False	False	False	False	False	False	True	False	False

拆分训练集和测试集

# 划分训练集和测试集 (70%/30%)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

数据标准化

比较大的数值，算法会认为其比较重要，导致结果不准确。
数值差异比较大的话，模型收敛较慢。
因此，需要将数据标准化。

# 数据标准化，这里是一个例子
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_example = np.array([[ 10., -2.,  23.],
                      [ 5.,  32.,  211.],
                      [ 10.,  1., -130.]])
X_example = stdsc.fit_transform(X_example)
X_example = pd.DataFrame(X_example)
print (X_example)
X_example.describe()

          0         1         2
0  0.707107 -0.802454 -0.083658
1 -1.414214  1.409716  1.264429
2  0.707107 -0.607262 -1.180771

	0	1	2
count	3.000000e+00	3.000000e+00	3.000000e+00
mean	-2.960595e-16	-1.110223e-16	7.401487e-17
std	1.224745e+00	1.224745e+00	1.224745e+00
min	-1.414214e+00	-8.024539e-01	-1.180771e+00
25%	-3.535534e-01	-7.048582e-01	-6.322145e-01
50%	7.071068e-01	-6.072624e-01	-8.365788e-02
75%	7.071068e-01	4.012270e-01	5.903856e-01
max	7.071068e-01	1.409716e+00	1.264429e+00

# 分别对训练集和测试集进行标准化
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
# transform our training features
X_train_std = stdsc.fit_transform(X_train)
print (X_train_std[0])
# transform the testing features in the same way
X_test_std = stdsc.transform(X_test)

[ 1.40697692 -0.21068428 -0.65422416 -1.37529896 -1.02172591 -0.41080801
 -0.14595719 -0.30564365 -0.98084819  1.16499228 -0.2981308  -0.23781569
 -0.22665375 -0.23057496 -0.21332806 -0.24641294 -0.25073288  1.62416352
 -0.41712208 -0.47247431]

构建模型

随机森林法

# 交叉验证（Cross validation）
from sklearn.model_selection import ShuffleSplit

# 进行20折交叉验证
cv = ShuffleSplit(n_splits=20, test_size=0.3)

# 构建随机森林模型
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
rf_model = RandomForestClassifier()

# 设置RF模型，建立树的数量
rf_param = {'n_estimators': range(1,11)}

# 探索模型参数（最佳树的个数）
rf_grid = GridSearchCV(rf_model, rf_param, cv=cv)
rf_grid.fit(X_train, y_train)

# 输出最佳参数和最佳得分
print('Parameter with best score:')
print(rf_grid.best_params_)
print('Cross validation score:', rf_grid.best_score_)

Parameter with best score:
{'n_estimators': 9}
Cross validation score: 0.9835079365079364

1
2
3

# 在测试集上评估模型
best_rf = rf_grid.best_estimator_
print('Test score:', best_rf.score(X_test, y_test))

Test score: 0.9884444444444445

# 通过随机森林查看特征的重要性，原理是每次打乱一个特征（或添加噪音），然后看预测结果（错误率）是否发生变化，如果变化大，则该特征对预测结果有影响，否则没有影响
features = X.columns
feature_importances = best_rf.feature_importances_

features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_df.sort_values('Importance Score', inplace=True, ascending=False)

features_df

	Features	Importance Score
0	satisfaction_level	0.260366
3	average_monthly_hours	0.186585
2	number_project	0.179788
4	time_spend_company	0.179571
1	last_evaluation	0.144083
5	Work_accident	0.011949
8	low	0.006395
7	high	0.005206
9	medium	0.003336
17	sales	0.003200
18	support	0.003070
19	technical	0.003039
11	RandD	0.002143
10	IT	0.002048
12	accounting	0.001887
6	promotion_last_5years	0.001799
14	management	0.001755
13	hr	0.001425
16	product_mng	0.001182
15	marketing	0.001173

1 2	# 计算前五项特征的重要性之和 features_df['Importance Score'][:5].sum()

np.float64(0.9503925098929926)

基于聚类模型的分析

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

data = pd.read_csv(url)

plt.figure(figsize = (8,8))
plt.subplot(1,2,1)
plt.plot(data.satisfaction_level[data.left == 1],data.last_evaluation[data.left == 1],'o', alpha = 0.1)
plt.ylabel('Last Evaluation')
plt.title('Employees who left')
plt.xlabel('Satisfaction level')

plt.subplot(1,2,2)
plt.title('Employees who stayed')
plt.plot(data.satisfaction_level[data.left == 0],data.last_evaluation[data.left == 0],'o', alpha = 0.1)
plt.xlim([0.4,1])
plt.ylabel('Last Evaluation')
plt.xlabel('Satisfaction level')

Text(0.5, 0, 'Satisfaction level')

# 用KMeans聚类分析
# 导入KMeans聚类算法模块
from sklearn.cluster import KMeans

# 选取数据中已经离职的员工（left列为1），并从这些数据中删除特定的列
# 这里axis=1表示按列删除，这些列包括：项目数量、月平均工作小时、公司服务时间、工作事故、是否离职、过去5年是否晋升、销售部门和薪水等
kmeans_df =  data[data.left == 1].drop([ u'number_project',
       u'average_montly_hours', u'time_spend_company', u'Work_accident',
       u'left', u'promotion_last_5years', u'sales', u'salary'],axis = 1)

# 使用KMeans算法对处理后的数据进行聚类，设定聚类数为3，并设置随机种子为0以确保结果的可复现性
# 这里fit方法用于训练模型，使其学习数据的聚类结构
kmeans = KMeans(n_clusters = 3, random_state = 0).fit(kmeans_df)

# 访问并输出每个聚类中心点的坐标，这些坐标表示了每个聚类的中心位置
kmeans.cluster_centers_

array([[0.41014545, 0.51698182],
       [0.80851586, 0.91170931],
       [0.11115466, 0.86930085]])

# 筛选出离职员工的数据
left = data[data.left == 1]

# 使用布尔索引和 .loc 方法将 KMeans 聚类的标签分配给离职员工数据
left_labels = (data.left == 1)
data.loc[left_labels, 'label'] = kmeans.labels_

# 重新获取带有标签的离职员工数据
left = data[data.left == 1]

# 创建一个新的图形窗口
plt.figure()

# 设置 x 轴标签为满意度水平
plt.xlabel('Satisfaction Level')

# 设置 y 轴标签为最后一次评估结果
plt.ylabel('Last Evaluation')

# 设置图形标题为“离职员工的3个聚类”
plt.title('3 Clusters of employees who left')

# 绘制不同聚类的离职员工的满意度水平和最后一次评估结果
plt.plot(left.satisfaction_level[left.label==0], left.last_evaluation[left.label==0], 'o', alpha=0.2, color='r')
plt.plot(left.satisfaction_level[left.label==1], left.last_evaluation[left.label==1], 'o', alpha=0.2, color='g')
plt.plot(left.satisfaction_level[left.label==2], left.last_evaluation[left.label==2], 'o', alpha=0.2, color='b')

# 添加图例，解释不同聚类的含义，并设置图例的位置和字体大小
plt.legend(['Winners', 'Frustrated', 'Bad Match'], loc=3, fontsize=15, frameon=True);

加关注

关注公众号“生信之巅”

敬告：使用文中脚本请引用本文网址，请尊重本人的劳动成果，谢谢！Notice: When you use the scripts in this article, please cite the link of this webpage. Thank you!

了尘兰若的小坑

分享微生物生物信息学分析方法，欢迎加入QQ群交流945751012，不接受群内广告！

简介