Exploratory Analysis and Visualization¶

Exploratory Analysis¶

Import data¶

In [1]:

Copied!

import pandas as pd
import scipy.stats as ss
import numpy as np
import pandas as pd
import scipy.stats as ss
import numpy as np

In [2]:

Copied!

df = pd.read_csv("./Data/HR_comma_sep.csv")
df.head(10)
df = pd.read_csv("./Data/HR_comma_sep.csv")
df.head(10)

Out[2]:

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	left	Department	salary
0	0.38	0.53	2	157	3	1	sales	low
1	0.80	0.86	5	262	6	1	sales	medium
2	0.11	0.88	7	272	4	1	sales	medium
3	0.72	0.87	5	223	5	1	sales	low
4	0.37	0.52	2	159	3	1	sales	low
5	0.41	0.50	2	153	3	1	sales	low
6	0.10	0.77	6	247	4	1	sales	low
7	0.92	0.85	5	259	5	1	sales	low
8	0.89	1.00	5	224	5	1	sales	low
9	0.42	0.53	2	142	3	1	sales	low

Trend Analysis¶

Mean, Median, Quantile, Mode
Standard deviation, Variance
Skewness and kurtosis, Normal distribution (kurtosis = 3) and 3 important distributions
- Chi-squared distribution: $ Q = \sum_{i=1}^{k}Z_i^2$
  - In statistics often used in hypothesis testing and confidence interval estimation, particularly within the context of goodness-of-fit tests, tests of independence in contingency tables, and variance estimation in ANOVA.
  - Q is a Chi-squared distributed variable.
  - Z is a standard normal variable.
  - k is the degrees of freedom, corresponding to the number of $Z_i^2$ terms being summed.
- t distribution: $t = Z/\frac{Q}{k}$
  - It arises when estimating the mean of a normally distributed population in situations where the sample size is small, and the population standard deviation is unknown
  - Z is a standard normal variable.
  - Q is a Chi-squared variable with k degrees of freedom.
  - k is the degrees of freedom, usually n - 1 in the context of a sample of size n.
- F distribution: $F= \frac{Q_1/d_1}{Q_2/d_2}$
  - It is most commonly used in the Analysis of Variance (ANOVA), especially for comparing the ratio of two variances to understand if they are significantly different.
  - $Q_1$ and $Q_2$ are independently chi-squared distributed variables with $d_1$ and $d_2$ degree of freedom, respectively.
  - F is and F-distributed variable with $d_1$ and $d_2$ degrees of freedom.
Sampling
- Error
  - Error in sampling with replacement: $u_x=\sqrt{\sigma^2/n}$
  - Error in sampling without replacement: $u_x=\sqrt{\frac{\sigma^2}{n} (\frac{N-n}{N-1})}$
  - N is population size, n is sample size
- Sample Size
  - Proper Sample Size with replacement: $n=\frac{Z_{\alpha/2}{\sigma^2}}{E^2}$
    - $Z_{\alpha/2}$ is the z-score corresponding to the desired confidence level.
    - $\sigma$ is the standard deviation of the population.
    - E is the desired margin of error.
    - 95% confidence interval $\to$ 2 $\sigma$ from $\mu$
  - Example: You are conducting a survey to estimate the average height of adult males in a city. You want to be confident that your sample mean is close to the true population mean, so you decide to use the sample size formula to determine how many individuals you should include in your sample.
    - Desired Confidence Level: 95%
    - Population Standard Deviation ($\sigma$): Assume it's 4 inches
    - Margin of Error ($E$): ±0.5 inches of the true mean
    - $n=\frac{Z_{\alpha/2}{\sigma^2}}{E^2} = (\frac{1.96 * 4}{0.5})^2=15.68^2=245.86$
  - Proper Sample Size without replacement: $\frac{N\frac{Z_{\alpha/2}^2*\sigma^2}{E^2}}{N+\frac{Z_{\alpha/2}^2*\sigma^2}{E^2}-1}$
    - N is the population size (It's important to know)
    - $Z_{\alpha/2}$ is the z-score corresponding to the desired confidence level.
    - $\sigma$ is the standard deviation of the population.
    - E is the desired margin of error.
  - We do the same example but we already know there's 10,000 adult males in the city.
    - $n=\frac{N\frac{Z_{\alpha/2}^2*\sigma^2}{E^2}}{N+\frac{Z_{\alpha/2}^2*\sigma^2}{E^2}-1}= \frac{10000*\frac{1.96^2*4^2}{0.5^2}}{10000+\frac{1.96^2*4^2}{0.5^2}-1}=240$

In [3]:

Copied!

df.describe()
df.describe()

Out[3]:

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years
count	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000
mean	0.612834	0.716102	3.803054	201.050337	3.498233	0.144610	0.238083	0.021268
std	0.248631	0.171169	1.232592	49.943099	1.460136	0.351719	0.425924	0.144281
min	0.090000	0.360000	2.000000	96.000000	2.000000	0.000000	0.000000	0.000000
25%	0.440000	0.560000	3.000000	156.000000	3.000000	0.000000	0.000000	0.000000
50%	0.640000	0.720000	4.000000	200.000000	3.000000	0.000000	0.000000	0.000000
75%	0.820000	0.870000	5.000000	245.000000	4.000000	0.000000	0.000000	0.000000
max	1.000000	1.000000	7.000000	310.000000	10.000000	1.000000	1.000000	1.000000

In [4]:

Copied!

df["satisfaction_level"].skew(), df["satisfaction_level"].kurt()
df["satisfaction_level"].skew(), df["satisfaction_level"].kurt()

Out[4]:

(-0.4763603412839644, -0.6708586220574557)

MVSK: Mean, Variance, Skewness, Kurtosis. (Moments analysis)

In [5]:

Copied!

ss.norm.stats(moments = "mvsk")
ss.norm.stats(moments = "mvsk")

Out[5]:

(0.0, 1.0, 0.0, 0.0)

Distribution Analysis¶

t distribution : ss.t
Normal distribution : ss.norm
Chi-squared distribution : ss.chi2
F distribution : ss.f
PDF (Probability Density Function)
PPF (Percent Point Function): inverse of the Cumulative Distribution Function (CDF)
RVS (Random Variable Simulation)

In [6]:

Copied!

ss.norm.pdf(0.0), ss.norm.ppf(0.9), ss.norm.cdf(1.96)
ss.norm.pdf(0.0), ss.norm.ppf(0.9), ss.norm.cdf(1.96)

Out[6]:

(0.3989422804014327, 1.2815515655446004, 0.9750021048517795)

In [7]:

Copied!

ss.norm.rvs(size = 10)
ss.norm.rvs(size = 10)

Out[7]:

array([-1.12758167,  0.61571217, -2.00672979, -0.42896313,  0.47855222,
        1.02556954, -1.53740147, -0.13296752, -0.34649074,  1.03586313])

Sampling¶

In [8]:

Copied!

df.sample(n = 10)
df.sample(n = 10)

Out[8]:

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years	Department	salary
9508	0.57	0.76	2	176	3	0	0	0	technical	low
3003	0.75	0.66	5	177	2	0	0	0	sales	low
10501	0.32	0.40	2	132	3	0	0	0	technical	low
4032	0.63	0.76	4	217	2	1	0	0	IT	medium
12722	0.44	0.47	2	130	3	0	1	0	technical	low
12685	0.79	0.84	4	240	5	0	1	0	sales	medium
4288	0.53	0.72	3	228	3	0	0	0	sales	medium
2255	0.87	0.74	4	190	4	0	0	0	technical	medium
14846	0.39	0.57	2	127	3	0	1	0	sales	low
10778	0.92	0.98	3	257	3	0	0	1	sales	medium

In [9]:

Copied!

df.sample(frac=0.001)
df.sample(frac=0.001)

Out[9]:

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	Department	salary
14891	0.85	0.87	5	246	5	1	1	sales	medium
2153	0.80	0.83	3	163	3	0	0	sales	low
13064	0.58	0.38	4	203	5	0	0	sales	low
3405	0.69	0.96	3	210	3	0	0	support	low
2483	0.94	0.78	3	184	3	1	0	technical	medium
5456	0.56	0.68	3	269	3	1	0	technical	low
9596	0.79	0.55	5	242	2	0	0	support	low
12789	0.61	0.96	3	247	3	0	0	support	low
14654	0.81	0.98	5	245	5	0	1	IT	low
5068	0.76	0.50	4	245	3	0	0	hr	low
12806	0.55	0.93	5	196	3	0	0	IT	medium
11129	0.80	0.90	4	211	8	0	0	accounting	medium
12473	0.42	0.56	2	149	3	0	1	sales	low
2633	0.78	0.59	3	212	2	0	0	technical	low
942	0.11	0.89	6	301	4	0	1	accounting	low

Single Variable analysis¶

Outlier Analysis¶

Purpose: Identify data points that are significantly different from the majority of the data. Outliers can skew and mislead the training process of machine learning models resulting in longer training times, less accurate models, and ultimately poorer results.
Methods: Box plots, scatter plots, Z-score, IQR method.

Comparison Analysis¶

Purpose: Understand the variable's characteristics compared to other variables or its behavior across different subgroups within the dataset.
Methods: Descriptive statistics (mean, median, mode), visualizations like histograms or bar charts for categorical data.

Structure Analysis¶

Purpose: Understand the type, category, and general composition of the data variable.
Components:
- Data Type Identification: Recognizing if the variable is nominal, ordinal, interval, or ratio.
- Missing Values Identification: Assessing the amount and pattern of missing data.
- Zero Variance Identification: Detecting if the variable has a single constant value or limited variance, which might be uninformative for certain analyses.

Distribution Analysis¶

Purpose: Understand the distribution and frequency of data points.
Methods
- Graphical Representations: Histograms, density plots, and Q-Q plots for visual inspection of distribution.
- Statistical Tests ： Kolmogorov-Smirnov test, Shapiro-Wilk test for normality testing.
- Summary Statistics: Skewness, kurtosis to understand the shape of distribution.

In [10]:

Copied!

### Outlier analysis

sl_s = df["satisfaction_level"]
sl_s.isnull().sum()
df[df["satisfaction_level"].isnull()]
### Outlier analysis

sl_s = df["satisfaction_level"]
sl_s.isnull().sum()
df[df["satisfaction_level"].isnull()]

Out[10]:

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years	Department	salary

In [11]:

Copied!

sl_s.describe()
sl_s.describe()

Out[11]:

count    14999.000000
mean         0.612834
std          0.248631
min          0.090000
25%          0.440000
50%          0.640000
75%          0.820000
max          1.000000
Name: satisfaction_level, dtype: float64

In [12]:

Copied!

sl_s.skew(),sl_s.kurt()
sl_s.skew(),sl_s.kurt()

Out[12]:

(-0.4763603412839644, -0.6708586220574557)

In [13]:

Copied!

np.histogram(sl_s.values, bins= np.arange(0,1.1,0.1))
np.histogram(sl_s.values, bins= np.arange(0,1.1,0.1))

Out[13]:

(array([ 195, 1214,  532,  974, 1668, 2146, 1972, 2074, 2220, 2004],
       dtype=int64),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]))

In [14]:

Copied!

### LastEvaluation Analysis

le_s = df["last_evaluation"]
le_s.isnull().sum()
### LastEvaluation Analysis

le_s = df["last_evaluation"]
le_s.isnull().sum()

Out[14]:

In [15]:

Copied!

le_s.describe()
le_s.describe()

Out[15]:

count    14999.000000
mean         0.716102
std          0.171169
min          0.360000
25%          0.560000
50%          0.720000
75%          0.870000
max          1.000000
Name: last_evaluation, dtype: float64

In [16]:

Copied!

np.histogram(le_s.values, bins= np.arange(0,1.1,0.1))
np.histogram(le_s.values, bins= np.arange(0,1.1,0.1))

Out[16]:

(array([   0,    0,    0,  179, 1389, 3395, 2234, 2062, 2752, 2988],
       dtype=int64),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]))

In [17]:

Copied!

### NumberProject Analysis

np_s = df["number_project"]
np_s[np_s.isnull()]

np_s.describe()
### NumberProject Analysis

np_s = df["number_project"]
np_s[np_s.isnull()]

np_s.describe()

Out[17]:

count    14999.000000
mean         3.803054
std          1.232592
min          2.000000
25%          3.000000
50%          4.000000
75%          5.000000
max          7.000000
Name: number_project, dtype: float64

In [18]:

Copied!

np_s.skew(),np_s.kurt()
np_s.skew(),np_s.kurt()

Out[18]:

(0.3377056123598222, -0.4954779519008947)

In [19]:

Copied!

np_s.value_counts(normalize=True).sort_index()
np_s.value_counts(normalize=True).sort_index()

Out[19]:

number_project
2    0.159211
3    0.270351
4    0.291019
5    0.184079
6    0.078272
7    0.017068
Name: proportion, dtype: float64

In [20]:

Copied!

### AverageMonthlyHours Analysis
amh_s = df["average_montly_hours"]
amh_s.describe()
### AverageMonthlyHours Analysis
amh_s = df["average_montly_hours"]
amh_s.describe()

Out[20]:

count    14999.000000
mean       201.050337
std         49.943099
min         96.000000
25%        156.000000
50%        200.000000
75%        245.000000
max        310.000000
Name: average_montly_hours, dtype: float64

In [21]:

Copied!

amh_s.skew(),amh_s.kurt()
amh_s.skew(),amh_s.kurt()

Out[21]:

(0.0528419894163242, -1.1349815681924558)

In [22]:

Copied!

np.histogram(amh_s,bins= 10)
np.histogram(amh_s,bins= 10)

Out[22]:

(array([ 367, 1240, 2733, 1722, 1628, 1712, 1906, 2240, 1127,  324],
       dtype=int64),
 array([ 96. , 117.4, 138.8, 160.2, 181.6, 203. , 224.4, 245.8, 267.2,
        288.6, 310. ]))

In [23]:

Copied!

amh_s.value_counts(bins=10).sort_index()
amh_s.value_counts(bins=10).sort_index()

Out[23]:

average_montly_hours
(95.785, 117.4]     367
(117.4, 138.8]     1240
(138.8, 160.2]     2733
(160.2, 181.6]     1722
(181.6, 203.0]     1700
(203.0, 224.4]     1640
(224.4, 245.8]     1906
(245.8, 267.2]     2240
(267.2, 288.6]     1127
(288.6, 310.0]      324
Name: count, dtype: int64

In [24]:

Copied!

### TimeSpendCompany Analysis

tsc_s = df["time_spend_company"]
tsc_s.describe()
### TimeSpendCompany Analysis

tsc_s = df["time_spend_company"]
tsc_s.describe()

Out[24]:

count    14999.000000
mean         3.498233
std          1.460136
min          2.000000
25%          3.000000
50%          3.000000
75%          4.000000
max         10.000000
Name: time_spend_company, dtype: float64

In [25]:

Copied!

tsc_s.value_counts().sort_index()
tsc_s.value_counts().sort_index()

Out[25]:

time_spend_company
2     3244
3     6443
4     2557
5     1473
6      718
7      188
8      162
10     214
Name: count, dtype: int64

In [26]:

Copied!

### WorkAccident Analysis

wa_s = df["Work_accident"]
wa_s.describe()
### WorkAccident Analysis

wa_s = df["Work_accident"]
wa_s.describe()

Out[26]:

count    14999.000000
mean         0.144610
std          0.351719
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: Work_accident, dtype: float64

In [27]:

Copied!

wa_s.value_counts().sort_index()
wa_s.value_counts().sort_index()

Out[27]:

Work_accident
0    12830
1     2169
Name: count, dtype: int64

In [28]:

Copied!

### Left Analysis

l_s = df["left"]
l_s.value_counts()
### Left Analysis

l_s = df["left"]
l_s.value_counts()

Out[28]:

left
0    11428
1     3571
Name: count, dtype: int64

In [29]:

Copied!

### PromotionLast5Years Analysis

pl5_s = df["promotion_last_5years"]
pl5_s.value_counts().sort_index()
### PromotionLast5Years Analysis

pl5_s = df["promotion_last_5years"]
pl5_s.value_counts().sort_index()

Out[29]:

promotion_last_5years
0    14680
1      319
Name: count, dtype: int64

In [30]:

Copied!

### Salary Analysis

s_s = df["salary"]
s_s.describe()
### Salary Analysis

s_s = df["salary"]
s_s.describe()

Out[30]:

count     14999
unique        3
top         low
freq       7316
Name: salary, dtype: object

In [31]:

Copied!

s_s.value_counts()
s_s.value_counts()

Out[31]:

salary
low       7316
medium    6446
high      1237
Name: count, dtype: int64

In [32]:

Copied!

df[s_s == "high"]
df[s_s == "high"]

Out[32]:

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years	Department	salary
72	0.45	0.49	2	149	3	0	1	0	product_mng	high
111	0.09	0.85	6	289	4	0	1	0	hr	high
189	0.44	0.51	2	156	3	0	1	0	technical	high
267	0.45	0.53	2	129	3	0	1	0	technical	high
306	0.37	0.46	2	149	3	0	1	0	marketing	high
...	...	...	...	...	...	...	...	...	...	...
14829	0.45	0.57	2	148	3	0	1	0	marketing	high
14868	0.43	0.55	2	130	3	0	1	0	support	high
14902	0.45	0.46	2	159	3	0	1	0	hr	high
14941	0.43	0.49	2	131	3	0	1	0	RandD	high
14980	0.76	0.89	5	238	5	0	1	0	technical	high

1237 rows × 10 columns

In [33]:

Copied!

### Department Analysis

d_s = df["Department"]
d_s.value_counts(normalize=True)
### Department Analysis

d_s = df["Department"]
d_s.value_counts(normalize=True)

Out[33]:

Department
sales          0.276018
technical      0.181345
support        0.148610
IT             0.081805
product_mng    0.060137
marketing      0.057204
RandD          0.052470
accounting     0.051137
hr             0.049270
management     0.042003
Name: proportion, dtype: float64

In [34]:

Copied!

### Comparison Analysis

df = df.dropna(axis=0,how="any")
df.shape
### Comparison Analysis

df = df.dropna(axis=0,how="any")
df.shape

Out[34]:

(14999, 10)

In [35]:

Copied!

df_num = df.select_dtypes(include=[np.number])
df_num["Department"] = df["Department"]
df_num.groupby("Department").mean()
df_num = df.select_dtypes(include=[np.number])
df_num["Department"] = df["Department"]
df_num.groupby("Department").mean()

Out[35]:

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years
Department
IT	0.618142	0.716830	3.816626	202.215974	3.468623	0.133659	0.222494	0.002445
RandD	0.619822	0.712122	3.853875	200.800508	3.367217	0.170267	0.153748	0.034307
accounting	0.582151	0.717718	3.825293	201.162973	3.522816	0.125163	0.265971	0.018253
hr	0.598809	0.708850	3.654939	198.684709	3.355886	0.120433	0.290934	0.020298
management	0.621349	0.724000	3.860317	201.249206	4.303175	0.163492	0.144444	0.109524
marketing	0.618601	0.715886	3.687646	199.385781	3.569930	0.160839	0.236597	0.050117
product_mng	0.619634	0.714756	3.807095	199.965632	3.475610	0.146341	0.219512	0.000000
sales	0.614447	0.709717	3.776329	200.911353	3.534058	0.141787	0.244928	0.024155
support	0.618300	0.723109	3.803948	200.758188	3.393001	0.154778	0.248991	0.008973
technical	0.607897	0.721099	3.877941	202.497426	3.411397	0.140074	0.256250	0.010294

In [36]:

Copied!

df.loc[:,["last_evaluation","Department"]].groupby("Department").mean()
df.loc[:,["last_evaluation","Department"]].groupby("Department").mean()

Out[36]:

	last_evaluation
Department
IT	0.716830
RandD	0.712122
accounting	0.717718
hr	0.708850
management	0.724000
marketing	0.715886
product_mng	0.714756
sales	0.709717
support	0.723109
technical	0.721099

In [37]:

Copied!

df.loc[:,["time_spend_company","Department"]].groupby("Department")["time_spend_company"].apply(lambda x:x.max()-x.min())
df.loc[:,["time_spend_company","Department"]].groupby("Department")["time_spend_company"].apply(lambda x:x.max()-x.min())

Out[37]:

Department
IT             8
RandD          6
accounting     8
hr             6
management     8
marketing      8
product_mng    8
sales          8
support        8
technical      8
Name: time_spend_company, dtype: int64

Visualization¶

Bar Plot¶

In [38]:

Copied!

import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt

In [39]:

Copied!





plt.title("SALARY")
plt.ylabel("# of observation")
plt.xlabel("Salary")
plt.axis([-0.5,2.5,0,8000])
plt.xticks(np.arange(len(df["salary"].value_counts())),df["salary"].value_counts().index)
plt.bar(np.arange(len(df["salary"].value_counts())),df["salary"].value_counts(),width=0.5)
for x,y in zip(np.arange(len(df["salary"].value_counts())),df["salary"].value_counts()):
    plt.text(x,y,y,ha = "center",va = "bottom")
plt.show()
plt.title("SALARY")
plt.ylabel("# of observation")
plt.xlabel("Salary")
plt.axis([-0.5,2.5,0,8000])
plt.xticks(np.arange(len(df["salary"].value_counts())),df["salary"].value_counts().index)
plt.bar(np.arange(len(df["salary"].value_counts())),df["salary"].value_counts(),width=0.5)
for x,y in zip(np.arange(len(df["salary"].value_counts())),df["salary"].value_counts()):
    plt.text(x,y,y,ha = "center",va = "bottom")
plt.show()

No description has been provided for this image

In [40]:

Copied!





sns.set_style(style="whitegrid")
sns.set_context(context="poster",font_scale=0.8)
sns.set_palette("summer")
ax = sns.countplot(x="salary",hue="Department",data=df)
ax.set_ylim(0,5000)
plt.legend(loc='upper right',ncol = 2)  # You can use other options like 'upper right', 'lower left', etc.
plt.show()
sns.set_style(style="whitegrid")
sns.set_context(context="poster",font_scale=0.8)
sns.set_palette("summer")
ax = sns.countplot(x="salary",hue="Department",data=df)
ax.set_ylim(0,5000)
plt.legend(loc='upper right',ncol = 2)  # You can use other options like 'upper right', 'lower left', etc.
plt.show()

Histogram¶

In [41]:

Copied!





f = plt.figure(figsize=(5,12))
f.add_subplot(3,1,1)
sns.histplot(df["satisfaction_level"],bins = 10,kde=True)
f.add_subplot(3,1,2)
sns.histplot(df["last_evaluation"],bins = 10,kde=True)
f.add_subplot(3,1,3)
sns.histplot(df["average_montly_hours"],bins = 10,kde=True)
plt.show()
f = plt.figure(figsize=(5,12))
f.add_subplot(3,1,1)
sns.histplot(df["satisfaction_level"],bins = 10,kde=True)
f.add_subplot(3,1,2)
sns.histplot(df["last_evaluation"],bins = 10,kde=True)
f.add_subplot(3,1,3)
sns.histplot(df["average_montly_hours"],bins = 10,kde=True)
plt.show()

Boxplot¶

In [42]:

Copied!

sns.boxplot(x = df["time_spend_company"])
sns.boxplot(x = df["time_spend_company"])

Out[42]:

<Axes: xlabel='time_spend_company'>

In [43]:

Copied!

sns.boxplot(df["time_spend_company"],whis=3,saturation=0.75)
sns.boxplot(df["time_spend_company"],whis=3,saturation=0.75)

Out[43]:

<Axes: ylabel='time_spend_company'>

Line Plot¶

In [44]:

Copied!

sub_df = df[["time_spend_company","left"]].groupby("time_spend_company").mean()
sub_df
sub_df = df[["time_spend_company","left"]].groupby("time_spend_company").mean()
sub_df

Out[44]:

	left
time_spend_company
2	0.016338
3	0.246159
4	0.348064
5	0.565513
6	0.291086
7	0.000000
8	0.000000
10	0.000000

In [45]:

Copied!

sns.pointplot(x = sub_df.index, y = sub_df["left"])
plt.show()
sns.pointplot(x = sub_df.index, y = sub_df["left"])
plt.show()

In [46]:

Copied!

sns.pointplot(x = df["time_spend_company"], y = df["left"])
plt.show()
sns.pointplot(x = df["time_spend_company"], y = df["left"])
plt.show()

Pie Chart¶

In [47]:

Copied!





sns.set_palette("Set1")
plt.figure(figsize=(8,8))
plt.pie(df["Department"].value_counts(normalize = True),
        explode = [0.1 if i == "sales" else 0 for i in df["Department"].value_counts().index],
        labels= df["Department"].value_counts().index,
        autopct = "%1.2f%%")
plt.show()
sns.set_palette("Set1")
plt.figure(figsize=(8,8))
plt.pie(df["Department"].value_counts(normalize = True),
        explode = [0.1 if i == "sales" else 0 for i in df["Department"].value_counts().index],
        labels= df["Department"].value_counts().index,
        autopct = "%1.2f%%")
plt.show()