EasyVisa Project¶

Context:¶

Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).

OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.

Objective:¶

In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.

The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. As a data scientist, I have to analyze the data provided and, with the help of a classification model:

  • Facilitate the process of visa approvals.
  • Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the drivers that significantly influence the case status.

Data Description¶

The data contains the different attributes of the employee and the employer. The detailed data dictionary is given below.

  • case_id: ID of each visa application
  • continent: Information of continent the employee
  • education_of_employee: Information of education of the employee
  • has_job_experience: Does the employee has any job experience? Y= Yes; N = No
  • requires_job_training: Does the employee require any job training? Y = Yes; N = No
  • no_of_employees: Number of employees in the employer's company
  • yr_of_estab: Year in which the employer's company was established
  • region_of_employment: Information of foreign worker's intended region of employment in the US.
  • prevailing_wage: Average wage paid to similarly employed workers in a specific occupation in the area of intended employment. The purpose of the prevailing wage is to ensure that the foreign worker is not underpaid compared to other workers offering the same or similar service in the same area of employment.
  • unit_of_wage: Unit of prevailing wage. Values include Hourly, Weekly, Monthly, and Yearly.
  • full_time_position: Is the position of work full-time? Y = Full Time Position; N = Part Time Position
  • case_status: Flag indicating if the Visa was certified or denied

Importing necessary libraries and data¶

In [232]:
# Installing the libraries with the specified version.
# !pip install numpy==1.25.2 pandas==1.5.3 scikit-learn==1.2.2 matplotlib==3.7.1 seaborn==0.13.1 xgboost==2.0.3

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [233]:
import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Library to split data
from sklearn.model_selection import train_test_split

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)


# Libraries different ensemble classifiers
from sklearn.ensemble import (
    BaggingClassifier,
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    StackingClassifier,
)

from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

# Libraries to get different metric scores
from sklearn import metrics
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

# To tune different models
from sklearn.model_selection import GridSearchCV

Importing data¶

In [234]:
# uncomment and run the following lines for Google Colab
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [235]:
# loading data
data = pd.read_csv('/content/drive/MyDrive/content/EasyVisa.csv')

Data Overview¶

  • Observations
  • Sanity checks

Share of data

In [236]:
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns in the data frame.")
There are 25480 rows and 12 columns in the data frame.

Displaying the first few rows of the dataset

In [237]:
data.head()
Out[237]:
case_id continent education_of_employee has_job_experience requires_job_training no_of_employees yr_of_estab region_of_employment prevailing_wage unit_of_wage full_time_position case_status
0 EZYV01 Asia High School N N 14513 2007 West 592.2029 Hour Y Denied
1 EZYV02 Asia Master's Y N 2412 2002 Northeast 83425.6500 Year Y Certified
2 EZYV03 Asia Bachelor's N Y 44444 2008 West 122996.8600 Year Y Denied
3 EZYV04 Asia Bachelor's N N 98 1897 West 83434.0300 Year Y Denied
4 EZYV05 Africa Master's Y N 1082 2005 South 149907.3900 Year Y Certified

Displaying the last few rows of the dataset

In [238]:
data.tail()
Out[238]:
case_id continent education_of_employee has_job_experience requires_job_training no_of_employees yr_of_estab region_of_employment prevailing_wage unit_of_wage full_time_position case_status
25475 EZYV25476 Asia Bachelor's Y Y 2601 2008 South 77092.57 Year Y Certified
25476 EZYV25477 Asia High School Y N 3274 2006 Northeast 279174.79 Year Y Certified
25477 EZYV25478 Asia Master's Y N 1121 1910 South 146298.85 Year N Certified
25478 EZYV25479 Asia Master's Y Y 1918 1887 West 86154.77 Year Y Certified
25479 EZYV25480 Asia Bachelor's Y N 3195 1960 Midwest 70876.91 Year Y Certified

Checking type of columns

In [239]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25480 entries, 0 to 25479
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   case_id                25480 non-null  object 
 1   continent              25480 non-null  object 
 2   education_of_employee  25480 non-null  object 
 3   has_job_experience     25480 non-null  object 
 4   requires_job_training  25480 non-null  object 
 5   no_of_employees        25480 non-null  int64  
 6   yr_of_estab            25480 non-null  int64  
 7   region_of_employment   25480 non-null  object 
 8   prevailing_wage        25480 non-null  float64
 9   unit_of_wage           25480 non-null  object 
 10  full_time_position     25480 non-null  object 
 11  case_status            25480 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 2.3+ MB

Checking for duplicates

In [240]:
data.duplicated().sum()
Out[240]:
0

Checking for null values

In [241]:
data.isnull().sum()
Out[241]:
0
case_id 0
continent 0
education_of_employee 0
has_job_experience 0
requires_job_training 0
no_of_employees 0
yr_of_estab 0
region_of_employment 0
prevailing_wage 0
unit_of_wage 0
full_time_position 0
case_status 0

Statistical summary of numeric data

In [242]:
data.describe(include='all').T
Out[242]:
count unique top freq mean std min 25% 50% 75% max
case_id 25480 25480 EZYV01 1 NaN NaN NaN NaN NaN NaN NaN
continent 25480 6 Asia 16861 NaN NaN NaN NaN NaN NaN NaN
education_of_employee 25480 4 Bachelor's 10234 NaN NaN NaN NaN NaN NaN NaN
has_job_experience 25480 2 Y 14802 NaN NaN NaN NaN NaN NaN NaN
requires_job_training 25480 2 N 22525 NaN NaN NaN NaN NaN NaN NaN
no_of_employees 25480.0 NaN NaN NaN 5667.04321 22877.928848 -26.0 1022.0 2109.0 3504.0 602069.0
yr_of_estab 25480.0 NaN NaN NaN 1979.409929 42.366929 1800.0 1976.0 1997.0 2005.0 2016.0
region_of_employment 25480 5 Northeast 7195 NaN NaN NaN NaN NaN NaN NaN
prevailing_wage 25480.0 NaN NaN NaN 74455.814592 52815.942327 2.1367 34015.48 70308.21 107735.5125 319210.27
unit_of_wage 25480 4 Year 22962 NaN NaN NaN NaN NaN NaN NaN
full_time_position 25480 2 Y 22773 NaN NaN NaN NaN NaN NaN NaN
case_status 25480 2 Certified 17018 NaN NaN NaN NaN NaN NaN NaN

Removing case_id since it is not required

In [243]:
data.drop("case_id", axis=1, inplace=True)

Fixing negative values in number of employees

In [244]:
# Understand how many cases we have
data.loc[data['no_of_employees'] < 0].shape
Out[244]:
(33, 11)

Observations:

In [245]:
# convert the negative values to their absolute values
data['no_of_employees'] = abs(data['no_of_employees'])

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Global functions for EDA¶

In [246]:
# creating a copy of the data so that original data is not changed.
df = data.copy()
In [247]:
# User-defined function to plot a boxplot and a histogram along the same scale
def histogram_boxplot(
    data, feature, xlabel, ylabel, figsize=(8, 6), kde=False, bins=None
):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    xlabel: label of x-axis
    ylabel: label of y-axis
    figsize: size of figure (default (8, 6))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots

    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="orange"
    )  # boxplot will be created and a star will indicate the mean value of the column

    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="Set2"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram

    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram

    ax_hist2.axvline(
        data[feature].median(), color="red", linestyle="-"
    )  # Add median to the histogram

    ax_box2.set_xlabel("", fontsize=16)  # remove label of 1st x-axis
    ax_hist2.set_xlabel(xlabel, fontsize=16)  # set 2nd x-axis label
    ax_hist2.set_ylabel(ylabel, fontsize=16)
    # set y-axis label
In [248]:
# User-defined function to create labeled barplots
def labeled_barplot(data, feature, xlabel, ylabel, perc=False, n=None):
    """
    Barplot with percentage to the left

    data: dataframe
    feature: dataframe column
    xlabel: label of x-axis
    ylabel: label of y-axis
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(8, 0.5 * count + 1))
    else:
        plt.figure(figsize=(8, 0.5 * n + 1))

    plt.yticks(fontsize=14)
    plt.xticks(fontsize=14)

    ax = sns.countplot(
        data=data,
        y=feature,
        palette="Set2",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_width() / total
            )  # percentage of each class of the category
        else:
            label = p.get_width()  # count of each level of the category

        y = p.get_y() + p.get_height() / 2
        x = p.get_width()

        ax.annotate(
            label,
            (x, y),
            ha="left",
            va="center",
            size=12,
            xytext=(0, 0),
            textcoords="offset points",
        )  # annotate the percentage

    ax.set_xlabel(xlabel, fontsize=16)  # set x-axis label
    ax.set_ylabel(ylabel, fontsize=16)  # set y-axis label

    plt.show()  # show the plot

Leading Questions:

  1. Those with higher education may want to travel abroad for a well-paid job. Does education play a role in Visa certification?

  2. How does the visa status vary across different continents?

  3. Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Does work experience influence visa status?

  4. In the United States, employees are paid at different intervals. Which pay unit is most likely to be certified for a visa?

  5. The US government has established a prevailing wage to protect local talent and foreign workers. How does the visa status change with the prevailing wage?

Univariate Analysis¶

Observations on education level¶

In [249]:
labeled_barplot(
    data=df,
    feature="education_of_employee",
    xlabel="Number of Applications",
    ylabel="Education Level",
    perc=True,
)
No description has been provided for this image

Observations:

  • The majority of the applicants have either bachelor's degrees (40.2%) or master's degrees (37.8%).
  • Only 8.6% of the applicants have doctorate degrees.

Observations on continents¶

In [250]:
labeled_barplot(data,
                "continent",
                xlabel="Number of Applications",
                ylabel="Continent of Origin",
                perc=True)
No description has been provided for this image

Observations:

  • The majority (66%) of the visa applicants are from Asia, which makes sense given the high population of this continent.
  • The lowest fraction (<1%) of the applicants are from Oceania,which also makes sense given its very low population.
  • North America and Europe have close number of applicants (12.9% and 14.6%).

Observations on work experience¶

In [251]:
labeled_barplot(
    data=df,
    feature="has_job_experience",
    xlabel="Number of Applications",
    ylabel="Job Experience",
    perc=True,
)
No description has been provided for this image

Observations:

  • 58% of applicants have previous job experience.

Observations on payment intervals¶

In [252]:
labeled_barplot(
    data=df,
    feature="unit_of_wage",
    xlabel="Number of Applications",
    ylabel="Payment interval",
    perc=True,
)
No description has been provided for this image

Observations:

  • The vast majority of applicants are for jobs with wages executed in a yearly basis.

Observations on prevailing wage¶

In [253]:
histogram_boxplot(data=df,
    feature="prevailing_wage",
    xlabel="Number of Applications",
    ylabel="Wage"
    )
No description has been provided for this image

Observations:

  • the distribution of the prevailing wage is skewed to the right
  • there is a huge difference between wages among applicants
  • There is a huge spike in wages close to 0 which requires a deeper review below.
In [254]:
data.loc[df['prevailing_wage'] < 100, 'unit_of_wage'].value_counts()
Out[254]:
count
unit_of_wage
Hour 176

  • The lower wages reflected in the spike on the first histogram are paid hourly which then affects the visual. No further analysis is required.

Observations on job training¶

In [255]:
labeled_barplot(
    data=df,
    feature="requires_job_training",
    xlabel="Number of Applications",
    ylabel="Training required?",
    perc=True,
)
No description has been provided for this image

Observations:

  • 88.4% of the applicants do not require any job training

Observations on region of employment¶

In [256]:
labeled_barplot(
    data=df,
    feature="region_of_employment",
    xlabel="Number of Applications",
    ylabel="Region",
    perc=True,
)
No description has been provided for this image

Observations:

  • Northeast, South, and West are equally distributed
  • The Island regions have only 1.5% of the applicants

Observations on case status¶

In [257]:
labeled_barplot(
    data=df,
    feature="case_status",
    xlabel="Number of Applications",
    ylabel="Status",
    perc=True,
)
No description has been provided for this image

Observations:

  • 66.8% of the visas were certified.
  • 33.2% of applicants got their visa request denied.

Bivariate Analysis¶

Global functions¶

In [258]:
def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()
In [259]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

Correlation check¶

In [260]:
# seperate the numerical values
cols_list = df.select_dtypes(include=np.number).columns.tolist()

# create the correlation matrix
plt.figure(figsize=(10, 5))
sns.heatmap(
    df[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
No description has been provided for this image

Observations:

  • There's no correlation between numerical variables in the dataset

Education vs visa certification¶

In [261]:
stacked_barplot(df, "education_of_employee", "case_status")
case_status            Certified  Denied    All
education_of_employee                          
All                        17018    8462  25480
Bachelor's                  6367    3867  10234
High School                 1164    2256   3420
Master's                    7575    2059   9634
Doctorate                   1912     280   2192
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations:

  • Data reflects that the higher the education level, there higher chances of get your visa approved.

Education vs region¶

In [262]:
plt.figure(figsize=(10, 5))
sns.heatmap(
    pd.crosstab(df["education_of_employee"], df["region_of_employment"]),
    annot=True,
    fmt="g",
    cmap="viridis",
)
plt.ylabel("Education")
plt.xlabel("Region")
plt.show()
No description has been provided for this image

Observations:

  • High school is mostly required in the South region, followed by Northeast region.
  • Bachelor's is mostly requested in South region, followed by West region.
  • The requirement for Master's is most in Northeast region, followed by South region.
  • Doctorate's is mostly requested in West region, followed by Northeast region.

Region vs case status¶

In [263]:
stacked_barplot(df, "region_of_employment", "case_status")
case_status           Certified  Denied    All
region_of_employment                          
All                       17018    8462  25480
Northeast                  4526    2669   7195
West                       4100    2486   6586
South                      4913    2104   7017
Midwest                    3253    1054   4307
Island                      226     149    375
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations:

  • Midwest has the highest approvals in visa requests.
  • Island has the lowest approvals in visa requests.

Continent vs case status¶

In [264]:
stacked_barplot(df, "continent", "case_status")
case_status    Certified  Denied    All
continent                              
All                17018    8462  25480
Asia               11012    5849  16861
North America       2037    1255   3292
Europe              2957     775   3732
South America        493     359    852
Africa               397     154    551
Oceania              122      70    192
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations:

  • Europe has the highest approvals in visa requests.
  • South America has the lowest approvals in visa requests.

Job experience vs training required¶

In [265]:
stacked_barplot(df, "has_job_experience", "requires_job_training")
requires_job_training      N     Y    All
has_job_experience                       
All                    22525  2955  25480
N                       8988  1690  10678
Y                      13537  1265  14802
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations:

  • If the applicant has a job experience, they are less likely to require training

Wage vs case status¶

In [266]:
distribution_plot_wrt_target(df, "prevailing_wage", "case_status")
No description has been provided for this image

Observations:

  • The median wage for the certified applications is slightly higher compared to denied applications.

Region vs wage¶

In [267]:
plt.figure(figsize=(10, 5))
sns.boxplot(df, x="region_of_employment", y="prevailing_wage")
plt.show()
No description has been provided for this image

Observations:

  • Wages are higher in Midwest and Island regions

Unit of wage vs case status¶

In [268]:
stacked_barplot(df, "unit_of_wage", "case_status")
case_status   Certified  Denied    All
unit_of_wage                          
All               17018    8462  25480
Year              16047    6915  22962
Hour                747    1410   2157
Week                169     103    272
Month                55      34     89
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations:

  • Yearly waged applicants are most likely to be certified while Hourly waged solicitants are more prompt to be denied.

Data Preprocessing¶

  • Missing value treatment (if needed)
  • Feature engineering
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

Missing values¶

In [269]:
df.isnull().sum()
Out[269]:
0
continent 0
education_of_employee 0
has_job_experience 0
requires_job_training 0
no_of_employees 0
yr_of_estab 0
region_of_employment 0
prevailing_wage 0
unit_of_wage 0
full_time_position 0
case_status 0

There are no values missing in any of the columns

Feature engineering¶

  • Feature yr_of_estab most be standardize to show information that could be better interpreted across the data. Hence, the feature should be transformed from year of stablishment to years since stablishment taking 2016 as the final year. This will provide more value.
  • Also prevailing_wage must be transformed to a more standard way to be interprested and read across all data therefore this information will be translated to hourly_wage, with this all people will have the value interpreted the same and more analysis could be done.
Years since stablishment¶
In [270]:
# Adding new column
df['years_since_estab'] = 2016 - df['yr_of_estab']
In [271]:
# Dropping yr_of_estab
df.drop("yr_of_estab", axis=1, inplace=True)
Hourly wages¶

To calculate working hours, the average information from USA is used in the year 2016 which is:

  • Monthly: (40 hours/week) x (52 weeks/year) / 12 months = 173.33 hours/month. This is rounded to 173
  • Weekly: (40 hours/week) = Using the same calculation as above, it was given that the average working hours per week was 40 hours.
  • Yearly: (40 hours/week) x (52 weeks/year)= 2080. Finally, it is given that a year consisting of 52 weeks and working 40 hours per week stablishes that the total hours worked by year is 2080
In [272]:
df["hourly_wage"] = df["prevailing_wage"]
In [273]:
#Calculating yearly hours
df.loc[df.unit_of_wage == "Year", "hourly_wage"] = (
    df.loc[df.unit_of_wage == "Year", "hourly_wage"] / 2080
)
In [274]:
#Calculating monthly hours
df.loc[df.unit_of_wage == "Month", "hourly_wage"] = (
    df.loc[df.unit_of_wage == "Month", "hourly_wage"] / 173
)
In [275]:
#Calculating weekly hours
df.loc[df.unit_of_wage == "Week", "hourly_wage"] = (
    df.loc[df.unit_of_wage == "Week", "hourly_wage"] / 40
)
In [276]:
#Finally, prevailing_wage can be dropped
df.drop("prevailing_wage", axis=1, inplace=True)

Checking again the data including the new columns

In [277]:
df.head()
Out[277]:
continent education_of_employee has_job_experience requires_job_training no_of_employees region_of_employment unit_of_wage full_time_position case_status years_since_estab hourly_wage
0 Asia High School N N 14513 West Hour Y Denied 9 592.202900
1 Asia Master's Y N 2412 Northeast Year Y Certified 14 40.108486
2 Asia Bachelor's N Y 44444 West Year Y Denied 8 59.133106
3 Asia Bachelor's N N 98 West Year Y Denied 119 40.112514
4 Africa Master's Y N 1082 South Year Y Certified 11 72.070861
In [278]:
# Check statistical summary of numeric data in updated data
df.describe().T
Out[278]:
count mean std min 25% 50% 75% max
no_of_employees 25480.0 5667.089207 22877.917453 11.000000 1022.00000 2109.000000 3504.000000 602069.00000
years_since_estab 25480.0 36.590071 42.366929 0.000000 11.00000 19.000000 40.000000 216.00000
hourly_wage 25480.0 94.902995 278.176919 0.048077 22.64806 39.826663 60.012036 7004.39875

Observations:

  • The mean and median values of years_since_estab are 36.5 and 19 years, respectively.
  • The oldest employer was established 216 before 2016.
  • The minimum hourly wage is 0.05 and the maximum value is 7004 respectively
  • The mean hourly wage is ~95.

Outlier detection and treatment¶

In [279]:
# outlier detection using boxplot
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)
plt.show()
No description has been provided for this image

Observations:

  • Despite having some outliers in the ds, these won't be removed as they provide value to the analysis.

Data preparation for modeling¶

In [280]:
df_m = df.copy()
  • We want to predict which visa will be certified.
  • We need to encode categorical features.
  • We'll split the data into train and test to be able to evaluate the model that we build on the train data.
Encoding categorical data¶
In [281]:
# case_status:
df_m.case_status = df_m.case_status.apply(lambda x: 1 if x == "Certified" else 0)
In [282]:
# has_job_experience:
df_m.has_job_experience = df.has_job_experience.apply(lambda x: 1 if x == "Y" else 0)
In [283]:
# requires_job_training:
df_m.requires_job_training = df_m.requires_job_training.apply(lambda x: 1 if x == "Y" else 0)
In [284]:
# full_time_position:
df_m.full_time_position = df_m.full_time_position.apply(lambda x: 1 if x == "Y" else 0)
In [285]:
# education_of_employee:
# Replace 'High School' with 1, 'Bachelor's' with 2, 'Master's' with 3, and 'Doctarate' with 4
df_m.education_of_employee = df_m.education_of_employee.apply(
    lambda x: 1
    if x == "High School"
    else (2 if x == "Bachelor's" else (3 if x == "Master's" else 4))
)
Dependent and independent variables¶
In [286]:
# split to train and test
X = df_m.drop(["case_status"], axis=1)
Y = df_m.case_status
Create dummy variables¶
In [287]:
# create dummy varialbes for categories
X = pd.get_dummies(
    X,
    columns=X.select_dtypes(include=["object"]).columns.tolist(),
    drop_first=True,
)
#transforming booleans into floats (1, 0)
X = X.astype(float)

# Check updated independent variables data frame
X.sample(5, random_state=1)
Out[287]:
education_of_employee has_job_experience requires_job_training no_of_employees full_time_position years_since_estab hourly_wage continent_Asia continent_Europe continent_North America continent_Oceania continent_South America region_of_employment_Midwest region_of_employment_Northeast region_of_employment_South region_of_employment_West unit_of_wage_Month unit_of_wage_Week unit_of_wage_Year
17639 2.0 1.0 0.0 567.0 1.0 24.0 12.905245 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
23951 2.0 0.0 0.0 619.0 1.0 78.0 31.932683 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
8625 3.0 0.0 0.0 2635.0 1.0 11.0 887.292100 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
20206 2.0 1.0 1.0 3184.0 1.0 30.0 23.767212 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
7471 2.0 1.0 0.0 4681.0 1.0 88.0 23.973649 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
Splitting data in train and test sets¶
In [288]:
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1, stratify=Y
)
# Check number of rows in each data set
print("Number of rows in training data set =", X_train.shape[0])
print("\nNumber of rows in test data set =", X_test.shape[0])
Number of rows in training data set = 17836

Number of rows in test data set = 7644

Model evaluation criterion¶

Model can make wrong predictions as:¶

  1. Model predicts that the visa application will get certified but in reality, the visa application should get denied.
  2. Model predicts that the visa application will not get certified but in reality, the visa application should get certified.

Which case is more important?¶

  • Both the cases are important as:

  • If a visa is certified when it had to be denied a wrong employee will get the job position while US citizens will miss the opportunity to work on that position.

  • If a visa is denied when it had to be certified the U.S. will lose a suitable human resource that can contribute to the economy.

How to reduce the losses?¶

  • F1 Score can be used a the metric for evaluation of the model, greater the F1 score higher are the chances of minimizing False Negatives and False Positives.
  • We will use balanced class weights so that model focuses equally on both classes.

Global functions¶

  • The model_performance_classification_sklearn function will be used to check the model performance of models.
  • The confusion_matrix_sklearn function will be used to plot the confusion matrix.
In [289]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [290]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Building bagging and boosting models¶

Note

  1. Sample parameter grids have been provided to do necessary hyperparameter tuning. These sample grids are expected to provide a balance between model performance improvement and execution time. One can extend/reduce the parameter grid based on execution time and system configuration.
  • Please note that if the parameter grid is extended to improve the model performance further, the execution time will increase

Decision Tree - Model Building and Hyperparameter Tuning¶

Default model - Decision Tree¶

model building¶
In [291]:
dtree_classifer_model = DecisionTreeClassifier(random_state=1)
dtree_classifer_model.fit(X_train, y_train)
Out[291]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
Training set¶
In [292]:
confusion_matrix_sklearn(dtree_classifer_model, X_train, y_train)
No description has been provided for this image
In [293]:
decision_tree_perf_train = model_performance_classification_sklearn(
    dtree_classifer_model, X_train, y_train
)
decision_tree_perf_train
Out[293]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing set¶
In [294]:
confusion_matrix_sklearn(dtree_classifer_model, X_test, y_test)
No description has been provided for this image
In [295]:
decision_tree_perf_test = model_performance_classification_sklearn(
    dtree_classifer_model, X_test, y_test
)
decision_tree_perf_test
Out[295]:
Accuracy Recall Precision F1
0 0.658163 0.743193 0.744505 0.743849
Observations:¶
  • The decision tree is overfitting the train data

Hyperparameter Tuning - Decision Tree¶

Decision Tree parameters:

param_grid = {
    'max_depth': np.arange(2,6),
    'min_samples_leaf': [1, 4, 7],
    'max_leaf_nodes' : [10, 15],
    'min_impurity_decrease': [0.0001,0.001]
}
model building¶
In [296]:
# Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(class_weight="balanced", random_state=1)

# Grid of parameters to choose from
param_grid = {
    'max_depth': np.arange(2,6),
    'min_samples_leaf': [1, 4, 7],
    'max_leaf_nodes' : [10, 15],
    'min_impurity_decrease': [0.0001,0.001]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, param_grid, scoring=scorer, n_jobs=-1)

grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of param_grid
dtree_estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
dtree_estimator.fit(X_train, y_train)
Out[296]:
DecisionTreeClassifier(class_weight='balanced', max_depth=4, max_leaf_nodes=15,
                       min_impurity_decrease=0.001, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=4, max_leaf_nodes=15,
                       min_impurity_decrease=0.001, random_state=1)
Training set¶
In [297]:
confusion_matrix_sklearn(dtree_estimator, X_train, y_train)
No description has been provided for this image
In [298]:
dtree_estimator_model_train_perf = model_performance_classification_sklearn(
    dtree_estimator, X_train, y_train
)
dtree_estimator_model_train_perf
Out[298]:
Accuracy Recall Precision F1
0 0.715351 0.775036 0.793895 0.784352
Testing set¶
In [299]:
confusion_matrix_sklearn(dtree_estimator, X_test, y_test)
No description has been provided for this image
In [300]:
dtree_estimator_model_test_perf = model_performance_classification_sklearn(
    dtree_estimator, X_test, y_test
)
dtree_estimator_model_test_perf
Out[300]:
Accuracy Recall Precision F1
0 0.714155 0.777473 0.790953 0.784155
Observations:¶
  • The overfitting is eliminated after hyperparameter tuning and the test score has increased by approx 4%.
  • Model can be generalized and used for prediction.

Bagging - Model Building and Hyperparameter Tuning¶

Default model - Bagging Classifier¶

model building¶
In [301]:
bagging_classifier = BaggingClassifier(random_state=1)
bagging_classifier.fit(X_train, y_train)
Out[301]:
BaggingClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(random_state=1)
Training set¶
In [302]:
confusion_matrix_sklearn(bagging_classifier, X_train, y_train)
No description has been provided for this image
In [303]:
bagging_classifier_model_train_perf = model_performance_classification_sklearn(
    bagging_classifier, X_train, y_train
)
bagging_classifier_model_train_perf
Out[303]:
Accuracy Recall Precision F1
0 0.985255 0.986485 0.991395 0.988934
Testing set¶
In [304]:
confusion_matrix_sklearn(bagging_classifier, X_test, y_test)
No description has been provided for this image
In [305]:
bagging_classifier_model_test_perf = model_performance_classification_sklearn(
    bagging_classifier, X_test, y_test
)
bagging_classifier_model_test_perf
Out[305]:
Accuracy Recall Precision F1
0 0.689168 0.767679 0.767078 0.767378
Observations:¶
  • Bagging classifier is also overfitting

Hyperparameter Tuning - Bagging Classifier¶

Bagging Tree parameters:

param_grid = {
    'max_samples': [0.8,0.9,1],
    'max_features': [0.7,0.8,0.9],
    'n_estimators' : [30,50,70],
}
model building¶
In [306]:
# Choose the type of classifier.
bagging_estimator_tuned = BaggingClassifier(random_state=1)

# Grid of parameters to choose from
param_grid = {
    'max_samples': [0.8,0.9,1],
    'max_features': [0.7,0.8,0.9],
    'n_estimators' : [30,50,70],
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(bagging_estimator_tuned, param_grid, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of param_grid
bagging_estimator_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(X_train, y_train)
Out[306]:
BaggingClassifier(max_features=0.7, max_samples=0.8, n_estimators=70,
                  random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(max_features=0.7, max_samples=0.8, n_estimators=70,
                  random_state=1)
Training set¶
In [307]:
confusion_matrix_sklearn(bagging_estimator_tuned, X_train, y_train)
No description has been provided for this image
In [308]:
bagging_estimator_tuned_model_train_perf = model_performance_classification_sklearn(
    bagging_estimator_tuned, X_train, y_train
)
bagging_estimator_tuned_model_train_perf
Out[308]:
Accuracy Recall Precision F1
0 0.995403 0.999916 0.993246 0.99657
Testing set¶
In [309]:
confusion_matrix_sklearn(bagging_estimator_tuned, X_test, y_test)
No description has been provided for this image
In [310]:
bagging_estimator_tuned_model_test_perf = model_performance_classification_sklearn(
    bagging_estimator_tuned, X_test, y_test
)
bagging_estimator_tuned_model_test_perf
Out[310]:
Accuracy Recall Precision F1
0 0.729984 0.880509 0.755589 0.81328
Observations:¶
  • In the case of bagging, the model is still overfitting even after tunning. No good model to use for prediction.

Default model - Random Forest¶

model building¶
In [311]:
rf_estimator = RandomForestClassifier(random_state=1, class_weight="balanced")
rf_estimator.fit(X_train, y_train)
Out[311]:
RandomForestClassifier(class_weight='balanced', random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight='balanced', random_state=1)
Training set¶
In [312]:
confusion_matrix_sklearn(rf_estimator, X_train, y_train)
No description has been provided for this image
In [313]:
rf_estimator_model_train_perf = model_performance_classification_sklearn(
    rf_estimator, X_train, y_train
)
rf_estimator_model_train_perf
Out[313]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing set¶
In [314]:
confusion_matrix_sklearn(rf_estimator, X_test, y_test)
No description has been provided for this image
In [315]:
rf_estimator_model_test_perf = model_performance_classification_sklearn(
    rf_estimator, X_test, y_test
)
rf_estimator_model_test_perf
Out[315]:
Accuracy Recall Precision F1
0 0.720042 0.842703 0.762901 0.800819
Observations:¶
  • Random forest with base model is overfiting the training data.

Hyperparameter Tuning - Random Forest Classifier¶

Random Forest parameters:

param_grid = {
    "n_estimators": [50,110,25],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)
}
model building¶
In [316]:
# Choose the type of classifier.
rf_tuned = RandomForestClassifier(random_state=1, oob_score=True, bootstrap=True)

param_grid = {
    "n_estimators": [50,110,25],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(rf_tuned, param_grid, scoring=acc_scorer, cv=5, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of param_grid
rf_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)
Out[316]:
RandomForestClassifier(max_samples=0.4, min_samples_leaf=3, n_estimators=110,
                       oob_score=True, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.4, min_samples_leaf=3, n_estimators=110,
                       oob_score=True, random_state=1)
Training set¶
In [317]:
confusion_matrix_sklearn(rf_tuned, X_train, y_train)
No description has been provided for this image
In [318]:
rf_tuned_model_train_perf = model_performance_classification_sklearn(
    rf_tuned, X_train, y_train
)
rf_tuned_model_train_perf
Out[318]:
Accuracy Recall Precision F1
0 0.80259 0.911022 0.815157 0.860427
Testing set¶
In [319]:
confusion_matrix_sklearn(rf_tuned, X_test, y_test)
No description has been provided for this image
In [320]:
rf_tuned_model_test_perf = model_performance_classification_sklearn(
    rf_tuned, X_test, y_test
)
rf_tuned_model_test_perf
Out[320]:
Accuracy Recall Precision F1
0 0.740712 0.867973 0.772086 0.817226
Observations:¶
  • The overfitting is eliminated after hyperparameter tuning and the test score has increased by approx 1%
  • Model can be generalized and used for prediction.

Boosting - Model Building and Hyperparameter Tuning¶

Default model - AdaBoost Classifier¶

model building¶
In [321]:
ab_classifier = AdaBoostClassifier(random_state=1)
ab_classifier.fit(X_train, y_train)
Out[321]:
AdaBoostClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(random_state=1)
Training set¶
In [322]:
confusion_matrix_sklearn(ab_classifier, X_train, y_train)
No description has been provided for this image
In [323]:
ab_classifier_model_train_perf = model_performance_classification_sklearn(
    ab_classifier, X_train, y_train
)
ab_classifier_model_train_perf
Out[323]:
Accuracy Recall Precision F1
0 0.737497 0.887518 0.759828 0.818724
Testing set¶
In [324]:
confusion_matrix_sklearn(ab_classifier, X_test, y_test)
No description has been provided for this image
In [325]:
ab_classifier_model_test_perf = model_performance_classification_sklearn(
    ab_classifier, X_test, y_test
)
ab_classifier_model_test_perf
Out[325]:
Accuracy Recall Precision F1
0 0.734432 0.88619 0.757408 0.816754
Observations:¶
  • The model is not overfiting and is giving a good performance overall.
  • Let's tune to see if there's even better results.

Hyperparameter Tuning - AdaBoost Classifier¶

AdaBoost parameters:

param_grid = {
    "estimator": [
        DecisionTreeClassifier(max_depth=2,random_state=1),
        DecisionTreeClassifier(max_depth=3,random_state=1),
    ],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": np.arange(0.01,0.1,0.05),
}
model building¶
In [326]:
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)

# Grid of parameters to choose from
param_grid = {
    # Let's try different max_depth for base_estimator
    "estimator": [
        DecisionTreeClassifier(max_depth=2,random_state=1),
        DecisionTreeClassifier(max_depth=3,random_state=1),
    ],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": np.arange(0.01,0.1,0.05),
}

# Type of scoring used to compare parameter  combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(abc_tuned, param_grid, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of param_grid
abc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
Out[326]:
AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=3,
                                                    random_state=1),
                   learning_rate=0.060000000000000005, n_estimators=100,
                   random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=3,
                                                    random_state=1),
                   learning_rate=0.060000000000000005, n_estimators=100,
                   random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
Training set¶
In [327]:
confusion_matrix_sklearn(abc_tuned, X_train, y_train)
No description has been provided for this image
In [328]:
abc_tuned_model_train_perf = model_performance_classification_sklearn(
    abc_tuned, X_train, y_train
)
abc_tuned_model_train_perf
Out[328]:
Accuracy Recall Precision F1
0 0.754878 0.87904 0.781318 0.827303
Testing set¶
In [329]:
confusion_matrix_sklearn(abc_tuned, X_test, y_test)
No description has been provided for this image
In [330]:
abc_tuned_model_test_perf = model_performance_classification_sklearn(
    abc_tuned, X_test, y_test
)
abc_tuned_model_test_perf
Out[330]:
Accuracy Recall Precision F1
0 0.742543 0.874829 0.770664 0.81945
Observations:¶
  • The improvement of the tunning has not changed that much.
  • The precision of the model did improve by 2%
  • Accuracy also increased by 2%
  • The model is good to use for prediction.

Default model - Gradient Boosting Classifier¶

model building¶
In [331]:
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train, y_train)
Out[331]:
GradientBoostingClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(random_state=1)
Training set¶
In [332]:
confusion_matrix_sklearn(gb_classifier, X_train, y_train)
No description has been provided for this image
In [333]:
gb_classifier_model_train_perf = model_performance_classification_sklearn(
    gb_classifier, X_train, y_train
)
gb_classifier_model_train_perf
Out[333]:
Accuracy Recall Precision F1
0 0.756896 0.879795 0.783041 0.828603
Testing set¶
In [334]:
confusion_matrix_sklearn(gb_classifier, X_test, y_test)
No description has been provided for this image
In [335]:
gb_classifier_model_test_perf = model_performance_classification_sklearn(
    gb_classifier, X_test, y_test
)
gb_classifier_model_test_perf
Out[335]:
Accuracy Recall Precision F1
0 0.744767 0.875808 0.77246 0.820894
Observations:¶
  • The model is not overfiting and is giving a good performance overall.
  • Let's tune to see if there's even better results.

Hyperparameter Tuning - Gradient Boosting Classifier¶

Gradient Boosting parameters:

param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}
model building¶
In [336]:
gbc_tuned = GradientBoostingClassifier(
    init=AdaBoostClassifier(random_state=1), random_state=1
)

# Grid of parameters to choose from
parameters = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
Out[336]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           learning_rate=0.05, max_features=0.7, random_state=1,
                           subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           learning_rate=0.05, max_features=0.7, random_state=1,
                           subsample=0.9)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
Training set¶
In [337]:
confusion_matrix_sklearn(gbc_tuned, X_train, y_train)
No description has been provided for this image
In [338]:
gbc_tuned_model_train_perf = model_performance_classification_sklearn(
    gbc_tuned, X_train, y_train
)
gbc_tuned_model_train_perf
Out[338]:
Accuracy Recall Precision F1
0 0.750953 0.873332 0.780085 0.824079
Testing set¶
In [339]:
confusion_matrix_sklearn(gbc_tuned, X_test, y_test)
No description has been provided for this image
In [340]:
gbc_tuned_model_test_perf = model_performance_classification_sklearn(
    gbc_tuned, X_test, y_test
)
gbc_tuned_model_test_perf
Out[340]:
Accuracy Recall Precision F1
0 0.744113 0.870127 0.774542 0.819557
Observations:¶
  • F1 decreases after tunning
  • The model is not improving, hence it is not recommended to enhance.

Default model - XGBoost Classifier¶

model building¶
In [341]:
xgb_classifier = XGBClassifier(random_state=1, eval_metric="logloss")
xgb_classifier.fit(X_train, y_train)
Out[341]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=None,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=None,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
Training set¶
In [342]:
confusion_matrix_sklearn(xgb_classifier, X_train, y_train)
No description has been provided for this image
In [343]:
xgb_classifier_model_train_perf = model_performance_classification_sklearn(
    xgb_classifier, X_train, y_train
)
xgb_classifier_model_train_perf
Out[343]:
Accuracy Recall Precision F1
0 0.843294 0.933182 0.847591 0.88833
Testing set¶
In [344]:
confusion_matrix_sklearn(xgb_classifier, X_test, y_test)
No description has been provided for this image
In [345]:
xgb_classifier_model_test_perf = model_performance_classification_sklearn(
    xgb_classifier, X_test, y_test
)
xgb_classifier_model_test_perf
Out[345]:
Accuracy Recall Precision F1
0 0.725536 0.8476 0.766248 0.804874
Observations:¶
  • XGB seems to be overfiting a bit.
  • Tunning to be checked

Hyperparameter Tuning - XGBoost Classifier¶

Parameters for XGBoost:

param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}
model building¶
In [346]:
xgb_tuned = XGBClassifier(random_state=1, eval_metric="logloss")

# Grid of parameters to choose from
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, param_grid, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of param_grid
xgb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
Out[346]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=3, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.05, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=50,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=3, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.05, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=50,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
Training set¶
In [347]:
confusion_matrix_sklearn(xgb_tuned, X_train, y_train)
No description has been provided for this image
In [348]:
xgb_tuned_model_train_perf = model_performance_classification_sklearn(
    xgb_tuned, X_train, y_train
)
xgb_tuned_model_train_perf
Out[348]:
Accuracy Recall Precision F1
0 0.762671 0.886091 0.785884 0.832985
Testing set¶
In [349]:
confusion_matrix_sklearn(xgb_tuned, X_test, y_test)
No description has been provided for this image
In [350]:
xgb_tuned_model_test_perf = model_performance_classification_sklearn(
    xgb_tuned, X_test, y_test
)
xgb_tuned_model_test_perf
Out[350]:
Accuracy Recall Precision F1
0 0.743851 0.87522 0.771809 0.820268
Observations:¶
  • XGB does not improve after tunning hence it is not considered to be used.

Stacking classifier¶

model building¶
In [351]:
estimators = [
    ("AdaBoost", ab_classifier),
    ("Gradient Boosting", gbc_tuned),
    ("Random Forest", rf_tuned),
]

final_estimator = xgb_tuned

stacking_classifier = StackingClassifier(
    estimators=estimators, final_estimator=final_estimator
)

stacking_classifier.fit(X_train, y_train)
Out[351]:
StackingClassifier(estimators=[('AdaBoost', AdaBoostClassifier(random_state=1)),
                               ('Gradient Boosting',
                                GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                                                           learning_rate=0.05,
                                                           max_features=0.7,
                                                           random_state=1,
                                                           subsample=0.9)),
                               ('Random Forest',
                                RandomForestClassifier(max_samples=0.4,
                                                       min_samples_leaf=3,
                                                       n_estimators=110,
                                                       oob_score=True,
                                                       random_stat...
                                                 feature_types=None, gamma=3,
                                                 grow_policy=None,
                                                 importance_type=None,
                                                 interaction_constraints=None,
                                                 learning_rate=0.05,
                                                 max_bin=None,
                                                 max_cat_threshold=None,
                                                 max_cat_to_onehot=None,
                                                 max_delta_step=None,
                                                 max_depth=None,
                                                 max_leaves=None,
                                                 min_child_weight=None,
                                                 missing=nan,
                                                 monotone_constraints=None,
                                                 multi_strategy=None,
                                                 n_estimators=50, n_jobs=None,
                                                 num_parallel_tree=None,
                                                 random_state=1, ...))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StackingClassifier(estimators=[('AdaBoost', AdaBoostClassifier(random_state=1)),
                               ('Gradient Boosting',
                                GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                                                           learning_rate=0.05,
                                                           max_features=0.7,
                                                           random_state=1,
                                                           subsample=0.9)),
                               ('Random Forest',
                                RandomForestClassifier(max_samples=0.4,
                                                       min_samples_leaf=3,
                                                       n_estimators=110,
                                                       oob_score=True,
                                                       random_stat...
                                                 feature_types=None, gamma=3,
                                                 grow_policy=None,
                                                 importance_type=None,
                                                 interaction_constraints=None,
                                                 learning_rate=0.05,
                                                 max_bin=None,
                                                 max_cat_threshold=None,
                                                 max_cat_to_onehot=None,
                                                 max_delta_step=None,
                                                 max_depth=None,
                                                 max_leaves=None,
                                                 min_child_weight=None,
                                                 missing=nan,
                                                 monotone_constraints=None,
                                                 multi_strategy=None,
                                                 n_estimators=50, n_jobs=None,
                                                 num_parallel_tree=None,
                                                 random_state=1, ...))
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
RandomForestClassifier(max_samples=0.4, min_samples_leaf=3, n_estimators=110,
                       oob_score=True, random_state=1)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=3, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.05, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=50,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
Training set¶
In [352]:
confusion_matrix_sklearn(stacking_classifier, X_train, y_train)
No description has been provided for this image
In [353]:
stacking_classifier_model_train_perf = model_performance_classification_sklearn(
    stacking_classifier, X_train, y_train
)
stacking_classifier_model_train_perf
Out[353]:
Accuracy Recall Precision F1
0 0.776632 0.883321 0.802241 0.840831
Testing set¶
In [354]:
confusion_matrix_sklearn(stacking_classifier, X_test, y_test)
No description has been provided for this image
In [355]:
stacking_classifier_model_test_perf = model_performance_classification_sklearn(
    stacking_classifier, X_test, y_test
)
stacking_classifier_model_test_perf
Out[355]:
Accuracy Recall Precision F1
0 0.743328 0.866405 0.775557 0.818468
Observations:¶
  • Stacking is giving an overall good performance.
  • Accuracy could be further improved.

Model Performance Comparison and Conclusions¶

Actionable Insights and Recommendations¶

Training set¶
In [356]:
models_train_comp_df = pd.concat(
    [
        decision_tree_perf_train.T,
        dtree_estimator_model_train_perf.T,
        bagging_classifier_model_train_perf.T,
        bagging_estimator_tuned_model_train_perf.T,
        rf_estimator_model_train_perf.T,
        rf_tuned_model_train_perf.T,
        ab_classifier_model_train_perf.T,
        abc_tuned_model_train_perf.T,
        gb_classifier_model_train_perf.T,
        gbc_tuned_model_train_perf.T,
        xgb_classifier_model_train_perf.T,
        xgb_tuned_model_train_perf.T,
        stacking_classifier_model_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree",
    "Tuned Decision Tree",
    "Bagging Classifier",
    "Tuned Bagging Classifier",
    "Random Forest",
    "Tuned Random Forest",
    "Adaboost Classifier",
    "Tuned Adaboost Classifier",
    "Gradient Boost Classifier",
    "Tuned Gradient Boost Classifier",
    "XGBoost Classifier",
    "XGBoost Classifier Tuned",
    "Stacking Classifier",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[356]:
Decision Tree Tuned Decision Tree Bagging Classifier Tuned Bagging Classifier Random Forest Tuned Random Forest Adaboost Classifier Tuned Adaboost Classifier Gradient Boost Classifier Tuned Gradient Boost Classifier XGBoost Classifier XGBoost Classifier Tuned Stacking Classifier
Accuracy 1.0 0.715351 0.985255 0.995403 1.0 0.802590 0.737497 0.754878 0.756896 0.750953 0.843294 0.762671 0.776632
Recall 1.0 0.775036 0.986485 0.999916 1.0 0.911022 0.887518 0.879040 0.879795 0.873332 0.933182 0.886091 0.883321
Precision 1.0 0.793895 0.991395 0.993246 1.0 0.815157 0.759828 0.781318 0.783041 0.780085 0.847591 0.785884 0.802241
F1 1.0 0.784352 0.988934 0.996570 1.0 0.860427 0.818724 0.827303 0.828603 0.824079 0.888330 0.832985 0.840831
Testing set¶
In [357]:
models_test_comp_df = pd.concat(
    [
        decision_tree_perf_test.T,
        dtree_estimator_model_test_perf.T,
        bagging_classifier_model_test_perf.T,
        bagging_estimator_tuned_model_test_perf.T,
        rf_estimator_model_test_perf.T,
        rf_tuned_model_test_perf.T,
        ab_classifier_model_test_perf.T,
        abc_tuned_model_test_perf.T,
        gb_classifier_model_test_perf.T,
        gbc_tuned_model_test_perf.T,
        xgb_classifier_model_test_perf.T,
        xgb_tuned_model_test_perf.T,
        stacking_classifier_model_test_perf.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree",
    "Tuned Decision Tree",
    "Bagging Classifier",
    "Tuned Bagging Classifier",
    "Random Forest",
    "Tuned Random Forest",
    "Adaboost Classifier",
    "Tuned Adaboost Classifier",
    "Gradient Boost Classifier",
    "Tuned Gradient Boost Classifier",
    "XGBoost Classifier",
    "XGBoost Classifier Tuned",
    "Stacking Classifier",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
Out[357]:
Decision Tree Tuned Decision Tree Bagging Classifier Tuned Bagging Classifier Random Forest Tuned Random Forest Adaboost Classifier Tuned Adaboost Classifier Gradient Boost Classifier Tuned Gradient Boost Classifier XGBoost Classifier XGBoost Classifier Tuned Stacking Classifier
Accuracy 0.658163 0.714155 0.689168 0.729984 0.720042 0.740712 0.734432 0.742543 0.744767 0.744113 0.725536 0.743851 0.743328
Recall 0.743193 0.777473 0.767679 0.880509 0.842703 0.867973 0.886190 0.874829 0.875808 0.870127 0.847600 0.875220 0.866405
Precision 0.744505 0.790953 0.767078 0.755589 0.762901 0.772086 0.757408 0.770664 0.772460 0.774542 0.766248 0.771809 0.775557
F1 0.743849 0.784155 0.767378 0.813280 0.800819 0.817226 0.816754 0.819450 0.820894 0.819557 0.804874 0.820268 0.818468

Observations:

  • After validating models and tunned models the best performance, not overfiting results, accuracy and precision, the best model to use is Gradient Boost Classifier to predict Visa applicants results.
Important features on final model¶
In [358]:
feature_names = X_train.columns
importances = gb_classifier.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image
In [359]:
print(pd.DataFrame(gb_classifier.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
                                     Imp
education_of_employee           0.478724
has_job_experience              0.163514
unit_of_wage_Year               0.117938
continent_Europe                0.061375
region_of_employment_Midwest    0.031557
region_of_employment_South      0.021650
hourly_wage                     0.020223
continent_North America         0.018159
no_of_employees                 0.017129
years_since_estab               0.014633
region_of_employment_West       0.010282
continent_Asia                  0.009764
region_of_employment_Northeast  0.009227
continent_South America         0.008435
full_time_position              0.006630
unit_of_wage_Week               0.004165
requires_job_training           0.003508
unit_of_wage_Month              0.002378
continent_Oceania               0.000709
Observations:¶

From EDA

  • 40.2% of applicants have a bachelor's degree and 37.8% of them have a Masters. Combined these 2 categories represent 78% of all Visa applicants.
  • The bast majority of candidates come from Asia with 66.2% of the data.
  • More than half, 58.1% of people have job experience.
  • The larger portion of applicants have a yearly payment interval.
  • The more extensive portion of applicants, do not require any job training with 88.4% of the total.
  • Northeast and South are the regions with most requests for visa with 28.2% and 27.5% respectively.
  • The approval rating of applications is 66.8% which represents more than 2/3 of people requesting a Visa.
  • The level of Education and visa approval are deeply related. the majority of applicants with Doctorate and Masters's degrees get approved.
  • Midwest and south are the regions that give more employability to those looking for.

From Classifiers:

  • All models perform similarly with a performance improvement presented by the Gradient Boost Classifier. Overall provides the best F1 score (0.82), and accuracy (0.7447). Almost second best in Recall (0.875) and it is the fifth in Precision (0.7745).
  • Overall best features for the model are: education_of_employee(0.478), has_job_experience (0.163) and unit_of_wage_year (0.1179). These are really important variables for visa certification odds.
  • Continent_europe and region_of_employment_midwest are also important variables for the model but that heavy as the first three.

Profiles with higher chances of being certified:

  • Education level:Having a doctorate or masters degree provide hight chances, secondary to this comes people with a bachelors degree
  • Job Experience: Applicants with Job experience under their belt are more likely to be approved.
  • Unit of wage: If the applicant works for Yearly based salaries, has better chances to be certified. -Other important features:
    • Europeans: People coming from Europe have more odds of having their visa certified
    • Region of employment: Applicants to work on midwest have higher probabilities to get a visa.

Profiles with fewer chances of being certified:

  • Education level: Applicants with High school completed are much less likely to get approved
  • Job experience: If the applicant doesn't have any job experience, chances are that won't get a visa.
  • Continent and unit of wage: those that come from oceania and are looking to be paid in a monthly basis are much less likely to get their visa certified.

Suggestions:

  • Collect areas of specialization from applicants to understand their profile. Folks with no high school but good workers in areas with shortage of employees like construction could be a good case.
  • Company sector: Understand to which sectors possible employers work with. These could help to understand which economic area is more likely than others to hire and support visa.
  • Age and family status: information about applicants like age and number of family members could also be a deterministic factor for the model and help understand which profiles are more appealing to be certified.