Case Study - Preprocessing

Case Study – Preprocessing

Data preprocessing denotes several preprocessing tasks applied to the target data set to ensure consistency in naming conventions, encoding structures, and attribute measures. Preprocessing mainly includes data integration and data cleaning. Data integration is a form of preprocessing that may combine multiple data sources. The data set then cleaned where the term ‘cleaning’ denotes the processing of data for reducing noise and the treatment of missing values.

Data transformation procedure may be applied to the preprocessed dataset prior to data classification. For example, this method is used to normalize the dataset as because neural network and regression-based techniques require distance measurements for analysis. It transforms database attribute values to a small-scale range such as [-1.0, +1.0] or [0.0, 1.0]. Occasionally, researchers follow aggregation or consolidation approaches for performing data transformation.

In this tutorial, we will perform data preprocessing for the “Loan Prediction Problem Dataset“. This case study is done using Python programming. You can download the dataset from here.

Loan Prediction Problem Dataset

Here is the description of the variables present in the dataset:

Variable	Description	Type
Loan_ID	Unique Loan ID	Object
Gender	Male / Female	Categorical
Married	Applicant married (Y/N)	Categorical
Dependents	Number of dependents	Categorical
Education	Education (Graduate / Not Graduate)	Categorical
Self_Employed	Self-employed (Y/N)	Categorical
ApplicantIncome	Applicant’s income	Numerical
CoapplicantIncome	Coapplicant’s income	Numerical
LoanAmount	Loan amount in thousands	Numerical
Loan_Amount_Term	Term of loan in months	Numerical
Credit_History	Credit history meets guidelines	Numerical
Property_Area	Urban / Semiurban / Rural	Categorical
Loan_Status	Loan approved (Y/N) – class	Categorical

Preprocessing and analysis in Python using Pandas

Python module pandas is one of the most useful data analysis libraries in Python. We will now use this module to read data from the dataset and perform preprocessing for this problem. Additionally, we may need scikit-learn (i.e. sklearn in program) module for label encoding. See the code below.

loan_preprocessing.py

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Reading the dataset in a dataframe using Pandas
df = pd.read_csv('loandata.csv') # The training dataset is renamed

# Gathering Info
print('\nInfo of numeric columns: ')
print(df.describe()) # takes numeric values only
print('\nInfo of all columns: ')
print(df.columns)
print(df.dtypes)
print('\nSize Info: ')
print(df.shape)

# Checking missing values in the dataset
print('\nMissing values info in the dataset: ')

# In pandas axis=0 represents rows (default) and axis=1 represents columns
print(df.apply(lambda x: sum(x.isnull()),axis=0))

# Replacing missing values for numeric attributes using mean
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mean(), inplace=True)
df['Credit_History'].fillna(df['Credit_History'].max(), inplace=True)

# Replacing missing values for categorical attributes using most frequent value
# Example: print(df['Gender'].value_counts()) gives frequency count for 'Gender'
df['Gender'].fillna('Male',inplace=True)
df['Married'].fillna('Yes',inplace=True)
df['Dependents'].fillna('0',inplace=True)
df['Self_Employed'].fillna('No',inplace=True)

# Removing a few unnecessary column(s)
# Example: Loan_Id is an unnecessary column
df.drop(['Loan_ID'], axis=1, inplace=True)
print('\n', df.head(5))

# Converting all categorical variables into numeric by encoding the categories
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i])
print(df.dtypes)

# Saving the dataframe "df" as "loan.csv" to our local machine
df.to_csv('loan.csv')

Output:

Info of numeric columns: 
ApplicantIncome ... Credit_History
count 614.000000 ... 564.000000
mean 5403.459283 ... 0.842199
std 6109.041673 ... 0.364878
min 150.000000 ... 0.000000
25% 2877.500000 ... 1.000000
50% 3812.500000 ... 1.000000
75% 5795.000000 ... 1.000000
max 81000.000000 ... 1.000000

[8 rows x 5 columns]

Info of all columns: 
Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
dtype='object')
Loan_ID object
Gender object
Married object
Dependents object
Education object
Self_Employed object
ApplicantIncome int64
CoapplicantIncome float64
LoanAmount float64
Loan_Amount_Term float64
Credit_History float64
Property_Area object
Loan_Status object
dtype: object

Size Info: 
(614, 13)

Missing values info in the dataset: 
Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64

Gender Married ... Property_Area Loan_Status
0 Male No ... Urban Y
1 Male Yes ... Rural N
2 Male Yes ... Urban Y
3 Male Yes ... Urban Y
4 Male No ... Urban Y

[5 rows x 12 columns]
Gender int32
Married int32
Dependents int32
Education int32
Self_Employed int32
ApplicantIncome int64
CoapplicantIncome float64
LoanAmount float64
Loan_Amount_Term float64
Credit_History float64
Property_Area int32
Loan_Status int32
dtype: object

N.B. Some of the attributes may not be visible in the output of Python program.

After, the data preprocessing is done, the dataset (denoted by ‘loan.csv‘) is ready for classification. But, before moving to classification it is better to know how the preprocessed dataset is to be divided into training and test datasets. This is called distribution of dataset which is to be applied before the actual classification task begins.

Distribution of Dataset

We can divide the dataset in two major ways using —

1. predefined division method

2. k-fold Cross-validation (CV) method

We will describe about them in brief using coding example.

1. Predefined division method

Using this method, we can divide the preprocessed dataset into training and test sets using fraction values. Here, the resulting sets may not be disjoint (i.e. some of the data records may be common) which may lead to biasness to creep in. As a result, it may affect the quality of classification.

For example, if we have a dataset of N (=500) data records, using the fraction value t = 0.3 (indiactes the portion of test data in the original dataset), the number of data records in the training set will be

N × (1 – t) = 500 × 0.7 = 500 × 7/10 = 350

and the number of data records in the test set will be

N × t = 500 × 0.3 = 500 × 3/10 = 150

Python code snippet:

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('loan.csv') # Load the preprocessed dataset
X = df.drop('Loan_Status', axis=1) # Class label column is 'Loan_Status'
y = df['Loan_Status']

# Split data into training and test sets using predefined division method
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

2. k-fold Cross-validation (CV) method

Using this method, we can divide the preprocessed dataset into training and test sets which are completely disjoint. This distribution of dataset is done randomly so that there is no common data between the training set and the test set.

For example, if we have a dataset of N (=500) data records, using 10-fold CV (i.e. k=10), the number of data records in the training set will be

N × (k – 1)/k = 500 × 9/10 = 450

and the number of data records in the test set will be

N × 1/k = 500 × 1/10 = 50

Python code snippet:

import pandas as pd
from sklearn.model_selection import KFold

df = pd.read_csv('loan.csv') # Load the preprocessed dataset
X = df.drop('Loan_Status', axis=1) # Class label column is 'Loan_Status'
y = df['Loan_Status']

# Split data into training and test sets using 10-fold CV method
kf = KFold(n_splits=10, random_state=None, shuffle=True) 
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

In the next case study, we will perform classification (including distribution of dataset) on this preprocessed dataset using several important classifiers based on ML.