Machine Learning Project - customer target and expected revenue in banks

By Ivory YAHSUAN TAI | Data Scientist | Master Student in Data Science in EMLyon Business School

Jupiter Notebook link: https://ivorydaae.github.io/ML_Banking/#index
Github link: https://github.com/ivorydaae/ML_Banking

Contact

If you have further suggestions with this project, or you would like to contact me for opportunities, don't hesitate to contact me.

Ivory YAHSUAN TAI
Based in Paris, France
Data Scientist | Master Student in Data Science in EMLyon Business School

Project Sumamary

Use customer data in banking to maximize revenue from marketing campaigns.

Purpose

The purpose of this project is to choose 100 clients of the bank to sell them one of the three offers in bank (Mutual Funds, Credit Card, Consumer Loan) to maximize the predicted revenue for the bank by using machine learning approaches.

Data

For the analysis, I use the data “DMML_Project_Dataset.xlsx”.
The excel file contains several tables (on 4 excel sheets):
§ Social-demographical data (client ID, age, gender, tenure in a bank)
§ Products and actual balance (current account, saving account, mutual funds, overdraft, credit card, consumer loan)
§ Inflow/outflow on C/A, aggregated card turnover (monthly average over past 3 months)
§ Clients actual sales and revenues. Data is available for 60 % of the clients.

Conditions:

The bank has capacity to contact around 100 clients with a marketing offer and each client can be targeted only once (you can't retarget).

Expected result:

§ Which clients are to be targeted with which offer?
§ What would be the expected revenue based on your strategy?


Import library

In [1]:
import numpy as np
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Import & merge data

In [35]:
# Import excel sheets
Description = pd.read_excel('DMML_Project_Dataset.xlsx', sheet_name='Description')

Inflow_Outflow = pd.read_excel('DMML_Project_Dataset.xlsx', sheet_name='Inflow_Outflow')
Soc_Dem = pd.read_excel('DMML_Project_Dataset.xlsx', sheet_name='Soc_Dem')
Products_ActBalance = pd.read_excel('DMML_Project_Dataset.xlsx', sheet_name='Products_ActBalance')
Sales_Revenues = pd.read_excel('DMML_Project_Dataset.xlsx', sheet_name='Sales_Revenues')
In [36]:
## Merge
df = pd.merge(Soc_Dem, Inflow_Outflow, on='Client', how='inner')
df = pd.merge(df, Products_ActBalance, on='Client', how='inner')

df = df.replace(np.nan,0)

# df_all: Training data with targets
df_all = pd.merge(Sales_Revenues, df, on='Client', how='inner')
df_all = df_all.replace(np.nan,0)

Adding features (for modeling)

For the modeling later, we tried to add two features other than the original ones:

  • Saving_Ratio: to see if the client has savings or not
  • age_20: to see if the client is more than 20 years old
In [37]:
# df_feature: Training data without targets
df_feature = df.set_index('Client')

# Add feature 'Saving ratio'
df_feature['Saving_Ratio'] = (df_feature['ActBal_CA']+df_feature['ActBal_SA']+df_feature['ActBal_MF'])/(df_feature['ActBal_CA']+df_feature['ActBal_SA']+df_feature['ActBal_MF']+df_feature['ActBal_OVD']+df_feature['ActBal_CC']+df_feature['ActBal_CL'])
df_feature['age_20'] = (df_feature['Age']>20).apply(lambda x: int(x))
#df_feature['VolumeCred_2'] = (df_feature['VolumeCred']**2)



# Clean df_feature
df_feature = df_feature.replace(np.nan,0)
df_feature
Out[37]:
Sex Age Tenure VolumeCred VolumeCred_CA TransactionsCred TransactionsCred_CA VolumeDeb VolumeDeb_CA VolumeDebCash_Card ... Count_CC Count_CL ActBal_CA ActBal_SA ActBal_MF ActBal_OVD ActBal_CC ActBal_CL Saving_Ratio age_20
Client
909 M 21 27 789.129643 738.230000 4 3 450.678571 448.892857 178.571429 ... 0.0 1.0 4.710714 0.000000 0.000000 0.0 0.00 4291.996429 0.001096 1
1217 M 38 165 0.002143 0.002143 1 1 714.285714 714.285714 0.000000 ... 0.0 0.0 6752.244643 0.000000 0.000000 0.0 0.00 0.000000 1.000000 1
850 F 49 44 1392.402857 1392.402857 3 3 1226.345357 1226.345357 0.000000 ... 0.0 0.0 43.523214 0.000000 0.000000 0.0 0.00 0.000000 1.000000 1
1473 M 54 34 1787.127500 939.128929 14 5 3875.137857 3794.580714 357.142857 ... 1.0 1.0 29.024286 14447.801429 0.000000 0.0 653.91 1132.590357 0.890152 1
1038 M 29 106 0.006071 0.006071 1 1 0.000000 0.000000 0.000000 ... 0.0 0.0 27.035714 0.000000 0.000000 0.0 0.00 0.000000 1.000000 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
409 M 31 91 469.179643 469.179643 3 3 465.092857 465.092857 178.571429 ... 0.0 0.0 348.402857 0.000000 0.000000 0.0 0.00 0.000000 1.000000 1
384 M 23 160 87.500357 87.500357 2 2 88.443929 88.443929 50.000000 ... 0.0 0.0 2418.876786 0.000000 0.000000 0.0 0.00 0.000000 1.000000 1
977 M 46 59 71.428571 71.428571 1 1 76.103571 76.103571 75.000000 ... 0.0 0.0 2639.308571 0.000000 0.000000 0.0 0.00 0.000000 1.000000 1
629 F 61 173 1064.190000 1064.190000 3 3 817.462143 817.462143 660.714286 ... 0.0 0.0 61.766786 0.000000 34387.583571 0.0 0.00 0.000000 1.000000 1
1466 M 63 97 742.597143 742.597143 4 4 624.428571 624.428571 71.428571 ... 0.0 0.0 21.627500 0.000000 0.000000 0.0 0.00 0.000000 1.000000 1

1587 rows × 31 columns

Selecting features (by correlation)

To select the features we compute the correlation of features, and generate a heat map.

Correlation Heat Map:

  • The darker the color, the higher the correlation between two features
In [38]:
# Correlation
corr = df_feature.corr()

# Heatmap
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(10, 6))
    ax = sns.heatmap(corr, mask=mask,square=True, cmap=sns.cm.rocket_r)

Delete one of the two features if their correlation > 0.92:

To select the features, we drop one of the two features if they have correlation more than 0.92.

In [63]:
#selecting features: correlation >= 0.92
columns = np.full((corr.shape[0],), True, dtype=bool)
columns
for i in range(corr.shape[0]):
    for j in range(i+1, corr.shape[0]):
        if corr.iloc[i,j] >= 0.92:
            if columns[j]:
                print(corr[j:j+1].index)  ## print dropped features
                columns[j] = False
                

selected_columns = corr.columns[columns]
df_selected_c = df_feature[selected_columns]

# df_selected: data with selected features & targets  (train data only)
# df_selected_all: data with selected features & targets  (train+target data)
sr_c = Sales_Revenues.set_index('Client')
df_selected_all = pd.merge(sr_c, df_selected_c, on='Client', how='right')
df_selected = pd.merge(df_selected_c, Sales_Revenues, on='Client', how='inner')
df_selected_all   #target data: SMF-RCL == NAN
df_selected_all['Sale_MF'].isna().sum()
df_selected = df_selected.set_index('Client')
#df_selected
Index(['VolumeCred_CA'], dtype='object')
Index(['TransactionsCred_CA'], dtype='object')
Index(['VolumeDeb_CA'], dtype='object')

After this step, we have 2 dataframes:

  1. df_selected: data with selected features & targets (train data only)
  2. df_selected_all: data with selected features & targets (train+prediction data)

Prepare the data for modeling (to numpy)

In [40]:
#Manipuling data: Extract X & Ys(targets) into numpy data
df_selected_np = df_selected.to_numpy()
X = df_selected_np[:,:df_selected.shape[1]-6]

# y: all 6 targets
y = df_selected_np[:,df_selected.shape[1]-6:df_selected.shape[1]]

# seperate the targets
y_sale_MF = df_selected_np[:,df_selected.shape[1]-6]
y_sale_CC = df_selected_np[:,df_selected.shape[1]-5]
y_sale_CL = df_selected_np[:,df_selected.shape[1]-4]

y_revenue_MF = df_selected_np[:,df_selected.shape[1]-3]
y_revenue_CC = df_selected_np[:,df_selected.shape[1]-2]
y_revenue_CL = df_selected_np[:,df_selected.shape[1]-1]

#X[0]

Split train & test data

In [41]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2,
                                                            random_state=77)
In [42]:
ytrain_SMF = ytrain[:,0]
ytrain_SCC = ytrain[:,1]
ytrain_SCL = ytrain[:,2]
ytrain_RMF = ytrain[:,3]
ytrain_RCC = ytrain[:,4]
ytrain_RCL = ytrain[:,5]

Prepare prediction data (for later to predict sales & revenue)

Here we try to extract the data with clients who don't have any sales_revenue data for prediction later on:

In [43]:
# Target data: the data we want to predict 
df_target = df_selected_all[df_selected_all.index.isin(sr_c.index) == False]
df_target = df_target.replace(np.nan,0)
df_target_x = df_target.iloc[:,6:]

df_target_x_np = df_target_x.to_numpy()
Xtarget = df_target_x_np
#Xtarget

Xtarget: np.array data with features of the clients who are't sold anything yet.

Modeling

In [44]:
#import evaluation
from sklearn import model_selection
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_validate

#import Kfold, pipeline, scaling
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import PolynomialFeatures

#import Models
##CLF
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
##REG
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

Cross-Validation using KFold

In our project, we use kfold cross-validation in all the models.

In [45]:
## Cross- Val KFold
kfold = KFold(n_splits=10, random_state=77, shuffle=True)

I. Classification problem: SMF, SCC, SCL

For every model, we will generate a matrix of precision, recall, f1, accuracy scores to evaluate the model and fine-tuning them.

1. Logistic Regression

In [46]:
## Logistic Regression: clf_logreg

clf_logreg = Pipeline((
            ("scaler", StandardScaler()),
            ("logreg", LogisticRegression(penalty='l2',  C=100, max_iter=1000))
        ))

#Scores
##Sales Mutual Funds
precision_SMF = cross_val_score(clf_logreg, Xtrain, ytrain_SMF, cv=kfold, scoring='precision')
recall_SMF    = cross_val_score(clf_logreg, Xtrain, ytrain_SMF, cv=kfold, scoring='recall')
f1_SMF        = cross_val_score(clf_logreg, Xtrain, ytrain_SMF, cv=kfold, scoring='f1')
acc_SMF       = cross_val_score(clf_logreg, Xtrain, ytrain_SMF, cv=kfold)

##Sales Credit Card
precision_SCC = cross_val_score(clf_logreg, Xtrain, ytrain_SCC, cv=kfold, scoring='precision')
recall_SCC    = cross_val_score(clf_logreg, Xtrain, ytrain_SCC, cv=kfold, scoring='recall')
f1_SCC        = cross_val_score(clf_logreg, Xtrain, ytrain_SCC, cv=kfold, scoring='f1')
acc_SCC       = cross_val_score(clf_logreg, Xtrain, ytrain_SCC, cv=kfold)

##Sales Consumer Loan
precision_SCL = cross_val_score(clf_logreg, Xtrain, ytrain_SCL, cv=kfold, scoring='precision')
recall_SCL    = cross_val_score(clf_logreg, Xtrain, ytrain_SCL, cv=kfold, scoring='recall')
f1_SCL        = cross_val_score(clf_logreg, Xtrain, ytrain_SCL, cv=kfold, scoring='f1')
acc_SCL       = cross_val_score(clf_logreg, Xtrain, ytrain_SCL, cv=kfold)

#Print score
print("Logistic Regression:") 
print("MF precision: %.2f%%" % (precision_SMF.mean()*100.0)) 
print("MF recall: %.2f%%" % (recall_SMF.mean()*100.0)) 
print("MF f1: %.2f%%" % (f1_SMF.mean()*100.0)) 
print("MF accuracy: %.2f%%" % (acc_SMF.mean()*100.0)) 
print("----------------------------------") 


print("CC precision: %.2f%%" % (precision_SCC.mean()*100.0)) 
print("CC recall: %.2f%%" % (recall_SCC.mean()*100.0)) 
print("CC f1: %.2f%%" % (f1_SCC.mean()*100.0)) 
print("CC accuracy: %.2f%%" % (acc_SCC.mean()*100.0)) 
print("----------------------------------") 

print("CL precision: %.2f%%" % (precision_SCL.mean()*100.0)) 
print("CL recall: %.2f%%" % (recall_SCL.mean()*100.0)) 
print("CL f1: %.2f%%" % (f1_SCL.mean()*100.0)) 
print("CL accuracy: %.2f%%" % (acc_SCL.mean()*100.0)) 
Logistic Regression:
MF precision: 69.28%
MF recall: 16.80%
MF f1: 25.48%
MF accuracy: 80.53%
----------------------------------
CC precision: 67.94%
CC recall: 17.16%
CC f1: 25.15%
CC accuracy: 75.92%
----------------------------------
CL precision: 61.24%
CL recall: 28.23%
CL f1: 37.98%
CL accuracy: 73.03%

> Logistic Regression result:

  1. SMF: ok - low recall and f1, high accuracy and precision
  2. SCC: bad - low recall and f1, not high precision, high accuracy
  3. SCL: ok - low recall, ok for precision and f1, high acc

2. Decision Tree

In [47]:
## Decision Tree Classifier: clf_tree

clf_tree = Pipeline((
            ("scaler", StandardScaler()),
            ("tree", DecisionTreeClassifier(max_depth =30,random_state=77)),
        ))

#Scores
##Sales Mutual Funds
precision_SMF = cross_val_score(clf_tree, Xtrain, ytrain_SMF, cv=kfold, scoring='precision')
recall_SMF    = cross_val_score(clf_tree, Xtrain, ytrain_SMF, cv=kfold, scoring='recall')
f1_SMF        = cross_val_score(clf_tree, Xtrain, ytrain_SMF, cv=kfold, scoring='f1')
acc_SMF       = cross_val_score(clf_tree, Xtrain, ytrain_SMF, cv=kfold)

##Sales Credit Card
precision_SCC = cross_val_score(clf_tree, Xtrain, ytrain_SCC, cv=kfold, scoring='precision')
recall_SCC    = cross_val_score(clf_tree, Xtrain, ytrain_SCC, cv=kfold, scoring='recall')
f1_SCC        = cross_val_score(clf_tree, Xtrain, ytrain_SCC, cv=kfold, scoring='f1')
acc_SCC       = cross_val_score(clf_tree, Xtrain, ytrain_SCC, cv=kfold)

##Sales Consumer Loan
precision_SCL = cross_val_score(clf_tree, Xtrain, ytrain_SCL, cv=kfold, scoring='precision')
recall_SCL    = cross_val_score(clf_tree, Xtrain, ytrain_SCL, cv=kfold, scoring='recall')
f1_SCL        = cross_val_score(clf_tree, Xtrain, ytrain_SCL, cv=kfold, scoring='f1')
acc_SCL       = cross_val_score(clf_tree, Xtrain, ytrain_SCL, cv=kfold)

#Print score
print("Decision Tree Classifier:") 

print("MF precision: %.2f%%" % (precision_SMF.mean()*100.0)) 
print("MF recall: %.2f%%" % (recall_SMF.mean()*100.0)) 
print("MF f1: %.2f%%" % (f1_SMF.mean()*100.0)) 
print("MF accuracy: %.2f%%" % (acc_SMF.mean()*100.0)) 
print("----------------------------------") 


print("CC precision: %.2f%%" % (precision_SCC.mean()*100.0)) 
print("CC recall: %.2f%%" % (recall_SCC.mean()*100.0)) 
print("CC f1: %.2f%%" % (f1_SCC.mean()*100.0)) 
print("CC accuracy: %.2f%%" % (acc_SCC.mean()*100.0)) 
print("----------------------------------") 

print("CL precision: %.2f%%" % (precision_SCL.mean()*100.0)) 
print("CL recall: %.2f%%" % (recall_SCL.mean()*100.0)) 
print("CL f1: %.2f%%" % (f1_SCL.mean()*100.0)) 
print("CL accuracy: %.2f%%" % (acc_SCL.mean()*100.0)) 
Decision Tree Classifier:
MF precision: 21.72%
MF recall: 24.89%
MF f1: 22.72%
MF accuracy: 65.79%
----------------------------------
CC precision: 31.39%
CC recall: 35.13%
CC f1: 32.25%
CC accuracy: 64.47%
----------------------------------
CL precision: 34.83%
CL recall: 39.29%
CL f1: 36.45%
CL accuracy: 59.34%

> Decision Tree result:

  1. SMF: bad - low recall and f1, high accuracy and precision
  2. SCC: bad - low recall and f1, high accuracy and precisiony
  3. SCL: bad - low recall and f1, high accuracy and precisiony

3. Random Forest

In [48]:
## Random Forest: clf_forest

clf_forest = Pipeline((
            ("scaler", StandardScaler()),
            ("forest", RandomForestClassifier(n_estimators=100, max_depth=20, random_state=54)),
        ))

#Scores
##Sales Mutual Funds
precision_SMF = cross_val_score(clf_forest, Xtrain, ytrain_SMF, cv=kfold, scoring='precision')
recall_SMF    = cross_val_score(clf_forest, Xtrain, ytrain_SMF, cv=kfold, scoring='recall')
f1_SMF        = cross_val_score(clf_forest, Xtrain, ytrain_SMF, cv=kfold, scoring='f1')
acc_SMF       = cross_val_score(clf_forest, Xtrain, ytrain_SMF, cv=kfold)

##Sales Credit Card
precision_SCC = cross_val_score(clf_forest, Xtrain, ytrain_SCC, cv=kfold, scoring='precision')
recall_SCC    = cross_val_score(clf_forest, Xtrain, ytrain_SCC, cv=kfold, scoring='recall')
f1_SCC        = cross_val_score(clf_forest, Xtrain, ytrain_SCC, cv=kfold, scoring='f1')
acc_SCC       = cross_val_score(clf_forest, Xtrain, ytrain_SCC, cv=kfold)

##Sales Consumer Loan
precision_SCL = cross_val_score(clf_forest, Xtrain, ytrain_SCL, cv=kfold, scoring='precision')
recall_SCL    = cross_val_score(clf_forest, Xtrain, ytrain_SCL, cv=kfold, scoring='recall')
f1_SCL        = cross_val_score(clf_forest, Xtrain, ytrain_SCL, cv=kfold, scoring='f1')
acc_SCL       = cross_val_score(clf_forest, Xtrain, ytrain_SCL, cv=kfold)

#Print score
print("Random Forest Classifier:") 

print("MF precision: %.2f%%" % (precision_SMF.mean()*100.0)) 
print("MF recall: %.2f%%" % (recall_SMF.mean()*100.0)) 
print("MF f1: %.2f%%" % (f1_SMF.mean()*100.0)) 
print("MF accuracy: %.2f%%" % (acc_SMF.mean()*100.0)) 
print("----------------------------------") 


print("CC precision: %.2f%%" % (precision_SCC.mean()*100.0)) 
print("CC recall: %.2f%%" % (recall_SCC.mean()*100.0)) 
print("CC f1: %.2f%%" % (f1_SCC.mean()*100.0)) 
print("CC accuracy: %.2f%%" % (acc_SCC.mean()*100.0)) 
print("----------------------------------") 

print("CL precision: %.2f%%" % (precision_SCL.mean()*100.0)) 
print("CL recall: %.2f%%" % (recall_SCL.mean()*100.0)) 
print("CL f1: %.2f%%" % (f1_SCL.mean()*100.0)) 
print("CL accuracy: %.2f%%" % (acc_SCL.mean()*100.0)) 
Random Forest Classifier:
MF precision: 53.33%
MF recall: 8.85%
MF f1: 14.61%
MF accuracy: 79.34%
----------------------------------
CC precision: 67.28%
CC recall: 19.44%
CC f1: 29.06%
CC accuracy: 77.37%
----------------------------------
CL precision: 53.59%
CL recall: 22.67%
CL f1: 31.42%
CL accuracy: 71.05%

> Forest result:

  1. SMF: bad - low recall and f1, ok precision, high acc
  2. SCC: ok - low recall and f1, ok precision, high acc
  3. SCL: bad - low recall and f1, ok precision, high acc

4. SVM

In [49]:
## SVM: clf_svm

clf_svm = Pipeline((
            ("scaler", StandardScaler()),
            ("LinearSVC", LinearSVC(random_state=77, tol=1, max_iter=1000)),
        ))
clf_svm = clf_svm.fit(Xtrain, ytrain_SMF)

#Scores
##Sales Mutual Funds
precision_SMF = cross_val_score(clf_svm, Xtrain, ytrain_SMF, cv=kfold, scoring='precision')
recall_SMF    = cross_val_score(clf_svm, Xtrain, ytrain_SMF, cv=kfold, scoring='recall')
f1_SMF        = cross_val_score(clf_svm, Xtrain, ytrain_SMF, cv=kfold, scoring='f1')
acc_SMF       = cross_val_score(clf_svm, Xtrain, ytrain_SMF, cv=kfold)

##Sales Credit Card
precision_SCC = cross_val_score(clf_svm, Xtrain, ytrain_SCC, cv=kfold, scoring='precision')
recall_SCC    = cross_val_score(clf_svm, Xtrain, ytrain_SCC, cv=kfold, scoring='recall')
f1_SCC        = cross_val_score(clf_svm, Xtrain, ytrain_SCC, cv=kfold, scoring='f1')
acc_SCC       = cross_val_score(clf_svm, Xtrain, ytrain_SCC, cv=kfold)

##Sales Consumer Loan
precision_SCL = cross_val_score(clf_svm, Xtrain, ytrain_SCL, cv=kfold, scoring='precision')
recall_SCL    = cross_val_score(clf_svm, Xtrain, ytrain_SCL, cv=kfold, scoring='recall')
f1_SCL        = cross_val_score(clf_svm, Xtrain, ytrain_SCL, cv=kfold, scoring='f1')
acc_SCL       = cross_val_score(clf_svm, Xtrain, ytrain_SCL, cv=kfold)

#Print score
print("SVM:") 

print("MF precision: %.2f%%" % (precision_SMF.mean()*100.0)) 
print("MF recall: %.2f%%" % (recall_SMF.mean()*100.0)) 
print("MF f1: %.2f%%" % (f1_SMF.mean()*100.0)) 
print("MF accuracy: %.2f%%" % (acc_SMF.mean()*100.0)) 
print("----------------------------------") 


print("CC precision: %.2f%%" % (precision_SCC.mean()*100.0)) 
print("CC recall: %.2f%%" % (recall_SCC.mean()*100.0)) 
print("CC f1: %.2f%%" % (f1_SCC.mean()*100.0)) 
print("CC accuracy: %.2f%%" % (acc_SCC.mean()*100.0)) 
print("----------------------------------") 

print("CL precision: %.2f%%" % (precision_SCL.mean()*100.0)) 
print("CL recall: %.2f%%" % (recall_SCL.mean()*100.0)) 
print("CL f1: %.2f%%" % (f1_SCL.mean()*100.0)) 
print("CL accuracy: %.2f%%" % (acc_SCL.mean()*100.0)) 
SVM:
MF precision: 70.12%
MF recall: 13.77%
MF f1: 22.12%
MF accuracy: 80.39%
----------------------------------
CC precision: 70.86%
CC recall: 14.13%
CC f1: 21.88%
CC accuracy: 76.05%
----------------------------------
CL precision: 61.05%
CL recall: 26.45%
CL f1: 36.56%
CL accuracy: 72.89%

> SVM result:

  1. SMF: ok - low recall and f1, high precision and acc
  2. SCC: ok - low recall and f1, ok precision, high acc
  3. SCL: ok - low recall and f1, high precision and acc

Classification models choosing:

For each target variable, we will use the model with has the best scores:

  • SMF: Logistic Regression
  • SCC: Random Forest
  • SML: Logistic Regression

II. Regression problem: RMF, RCC, RCL

For every model, we will generate a matrix of Error, MAE, MSE to evaluate the model and fine-tuning them.

1. Linear Models

1.1 Linear Regression

In [50]:
## Linear Regression: reg_lin

reg_lin = Pipeline((
            ("scaler", StandardScaler()),
            ("linear", LinearRegression())
        ))

#Scores
##Revenues Mutual Funds
err_RMF = cross_val_score(reg_lin, Xtrain, ytrain_RMF, cv=kfold)
MAE_RMF = cross_val_score(reg_lin, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RMF = cross_val_score(reg_lin, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_squared_error')


##Revenues Credit Card
err_RCC = cross_val_score(reg_lin, Xtrain, ytrain_RCC, cv=kfold)
MAE_RCC = cross_val_score(reg_lin, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCC = cross_val_score(reg_lin, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_squared_error')


##Revenues Consumer Loan
err_RCL = cross_val_score(reg_lin, Xtrain, ytrain_RCL, cv=kfold)
MAE_RCL = cross_val_score(reg_lin, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCL = cross_val_score(reg_lin, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_squared_error')


#Print score
print("Linear Regression:") 
print("MF Error: %.2f" % (err_RMF.mean())) 
print("MF MAE:   %.2f" % (MAE_RMF.mean()))
print("MF MSE:   %.2f" % (MSE_RMF.mean()))
print("----------------------------------") 

print("CC Error: %.2f" % (err_RCC.mean())) 
print("CC MAE:   %.2f" % (MAE_RCC.mean()))
print("CC MSE:   %.2f" % (MSE_RCC.mean()))
print("----------------------------------") 

print("CL Error: %.2f" % (err_RCL.mean())) 
print("CL MAE:   %.2f" % (MAE_RCL.mean()))
print("CL MSE:   %.2f" % (MSE_RCL.mean()))
Linear Regression:
MF Error: -0.60
MF MAE:   -3.85
MF MSE:   -121.56
----------------------------------
CC Error: -0.40
CC MAE:   -4.98
CC MSE:   -383.12
----------------------------------
CL Error: 0.03
CL MAE:   -4.60
CL MSE:   -41.88

> Linear Regression result:

  1. RMF: ok - ok MSE
  2. RCC: bad - high MSE
  3. RCL: good - low MSE

1.2 Lasso

In [51]:
## Linear Regression: reg_lin

reg_lasso = Pipeline((
            ("scaler", StandardScaler()),
            ("lasso", Lasso(alpha=1))
        ))

#Scores
##Revenues Mutual Funds
err_RMF = cross_val_score(reg_lasso, Xtrain, ytrain_RMF, cv=kfold)
MAE_RMF = cross_val_score(reg_lasso, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RMF = cross_val_score(reg_lasso, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_squared_error')


##Revenues Credit Card
err_RCC = cross_val_score(reg_lasso, Xtrain, ytrain_RCC, cv=kfold)
MAE_RCC = cross_val_score(reg_lasso, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCC = cross_val_score(reg_lasso, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_squared_error')



##Revenues Consumer Loan
err_RCL = cross_val_score(reg_lasso, Xtrain, ytrain_RCL, cv=kfold)
MAE_RCL = cross_val_score(reg_lasso, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCL = cross_val_score(reg_lasso, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_squared_error')


#Print score
print("Lasso:") 
print("MF Error: %.2f" % (err_RMF.mean())) 
print("MF MAE:   %.2f" % (MAE_RMF.mean()))
print("MF MSE:   %.2f" % (MSE_RMF.mean()))
print("----------------------------------") 

print("CC Error: %.2f" % (err_RCC.mean())) 
print("CC MAE:   %.2f" % (MAE_RCC.mean()))
print("CC MSE:   %.2f" % (MSE_RCC.mean()))
print("----------------------------------") 

print("CL Error: %.2f" % (err_RCL.mean())) 
print("CL MAE:   %.2f" % (MAE_RCL.mean()))
print("CL: MSE:   %.2f" % (MSE_RCL.mean()))
Lasso:
MF Error: -0.05
MF MAE:   -3.34
MF MSE:   -111.89
----------------------------------
CC Error: -0.09
CC MAE:   -4.52
CC MSE:   -375.87
----------------------------------
CL Error: -0.03
CL MAE:   -4.98
CL: MSE:   -44.22

> Lasso result:

  1. RMF: ok - ok MSE
  2. RCC: bad - high MSE
  3. RCL: good - low MSE

2. Decision Tree regression

In [52]:
## Decision Tree regression: reg_tree

reg_tree = Pipeline((
            ("scaler", StandardScaler()),
            ("tree", DecisionTreeRegressor(criterion='mse',max_depth=100))
        ))

#Scores
##Revenues Mutual Funds
err_RMF = cross_val_score(reg_tree, Xtrain, ytrain_RMF, cv=kfold)
MAE_RMF = cross_val_score(reg_tree, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RMF = cross_val_score(reg_tree, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_squared_error')


##Revenues Credit Card
err_RCC = cross_val_score(reg_tree, Xtrain, ytrain_RCC, cv=kfold)
MAE_RCC = cross_val_score(reg_tree, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCC = cross_val_score(reg_tree, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_squared_error')



##Revenues Consumer Loan
err_RCL = cross_val_score(reg_tree, Xtrain, ytrain_RCL, cv=kfold)
MAE_RCL = cross_val_score(reg_tree, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCL = cross_val_score(reg_tree, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_squared_error')


#Print score
print("Decision Tree regression:") 
print("MF Error: %.2f" % (err_RMF.mean())) 
print("MF MAE:   %.2f" % (MAE_RMF.mean()))
print("MF MSE:   %.2f" % (MSE_RMF.mean()))
print("----------------------------------") 

print("CC Error: %.2f" % (err_RCC.mean())) 
print("CC MAE:   %.2f" % (MAE_RCC.mean()))
print("CC MSE:   %.2f" % (MSE_RCC.mean()))
print("----------------------------------") 

print("CL Error: %.2f" % (err_RCL.mean())) 
print("CL MAE:   %.2f" % (MAE_RCL.mean()))
print("CL MSE:   %.2f" % (MSE_RCL.mean()))
Decision Tree regression:
MF Error: -8.38
MF MAE:   -4.61
MF MSE:   -312.00
----------------------------------
CC Error: -54.75
CC MAE:   -6.20
CC MSE:   -991.96
----------------------------------
CL Error: -1.13
CL MAE:   -5.30
CL MSE:   -86.64

> Decision Tree result:

  1. RMF: bad - high MSE
  2. RCC: bad - very high MSE
  3. RCL: bad - high MSE

3. Random Forest regression

In [53]:
## Random Forest regression: reg_forest

reg_forest = Pipeline((
            ("scaler", StandardScaler()),
            ("forest", RandomForestRegressor(n_estimators=10, max_depth=1000))
        ))

#Scores
##Revenues Mutual Funds
err_RMF = cross_val_score(reg_forest, Xtrain, ytrain_RMF, cv=kfold)
MAE_RMF = cross_val_score(reg_forest, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RMF = cross_val_score(reg_forest, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_squared_error')


##Revenues Credit Card
err_RCC = cross_val_score(reg_forest, Xtrain, ytrain_RCC, cv=kfold)
MAE_RCC = cross_val_score(reg_forest, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCC = cross_val_score(reg_forest, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_squared_error')



##Revenues Consumer Loan
err_RCL = cross_val_score(reg_forest, Xtrain, ytrain_RCL, cv=kfold)
MAE_RCL = cross_val_score(reg_forest, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCC = cross_val_score(reg_forest, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_squared_error')


#Print score
print("Random Forest regression:") 
print("MF Error: %.2f" % (err_RMF.mean())) 
print("MF MAE:   %.2f" % (MAE_RMF.mean()))
print("MF MSE:   %.2f" % (MSE_RMF.mean()))
print("----------------------------------") 

print("CC Error: %.2f" % (err_RCC.mean())) 
print("CC MAE:   %.2f" % (MAE_RCC.mean()))
print("CC MSE:   %.2f" % (MSE_RCC.mean()))
print("----------------------------------") 

print("CL Error: %.2f" % (err_RCL.mean())) 
print("CL MAE:   %.2f" % (MAE_RCL.mean()))
print("CL MSE:   %.2f" % (MSE_RCL.mean()))
Random Forest regression:
MF Error: -1.35
MF MAE:   -4.19
MF MSE:   -142.84
----------------------------------
CC Error: -11.46
CC MAE:   -5.37
CC MSE:   -47.79
----------------------------------
CL Error: -0.17
CL MAE:   -4.91
CL MSE:   -86.64

> Random Forest result:

  1. RMF: ok - ok MSE
  2. RCC: good - low MSE
  3. RCL: bad - high MSE

4. SVR

In [54]:
## SVR: svr

reg_svr = Pipeline((
            ("scaler", StandardScaler()),
            ("svr", SVR(kernel='rbf', gamma='scale', C=10))
        ))

#Scores
##Revenues Mutual Funds
err_RMF = cross_val_score(reg_svr, Xtrain, ytrain_RMF, cv=kfold)
MAE_RMF = cross_val_score(reg_svr, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RMF = cross_val_score(reg_svr, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_squared_error')


##Revenues Credit Card
err_RCC = cross_val_score(reg_svr, Xtrain, ytrain_RCC, cv=kfold)
MAE_RCC = cross_val_score(reg_svr, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCC = cross_val_score(reg_svr, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_squared_error')

##Revenues Consumer Loan
err_RCL = cross_val_score(reg_svr, Xtrain, ytrain_RCL, cv=kfold)
MAE_RCL = cross_val_score(reg_svr, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCL = cross_val_score(reg_svr, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_squared_error')

#Print score
print("SVR:") 
print("MF Error: %.2f" % (err_RMF.mean())) 
print("MF MAE:   %.2f" % (MAE_RMF.mean()))
print("MF MSE:   %.2f" % (MSE_RMF.mean()))
print("----------------------------------") 

print("CC Error: %.2f" % (err_RCC.mean())) 
print("CC MAE:   %.2f" % (MAE_RCC.mean()))
print("CC MSE:   %.2f" % (MSE_RCC.mean()))
print("----------------------------------") 

print("CL Error: %.2f" % (err_RCL.mean())) 
print("CL MAE:   %.2f" % (MAE_RCL.mean()))
print("CL MSE:   %.2f" % (MSE_RCL.mean()))
SVR:
MF Error: -0.04
MF MAE:   -2.21
MF MSE:   -114.26
----------------------------------
CC Error: -0.02
CC MAE:   -3.06
CC MSE:   -380.53
----------------------------------
CL Error: -0.06
CL MAE:   -3.65
CL MSE:   -46.20

> SVR result:

  1. RMF: ok - ok MSE
  2. RCC: bad - high MSE
  3. RCL: good - low MSE

Regression models choosing:

For each target variable, we will use the model with the least errors:

  • RMF: Lasso
  • RCC: Random Forest
  • RML: Linear Regression

Modeling: Summary

For each target variables, we choose these models:

  • SMF: Logistic Regression
  • SCC: Random Forest
  • SML: Logistic Regression
  • RMF: Lasso
  • RCC: Random Forest
  • RML: Linear Regression

Prediction

After fine-tuning the models, we will predict the sales and revenues on the clients who have'nt been sold anything yet.

Clf Prediction

  • pre_SMF: an array with the prediction of sales of mutual funds (0 or 1)
  • pre_SCC: an array with the prediction of sales of credit card (0 or 1)
  • pre_SCC: an array with the prediction of sales of consumer loan (0 or 1)
In [55]:
# SMF prediction: logreg
clf_logreg.fit(Xtrain, ytrain_SMF)
pre_SMF = clf_logreg.predict(Xtarget).astype(int)

# SCC prediction: forest
clf_forest.fit(Xtrain, ytrain_SCC)
pre_SCC = clf_forest.predict(Xtarget).astype(int)

# SCL prediction: logreg
clf_logreg.fit(Xtrain, ytrain_SCL)
pre_SCL = clf_logreg.predict(Xtarget).astype(int)

Reg Prediction

  • pre_RMF: an array with the prediction of revenues of mutual funds (0 or 1)
  • pre_RCC: an array with the prediction of revenues of credit card (0 or 1)
  • pre_RCC: an array with the prediction of revenues of consumer loan (0 or 1)
In [56]:
# RMF prediction: lasso
reg_lasso.fit(Xtrain, ytrain_RMF)
pre_RMF = reg_lasso.predict(Xtarget)

# RCC prediction: forest
reg_forest.fit(Xtrain, ytrain_RCC)
pre_RCC = reg_forest.predict(Xtarget)

# RCL prediction: lin
reg_lin.fit(Xtrain, ytrain_RCL)
pre_RCL = reg_lin.predict(Xtarget)

Select 100 clients to maximize revenue

In [57]:
## only one sale to one person
product_MF = np.multiply(pre_SMF,pre_RMF)
product_CC = np.multiply(pre_SCC,pre_RCC)
product_CL = np.multiply(pre_SCL,pre_RCL)

for i in range(product_MF.shape[0]):
    if product_CC[i] > product_MF[i]:
        product_MF[i] = 0.0
        pre_SMF[i] = 0
        pre_RMF[i] = 0.0
        if product_CL[i] > product_CC[i]:
            product_CC[i] = 0
            pre_SCC[i] = 0
            pre_RCC[i] = 0
        else:
            product_CL[i] = 0
            pre_SCL[i] = 0
            pre_RCL[i] = 0
    else:
        product_CC[i] = 0.0
        pre_SCC[i] = 0
        pre_RCC[i] = 0.0
        if product_MF[i] > product_CL[i]:
            product_CL[i] = 0
            pre_SCL[i] = 0
            pre_RCL[i] = 0
        else:
            product_MF[i] = 0
            pre_SMF[i] = 0
            pre_RMF[i] = 0

            
## create sale_type array
pre_sale = np.empty(product_CL.shape[0], dtype='object')
for i in range(product_CL.shape[0]):
    if pre_SMF[i] == 1:
        pre_sale[i] = 'MF'
    elif pre_SCC[i] == 1:
        pre_sale[i] = 'CC'
    elif pre_SCL[i] == 1:
        pre_sale[i] = 'CL'
In [58]:
# Add our prediction into data
df_target_x['Sale_MF'] = pre_SMF
df_target_x['Sale_CC'] = pre_SCC
df_target_x['Sale_CL'] = pre_SCL
df_target_x['Sale'] = pre_sale



df_target_x['Revenue_MF'] = product_MF
df_target_x['Revenue_CC'] = product_CC
df_target_x['Revenue_CL'] = product_CL

# select max revenue top 100 client
df_prediction = df_target_x
df_prediction['Revenue'] = df_prediction[['Revenue_MF', 'Revenue_CC', 'Revenue_CL']].max(axis=1)
list_clients = df_prediction.sort_values('Revenue',ascending=False).head(100)[['Sale', 'Revenue','Sale_MF', 'Sale_CC', 'Sale_CL','Revenue_MF', 'Revenue_CC', 'Revenue_CL']]

We select 100 clients with the top 100 revenues.

  • list_clients: dataframe with 100 clients we want to target, the offer to the clients, and the revenue

Output: list of 100 clients to target & expected revenue

  • clients_sale_rev: dataframe with clients id, sales offer, and expected revenue
  • clients_id: list of 100 clients we want to target
  • expected_revenue: the value of the sum of the expected revenue of all sales
In [60]:
clients_sale_rev = list_clients[['Sale','Revenue']]

#list of 100 clients' id
clients_id = list(clients_sale_rev.index)
In [68]:
#sum of expected revenue
expected_revenue = clients_sale_rev['Revenue'].sum()
expected_revenue
Out[68]:
908.054160150071

Export Output

list_clients.xlsx: excel file of the clients list (clients id, offer, expected revenue)

In [64]:
#client id, sale offer, and expected revenue
#from pandas import to_exel()
clients_sale_rev = list_clients[['Sale','Revenue']]

#Export the output
clients_sale_rev.to_excel('list_clients.xlsx')
clients_sale_rev
Out[64]:
Sale Revenue
Client
1365 CC 64.691214
1278 CC 44.025214
331 CC 43.826000
1077 CL 34.134872
1289 CL 30.702221