Jupiter Notebook link: https://ivorydaae.github.io/ML_Banking/#index
Github link: https://github.com/ivorydaae/ML_Banking
If you have further suggestions with this project, or you would like to contact me for opportunities, don't hesitate to contact me.
Ivory YAHSUAN TAI
Based in Paris, France
Data Scientist | Master Student in Data Science in EMLyon Business School
- E-mail: yahsuan.tai@sciencespo.fr
- Phone: +33 6 10 91 14 15
- LinkedIn: https://www.linkedin.com/in/yahsuantai/
- Github: https://github.com/ivorydaae/
Use customer data in banking to maximize revenue from marketing campaigns.
The purpose of this project is to choose 100 clients of the bank to sell them one of the three offers in bank (Mutual Funds, Credit Card, Consumer Loan) to maximize the predicted revenue for the bank by using machine learning approaches.
For the analysis, I use the data “DMML_Project_Dataset.xlsx”.
The excel file contains several tables (on 4 excel sheets):
§ Social-demographical data (client ID, age, gender, tenure in a bank)
§ Products and actual balance (current account, saving account, mutual funds, overdraft,
credit card, consumer loan)
§ Inflow/outflow on C/A, aggregated card turnover (monthly average over past 3
months)
§ Clients actual sales and revenues. Data is available for 60 % of the clients.
The bank has capacity to contact around 100 clients with a marketing offer and each client can be targeted only once (you can't retarget).
§ Which clients are to be targeted with which offer?
§ What would be the expected revenue based on your strategy?
import numpy as np
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Import excel sheets
Description = pd.read_excel('DMML_Project_Dataset.xlsx', sheet_name='Description')
Inflow_Outflow = pd.read_excel('DMML_Project_Dataset.xlsx', sheet_name='Inflow_Outflow')
Soc_Dem = pd.read_excel('DMML_Project_Dataset.xlsx', sheet_name='Soc_Dem')
Products_ActBalance = pd.read_excel('DMML_Project_Dataset.xlsx', sheet_name='Products_ActBalance')
Sales_Revenues = pd.read_excel('DMML_Project_Dataset.xlsx', sheet_name='Sales_Revenues')
## Merge
df = pd.merge(Soc_Dem, Inflow_Outflow, on='Client', how='inner')
df = pd.merge(df, Products_ActBalance, on='Client', how='inner')
df = df.replace(np.nan,0)
# df_all: Training data with targets
df_all = pd.merge(Sales_Revenues, df, on='Client', how='inner')
df_all = df_all.replace(np.nan,0)
For the modeling later, we tried to add two features other than the original ones:
# df_feature: Training data without targets
df_feature = df.set_index('Client')
# Add feature 'Saving ratio'
df_feature['Saving_Ratio'] = (df_feature['ActBal_CA']+df_feature['ActBal_SA']+df_feature['ActBal_MF'])/(df_feature['ActBal_CA']+df_feature['ActBal_SA']+df_feature['ActBal_MF']+df_feature['ActBal_OVD']+df_feature['ActBal_CC']+df_feature['ActBal_CL'])
df_feature['age_20'] = (df_feature['Age']>20).apply(lambda x: int(x))
#df_feature['VolumeCred_2'] = (df_feature['VolumeCred']**2)
# Clean df_feature
df_feature = df_feature.replace(np.nan,0)
df_feature
# Correlation
corr = df_feature.corr()
# Heatmap
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
f, ax = plt.subplots(figsize=(10, 6))
ax = sns.heatmap(corr, mask=mask,square=True, cmap=sns.cm.rocket_r)
To select the features, we drop one of the two features if they have correlation more than 0.92.
#selecting features: correlation >= 0.92
columns = np.full((corr.shape[0],), True, dtype=bool)
columns
for i in range(corr.shape[0]):
for j in range(i+1, corr.shape[0]):
if corr.iloc[i,j] >= 0.92:
if columns[j]:
print(corr[j:j+1].index) ## print dropped features
columns[j] = False
selected_columns = corr.columns[columns]
df_selected_c = df_feature[selected_columns]
# df_selected: data with selected features & targets (train data only)
# df_selected_all: data with selected features & targets (train+target data)
sr_c = Sales_Revenues.set_index('Client')
df_selected_all = pd.merge(sr_c, df_selected_c, on='Client', how='right')
df_selected = pd.merge(df_selected_c, Sales_Revenues, on='Client', how='inner')
df_selected_all #target data: SMF-RCL == NAN
df_selected_all['Sale_MF'].isna().sum()
df_selected = df_selected.set_index('Client')
#df_selected
#Manipuling data: Extract X & Ys(targets) into numpy data
df_selected_np = df_selected.to_numpy()
X = df_selected_np[:,:df_selected.shape[1]-6]
# y: all 6 targets
y = df_selected_np[:,df_selected.shape[1]-6:df_selected.shape[1]]
# seperate the targets
y_sale_MF = df_selected_np[:,df_selected.shape[1]-6]
y_sale_CC = df_selected_np[:,df_selected.shape[1]-5]
y_sale_CL = df_selected_np[:,df_selected.shape[1]-4]
y_revenue_MF = df_selected_np[:,df_selected.shape[1]-3]
y_revenue_CC = df_selected_np[:,df_selected.shape[1]-2]
y_revenue_CL = df_selected_np[:,df_selected.shape[1]-1]
#X[0]
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2,
random_state=77)
ytrain_SMF = ytrain[:,0]
ytrain_SCC = ytrain[:,1]
ytrain_SCL = ytrain[:,2]
ytrain_RMF = ytrain[:,3]
ytrain_RCC = ytrain[:,4]
ytrain_RCL = ytrain[:,5]
Here we try to extract the data with clients who don't have any sales_revenue data for prediction later on:
# Target data: the data we want to predict
df_target = df_selected_all[df_selected_all.index.isin(sr_c.index) == False]
df_target = df_target.replace(np.nan,0)
df_target_x = df_target.iloc[:,6:]
df_target_x_np = df_target_x.to_numpy()
Xtarget = df_target_x_np
#Xtarget
Xtarget: np.array data with features of the clients who are't sold anything yet.
#import evaluation
from sklearn import model_selection
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_validate
#import Kfold, pipeline, scaling
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
#import Models
##CLF
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
##REG
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
In our project, we use kfold cross-validation in all the models.
## Cross- Val KFold
kfold = KFold(n_splits=10, random_state=77, shuffle=True)
For every model, we will generate a matrix of precision, recall, f1, accuracy scores to evaluate the model and fine-tuning them.
## Logistic Regression: clf_logreg
clf_logreg = Pipeline((
("scaler", StandardScaler()),
("logreg", LogisticRegression(penalty='l2', C=100, max_iter=1000))
))
#Scores
##Sales Mutual Funds
precision_SMF = cross_val_score(clf_logreg, Xtrain, ytrain_SMF, cv=kfold, scoring='precision')
recall_SMF = cross_val_score(clf_logreg, Xtrain, ytrain_SMF, cv=kfold, scoring='recall')
f1_SMF = cross_val_score(clf_logreg, Xtrain, ytrain_SMF, cv=kfold, scoring='f1')
acc_SMF = cross_val_score(clf_logreg, Xtrain, ytrain_SMF, cv=kfold)
##Sales Credit Card
precision_SCC = cross_val_score(clf_logreg, Xtrain, ytrain_SCC, cv=kfold, scoring='precision')
recall_SCC = cross_val_score(clf_logreg, Xtrain, ytrain_SCC, cv=kfold, scoring='recall')
f1_SCC = cross_val_score(clf_logreg, Xtrain, ytrain_SCC, cv=kfold, scoring='f1')
acc_SCC = cross_val_score(clf_logreg, Xtrain, ytrain_SCC, cv=kfold)
##Sales Consumer Loan
precision_SCL = cross_val_score(clf_logreg, Xtrain, ytrain_SCL, cv=kfold, scoring='precision')
recall_SCL = cross_val_score(clf_logreg, Xtrain, ytrain_SCL, cv=kfold, scoring='recall')
f1_SCL = cross_val_score(clf_logreg, Xtrain, ytrain_SCL, cv=kfold, scoring='f1')
acc_SCL = cross_val_score(clf_logreg, Xtrain, ytrain_SCL, cv=kfold)
#Print score
print("Logistic Regression:")
print("MF precision: %.2f%%" % (precision_SMF.mean()*100.0))
print("MF recall: %.2f%%" % (recall_SMF.mean()*100.0))
print("MF f1: %.2f%%" % (f1_SMF.mean()*100.0))
print("MF accuracy: %.2f%%" % (acc_SMF.mean()*100.0))
print("----------------------------------")
print("CC precision: %.2f%%" % (precision_SCC.mean()*100.0))
print("CC recall: %.2f%%" % (recall_SCC.mean()*100.0))
print("CC f1: %.2f%%" % (f1_SCC.mean()*100.0))
print("CC accuracy: %.2f%%" % (acc_SCC.mean()*100.0))
print("----------------------------------")
print("CL precision: %.2f%%" % (precision_SCL.mean()*100.0))
print("CL recall: %.2f%%" % (recall_SCL.mean()*100.0))
print("CL f1: %.2f%%" % (f1_SCL.mean()*100.0))
print("CL accuracy: %.2f%%" % (acc_SCL.mean()*100.0))
## Decision Tree Classifier: clf_tree
clf_tree = Pipeline((
("scaler", StandardScaler()),
("tree", DecisionTreeClassifier(max_depth =30,random_state=77)),
))
#Scores
##Sales Mutual Funds
precision_SMF = cross_val_score(clf_tree, Xtrain, ytrain_SMF, cv=kfold, scoring='precision')
recall_SMF = cross_val_score(clf_tree, Xtrain, ytrain_SMF, cv=kfold, scoring='recall')
f1_SMF = cross_val_score(clf_tree, Xtrain, ytrain_SMF, cv=kfold, scoring='f1')
acc_SMF = cross_val_score(clf_tree, Xtrain, ytrain_SMF, cv=kfold)
##Sales Credit Card
precision_SCC = cross_val_score(clf_tree, Xtrain, ytrain_SCC, cv=kfold, scoring='precision')
recall_SCC = cross_val_score(clf_tree, Xtrain, ytrain_SCC, cv=kfold, scoring='recall')
f1_SCC = cross_val_score(clf_tree, Xtrain, ytrain_SCC, cv=kfold, scoring='f1')
acc_SCC = cross_val_score(clf_tree, Xtrain, ytrain_SCC, cv=kfold)
##Sales Consumer Loan
precision_SCL = cross_val_score(clf_tree, Xtrain, ytrain_SCL, cv=kfold, scoring='precision')
recall_SCL = cross_val_score(clf_tree, Xtrain, ytrain_SCL, cv=kfold, scoring='recall')
f1_SCL = cross_val_score(clf_tree, Xtrain, ytrain_SCL, cv=kfold, scoring='f1')
acc_SCL = cross_val_score(clf_tree, Xtrain, ytrain_SCL, cv=kfold)
#Print score
print("Decision Tree Classifier:")
print("MF precision: %.2f%%" % (precision_SMF.mean()*100.0))
print("MF recall: %.2f%%" % (recall_SMF.mean()*100.0))
print("MF f1: %.2f%%" % (f1_SMF.mean()*100.0))
print("MF accuracy: %.2f%%" % (acc_SMF.mean()*100.0))
print("----------------------------------")
print("CC precision: %.2f%%" % (precision_SCC.mean()*100.0))
print("CC recall: %.2f%%" % (recall_SCC.mean()*100.0))
print("CC f1: %.2f%%" % (f1_SCC.mean()*100.0))
print("CC accuracy: %.2f%%" % (acc_SCC.mean()*100.0))
print("----------------------------------")
print("CL precision: %.2f%%" % (precision_SCL.mean()*100.0))
print("CL recall: %.2f%%" % (recall_SCL.mean()*100.0))
print("CL f1: %.2f%%" % (f1_SCL.mean()*100.0))
print("CL accuracy: %.2f%%" % (acc_SCL.mean()*100.0))
## Random Forest: clf_forest
clf_forest = Pipeline((
("scaler", StandardScaler()),
("forest", RandomForestClassifier(n_estimators=100, max_depth=20, random_state=54)),
))
#Scores
##Sales Mutual Funds
precision_SMF = cross_val_score(clf_forest, Xtrain, ytrain_SMF, cv=kfold, scoring='precision')
recall_SMF = cross_val_score(clf_forest, Xtrain, ytrain_SMF, cv=kfold, scoring='recall')
f1_SMF = cross_val_score(clf_forest, Xtrain, ytrain_SMF, cv=kfold, scoring='f1')
acc_SMF = cross_val_score(clf_forest, Xtrain, ytrain_SMF, cv=kfold)
##Sales Credit Card
precision_SCC = cross_val_score(clf_forest, Xtrain, ytrain_SCC, cv=kfold, scoring='precision')
recall_SCC = cross_val_score(clf_forest, Xtrain, ytrain_SCC, cv=kfold, scoring='recall')
f1_SCC = cross_val_score(clf_forest, Xtrain, ytrain_SCC, cv=kfold, scoring='f1')
acc_SCC = cross_val_score(clf_forest, Xtrain, ytrain_SCC, cv=kfold)
##Sales Consumer Loan
precision_SCL = cross_val_score(clf_forest, Xtrain, ytrain_SCL, cv=kfold, scoring='precision')
recall_SCL = cross_val_score(clf_forest, Xtrain, ytrain_SCL, cv=kfold, scoring='recall')
f1_SCL = cross_val_score(clf_forest, Xtrain, ytrain_SCL, cv=kfold, scoring='f1')
acc_SCL = cross_val_score(clf_forest, Xtrain, ytrain_SCL, cv=kfold)
#Print score
print("Random Forest Classifier:")
print("MF precision: %.2f%%" % (precision_SMF.mean()*100.0))
print("MF recall: %.2f%%" % (recall_SMF.mean()*100.0))
print("MF f1: %.2f%%" % (f1_SMF.mean()*100.0))
print("MF accuracy: %.2f%%" % (acc_SMF.mean()*100.0))
print("----------------------------------")
print("CC precision: %.2f%%" % (precision_SCC.mean()*100.0))
print("CC recall: %.2f%%" % (recall_SCC.mean()*100.0))
print("CC f1: %.2f%%" % (f1_SCC.mean()*100.0))
print("CC accuracy: %.2f%%" % (acc_SCC.mean()*100.0))
print("----------------------------------")
print("CL precision: %.2f%%" % (precision_SCL.mean()*100.0))
print("CL recall: %.2f%%" % (recall_SCL.mean()*100.0))
print("CL f1: %.2f%%" % (f1_SCL.mean()*100.0))
print("CL accuracy: %.2f%%" % (acc_SCL.mean()*100.0))
## SVM: clf_svm
clf_svm = Pipeline((
("scaler", StandardScaler()),
("LinearSVC", LinearSVC(random_state=77, tol=1, max_iter=1000)),
))
clf_svm = clf_svm.fit(Xtrain, ytrain_SMF)
#Scores
##Sales Mutual Funds
precision_SMF = cross_val_score(clf_svm, Xtrain, ytrain_SMF, cv=kfold, scoring='precision')
recall_SMF = cross_val_score(clf_svm, Xtrain, ytrain_SMF, cv=kfold, scoring='recall')
f1_SMF = cross_val_score(clf_svm, Xtrain, ytrain_SMF, cv=kfold, scoring='f1')
acc_SMF = cross_val_score(clf_svm, Xtrain, ytrain_SMF, cv=kfold)
##Sales Credit Card
precision_SCC = cross_val_score(clf_svm, Xtrain, ytrain_SCC, cv=kfold, scoring='precision')
recall_SCC = cross_val_score(clf_svm, Xtrain, ytrain_SCC, cv=kfold, scoring='recall')
f1_SCC = cross_val_score(clf_svm, Xtrain, ytrain_SCC, cv=kfold, scoring='f1')
acc_SCC = cross_val_score(clf_svm, Xtrain, ytrain_SCC, cv=kfold)
##Sales Consumer Loan
precision_SCL = cross_val_score(clf_svm, Xtrain, ytrain_SCL, cv=kfold, scoring='precision')
recall_SCL = cross_val_score(clf_svm, Xtrain, ytrain_SCL, cv=kfold, scoring='recall')
f1_SCL = cross_val_score(clf_svm, Xtrain, ytrain_SCL, cv=kfold, scoring='f1')
acc_SCL = cross_val_score(clf_svm, Xtrain, ytrain_SCL, cv=kfold)
#Print score
print("SVM:")
print("MF precision: %.2f%%" % (precision_SMF.mean()*100.0))
print("MF recall: %.2f%%" % (recall_SMF.mean()*100.0))
print("MF f1: %.2f%%" % (f1_SMF.mean()*100.0))
print("MF accuracy: %.2f%%" % (acc_SMF.mean()*100.0))
print("----------------------------------")
print("CC precision: %.2f%%" % (precision_SCC.mean()*100.0))
print("CC recall: %.2f%%" % (recall_SCC.mean()*100.0))
print("CC f1: %.2f%%" % (f1_SCC.mean()*100.0))
print("CC accuracy: %.2f%%" % (acc_SCC.mean()*100.0))
print("----------------------------------")
print("CL precision: %.2f%%" % (precision_SCL.mean()*100.0))
print("CL recall: %.2f%%" % (recall_SCL.mean()*100.0))
print("CL f1: %.2f%%" % (f1_SCL.mean()*100.0))
print("CL accuracy: %.2f%%" % (acc_SCL.mean()*100.0))
For each target variable, we will use the model with has the best scores:
For every model, we will generate a matrix of Error, MAE, MSE to evaluate the model and fine-tuning them.
## Linear Regression: reg_lin
reg_lin = Pipeline((
("scaler", StandardScaler()),
("linear", LinearRegression())
))
#Scores
##Revenues Mutual Funds
err_RMF = cross_val_score(reg_lin, Xtrain, ytrain_RMF, cv=kfold)
MAE_RMF = cross_val_score(reg_lin, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RMF = cross_val_score(reg_lin, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_squared_error')
##Revenues Credit Card
err_RCC = cross_val_score(reg_lin, Xtrain, ytrain_RCC, cv=kfold)
MAE_RCC = cross_val_score(reg_lin, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCC = cross_val_score(reg_lin, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_squared_error')
##Revenues Consumer Loan
err_RCL = cross_val_score(reg_lin, Xtrain, ytrain_RCL, cv=kfold)
MAE_RCL = cross_val_score(reg_lin, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCL = cross_val_score(reg_lin, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_squared_error')
#Print score
print("Linear Regression:")
print("MF Error: %.2f" % (err_RMF.mean()))
print("MF MAE: %.2f" % (MAE_RMF.mean()))
print("MF MSE: %.2f" % (MSE_RMF.mean()))
print("----------------------------------")
print("CC Error: %.2f" % (err_RCC.mean()))
print("CC MAE: %.2f" % (MAE_RCC.mean()))
print("CC MSE: %.2f" % (MSE_RCC.mean()))
print("----------------------------------")
print("CL Error: %.2f" % (err_RCL.mean()))
print("CL MAE: %.2f" % (MAE_RCL.mean()))
print("CL MSE: %.2f" % (MSE_RCL.mean()))
## Linear Regression: reg_lin
reg_lasso = Pipeline((
("scaler", StandardScaler()),
("lasso", Lasso(alpha=1))
))
#Scores
##Revenues Mutual Funds
err_RMF = cross_val_score(reg_lasso, Xtrain, ytrain_RMF, cv=kfold)
MAE_RMF = cross_val_score(reg_lasso, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RMF = cross_val_score(reg_lasso, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_squared_error')
##Revenues Credit Card
err_RCC = cross_val_score(reg_lasso, Xtrain, ytrain_RCC, cv=kfold)
MAE_RCC = cross_val_score(reg_lasso, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCC = cross_val_score(reg_lasso, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_squared_error')
##Revenues Consumer Loan
err_RCL = cross_val_score(reg_lasso, Xtrain, ytrain_RCL, cv=kfold)
MAE_RCL = cross_val_score(reg_lasso, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCL = cross_val_score(reg_lasso, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_squared_error')
#Print score
print("Lasso:")
print("MF Error: %.2f" % (err_RMF.mean()))
print("MF MAE: %.2f" % (MAE_RMF.mean()))
print("MF MSE: %.2f" % (MSE_RMF.mean()))
print("----------------------------------")
print("CC Error: %.2f" % (err_RCC.mean()))
print("CC MAE: %.2f" % (MAE_RCC.mean()))
print("CC MSE: %.2f" % (MSE_RCC.mean()))
print("----------------------------------")
print("CL Error: %.2f" % (err_RCL.mean()))
print("CL MAE: %.2f" % (MAE_RCL.mean()))
print("CL: MSE: %.2f" % (MSE_RCL.mean()))
## Decision Tree regression: reg_tree
reg_tree = Pipeline((
("scaler", StandardScaler()),
("tree", DecisionTreeRegressor(criterion='mse',max_depth=100))
))
#Scores
##Revenues Mutual Funds
err_RMF = cross_val_score(reg_tree, Xtrain, ytrain_RMF, cv=kfold)
MAE_RMF = cross_val_score(reg_tree, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RMF = cross_val_score(reg_tree, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_squared_error')
##Revenues Credit Card
err_RCC = cross_val_score(reg_tree, Xtrain, ytrain_RCC, cv=kfold)
MAE_RCC = cross_val_score(reg_tree, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCC = cross_val_score(reg_tree, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_squared_error')
##Revenues Consumer Loan
err_RCL = cross_val_score(reg_tree, Xtrain, ytrain_RCL, cv=kfold)
MAE_RCL = cross_val_score(reg_tree, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCL = cross_val_score(reg_tree, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_squared_error')
#Print score
print("Decision Tree regression:")
print("MF Error: %.2f" % (err_RMF.mean()))
print("MF MAE: %.2f" % (MAE_RMF.mean()))
print("MF MSE: %.2f" % (MSE_RMF.mean()))
print("----------------------------------")
print("CC Error: %.2f" % (err_RCC.mean()))
print("CC MAE: %.2f" % (MAE_RCC.mean()))
print("CC MSE: %.2f" % (MSE_RCC.mean()))
print("----------------------------------")
print("CL Error: %.2f" % (err_RCL.mean()))
print("CL MAE: %.2f" % (MAE_RCL.mean()))
print("CL MSE: %.2f" % (MSE_RCL.mean()))
## Random Forest regression: reg_forest
reg_forest = Pipeline((
("scaler", StandardScaler()),
("forest", RandomForestRegressor(n_estimators=10, max_depth=1000))
))
#Scores
##Revenues Mutual Funds
err_RMF = cross_val_score(reg_forest, Xtrain, ytrain_RMF, cv=kfold)
MAE_RMF = cross_val_score(reg_forest, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RMF = cross_val_score(reg_forest, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_squared_error')
##Revenues Credit Card
err_RCC = cross_val_score(reg_forest, Xtrain, ytrain_RCC, cv=kfold)
MAE_RCC = cross_val_score(reg_forest, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCC = cross_val_score(reg_forest, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_squared_error')
##Revenues Consumer Loan
err_RCL = cross_val_score(reg_forest, Xtrain, ytrain_RCL, cv=kfold)
MAE_RCL = cross_val_score(reg_forest, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCC = cross_val_score(reg_forest, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_squared_error')
#Print score
print("Random Forest regression:")
print("MF Error: %.2f" % (err_RMF.mean()))
print("MF MAE: %.2f" % (MAE_RMF.mean()))
print("MF MSE: %.2f" % (MSE_RMF.mean()))
print("----------------------------------")
print("CC Error: %.2f" % (err_RCC.mean()))
print("CC MAE: %.2f" % (MAE_RCC.mean()))
print("CC MSE: %.2f" % (MSE_RCC.mean()))
print("----------------------------------")
print("CL Error: %.2f" % (err_RCL.mean()))
print("CL MAE: %.2f" % (MAE_RCL.mean()))
print("CL MSE: %.2f" % (MSE_RCL.mean()))
## SVR: svr
reg_svr = Pipeline((
("scaler", StandardScaler()),
("svr", SVR(kernel='rbf', gamma='scale', C=10))
))
#Scores
##Revenues Mutual Funds
err_RMF = cross_val_score(reg_svr, Xtrain, ytrain_RMF, cv=kfold)
MAE_RMF = cross_val_score(reg_svr, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RMF = cross_val_score(reg_svr, Xtrain, ytrain_RMF, cv=kfold, scoring='neg_mean_squared_error')
##Revenues Credit Card
err_RCC = cross_val_score(reg_svr, Xtrain, ytrain_RCC, cv=kfold)
MAE_RCC = cross_val_score(reg_svr, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCC = cross_val_score(reg_svr, Xtrain, ytrain_RCC, cv=kfold, scoring='neg_mean_squared_error')
##Revenues Consumer Loan
err_RCL = cross_val_score(reg_svr, Xtrain, ytrain_RCL, cv=kfold)
MAE_RCL = cross_val_score(reg_svr, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_absolute_error')
MSE_RCL = cross_val_score(reg_svr, Xtrain, ytrain_RCL, cv=kfold, scoring='neg_mean_squared_error')
#Print score
print("SVR:")
print("MF Error: %.2f" % (err_RMF.mean()))
print("MF MAE: %.2f" % (MAE_RMF.mean()))
print("MF MSE: %.2f" % (MSE_RMF.mean()))
print("----------------------------------")
print("CC Error: %.2f" % (err_RCC.mean()))
print("CC MAE: %.2f" % (MAE_RCC.mean()))
print("CC MSE: %.2f" % (MSE_RCC.mean()))
print("----------------------------------")
print("CL Error: %.2f" % (err_RCL.mean()))
print("CL MAE: %.2f" % (MAE_RCL.mean()))
print("CL MSE: %.2f" % (MSE_RCL.mean()))
For each target variable, we will use the model with the least errors:
For each target variables, we choose these models:
After fine-tuning the models, we will predict the sales and revenues on the clients who have'nt been sold anything yet.
# SMF prediction: logreg
clf_logreg.fit(Xtrain, ytrain_SMF)
pre_SMF = clf_logreg.predict(Xtarget).astype(int)
# SCC prediction: forest
clf_forest.fit(Xtrain, ytrain_SCC)
pre_SCC = clf_forest.predict(Xtarget).astype(int)
# SCL prediction: logreg
clf_logreg.fit(Xtrain, ytrain_SCL)
pre_SCL = clf_logreg.predict(Xtarget).astype(int)
# RMF prediction: lasso
reg_lasso.fit(Xtrain, ytrain_RMF)
pre_RMF = reg_lasso.predict(Xtarget)
# RCC prediction: forest
reg_forest.fit(Xtrain, ytrain_RCC)
pre_RCC = reg_forest.predict(Xtarget)
# RCL prediction: lin
reg_lin.fit(Xtrain, ytrain_RCL)
pre_RCL = reg_lin.predict(Xtarget)
## only one sale to one person
product_MF = np.multiply(pre_SMF,pre_RMF)
product_CC = np.multiply(pre_SCC,pre_RCC)
product_CL = np.multiply(pre_SCL,pre_RCL)
for i in range(product_MF.shape[0]):
if product_CC[i] > product_MF[i]:
product_MF[i] = 0.0
pre_SMF[i] = 0
pre_RMF[i] = 0.0
if product_CL[i] > product_CC[i]:
product_CC[i] = 0
pre_SCC[i] = 0
pre_RCC[i] = 0
else:
product_CL[i] = 0
pre_SCL[i] = 0
pre_RCL[i] = 0
else:
product_CC[i] = 0.0
pre_SCC[i] = 0
pre_RCC[i] = 0.0
if product_MF[i] > product_CL[i]:
product_CL[i] = 0
pre_SCL[i] = 0
pre_RCL[i] = 0
else:
product_MF[i] = 0
pre_SMF[i] = 0
pre_RMF[i] = 0
## create sale_type array
pre_sale = np.empty(product_CL.shape[0], dtype='object')
for i in range(product_CL.shape[0]):
if pre_SMF[i] == 1:
pre_sale[i] = 'MF'
elif pre_SCC[i] == 1:
pre_sale[i] = 'CC'
elif pre_SCL[i] == 1:
pre_sale[i] = 'CL'
# Add our prediction into data
df_target_x['Sale_MF'] = pre_SMF
df_target_x['Sale_CC'] = pre_SCC
df_target_x['Sale_CL'] = pre_SCL
df_target_x['Sale'] = pre_sale
df_target_x['Revenue_MF'] = product_MF
df_target_x['Revenue_CC'] = product_CC
df_target_x['Revenue_CL'] = product_CL
# select max revenue top 100 client
df_prediction = df_target_x
df_prediction['Revenue'] = df_prediction[['Revenue_MF', 'Revenue_CC', 'Revenue_CL']].max(axis=1)
list_clients = df_prediction.sort_values('Revenue',ascending=False).head(100)[['Sale', 'Revenue','Sale_MF', 'Sale_CC', 'Sale_CL','Revenue_MF', 'Revenue_CC', 'Revenue_CL']]
clients_sale_rev = list_clients[['Sale','Revenue']]
#list of 100 clients' id
clients_id = list(clients_sale_rev.index)
#sum of expected revenue
expected_revenue = clients_sale_rev['Revenue'].sum()
expected_revenue
list_clients.xlsx: excel file of the clients list (clients id, offer, expected revenue)
#client id, sale offer, and expected revenue
#from pandas import to_exel()
clients_sale_rev = list_clients[['Sale','Revenue']]
#Export the output
clients_sale_rev.to_excel('list_clients.xlsx')
clients_sale_rev