Attracting new customers is a challenge for any company, and the response rate to marketing campaigns is usually low among non-customers. Traditional Machine Learning algorithms, such as Logistic Regression, can classify potential customers and predict the result of marketing campaigns, but their predictive accuracy is limited when the data is imbalanced (Chawla, Bowyer, Hall & Kegelmeyer, 2002).
Responses to commercial campaigns are imbalanced datasets that generally include a small fraction of positive responses and a large fraction of negative responses. In this context, new methods, such as SMOTE (Synthetic Minority Oversampling Technique), have demonstrated better performance in test samples with Area Under the Curve (AUC) than other methods such simple Logistic Regression.
Analytical Framework and Data Sources
As defined by Chawla et al. SMOTE is “an over-sampling approach in which the minority class is over-sampled by creating synthetic examples rather than by over-sampling with replacement […] that helps the classifier to create larger and less specific decision regions”.
The SMOTE Algorithm selects two similar instances using nearest neighbours and bootstrapping, and generates synthetic samples from instances in the minority classes. In this context, and following the steps indicated below, the algorithm was tested to identify non-customer profiles with high propensity to respond positively to a marketing campaign.
- External data extraction to build analytical variables to describe the behavior of the non-customer. This can be done leveraging external data sources, such as debt in other banks and financial products.
- Feature selection to identify relevant variables from huge number of financial and behavioral variables.
- Definition of sample’s size using SMOTE to test the predictive power of classification algorithms such as Logistic Regression.
- Finally, after applying a Logistic Regression model with and without SMOTE, we evaluate the performance in test sample with AUC indicator.
After comparing the results of Logistic Regression with SMOTE and the traditional process of Logistic Regression with imbalanced sampling, we found the follow results:
- Although the two design of models got the same AUC metric (0.78), the SMOTE-Logistic Regression improve Specificity indicator from 0.71 to 0.73. The interpretation of that is: the SMOTE-Logistic Regression have more predict power to identify the refuse of the offer of credit card by non-customers. (Figure 1)
- A commercial evidence of the advantage of SMOTE-Logistic Regression is the comparison of prioritization capacity between two models designs. We conclude that the SMOTE-Logistic Regression can identify almost 60 additional credit card sold in high propensity to non-customer than the Logistic Regression in traditional way. (Figura 2)
Figure 1: Comparison of performance of the models in test data.
Figure 2: Comparison of prioritization capacity between two designs models
SMOTE is an open source tool R with many tutorials.